STAT.203 Applied Regression
STAT.203 Applied Regression
MS in Mathematics - 2024
1 These course slides should not be reproduced nor used by others (without
permission).
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 1 / 248
Introduction
Introduction
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 2 / 248
Introduction
1 Introduction
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 3 / 248
Introduction Learning Outcomes of the Course
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 4 / 248
Introduction Text and Reference List
Reference list
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 5 / 248
Introduction Text and Reference List
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 6 / 248
Introduction Text and Reference List
Lecture Outline I
1 Introduction
MRKARIM
2.1 Variable
2.5 Covariance
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 7 / 248
Introduction Text and Reference List
Lecture Outline II
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 8 / 248
Introduction Text and Reference List
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 9 / 248
Introduction Text and Reference List
Lecture Outline IV
3.5 MRKARIM
Types of parametric regression analysis
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 10 / 248
Introduction Text and Reference List
Lecture Outline V
3.11 Real Data Example: Obstetrics Dataset
MRKARIM
4.5 Adjusted R2
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 11 / 248
Introduction Text and Reference List
Lecture Outline VI
4.6 Example
4.7 F-test in Multiple Regression
4.8 ANOVA Table in Regression Analysis
4.9 t -tests in Multiple Regression
4.10 Real Data Example: Hypertension Dataset
MRKARIM
4.11 Python Code: Hypertension Dataset
4.12 R Code: Hypertension Dataset
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 12 / 248
Chapter 1: Correlation and Association Analysis
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 13 / 248
Chapter 1: Correlation and Association Analysis
2.1 Variable
Statistical methods
can be used to summarize or describe a collection of statistical
methods statistics data; this is called descriptive statistics
allows to make predictions (inferences ) from that data; this is called
inferential statistics
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 15 / 248
Chapter 1: Correlation and Association Analysis Variable
Variable
is an attribute or characteristics of interest that vary or change
respondent to respondent
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 16 / 248
Chapter 1: Correlation and Association Analysis Variable
Types of Variable
. qualitative variable (also known as categorical variable)
. quantitative variable (also known as numerical variable)
▸ discrete variable MRKARIM
▸ continuous variable
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 17 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 18 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 19 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 20 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 21 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 22 / 248
Chapter 1: Correlation and Association Analysis Summarizing bivariate data
▸ A tabular summary of data for two variables. The classes for one
variable are represented by the rows; the classes for the other variable
are represented by the columns.
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 23 / 248
Chapter 1: Correlation and Association Analysis Summarizing bivariate data
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 24 / 248
Chapter 1: Correlation and Association Analysis Summarizing bivariate data
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 25 / 248
Chapter 1: Correlation and Association Analysis Scatter Diagram
Scatter Diagram
a scatter diagram is a graphic tool used to portray the relationship
between two variables
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 26 / 248
Chapter 1: Correlation and Association Analysis Covariance
∑(xi − x̄)(yi − ȳ )
sxy =
n−1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 27 / 248
Chapter 1: Correlation and Association Analysis Covariance
(ii). Based on the scatter diagram, describe the observed trend or pattern
in the data.
(iv). Interpret the sample covariance value in terms of the strength and
direction of the relationship between the number of commercial
advertisements and the sales volume.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 29 / 248
Chapter 1: Correlation and Association Analysis Covariance
Sample Covariance
what is the nature of the relationship between x and y?
MRKARIM
sample covariance:
∑(xi − x̄)(yi − ȳ )
sxy =
n−1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 30 / 248
Chapter 1: Correlation and Association Analysis Covariance
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 31 / 248
Chapter 1: Correlation and Association Analysis Correlation Analysis
Correlation Analysis
Correlation analysis is a group of techniques to measure the strength and
direction of the relationship between two variables. There are dierent
types of correlation coecients that can be used depending on the nature
of the variables being analyzed and the assumptions of the data. Some of
the common types of correlation coecients include:
n n
∑ (xi − x̄)(yi − ȳ ) ∑ (xi − x̄)(yi − ȳ )
i=1 MRKARIM i=1 sxy
r=√ √ = =
(n − 1)sx sy sx sy
∑ (xi − x̄)2 ∑ (yi − ȳ )2
n n
i= 1 i= 1
√
sx = 1 ∑ n
2
where
n− 1 i=1(xi − x̄) (the sample standard deviation); and
analogously for sy
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 33 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
Working Formula
n
∑ xi yi − nx̄ ȳ
i=1
rxy =¿ MRKARIM ¿
Á n Á n
À( ∑ x 2 − nx̄ 2 ) Á
Á À( ∑ y 2 − nȳ 2 )
i= 1 i
i= 1 i
√
sx = 1 ∑ n
2
where
n−1 i=1(xi − x̄) (the sample standard deviation); and
analogously for sy
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 34 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 35 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 36 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
Correlation Coecient
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 37 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
Correlation Coecient
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 38 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 39 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
Exercises
1 The following sample of observations were randomly selected.
x 4 5 3 6 10
y 4 6 5 7 7
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 40 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
Exercises
Car Age (years) Selling Price ($000) Car Age (years) Selling Price ($000)
1 9 8.1MRKARIM 7 8 7.6
2 7 6.0 8 11 8.0
3 11 3.6 9 10 8.0
4 12 4.0 10 12 6.0
5 8 5.0 11 6 8.6
6 7 10.0 12 6 8.0
(a). Draw a scatter diagram.
(b). Determine the correlation coecient.
(c). Interpret the correlation coecient. Does it surprise you that the
correlation coecient is negative?
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 41 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
1 hypothesis
2 level of signicance α
MRKARIM
3 reject H0 if
T > tα/2,n−2 or T < −tα/2,n−2
4 test statistic
√
r n−2
T= √ ∼t distribution with n−2 degrees of freedom
1 − r2
5 decision
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 42 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
H0 ∶ ρ = 0
if the null hypothesis is true, the test statistic T follows the student's-t
distribution with (n − 2) degrees of freedom, i.e., T ∼ t(n − 2)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 43 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 44 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 45 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 46 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 47 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient
Exercises
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 48 / 248
Chapter 1: Correlation and Association Analysis R Code: Correlation Matrix
R Code
# Example data (you should replace this with your actual data)
data <- data.frame(
x1 = c(1, 2, 3, 4, 5),
x2 = c(2, 3, 4, 5, 6),
x3 = c(3, 4, 5, 6, 7)
)
MRKARIM
# Compute correlation matrix
correlation_matrix <- cor(data)
print(correlation_matrix)
# Compute correlation matrix with 2 decimal places
correlation_matrix <- round(cor(data), 2)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 49 / 248
Chapter 1: Correlation and Association Analysis Python Code: Correlation Matrix
# Example data (you should replace this with your actual data)
data = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [2, 3, 4, 5, 6],
'x3': [3, 4, 5, 6, 7]
}) MRKARIM
print(correlation_matrix)
# Compute correlation matrix with 2 decimal places
correlation_matrix = data.corr().round(2)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 50 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 51 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
Question (Q1)
Based on the following data, nd the rank correlation between marks of
English and Mathematics courses.
English 56 75 45 71 62 64
MRKARIM 58 80 76 61
Maths 66 70 40 60 65 56 59 77 67 63
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 52 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
Solution: The procedure for ranking these scores is as follows:
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 53 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
6 ∑ di
2 6 × 54
rR = 1 − =1− = 0.6727
n(n2 − 1) 10(102 − 1)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 54 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
if there are more than one such group of items with common rank,
MRKARIM
this value is added as many times as the number of such groups
6{∑ di
2 + m1 (m2 − 1) + m2 (m2 − 1) + ⋯}
rR = 1 − 12 1 12 2
n(n2 − 1)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 55 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
Question (Q2)
Based on the following data, Find the rank correlation between marks of
English and Mathematics courses.
English 56 75 45 71 61 64
MRKARIM 58 80 76 61
Maths 70 70 40 60 65 56 59 70 67 80
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 56 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
Solution: The procedure for ranking these scores is as follows:
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 57 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
6{104.5 +
2 2 3 2
rR = 1 − 12 (2 − 1) + 12 (3 − 1) + ⋯}
10(102 − 1)
MRKARIM
= 0.3515
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 58 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
a study showed that the ice cream sales is correlated with homicides
in New York
▸ as the sales of ice cream rise and fall, so do the number of homicides.
Does the consumption of ice cream causing the death of the people?
▸ No two things are correlated doesn't mean one causes other
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 59 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 60 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
MRKARIM
and yes, ice cream sales and homicide has a causal relationship with
weather
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 61 / 248
Chapter 1: Correlation and Association Analysis R Code: Rank Correlation
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 62 / 248
Chapter 1: Correlation and Association Analysis Python Code: Rank Correlation
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 63 / 248
Chapter 1: Correlation and Association Analysis Python Code: Rank Correlation
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 64 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient
Denition
Formula: nc − nd
MRKARIM
τ= 1
2 n(n − 1)
where nc is the number of concordant pairs and nd is the number of
discordant pairs.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 65 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient
Suppose we have the following data on two variables, X and Y, with their
corresponding ranks:
Observation Rank
X Y
MRKARIM
10 15
15 10
20 20
25 25
30 40
Using this data, we can calculate the Kendall Tau correlation coecient as
follows:
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 66 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient
Concordant Pairs:
In the context of correlation coecients such as Kendall Tau,
concordant pairs refer to pairs of observations where the ranks for
both variables follow the same order. In other words, if (Xi , Yi ) and
(Xj , Yj ) are two pairs of observations, they are considered concordant
if both Xi < Xj and Yi < Yj or if both Xi > Xj and Yi > Yj .
Discordant Pairs: Discordant pairs, on the other hand, refer to pairs
MRKARIM
of observations where the ranks for the variables have opposite orders.
In other words, if (Xi , Yi ) and (Xj , Yj ) are two pairs of observations,
they are considered discordant if Xi < Xj and Yi > Yj or if Xi > Xj and
Yi < Yj .
In the context of calculating correlation coecients like Kendall Tau,
understanding concordant and discordant pairs is crucial as they form
the basis for determining the strength and direction of association
between two variables based on their ranks.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 67 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient
nc − nd
τ= 1
2 n(n − 1)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 68 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient
Example Calculation
Suppose we have the following data on two variables, X and Y, with their
corresponding ranks:
Observation Rank
X Y
MRKARIM
10 15
15 10
20 20
25 25
30 40
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 69 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient
So, out of the 10 pairs of observations, there are 9 concordant pairs and 1
discordant pair.
= 10
5(5−1)
Total number of pairs (n): MRKARIM
2
Kendall Tau coecient (τ ):
nc − nd 9−1 8
τ= 1 = 1 = ≈ 0.1778
2 n(n − 1) 2 (10)(9) 45
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 71 / 248
Chapter 1: Correlation and Association Analysis Point-Biserial Correlation Coecient
X1 − X0
rpb = √
p( −p)
MRKARIM
sX
1
n
where:
Question
Suppose we are interested in examining the relationship between students'
exam scores and their gender (male/female). We have the following data:
Solution
87.6 − 75.4
rpb = √ ≈ 0.76
0.5×(1−0.5)
8.67
10
Therefore, the point-biserial correlation coecient is approximately 0.76,
indicating a strong positive relationship between exam scores and gender.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 74 / 248
Chapter 1: Correlation and Association Analysis Phi Coecient
Phi Coecient
Denition
MRKARIM
We have the following contingency table for two categorical variables:
Variable 2
Variable 1 Variable 2 = 0 Variable 2 = 1 Total
Variable 1 = 0 n00 n01 n0+
Variable 1 = 1 n10 n11 n1+
Total n+0 n +1 n++
Here, n00 represents the frequency of observations with Variable 1 being in
category 0 and Variable 2 being in category 0, and so on.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 75 / 248
Chapter 1: Correlation and Association Analysis Phi Coecient
Variable 2 = 0 Variable 2 = 1
Variable 1 = 0 20 30
Variable 1 = 1 40 10
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 76 / 248
Chapter 1: Correlation and Association Analysis Phi Coecient
n1+ = 40 + 10 = 50
n0+ = 20 + 30 = 50
n+1 = 30 + 10 = 40
n+0 = 20 + 40 = 60
n00 = 20
n01 = 30
n10 = 40
n11 = 10
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 77 / 248
Chapter 1: Correlation and Association Analysis Phi Coecient
−1000
=√ ≈ −0.3162
120000000
So, the phi coecient for this example is approximately -0.3162, indicating
a moderate negative association between Variable 1 and Variable 2.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 78 / 248
Chapter 1: Correlation and Association Analysis Cramér's V
Cramér's V
Denition
Variable 1
Variable 2 A B Total
C 20 30 50
D 10 40 50
Total 30 70 100
50 × 30
E11 = = 15
100
50 × 70
E12 = = 35
100
50 × 30
E21 = = 15
100
50 × 70
E22 = = 35
100
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 80 / 248
Chapter 1: Correlation and Association Analysis Cramér's V
Then, compute the chi-square statistic:
Po − Pe
κ=
1 − Pe
Where:
Po is the proportion of observed agreement.
Pe is the proportion of agreement expected by chance.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 82 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic
Judgment 2
Judgment 1 Yes No Total
Yes a b a+b
No c d c +d
Total a+c b+d n
where
a+d
Po =MRKARIM
a+b+c +d
a+b a+c
PYes = ⋅
a+b+c +d a+b+c +d
Similarly:
c +d b+d
PNo = ⋅
a+b+c +d a+b+c +d
Finally,
Pe = PYes + PNo
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 83 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 84 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic
MRKARIM
Readers 1
Readers 2 Present Absent Total
Present 20 5 25
Absent 10 15 25
Total 30 20 50
Calculate Cohen's Kappa to assess the agreement between the two doctors.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 85 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic
a+d 20 + 15
Po = = = 0.7
a+b+c +d 50
a+b a+c
pYes = ⋅ = 0.5 × 0.6 = 0.3
a+b+c +d a+b+c +d
Similarly:
c +d b+d
PNo = ⋅ = 0.5 × 0.4 = 0.2
a+b+c +d a+b+c +d
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 86 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic
Pe 0.7 − 0.5
Po − MRKARIM
κ= = = 0.4
1 − Pe 1 − 0.5
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 87 / 248
Chapter 1: Correlation and Association Analysis Python Code: The Kappa Statistic
# Expected agreement
P_present_reader1 = 25 / total_obs
P_absent_reader1 = 25 / total_obs
MRKARIM
P_present_reader2 = 30 / total_obs
P_absent_reader2 = 20 / total_obs
# Cohen's Kappa
kappa = (P_o - P_e) / (1 - P_e)
print("Cohen's Kappa:", kappa)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 88 / 248
Chapter 1: Correlation and Association Analysis Python Code: The Kappa Statistic
Another Example
from sklearn.metrics import cohen_kappa_score
# Example data
doctor1_ratings = [3, 2, 1, 5, 4, 2, 3, 4, 1, 5]
doctor2_ratings = [4, 2, 1, 5, 3, 2, 3, 4, 2, 5]
# Example data
doctor1_ratings = [3, 2, 1, 5, 4, 2, 3, 4, 1, 5]
doctor2_ratings = [4, 2, 1, 5, 3, 2, 3, 4, 2, 5]
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 89 / 248
Chapter 1: Correlation and Association Analysis Partial Correlation
Partial Correlation
Partial correlation is a statistical technique used to measure the strength
and direction of the relationship between two variables while controlling for
the inuence of one or more additional variables. The formula for
calculating partial correlation coecient (denoted as rxy .z ) between
variables X and Y while controlling for variable Z is given by:
Where:
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 92 / 248
Chapter 1: Correlation and Association Analysis Python Code: The Partial Correlation
# Sample data
data = {
'X': [1, 2, 3, 4, 5],
'Y': [2, 3, 5, 4, 6],
'Z': [5, 4, 3, 2, 1]
} MRKARIM
# Create a DataFrame
df = pd.DataFrame(data)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 93 / 248
Chapter 1: Correlation and Association Analysis Multiple Correlation
Multiple Correlation
Suppose we have an outcome variable y and a set of predictors x1 , . . . , xk .
The maximum possible correlation between y and a linear combination of
the predictors c1 x1 + . . . + ck xk is given by the correlation between y and
the regression function β1 x1 + . . . + βk xk and is called the multiple
correlation between y and {x1 , . . . , xk }. It is estimated by the Pearson
correlation between y and b1 x1 + . . . + bk xk , where b1 , . . . , bk are the
least-squares estimates of β1 , . . . , βk . The multiple correlation can also be
√ MRKARIM
Reg SS = √R 2 . So multiple correlation is
shown to be equivalent to
Total SS
dened as √
R= R2
where R2 is the coecient of determination for the number of independent
variables in the model.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 95 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises
Chapter Exercises
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 96 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises
2 Suppose you have collected data on the height and weight of ten
individuals:
Height (cm) Weight (kg)
160 55
165 60
170 65
175 70
180 75
MRKARIM
185 80
190 85
195 90
200 95
205 100
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 97 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises
Variable 1 Total
Variable 2 A B
C 20
MRKARIM 30
D 10 40
Total 30 70
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 98 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 99 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 100 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises
(i) Convert the ratings into binary labels. For example, you might assign
1 for "Pass" and 0 for "Fail."
(ii) Calculate the observed agreement (Po ) between Rater 1 and Rater 2's
ratings.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 101 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 102 / 248
Chapter 2: Regression Analysis: Simple Linear Regression
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 103 / 248
Chapter 2: Regression Analysis: Simple Linear Regression
3.5 MRKARIM
Types of parametric regression analysis
Research Problem 1
a
Greene, J., & Touchstone, J. (1963). Urinary tract estriol: An index of placental
function. American Journal of Obstetrics and Gynecology, 85(1), 1-9.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 105 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Problem & Motivation
Table 4: Sample data from the Greene-Touchstone study relating birthweight and
estriol level in pregnant women near term
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 106 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Problem & Motivation
Functional Relation
Suppose f is a known function.
Y = g (X )
Whenever X is known, Y is completely known.
Examples:
(3 ) y = 5 + x 2
1 1
(1 ) y = 2 x (2 ) y = x − 1
2
MRKARIM 2
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 108 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Functional vs. Statistical Relation
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 109 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Regression Analysis
Regression Analysis
correlation analysis does not tell us why and how behind the
relationship but it just says the relationship exists
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 111 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression
MRKARIM
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 113 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 114 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 115 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 116 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 117 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis
▸ Poisson Regression
for survival-time (time-to-event) outcomes
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 118 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis
Regression Models
another types of regression analysis:
Ridge Regression
Lasso Regression
Bayesian regression
Nonparametric regression
Additive model
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 119 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis
E (y ∣x) = β0 + β1 x
yi = β0 + β1 xi + ei ; i = 1, 2, . . . , n (1)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 120 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis
Graphical Presentation
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 121 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis
yi = β0 + β1 xi
= E(y ∣xi ) + ei
where
▸ β̂0 = estimator of β0
▸ β̂1 = estimator of β1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 122 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 123 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 124 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 125 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 126 / 248
Chapter 2: Regression Analysis: Simple Linear Regression A Probabilistic View of Linear Regression
µ(x)
fY ∣X (y ∣x)MRKARIM ← { 2
σ (x )
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
parametric ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
parametric/nonparametric
conditional Mean of Y given X=x :
E(Y ∣X = x) = µ(X)
aims:
if
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 128 / 248
Chapter 2: Regression Analysis: Simple Linear Regression A Probabilistic View of Linear Regression
yi = β0 + β1 xi + ei ; i = 1 , 2, . . . , n (2)
Assumptions
1 Linearity: There exists a linear relationship between the independent
variable X and the dependent variable Y.
2 Independence: MRKARIM
The observations are independent of each other.
3 Homoscedasticity: The variance of the errors (residuals) is constant
across all levels of the independent variable X. That is , Var(ei ) = σ2.
4 Normality: The errors (residuals) follow a normal distribution with
mean 0. That is , E(ei ) = 0.
5 No perfect multicollinearity: There is no perfect linear relationship
between the independent variable X and any other variable.
iid 2 iid 2
That is, ei ∼ N (0, σ ), that is yi ∼ N (β0 + β1 xi , σ ). Therefore, it is also
called normal regression
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 129 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation
yi = β0 + β1 xi + ei ; i = 1, 2, . . . , n
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 130 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 131 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 132 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation
n n
Q = min ∑ ei2 = min ∑(yi − ŷi )2
β0 ,β1 i= 1 β0 ,β1 i= 1
n
= min ∑(yi − β0 − β1 xi )2
β0 ,β1 i= 1
2 the estimators for β0 MRKARIM
1
and β can be found by solving the following
normalized equation
∂Q ∂ n 2
= ∑(yi − β0 − β1 xi ) = 0 (3)
∂β0 ∂β0 i=1
and
∂Q ∂ n 2
= ∑(yi − β0 − β1 xi ) = 0 (4)
∂β1 ∂β1 i=1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 133 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation
OLS estimatior
from (3), we have
n
∑(yi − β0 − β1 Xi )(−1) = 0
i= 1
n n
⇒ ∑ yi − nβ0 − β1 ∑ xi = 0
i= 1 i=1
MRKARIM
or
β0 = ȳ − β1 X̄
from (4), we have
n
∑(yi − β0 − β1 xi )(−xi ) = 0
i= 1
n n n
⇒ ∑ xi yi − β0 ∑ xi − β1 ∑ xi2 = 0 (5)
i= 1 i= 1 i=1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 134 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation
n n n
2
∑ xi yi − (ȳ − β1 x̄) ∑ xi − β1 ∑ xi = 0
i= 1 i=1 1
i=
n n n n
⇒ ∑ xi yi − ȳ ∑ xi + β1 x̄ ∑ xi − β1 ∑ xi2
i= 1 i=1 i= 1 i= 1
n n
⇒ ∑ xi yi − nȳ x̄ + nβ1 x̄ 2 − β1 ∑ xi2 = 0
i= 1 MRKARIM 1
i=
n n
⇒ ∑ xi yi − nȳ x̄ − β1 (∑ xi2 − nx̄ 2 ) = 0
i= 1 i= 1
n
∑ xi yi − nx̄ ȳ
⇒β1 =
i=1
∑ xi2 − nx̄ 2
n
i= 1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 135 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation
hence, the ordinary least square (OLS) estimators of β0 and β1 are
β̂0 = ȳ − β̂1 x̄
n
∑ xi yi − nx̄ ȳ
β̂1 =
i=1
∑ xi2 − nx̄ 2
n
i= 1
the expression of β̂1 can also be written as
MRKARIM
n
∑ (xi − x̄)(yi − ȳ )
β̂1 =
1
i=
=r(
sy
)
∑ (yi − ȳ )2
n
sx
i=1
because
n n n n
2 2
∑(xi − x̄) = ∑ xi − nx̄
2 and ∑(xi − x̄)(yi − ȳ ) = ∑ xi yi − nx̄ ȳ
i= 1 i= 1 i= 1 1
i=
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 136 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 137 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
σ2 =
i i i i
̂ =
n−2 MRKARIM n−2
where
ŷi = β̂0 + β̂1 xi
Hence, the standard error of estimate is
¿
¿ Án
Á ∑(yi − ŷi )2 Á ∑ y 2 − β̂0 ∑ yi − β̂1 ∑ xi yi
n n
Á Á
Á
Ài Ài i
Á i i
̂
σ= =
n−2 n−2
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 138 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
To illustrate the least squares method, suppose data were collected from a
sample of 10 Sultan's Dine restaurants located near to the university
campuses. For the i th observation or restaurant in the sample, xi is the
size of the student population (in thousands) and yi is the quarterly sales
(in thousands of dollars).
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 139 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
(i). Show the relationship between the size of student population and the
quarterly sales. Make a comment on the diagram.
(ii). Write down the regression model for this example, and mention the
assumptions of the model.
(iii). Write the estimated regression equation and nd the least square
estimates. Interpret the results.
MRKARIM
(iv). Draw the regression line.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 140 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 141 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
Simple regression model:
yi = β0 + β1 xi + ei ; i = 1, 2, . . . , 10 (6)
where
▸ yi = Quarterly Sales ($1000s)
▸ xi = bStudent Population (1000s)
▸ β0 is intercept
▸ β1 is slope coecient
▸ ei error term and we assume, ei is normally distributed with mean 0
and variance σ 2 .
MRKARIM
The estimated regression model (or equation) of (6) is
i= 1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 142 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
xi yi xi2 xi yi
2 58 4 116
6 105 36 630
8 88 64 704
8 118 64 944
MRKARIM
12 117 144 1404
16 137 256 2192
20 157 400 3140
20 169 400 3380
22 149 484 3278
26 202 676 5252
Total= 140 1300 2528 21040
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 143 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
n
∑ xi yi − nx̄ ȳ
β̂1 =
i= 1 =
2840
=5
∑ xi2 − nx̄ 2
n
568
i= 1
β̂0 = ȳ − β̂1 x̄ = 130 − 5(14) = 60
Thus, the estimated regression equation
MRKARIM is
ŷi = 60 + 5xi
̂
The slope of the estimated regression equation (β 1 = 5) is positive,
implying that as student population increases, sales increase. In fact, we
can conclude that an increase in the student population of 1000 is
associated with an increase of $5000 in expected sales; that is, quarterly
sales are expected to increase by $5 per student.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 144 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 145 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
ŷ = 60 + 5 × 16 = 140
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 146 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
∑(yi − ŷi )2 √
σ2 =
i 1530
̂ = = 765 and ̂
σ= 765 = 27.66
n−2 10 − 2
Coecient of Determination
the coecient of determination is the proportion of the total variation
in the dependent variable Y that is explained by the independent
variable(s) in a regression model. It is denoted by R2 and dened by
∑(yi − ŷi )2
R2 = 1 −
SSE i
=1−
SST ∑(yi − ȳ )2
i
MRKARIM
where
▸ the total sum of squares (proportional to the variance of the data):
SST = ∑(yi − ȳ )2
i
▸ The sum of squares of residuals, also called the Residual or Error Sum
of Squares (SSE): SSE = ∑(yi − ŷi )2 = ∑ ei2
i i
For example, suppose R 2 = 0.9027. This implies that 90.27% of the
variability of the dependent variable is explained and the remaining
9.73% of the variability is still unexplained by the regression model.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 148 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
For the Sultan's Dine restaurants Sales Dataset give in 3.1, we have
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 149 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
Classroom Practice
x 4 5 3 6 10
y 4MRKARIM
6 5 7 7
(d). Find the value of the coecient of determination and interpret your
results.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 150 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
xi yi xi2 yi2 xi yi
4 4 16 16 16
5 6 25
MRKARIM 36 30
3 5 9 25 15
6 7 36 49 42
10 7 100 49 70
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 151 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
n
∑ xi yi − nx̄ ȳ
β̂1 =
i= 1 = 0.3630 and β̂0 = ȳ − β̂1 x̄ = 3.7671
∑ xi2 − nx̄ 2
n
i=1
when xi is 7.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 152 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
∑i xi = 28 ∑i yi = 29 MRKARIM
SST= 6.8 SSE= 2.9521
where
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 153 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
R 2 = (rxy )2
Hence, for the previous example,
R 2 = (rxy )2
= (0.7522)2
= 0.566
Remarks: Note that the relationship R 2 = (rxy )2 only holds for the
simple linear regression model.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 154 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
Remarks: Note that the equation (7) is true only for simple regres-
sion model.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 155 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
an R2 MRKARIM
of 1 indicates that the regression predictions perfectly t the
data
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 156 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
Adjusted R 2
n−1
R̄ 2 = 1 − (1 − R 2 )
n−p−1
where p is the total number of explanatory variables in the model
(not including the constant term), and n is the sample size
MRKARIM
it can also be written as:
SSE /dfe
R̄ 2 = 1 −
SST /dft
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 157 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
Adjusted R 2
5−1
R̄ 2 = 1 − (1 − 0.566)( ) = 0.4212
5−1−1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 158 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 159 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 160 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
β̂1
MRKARIM
is the estimated coecient for x ,
̂
σ
se(β̂1 ) = √
∑ni= 1 (xi − x̄)2
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 161 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
Where:
¿
Án
Á ∑(yi − ŷi )2
Á
Á
Ài
̂
σ=
n−2
is the estimated standard error of the residuals (or the square root of
MRKARIM
the mean squared error, often obtained from the regression output),
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 162 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
H0 ∶ β1 = 0
H1 ∶ β1 ≠ 0
The null hypothesis H0 implying that the independent variable(s) do not
have any eect on the dependent variable. The alternative hypothesis H
MRKARIM 1
indicating that the independent variable(s) do have a signicant eect on
the dependent variable. The formula for the F-statistic in a simple linear
regression model is:
SSR/1
F=
SSE /(n − 2)
Where:
yi − ȳ )2 is the sum of squared regression
SSR = ∑ni=1 (̂ (explained),
SSE is the sum of squared error (residual) terms,
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 163 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
In a simple linear regression model, p=1 because you only have one
independent variable (excluding the intercept).
Once you compute the F-statistic, you can compare it to the critical
value from the F-distribution at a chosen signicance level (e.g.,
α = 0.05) to determine whether to reject the null hypothesis.
If the F-statistic is greater than the critical value, you reject the null
hypothesis and conclude that the model is signicant. Otherwise, you
fail to reject the null hypothesis.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 164 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
Classical Approach
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 165 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
p -value Approach
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 166 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
Hypothesis Test: H0 ∶ β1 = 0
Consider the simple linear regression model:
yi = β0 + β1 xi + ei ; 1, 2, . . . , n
H0 ∶ β1 = 0
Under H0 , the test statistic is given by:
MRKARIM
β̂1 − 0
t=
se(β̂1 )
¿
Án
Á ∑(yi − ŷi )2
̂ Á
σ Á
Ài
se(β̂1 ) = √ n
where, and ̂
σ=
∑i=1 (xi − x̄)2 n−2
We reject H0 if ∣t∣ exceeds the critical value tcritical = t α ,(n−2) from the
2
t-distribution with n − 2 degrees of freedom, where n is the sample size.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 167 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
Classical Approach
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 168 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
p -value Approach
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 169 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
It is computed as:
ŷ∗ ± t α ,(n−2) ⋅ se(̂
y ∗)
2
∗ ∗
where ŷ
is the predicted value of y for a given x , tα/ is the critical 2
MRKARIM
value of the t -distribution, n is the number of observations, and
¿
Á1 ∗ 2
y ∗) = ̂
se(̂ À + (x − x̄)
σÁ
n ∑i=1 (xi − x̄)2
n
For Sultan's Dine restaurants Sales Dataset given in Example 3.1, we have
̂
σ = 13.829. With x ∗ = 10, x̄ = 14, ∑(xi − x̄)2 = 568,
and we have
√
(10 − 14)2
y ∗ ) = 13.829
1
se(̂ +
10 568
√
= 13.829 0.1282
= 4.95
MRKARIM
∗
With ŷ = 110 and a margin of error of y ∗ ) = 2.306 × 4.95
t0.025,8 × se(̂
= 11.4147, the 95% condence interval for an average quarterly sales for
the Sultan's Dine restaurants located near campus for xed x ∗ = 10 is
110 ± 11.4147
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 171 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 172 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
For Sultan's Dine restaurants Sales Dataset given in Example 3.1, the
estimated standard deviation corresponding to the prediction of quarterly
sales for a new restaurant located need to the campus with 10,000
students, is computed as follows
√
1 (10 − 14)2
spred = 13.829 1+ +
10 568
√
= 13.829 1.1282
MRKARIM
= 14.69
The 95% prediction interval for quarterly sales for the Sultan's Dine
restaurants located near campus can be found t α ,(n−1) = 2.306. Thus, with
∗
2
ŷ = 110 and a margin of error of t0.025 × spred = 2.306 × 14.69 = 33.875,
the 95% prediction interval is
110 ± 33.875
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 174 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 175 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 176 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
Table 6: Sample data from the Greene-Touchstone study relating birthweight and
estriol level in pregnant women near term
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 177 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
MRKARIM
Figure 6: Data from the Greene-Touchstone study relating birthweight and estriol
level in pregnant women near term
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 178 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
∣x) = β0 + β1 x
E (yMRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 179 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
Let's assume e follows a normal distribution, with mean 0 and variance σ2.
The full linear-regression model then takes the following form:
y = β0 + β1 x + e
31 31 31 31
∑ xi = 534,
2
∑ xi = 9876, ∑ yi = 992, ∑ xi yi = 17500
i= 1 i=1 i=
MRKARIM 1 i= 1
For computing the slope and intercept of the regression line, we consider
the least square estimator of β0 and β1 . These are
n
∑ xi yi − nx̄ ȳ
β̂1 =
1
i=
= 0.608 and β̂0 = ȳ − β̂1 x̄ = 21.52
∑ xi2 − nx̄ 2
n
i=1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 180 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
ŷ = 21.52 + 0.608x
. This regression line is shown in Figure 6. The slope of 0.608 tells us that
MRKARIM
the predicted y increases by about 0.61 units per 1 mg/24 hr. Thus, the
predicted birthweight increases by 61 g for every 1 mg/24 hr increase in
estriol.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 181 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
25 = 21.52 + 0.608x
or
x = (25 MRKARIM
− 21.52) × 0.608 = 5.72
Thus, if a woman has an estriol level of 5.72 mg/24 hr, then the
predicted birthweight is 2500 g. Furthermore, the predicted infant
birthweight for all women with estriol levels of ≤5 mg/24 hr is
< 2500g (assuming estriol can only be measured in increments of 1
mg/24 hr). This level could serve as a critical value for identifying
high-risk women and trying to prolong their pregnancies.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 182 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
▸ 31
∑i=1 xi = 534, x = 17.23
▸ Standard error of β̂0 : se(β̂0 ) = 2.62
▸ 95% condence interval for β0 : 21.52 ± t29,.975 (2.62) = (16.16, 26.88)
These intervals are rather wide due to the small sample size.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 183 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
Assess the signicant eect of birthweight on estriol level: that is, test the
hypothesis is H0 ∶ β 1 = 0
Thus,
Since the condence interval (0.308, 0.908) does not contain the
value 0, we can reject the null hypothesis H0 ∶ β1 = 0 at the 0.05
signicance level. This means that there is evidence to suggest that
the slope coecient β1 is not equal to zero in the regression model.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 184 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis
Residual Analysis
Residuals are the dierences between the observed values of the dependent
variable and the values predicted by the regression model. Residual
analysis is a critical component of regression analysis as it helps to
determine whether the assumptions made about the regression model
appear to be valid. Key Aspects of Residual Analysis:
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 185 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis
1 (xi − x̄)2
hi = + i = 1, ..., n
∑i (xi − x̄)2
;
n
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 186 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 187 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis
Inuential Observations
Inuential observations are data points that have a large impact on the
estimated coecients of the regression model. They can signicantly alter
the t of the model if removed. Inuential observations are identied using
Cook's distance, where observations with Cook's distance greater than 4/n
(where n is the number of observations)
MRKARIM are considered inuential.
Outliers
Outliers are data points that deviate signicantly from the rest of the
data. They can aect the regression model's accuracy and should be
investigated to determine if they are genuine data points or errors.
Outliers are identied by selecting observations with Cook's distance
greater than a certain threshold (here, 4/n).
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 188 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis
Cook's Distance
Cook's Distance
where ŷj is the j th tted value, ŷj(i) is the j th tted value with the i th
observation removed, p is the number of predictors in the model, and MSE
is the mean squared error of the model.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 189 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 190 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 191 / 248
Chapter 2: Regression Analysis: Simple Linear Regression R Code: Linear Regression Model
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 192 / 248
Chapter 2: Regression Analysis: Simple Linear Regression R Code: Linear Regression Model
# Residual plot
plot(model$fitted.values, residuals,
xlab = "Fitted values",
ylab = "Residuals",
main = "Residual Plot")
abline(h = 0, col = "red", lty = 2) # Add a horizontal line at
y=0 MRKARIM
# Histogram of residuals
hist(residuals,
main = "Histogram of Residuals",
xlab = "Residuals",
ylab = "Frequency",
col = "lightblue")
# QQ plot of residuals
qqnorm(residuals)
qqline(residuals)
title("QQ Plot of Residuals")
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 193 / 248
Chapter 2: Regression Analysis: Simple Linear Regression R Code: Linear Regression Model
# Residual plot
plot(model, which = 1)
attach(data)
# Leverage points
leverage <- infl$hat
# Influential observations
cook_dist <- cooks.distance(model)
# Outliers
outliers <- which(cook_dist > 4 / length(y))
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 194 / 248
Chapter 2: Regression Analysis: Simple Linear Regression R Code: Linear Regression Model
print("Influential observations:")
print(which(cook_dist > 4 / length(y)))
print("Outliers:")
print(outliers)
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 195 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 196 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model
# Residual plot
plt.scatter(model.predict(), residuals)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='r', linestyle='--') # Add a horizontal
line at y=0
MRKARIM
plt.show()
# Histogram of residuals
plt.hist(residuals, bins=10)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Histogram of Residuals')
plt.show()
# QQ plot of residuals
sm.qqplot(residuals, line ='45')
plt.title('QQ Plot of Residuals')
plt.show()
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 197 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence
# Leverage points
leverage = influence.hat_matrix_diag
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 198 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model
# Influential observations
cook_dist = influence.cooks_distance[0]
print(cook_dist)
# Outliers
outliers = np.where(cook_dist > 4 / len(y))[0]
print("Influential observations:")
print(np.where(cook_dist > 4 / len(y))[0])
print("Outliers:") MRKARIM
print(outliers)
# Residual plot
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(model.fittedvalues, model.resid)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='red', linestyle='--')
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 199 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model
plt.tight_layout() MRKARIM
plt.show()
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 200 / 248
Chapter 3: Multiple Linear Regression Model
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 201 / 248
Chapter 3: Multiple Linear Regression Model
4.6 Example
4.7 F-test in Multiple Regression
4.8 ANOVA Table in Regression Analysis
4.9 t -tests in Multiple Regression
4.10 Real Data Example: Hypertension Dataset
4.11 Python Code: Hypertension Dataset
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 202 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
Suppose age (days), birthweight (oz), and SBP are measured for 16
infants and the data are as shown in Table 7. What is the relationship
between infant systolic blood pressure (SBP) and their age and
birthweight? Can we predict SBP based on these factors?
MRKARIM
Table 7: Sample data for infant blood pressure, age, and birthweight for 16 infants
1 3 135 89
2 4 120 90
3 3 100 83
4 2 105 77
5 4 130 92
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 203 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
Table 7 (continue)
6 5 125 98
7 2 125 82
8 3 105 85
9 5 120 96
10 4 90
MRKARIM 95
11 2 120 80
12 3 95 79
13 3 120 86
14 4 150 97
15 3 160 92
16 3 125 88
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 204 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
Research Problem 3
Let's delve into the exploration of the dataset. We'll utilize the Boston
House Prices Dataset, consisting of 506 rows and 13 attributes,
including a target column. MRKARIM
How can we apply multiple regression to predict the price based on various
attributes? Let's take a quick look at the dataset.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 205 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 206 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
Crim: Per capita crime rate by town
Zn: Proportion of residential land zoned for lots over 25,000 sq. ft.
Indus: Proportion of non-retail business acres per town
Chas: Charles River dummy variable (= 1 if tract bounds river; 0,
otherwise)
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 208 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
Python Code
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 209 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
where:
x1i , x2i , . . . , xpi are the independent variables for the i th observation.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 210 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
Model Assumptions
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 211 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 212 / 248
Chapter 3: Multiple Linear Regression Model Estimation Procedure
Y⃗ = X β⃗ + e⃗
To nd the estimated ̂ of β using
coecients β ordinary least squares
(OLS), we minimize the sum of squared residuals e:
e⃗ = Y⃗ − X β̂
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 213 / 248
Chapter 3: Multiple Linear Regression Model Estimation Procedure
T
⃗ ⃗
SSE = e⃗T e⃗ = (Y⃗ − X β)
̂ (Y⃗ − X β)
̂
⃗
To minimize SSE, we take the derivative with respect to β̂ and set it to
zero:
MRKARIM
⃗
= −2X T (Y⃗ − X β)
∂ SSE ̂ =0
⃗
∂ β̂
⃗
Solving for β̂, we get:
β̂ = (X T X )−1 X T Y⃗
⃗
This equation provides the estimated coecients β̂ that minimize the sum
of squared residuals.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 214 / 248
Chapter 3: Multiple Linear Regression Model Estimation Procedure of Error Variance
σ2 =
T
⃗ ⃗
(Y⃗ − X β) (Y⃗ − X β)
1
̂ ̂ ̂
n−p−1
where:
⃗
β̂ is the vector of estimated coecients obtained using ordinary least
squares (OLS). MRKARIM
Coecient of Determination (R 2 )
2
The coecient of determination (R ) measures the proportion of the
variance in the dependent variable (y ) that is explained by the
independent variables (x 1 , x2 , . . . , xp ) in the regression model.
R2 = 1 −
SSE
SST
∑n 1 (yi − ŷi )2
MRKARIM
= 1 − i=
∑ni=1 (yi − ȳ )2
where:
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 216 / 248
Chapter 3: Multiple Linear Regression Model Coecient of Determination
Interpretation:
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 217 / 248
Chapter 3: Multiple Linear Regression Model Adjusted R 2
Adjusted R 2
n−1
R̄ 2 = 1 − (1 − R 2 )
n−p−1
where p is the total number of explanatory variables in the model
(not including the constant term), and n is the sample size
MRKARIM
it can also be written as:
SSE /dfe
R̄ 2 = 1 −
SST /dft
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 218 / 248
Chapter 3: Multiple Linear Regression Model Adjusted R 2
Hence,
SSE /(n − p − 1)
R̄ 2 = 1 −
SST /(n − 1)
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 219 / 248
Chapter 3: Multiple Linear Regression Model Adjusted R 2
Interpretation:
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 220 / 248
Chapter 3: Multiple Linear Regression Model Example
⎡1 9 16⎤
⎢ ⎥
⎢1 13 14⎥
⎢ ⎥
⎢ ⎥
⎢1 11 10⎥
⎢ ⎥
⎢1 11 8 ⎥ ⎡10⎤
⎢ ⎥ ⎢ ⎥
⎢1 14 11⎥ ⎢12⎥
X = ⎢⎢1 15 17⎥⎥⎥ ;
⎢
For the given dataset, we have: MRKARIM
⎢ ⎥
Y⃗ = ⎢ ⎥.
⎢⋮⎥
⎢ ⎥ ⎢ ⎥
⎢1 16 9 ⎥ ⎢28⎥
⎢ ⎥ ⎣ ⎦
⎢1 20 16⎥
⎢ ⎥
⎢ ⎥
⎢1 15 12⎥
⎢ ⎥
⎢1 15 12⎥
⎣ ⎦
−1 ⃗
Now, let's calculate X X , (X X ) , X Y , and β
T T T ⃗ ̂.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 222 / 248
Chapter 3: Multiple Linear Regression Model Example
Calculation of XTX
⎡1 9 16⎤
⎢ ⎥
⎡1 . . . 1 ⎥ ⎢⎢1
⎤ ⎥
⎢ 1 1 13 14⎥
⎢ ⎥⎢ ⎥
XTX =⎢9
⎢
13 11 . . . 15⎥ ⎢1
⎥⎢
11 10⎥
⎥
⎢16 14 10 . . . 12⎥ ⎢ ⋮ ⎥⎥
⎣ MRKARIM ⎦⎢⋮ ⋮
⎢1 ⎥
⎣ 15 12⎦
⎡ 10 ⎤
⎢ 139 125⎥
⎢ ⎥
= ⎢139 2019 1757⎥
⎢ ⎥
⎢125 1757 1651⎥
⎣ ⎦
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 223 / 248
Chapter 3: Multiple Linear Regression Model Example
⃗
Calculation of β̂
⎡ 2.821 ⎤
⎢ ⎥
⃗ −1 T ⃗ ⎢ ⎥
β = (X X ) X Y = ⎢ 1.591 ⎥
̂ T
⎢ ⎥
⎢−0.475⎥
⎣ ⎦
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 224 / 248
Chapter 3: Multiple Linear Regression Model Example
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 225 / 248
Chapter 3: Multiple Linear Regression Model Example
n
σ2 = 2
1 119.5229
̂ ∑(yi − ŷi ) = = 17.0747
n − p − 1 i=1 10 − 2 − 1
R2 = 1 −
SSE 119.522
=1− = 0.6378
TSS 330
(1 − R 2 )(n − 1) (1 − 0.6378)(10 − 1)
R̄ 2 = 1 − =1− = 0.5343
n−p−1 10 − 2 − 1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 226 / 248
Chapter 3: Multiple Linear Regression Model Example
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 227 / 248
Chapter 3: Multiple Linear Regression Model F-test in Multiple Regression
where β̂1 , β̂2 , . . . , β̂p are the coecients of the independent variables.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 228 / 248
Chapter 3: Multiple Linear Regression Model ANOVA Table in Regression Analysis
Source of Variation SS df MS F
Regression SSR p MSR = SSR
p
Residual (Error) SSE n−p−1 MSE = n−p−
SSE
1 F= MSR
MSE
Total SST n−1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 229 / 248
Chapter 3: Multiple Linear Regression Model ANOVA Table in Regression Analysis
Classical Approach
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 230 / 248
Chapter 3: Multiple Linear Regression Model ANOVA Table in Regression Analysis
p -value Approach
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 231 / 248
Chapter 3: Multiple Linear Regression Model ANOVA Table in Regression Analysis
df SS MS F Signicance F
Regression 2 210.4651 105.2326 6.1625 0.0286
MRKARIM
Residual 7 119.5349 17.0764
Total 9 330
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 232 / 248
Chapter 3: Multiple Linear Regression Model The t -tests in Multiple Regression
β̂i
t=
SE(β ̂i )
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 233 / 248
Chapter 3: Multiple Linear Regression Model The t -tests in Multiple Regression
Classical Approach
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 234 / 248
Chapter 3: Multiple Linear Regression Model The t -tests in Multiple Regression
p -value Approach
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 235 / 248
Chapter 3: Multiple Linear Regression Model The t -tests in Multiple Regression
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 236 / 248
Chapter 3: Multiple Linear Regression Model Real Data Example: Hypertension Dataset
Example 4.1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 237 / 248
Chapter 3: Multiple Linear Regression Model Real Data Example: Hypertension Dataset
yi = β0 + β1 x1i + β2 x2i + ei ; i = 1, 2, . . . , 16
where
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 239 / 248
Chapter 3: Multiple Linear Regression Model Python Code: Hypertension Dataset
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 240 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 241 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 242 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset
Example 4.2
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 243 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 244 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset
Crim: Per capita crime rate by town
Zn: Proportion of residential land zoned for lots over 25,000 sq. ft.
Indus: Proportion of non-retail business acres per town
Chas: Charles River dummy variable (= 1 if tract bounds river; 0,
otherwise)
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 246 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset
## Alternatively MRKARIM
MRKARIM
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 248 / 248