0% found this document useful (0 votes)
58 views

STAT.203 Applied Regression

Uploaded by

redmintoys
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

STAT.203 Applied Regression

Uploaded by

redmintoys
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 248

MATH6703: Applied Regression Analysis 1

Dr. Md. Rezaul Karim


PhD(KU Leuven & UHasselt), MS(Biostatistics, UHasselt), MS(Statistics, JU)
Professor, Department of Statistics and Data Science
Jahangirnagar University (JU), Savar, Dhaka - 1342, Bangladesh
Mobile: 01912605556, Email: [email protected]
MRKARIM

MS in Mathematics - 2024

1 These course slides should not be reproduced nor used by others (without
permission).
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 1 / 248
Introduction

Introduction
MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 2 / 248
Introduction

1 Introduction

1.1 Learning Outcomes of the Course

1.2 Text and Reference List


MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 3 / 248
Introduction Learning Outcomes of the Course

Learning Outcomes of the Course

1 Understand fundamental concepts of correlation and regression


analysis.

2 Learn to build simple and multiple regression models.

3 Understand assumptions and diagnostics of correlation and regression.

4 Explore practical applications in various elds.


MRKARIM
5 Learn advanced regression models such as logistic regression, Poisson
regression, Polynomial regression, Generalized Linear Model (GLM),
Generalized Additive Models (GAM), Mixed eect model, Random
eect models, etc.

6 Develop critical thinking skills in result interpretation.

7 Gain prociency in statistical software for analysis.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 4 / 248
Introduction Text and Reference List

Text and Reference List


Text Book

1 Michael Kutner, Christopher Nachtsheim, John Neter, William Li


(2004): Applied Linear Statistical Models. 5th edition. New York:
McGraw-Hill/Irwin.
2 Gujarati, Damodar N. and Dawn C. Porter (2012): Basic
Econometrics, 5th Edition. McGraw-Hill.
MRKARIM

Reference list

1 Michael H. Kutner, Chris Nachtsheim, John Neter (2004): Applied


Linear Regression Models. 5th edition. New York: McGraw-Hill/Irwin.
2 Greene, W. H. (2003): Econometric Analysis. Pearson Education.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 5 / 248
Introduction Text and Reference List

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 6 / 248
Introduction Text and Reference List

Lecture Outline I

1 Introduction

1.1 Learning Outcomes of the Course

1.2 Text and Reference List

MRKARIM

2 Chapter 1: Correlation and Association Analysis

2.1 Variable

2.2 Level or Scales of Measurement

2.3 Summarizing bivariate data

2.4 Scatter Diagram

2.5 Covariance
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 7 / 248
Introduction Text and Reference List

Lecture Outline II

2.6 Correlation Analysis

2.7 Correlation Coecient

2.8 R Code: Correlation Matrix

2.9 Python Code: Correlation Matrix

2.10 Rank Correlation MRKARIM

2.11 R Code: Rank Correlation

2.12 Python Code: Rank Correlation

2.13 Kendall Tau Correlation Coecient

2.14 Point-Biserial Correlation Coecient

2.15 Phi Coecient

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 8 / 248
Introduction Text and Reference List

Lecture Outline III


2.16 Cramér's V

2.17 The Kappa Statistic

2.18 Python Code: The Kappa Statistic

2.19 Partial Correlation

2.20 Python Code: The Partial Correlation


MRKARIM
2.21 Multiple Correlation

2.22 Python Code: The Multiple Correlation

2.23 Chapter Exercises

3 Chapter 2: Regression Analysis: Simple Linear Regression

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 9 / 248
Introduction Text and Reference List

Lecture Outline IV

3.1 Problem & Motivation

3.2 Functional vs. Statistical Relation

3.3 Regression Analysis

3.4 Historical Origin of the Term Regression

3.5 MRKARIM
Types of parametric regression analysis

3.6 A Probabilistic View of Linear Regression

3.7 Parameter Estimation

3.8 The Estimated Error Variance or Standard Error

3.9 Coecient of Determination

3.10 Interval Estimation and Hypothesis Testing

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 10 / 248
Introduction Text and Reference List

Lecture Outline V
3.11 Real Data Example: Obstetrics Dataset

3.12 Residual analysis

3.13 R Code: Linear Regression Model

3.14 Python Code: Linear Regression Model

MRKARIM

4 Chapter 3: Multiple Linear Regression Model

4.1 Problems & Motivation

4.2 Estimation Procedure

4.3 Estimation Procedure of Error Variance

4.4 Coecient of Determination

4.5 Adjusted R2
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 11 / 248
Introduction Text and Reference List

Lecture Outline VI
4.6 Example
4.7 F-test in Multiple Regression
4.8 ANOVA Table in Regression Analysis
4.9 t -tests in Multiple Regression
4.10 Real Data Example: Hypertension Dataset
MRKARIM
4.11 Python Code: Hypertension Dataset
4.12 R Code: Hypertension Dataset

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 12 / 248
Chapter 1: Correlation and Association Analysis

Chapter 1: Correlation and Association Analysis


MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 13 / 248
Chapter 1: Correlation and Association Analysis

2 Chapter 1: Correlation and Association Analysis

2.1 Variable

2.2 Level or Scales of Measurement

2.3 Summarizing bivariate data

2.4 Scatter Diagram

2.5 Covariance MRKARIM

2.6 Correlation Analysis

2.7 Correlation Coecient

2.8 R Code: Correlation Matrix

2.9 Python Code: Correlation Matrix

2.10 Rank Correlation

2.11 R Code: Rank Correlation


Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 14 / 248
Chapter 1: Correlation and Association Analysis

Statistical methods
can be used to summarize or describe a collection of statistical
methods statistics data; this is called descriptive statistics
allows to make predictions (inferences ) from that data; this is called
inferential statistics

MRKARIM

Parameter & Statistic


measurable of a population characteristics ⇒ parameter

measurable of a sample characteristics ⇒ statistic

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 15 / 248
Chapter 1: Correlation and Association Analysis Variable

Variable & it's types

Variable
is an attribute or characteristics of interest that vary or change
respondent to respondent

it is also called feature, or factor. In the context of statistics and data


analysis, variables can be divided into two main types:

▸ independent variables MRKARIM


▸ dependent variables.
Example: Suppose we are conducting a study to investigate the eect
of studying time on exam scores. In this study, "studying time" is the
independent variable because it is manipulated by the students
themselves, and "exam scores" are the dependent variable because
they are measured in response to changes in studying time. Both
studying time and exam scores are examples of variables.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 16 / 248
Chapter 1: Correlation and Association Analysis Variable

Types of Variable
. qualitative variable (also known as categorical variable)
. quantitative variable (also known as numerical variable)
▸ discrete variable MRKARIM
▸ continuous variable

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 17 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement

Level or Scales of Measurement

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 18 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 19 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 20 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 21 / 248
Chapter 1: Correlation and Association Analysis Level or Scales of Measurement

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 22 / 248
Chapter 1: Correlation and Association Analysis Summarizing bivariate data

Summarizing bivariate data


Tabular Method-Crosstabulation

▸ A tabular summary of data for two variables. The classes for one
variable are represented by the rows; the classes for the other variable
are represented by the columns.
MRKARIM

Graphical method- scatter diagram

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 23 / 248
Chapter 1: Correlation and Association Analysis Summarizing bivariate data

Example: Test Performances of MAT 101


ID Gender Test Performance Study Hour Score of STAT
1 Male Good 10 71
2 Female Good 11 75
3 Male Excellent 14 85
4 Female Excellent 10 90
5 Male Poor 8 50
6 Female Excellent 13 88
MRKARIM
7 Male Poor 6 45
8 Female Excellent 15 80
9 Male Good 14 65
10 Female Good 10 82
11 Male Excellent 14 92
12 Female Poor 8 55
13 Male Poor 5 40
14 Male Good 10 68
15 Female Good 9 62

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 24 / 248
Chapter 1: Correlation and Association Analysis Summarizing bivariate data

Table 1: The Crosstabulation of Gender and Test Performance.

Gender Test Performance Total


Poor Good Excellent
Male 3 3
MRKARIM 2 8
Female 1 3 3 7
Total 4 6 5 15

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 25 / 248
Chapter 1: Correlation and Association Analysis Scatter Diagram

Scatter Diagram
a scatter diagram is a graphic tool used to portray the relationship
between two variables

the dependent variable is scaled on the Y-axis and is the variable


being estimated

the independent variable is scaled on the X-axis and is the variable


used as the predictor
MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 26 / 248
Chapter 1: Correlation and Association Analysis Covariance

Covariance is a statistical measure that quanties the degree to which two


random variables vary together. In other words, it assesses the relationship
between two variables, indicating whether they tend to move in the same
direction (positive covariance) or opposite directions (negative covariance).
A positive covariance suggests that when one variable increases, the other
variable tends to increase as well, while a negative covariance indicates
MRKARIM
that when one variable increases, the other tends to decrease. Sample
covariance is denoted by sxy and calculated by the following formula

∑(xi − x̄)(yi − ȳ )
sxy =
n−1

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 27 / 248
Chapter 1: Correlation and Association Analysis Covariance

Example: Sample Covariance


Using a dataset containing the number of commercial advertisements aired
and the corresponding sales volume for a product or service over a specic
time period:

Table 2: Sample Data for the San Francisco Electronics Store

Week Number of commercials (X ) Sales Volume ($100s)


1 2 MRKARIM 50
2 5 57
3 1 41
4 3 54
5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 28 / 248
Chapter 1: Correlation and Association Analysis Covariance

(i). Draw a scatter diagram of the data points representing the


relationship between the number of commercial advertisements and
the sales volume.

(ii). Based on the scatter diagram, describe the observed trend or pattern
in the data.

(iii). Calculate the sample covariance between the number of commercial


MRKARIM
advertisements and the sales volume to quantitatively assess the
degree of association between these two variables.

(iv). Interpret the sample covariance value in terms of the strength and
direction of the relationship between the number of commercial
advertisements and the sales volume.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 29 / 248
Chapter 1: Correlation and Association Analysis Covariance

Sample Covariance
what is the nature of the relationship between x and y?

MRKARIM

sample covariance:

∑(xi − x̄)(yi − ȳ )
sxy =
n−1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 30 / 248
Chapter 1: Correlation and Association Analysis Covariance

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 31 / 248
Chapter 1: Correlation and Association Analysis Correlation Analysis

Correlation Analysis
Correlation analysis is a group of techniques to measure the strength and
direction of the relationship between two variables. There are dierent
types of correlation coecients that can be used depending on the nature
of the variables being analyzed and the assumptions of the data. Some of
the common types of correlation coecients include:

1 Pearson correlation coecient: Measures linear relationship


between two continuous variables.
MRKARIM
2 Spearman rank correlation coecient: Measures association
between ranked variables.

3 Kendall tau correlation coecient: Measures similarity of


orderings of data pairs.

4 Point-biserial correlation coecient: Measures association


between continuous and binary variables.

5 Phi coecient: Measures association between two binary variables.


6 Cramér's V: Measures association between two nominal variables.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 32 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Pearson's Correlation coecient


is a measure of the strength of the linear relationship between two
variables

is denoted by r and dened as

n n
∑ (xi − x̄)(yi − ȳ ) ∑ (xi − x̄)(yi − ȳ )
i=1 MRKARIM i=1 sxy
r=√ √ = =
(n − 1)sx sy sx sy
∑ (xi − x̄)2 ∑ (yi − ȳ )2
n n

i= 1 i= 1

sx = 1 ∑ n
2
where
n− 1 i=1(xi − x̄) (the sample standard deviation); and

analogously for sy

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 33 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Working Formula

this correlation coecient is called the Pearson's correlation coecient


whicgh is also dened as

n
∑ xi yi − nx̄ ȳ
i=1
rxy =¿ MRKARIM ¿
Á n Á n
À( ∑ x 2 − nx̄ 2 ) Á
Á À( ∑ y 2 − nȳ 2 )
i= 1 i
i= 1 i


sx = 1 ∑ n
2
where
n−1 i=1(xi − x̄) (the sample standard deviation); and

analogously for sy

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 34 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 35 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 36 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Correlation Coecient

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 37 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Correlation Coecient

the following drawing summarizes the strength and direction of the


correlation coecient

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 38 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Characteristics of correlation coecient


1 the sample correlation coecient is identied by the lowercase letter r
2 it shows the direction and strength of the linear relationship between
two interval or ratio-scale variables

3 it ranges from −1 up to and including +1


4 a value near 0 indicates there is little linear relationship between the
MRKARIM
variables

5 a value near 1 indicates a direct or positive linear relationship between


the variables

6 a value near −1 indicates an inverse or negative linear relationship


between the variables

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 39 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Exercises
1 The following sample of observations were randomly selected.
x 4 5 3 6 10
y 4 6 5 7 7

(a). Draw a scatter diagram.


(b). Determine the correlation coecient and interpret the relationship between x
and y .
(c). Interpret these statistical measures.
Solution (b)
x x x2 y2 xy
MRKARIM
4 4 16 16 16
5 6 25 36 30
3 5 9 25 15
6 7 36 49 42
10 7 100 49 70

∑ xi = 28 ∑ yi = 29 ∑ xi2 = 186 ∑ yi2 = 175 ∑ xi yi = 173


x̄ = 5.6 ȳ = 5.8
n
∑ xi yi − nx̄ ȳ
r=¿
1 i=
¿ =√
173 × −5 × 5.6 × 5.8
√ = 0.7522
Á n Á n 186 − 5 × (5.6)2 175 − 5 × (5.8)2
À( ∑ x 2 − nx̄ 2 ) Á
Á À( ∑ y 2 − nȳ 2 )
i=1 i
1
i=
i

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 40 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Exercises

2 The owner of Maumee Ford-Volvo wants to study the relationship


between the age of a car and its selling price. Listed below is a
random sample of 12 used cars sold at the dealership during the last
year.

Car Age (years) Selling Price ($000) Car Age (years) Selling Price ($000)
1 9 8.1MRKARIM 7 8 7.6
2 7 6.0 8 11 8.0
3 11 3.6 9 10 8.0
4 12 4.0 10 12 6.0
5 8 5.0 11 6 8.6
6 7 10.0 12 6 8.0
(a). Draw a scatter diagram.
(b). Determine the correlation coecient.
(c). Interpret the correlation coecient. Does it surprise you that the
correlation coecient is negative?

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 41 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Testing the Signicance of the Correlation Coecient

1 hypothesis

H0 ∶ ρ = 0 (the correlation in the population is 0)

H1 ∶ ρ ≠ 0 (the correlation in the population is not 0)

2 level of signicance α
MRKARIM
3 reject H0 if
T > tα/2,n−2 or T < −tα/2,n−2
4 test statistic

r n−2
T= √ ∼t distribution with n−2 degrees of freedom
1 − r2

5 decision

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 42 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

in general, if the null hypothesis is

H0 ∶ ρ = 0

if the null hypothesis is true, the test statistic T follows the student's-t
distribution with (n − 2) degrees of freedom, i.e., T ∼ t(n − 2)

Table 3: decision rule for the test of hypothesis H0 ∶ ρ = 0


MRKARIM

Alternative hypothesis Reject H0 if


H1 ∶ ρ < 0 T < −tα,n−2
H1 ∶ ρ > 0 T > tα,n−2
H1 ∶ ρ ≠ 0 T > tα/2,n−2 or T < −tα/2,n−2

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 43 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Correlation Coecient - Example

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 44 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Correlation Coecient - Example

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 45 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Correlation Coecient - Example

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 46 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Correlation Coecient - Example

MRKARIM

The computed t(3.297) is within the rejection region, therefore, we will


reject H0 . This means the correlation in the population is not zero. From
a practical standpoint, it indicates to the sales manager that there is
correlation with respect to the number of sales calls made and the number
of copiers sold in the population of salespeople.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 47 / 248
Chapter 1: Correlation and Association Analysis Correlation Coecient

Exercises

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 48 / 248
Chapter 1: Correlation and Association Analysis R Code: Correlation Matrix

R Code
# Example data (you should replace this with your actual data)
data <- data.frame(
x1 = c(1, 2, 3, 4, 5),
x2 = c(2, 3, 4, 5, 6),
x3 = c(3, 4, 5, 6, 7)
)
MRKARIM
# Compute correlation matrix
correlation_matrix <- cor(data)
print(correlation_matrix)
# Compute correlation matrix with 2 decimal places
correlation_matrix <- round(cor(data), 2)

# View the correlation matrix


print(correlation_matrix)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 49 / 248
Chapter 1: Correlation and Association Analysis Python Code: Correlation Matrix

Python Code: Correlation Matrix


import pandas as pd

# Example data (you should replace this with your actual data)
data = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [2, 3, 4, 5, 6],
'x3': [3, 4, 5, 6, 7]
}) MRKARIM

# Compute correlation matrix


correlation_matrix = data.corr()

print(correlation_matrix)
# Compute correlation matrix with 2 decimal places
correlation_matrix = data.corr().round(2)

# View the correlation matrix


print(correlation_matrix)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 50 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation

Spearman Rank Correlation


is measured the relationship between rankings of dierent ordinal
variables or dierent rankings of the same variable, where a "ranking"
is the assignment of the ordering labels "rst", "second", "third",
etc. to dierent observations of a particular variable

If ri , si are the ranks of the i -member according to the x and the


y -quality respectively, then the rank correlation coecient is
MRKARIM
6 ∑ di
2
rR = 1 −
n(n2 − 1)

where n is the number of data points of the two variables and di is


the dierence in the ranks of the i th element of each random variable
considered, i.e. di = ri − si , is the dierence between ranks

the Spearman correlation coecient, rR , can take values from +1 to


−1

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 51 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation

Question (Q1)

Based on the following data, nd the rank correlation between marks of
English and Mathematics courses.

English 56 75 45 71 62 64
MRKARIM 58 80 76 61
Maths 66 70 40 60 65 56 59 77 67 63

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 52 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
Solution: The procedure for ranking these scores is as follows:

English (mark) Maths (mark) Rank (Engilsh) Rank (Maths) di di2


56 66 9 4 5 25
75 70 3 2 1 1
45 40 10 10 0 0
71 60 4 7 -3 9
62 65 6 5 1 1
64 56 5 9 -4 16
58 59 8
MRKARIM 8 0 0
80 77 1 1 0 0
76 67 2 3 -1 1
61 63 7 6 1 1

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 53 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation

. the realized value of the Spearman Rank Correlation is

6 ∑ di
2 6 × 54
rR = 1 − =1− = 0.6727
n(n2 − 1) 10(102 − 1)

This indicates a strong positive relationship between the ranks


MRKARIM
individuals obtained in the Maths and English exam. That is, the
higher you ranked in maths, the higher you ranked in English also,
and vice versa.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 54 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation

Equal Ranks or Tie in Ranks

for tie observations, the rank correlation can be computed by


adjusting m(m2 − 1)/12 to the value of ∑ di2 , where m stands for the
number of items whose ranks are equal

if there are more than one such group of items with common rank,
MRKARIM
this value is added as many times as the number of such groups

then the formula for the rank correlation is

6{∑ di
2 + m1 (m2 − 1) + m2 (m2 − 1) + ⋯}
rR = 1 − 12 1 12 2
n(n2 − 1)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 55 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation

Question (Q2)

Based on the following data, Find the rank correlation between marks of
English and Mathematics courses.

English 56 75 45 71 61 64
MRKARIM 58 80 76 61
Maths 70 70 40 60 65 56 59 70 67 80

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 56 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation
Solution: The procedure for ranking these scores is as follows:

English (mark) Maths (mark) Rank (Engilsh) Rank (Maths) di di2


56 70 9 3 6 36
75 70 3 3 0 0
45 40 10 10 0 0
71 60 4 7 -3 9
61 65 6.5 6 0.5 0.25
64 56 5
MRKARIM 9 -4 16
58 59 8 8 0 0
80 70 1 3 -2 4
76 67 2 5 -3 9
61 80 6.5 1 5.5 30.25

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 57 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation

the mark 61 is repeated 2 times in series X (in English) and hence


m1 = 2
in series Y (in Math), the marks 70 occurs 3 times and hence m2 = 3
so the rank correlation is

6{104.5 +
2 2 3 2
rR = 1 − 12 (2 − 1) + 12 (3 − 1) + ⋯}
10(102 − 1)
MRKARIM

= 0.3515

Correlation is not causation!

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 58 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation

Correlation is not causation!

a study showed that the ice cream sales is correlated with homicides
in New York

▸ as the sales of ice cream rise and fall, so do the number of homicides.
Does the consumption of ice cream causing the death of the people?
▸ No  two things are correlated doesn't mean one causes other
MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 59 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation

Consider underlying factors before conclusion

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 60 / 248
Chapter 1: Correlation and Association Analysis Rank Correlation

Don't conclude too fast!

MRKARIM

there is no causal relationship between the ice cream and rate of


homicide, sunny weather is bringing both the factors together

and yes, ice cream sales and homicide has a causal relationship with
weather

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 61 / 248
Chapter 1: Correlation and Association Analysis R Code: Rank Correlation

R Code: Rank Correlation


# Example data (you should replace this with your actual
data)
x <- c(1, 2, 3, 4, 5)
y <- c(2, 3, 1, 5, 4)

# Calculate Spearman rank correlation coefficient


correlation <- cor(x, y, method = "spearman")
print(correlation)
MRKARIM
# Alternatively, if you have data in a dataframe, you can
use the cor() function directly on the dataframe:
# Example dataframe (you should replace this with your
actual data)
data <- data.frame(
x = c(1, 2, 3, 4, 5),
y = c(2, 3, 1, 5, 4)
)

# Calculate Spearman rank correlation coefficient


correlation <- cor(data, method = "spearman")
print(correlation)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 62 / 248
Chapter 1: Correlation and Association Analysis Python Code: Rank Correlation

Python Code: Rank Correlation


In Python, you can nd rank correlation using the scipy.stats module,
which provides a function called spearmanr() to calculate the Spearman
rank correlation coecient. Here's how you can do it:

from scipy.stats import spearmanr


# Example data (you should replace this with your actual
data)
x = [1, 2, 3, 4, 5] MRKARIM
y = [2, 3, 1, 5, 4]

# Calculate Spearman rank correlation coefficient


rho, p_value = spearmanr(x, y)
print("Spearman rank correlation coefficient:", rho)
print("p-value:", p_value)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 63 / 248
Chapter 1: Correlation and Association Analysis Python Code: Rank Correlation

Python Code: Rank Correlation


If you have two columns of data in a pandas DataFrame, you can calculate
the rank correlation directly from the DataFrame. Here's an example:

# From Data file


import pandas as pd
# Example DataFrame (you should replace this with your
actual data)
data = pd.DataFrame({ MRKARIM
'x': [1, 2, 3, 4, 5],
'y': [2, 3, 1, 5, 4]
})

# Calculate Spearman rank correlation coefficient


rho, p_value = data.corr(method='spearman').iloc[0, 1]

print("Spearman rank correlation coefficient:", rho)


print("p-value:", p_value)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 64 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient

Kendall Tau Correlation Coecient

Denition

The Kendall tau correlation coecient, denoted by τ, measures the


similarity of the orderings of data pairs.

Formula: nc − nd
MRKARIM
τ= 1
2 n(n − 1)
where nc is the number of concordant pairs and nd is the number of
discordant pairs.

Suitable for ordinal data.

Useful when dealing with tied ranks.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 65 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient

Example: Tau correlation coecient

Suppose we have the following data on two variables, X and Y, with their
corresponding ranks:

Observation Rank
X Y
MRKARIM
10 15
15 10
20 20
25 25
30 40

Using this data, we can calculate the Kendall Tau correlation coecient as
follows:

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 66 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient

Concordant Pairs:
In the context of correlation coecients such as Kendall Tau,
concordant pairs refer to pairs of observations where the ranks for
both variables follow the same order. In other words, if (Xi , Yi ) and
(Xj , Yj ) are two pairs of observations, they are considered concordant
if both Xi < Xj and Yi < Yj or if both Xi > Xj and Yi > Yj .
Discordant Pairs: Discordant pairs, on the other hand, refer to pairs
MRKARIM
of observations where the ranks for the variables have opposite orders.
In other words, if (Xi , Yi ) and (Xj , Yj ) are two pairs of observations,
they are considered discordant if Xi < Xj and Yi > Yj or if Xi > Xj and
Yi < Yj .
In the context of calculating correlation coecients like Kendall Tau,
understanding concordant and discordant pairs is crucial as they form
the basis for determining the strength and direction of association
between two variables based on their ranks.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 67 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient

Calculation of Kendall Tau

To calculate the Kendall Tau correlation coecient:

1 Compare each pair of observations in terms of their ranks.

2 Determine whether they have the same order (concordant) or


opposite order (discordant) for both variables.

3 Count the total number of MRKARIM


concordant pairs (nc ) and discordant pairs
(nd ).

4 Calculate the Kendall Tau coecient using the formula:

nc − nd
τ= 1
2 n(n − 1)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 68 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient

Example Calculation

Suppose we have the following data on two variables, X and Y, with their
corresponding ranks:

Observation Rank
X Y
MRKARIM
10 15
15 10
20 20
25 25
30 40

Let's calculate the Kendall Tau correlation coecient.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 69 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient

Concordant and Discordant Pairs


To nd the number of concordant and discordant pairs in the given data,
we need to compare each pair of observations in terms of their ranks and
determine whether they are concordant or discordant.
Let's denote the observations as (Xi , Yi ) and (Xj , Yj ), where i < j.
A pair (Xi , Yi ) and (Xj , Yj ) is:
Concordant if Xi < Xj and Yi < Yj , or if Xi > Xj and Yi > Yj .
Discordant if Xi < Xj and Yi > Yj , or if Xi > Xj and Yi < Yj .
MRKARIM
Let's analyze each pair:
1 (10, 15) and (15, 10): Discordant
2 (10, 15) and (20, 20): Concordant
3 (10, 15) and (25, 25): Concordant
4 (10, 15) and (30, 40): Concordant
5 (15, 10) and (20, 20): Concordant
6 (15, 10) and (25, 25): Concordant
7 (15, 10) and (30, 40): Concordant
8 (20, 20) and (25, 25): Concordant
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 70 / 248
Chapter 1: Correlation and Association Analysis Kendall Tau Correlation Coecient

9 (20, 20) and (30, 40): Concordant

10 (25, 25) and (30, 40): Concordant

So, out of the 10 pairs of observations, there are 9 concordant pairs and 1
discordant pair.

Number of concordant pairs (nc ): 9

Number of discordant pairs (nd ): 1

= 10
5(5−1)
Total number of pairs (n): MRKARIM
2
Kendall Tau coecient (τ ):

nc − nd 9−1 8
τ= 1 = 1 = ≈ 0.1778
2 n(n − 1) 2 (10)(9) 45

The Kendall Tau correlation coecient for the given data is


8 ≈ 0.1778.
τ = 45

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 71 / 248
Chapter 1: Correlation and Association Analysis Point-Biserial Correlation Coecient

Point-Biserial Correlation Coecient


Denition

The point-biserial correlation coecient, denoted by rpb , measures the


strength and direction of association between a continuous variable and a
binary variable. It is calculated by using

X1 − X0
rpb = √
p( −p)
MRKARIM
sX
1
n

where:

X 1 is the mean exam score for one group (e.g., males).


X 0 is the mean exam score for the other group (e.g., females).
sX is the standard deviation of the exam scores.
p is the proportion of one group (e.g., proportion of males).
n is the total sample size.
Applicable when one variable is dichotomous.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 72 / 248
Chapter 1: Correlation and Association Analysis Point-Biserial Correlation Coecient

Question
Suppose we are interested in examining the relationship between students'
exam scores and their gender (male/female). We have the following data:

Student Exam Score Gender


1 85 Male
2 70 Female
3 90 Male
MRKARIM
4 75 Female
5 80 Male
6 65 Female
7 95 Male
8 85 Female
9 88 Male
10 82 Female

Calculate the point-biserial correlation coecient to assess the relationship


between exam scores and gender.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 73 / 248
Chapter 1: Correlation and Association Analysis Point-Biserial Correlation Coecient

Solution

Using the given data:

X 1 = 87.6 (mean exam score for males)

X 0 = 75.4 (mean exam score for females)

sX = 8.67 (standard deviation of exam scores)

p = 0.5 (proportion of males in the sample)


MRKARIM
n= 10 (total sample size)

Substituting these values into the formula, we nd:

87.6 − 75.4
rpb = √ ≈ 0.76
0.5×(1−0.5)
8.67
10
Therefore, the point-biserial correlation coecient is approximately 0.76,
indicating a strong positive relationship between exam scores and gender.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 74 / 248
Chapter 1: Correlation and Association Analysis Phi Coecient

Phi Coecient

Denition

The phi coecient, denoted by φ, measures the association between


between two dichotomous variables. It is essentially a special case of the
Pearson correlation coecient, specically applicable when both variables
are binary (i.e., they have only two possible values).

MRKARIM
We have the following contingency table for two categorical variables:

Variable 2
Variable 1 Variable 2 = 0 Variable 2 = 1 Total
Variable 1 = 0 n00 n01 n0+
Variable 1 = 1 n10 n11 n1+
Total n+0 n +1 n++
Here, n00 represents the frequency of observations with Variable 1 being in
category 0 and Variable 2 being in category 0, and so on.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 75 / 248
Chapter 1: Correlation and Association Analysis Phi Coecient

Phi Coecient with an example

We can then calculate the phi coecient using the formula:

n00 n11 − n10 n01


φ= √
n1+ n0+ n+1 n+0
Similar to Pearson correlation.

Both variables are dichotomous.


MRKARIM

Suppose we have data on two binary variables, Variable 1 and Variable 2,


and we want to calculate the phi coecient to measure the association
between them. We have the following contingency table:

Variable 2 = 0 Variable 2 = 1
Variable 1 = 0 20 30
Variable 1 = 1 40 10

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 76 / 248
Chapter 1: Correlation and Association Analysis Phi Coecient

9 Calculate the marginal frequencies:

n1+ = 40 + 10 = 50
n0+ = 20 + 30 = 50
n+1 = 30 + 10 = 40
n+0 = 20 + 40 = 60

10 Calculate the frequencies:


MRKARIM

n00 = 20
n01 = 30
n10 = 40
n11 = 10

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 77 / 248
Chapter 1: Correlation and Association Analysis Phi Coecient

Using this contingency table, we can calculate the phi coecient as


follows:

1 Substitute the frequencies into the formula:

n11 n00 − n10 n01


φ= √
n1+ n0+ n+1 n+0

2 Calculate the phi coecient:


MRKARIM
(10)(20) − (40)(30) 200 − 1200
φ= √ =√
(50)(50)(40)(60) 50000 × 2400

−1000
=√ ≈ −0.3162
120000000

So, the phi coecient for this example is approximately -0.3162, indicating
a moderate negative association between Variable 1 and Variable 2.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 78 / 248
Chapter 1: Correlation and Association Analysis Cramér's V

Cramér's V
Denition

Cramér's V is a measure of the strength of association between two


nominal variables. It is dened as
¿ ¿
Á χ2 Á χ2
V =Á
À =Á
À
n(min(r , c) − 1) n × min(r − 1, c − 1)
MRKARIM
where
r c (Oij − Eij )2
χ2 = ∑ ∑
i j Eij
is the chi-square statistic, Oij is the observed frequency in cell (i, j) of the
contingency table. Eij is the expected frequency in cell (i, j) of the
contingency table. n is the total number of observations, and r and c are
the number of rows and columns in the contingency table respectively.
Ranges from 0 to 1.
0 indicates no association, 1 indicates perfect association.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 79 / 248
Chapter 1: Correlation and Association Analysis Cramér's V
Computation of Cramér's V
Suppose we have the following contingency table:

Variable 1
Variable 2 A B Total
C 20 30 50
D 10 40 50
Total 30 70 100

Step 1: Calculate the chi-square statistic


MRKARIM
First, compute the expected frequencies for each cell:

50 × 30
E11 = = 15
100
50 × 70
E12 = = 35
100
50 × 30
E21 = = 15
100
50 × 70
E22 = = 35
100
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 80 / 248
Chapter 1: Correlation and Association Analysis Cramér's V
Then, compute the chi-square statistic:

(20 − 15)2 (30 − 35)2 (10 − 15)2 (40 − 35)2


χ2 = + + +
15 35 15 35
25 25 25 25
= + + +
15 35 15 35
5 5 5 5
= + + +
3 7 3 7
70
= ≈ 3.333 MRKARIM
21

Step 2: Calculate Cramér's V


√ √
3.333 3.333
V = = ≈ 0.577
100 × (min(2, 2) − 1) 100

So, in this example, Cramér's V is approximately 0.577. This value


indicates a moderate to strong association between the two categorical
variables. Therefore, we can interpret that there is a notable relationship
between Variable 1 and Variable 2 in the given contingency table.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 81 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic

The Kappa Statistic


The Kappa Statistic or Cohen's Kappa statistic is a statistical measure
used to quantify the level of agreement between two raters (or judges,
observers, surveys, etc.) who each classify items into categories.
It assesses the level of agreement between two raters beyond what
would be expected by chance.
Kappa takes into account both observed agreement and agreement
expected by chance. MRKARIM
It is widely used in various elds to assess the reliability of ratings or
classications made by multiple raters.
The formula for Cohen's Kappa is:

Po − Pe
κ=
1 − Pe
Where:
Po is the proportion of observed agreement.
Pe is the proportion of agreement expected by chance.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 82 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic

Judgment 2
Judgment 1 Yes No Total
Yes a b a+b
No c d c +d
Total a+c b+d n
where

a+d
Po =MRKARIM
a+b+c +d
a+b a+c
PYes = ⋅
a+b+c +d a+b+c +d
Similarly:
c +d b+d
PNo = ⋅
a+b+c +d a+b+c +d
Finally,
Pe = PYes + PNo
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 83 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic

Interpretation of Cohen's Kappa

κ = 1: Perfect agreement between raters.

κ = 0: Agreement no better than chance.

κ < 0: Agreement worse than chance.

0 < κ < 0.2: Slight agreement.


MRKARIM

0.2 ≤ κ < 0.4: Fair agreement.

0.4 ≤ κ < 0.6: Moderate agreement.

0.6 ≤ κ < 0.8: Substantial agreement.

0.8 ≤ κ: Almost perfect agreement.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 84 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic

Example of Cohen's Kappa

Suppose that you were analyzing data related to a group of 50 people


applying for a grant. Each grant proposal was read by two readers and
each reader either said "Yes" or "No" to the proposal. Suppose the
disagreement count data were as follows

MRKARIM
Readers 1
Readers 2 Present Absent Total
Present 20 5 25
Absent 10 15 25
Total 30 20 50

Calculate Cohen's Kappa to assess the agreement between the two doctors.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 85 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic

The observed proportionate agreement is:

a+d 20 + 15
Po = = = 0.7
a+b+c +d 50

To calculate pe (the probability of random agreement) we note that:


Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus
reader A said "Yes" 50% of the time. Reader B said "Yes" to 30 applicants
and "No" to 20 applicants. Thus reader B said "Yes" 60% of the time. So
MRKARIM
the expected probability that both would say yes at random is:

a+b a+c
pYes = ⋅ = 0.5 × 0.6 = 0.3
a+b+c +d a+b+c +d
Similarly:

c +d b+d
PNo = ⋅ = 0.5 × 0.4 = 0.2
a+b+c +d a+b+c +d

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 86 / 248
Chapter 1: Correlation and Association Analysis The Kappa Statistic

Overall random agreement probability is the probability that they agreed


on either Yes or No, i.e.:

Pe = PYes + PNo = 0.3 + 0.2 = 0.5

So now applying our formula for Cohen's Kappa we get:

Pe 0.7 − 0.5
Po − MRKARIM
κ= = = 0.4
1 − Pe 1 − 0.5

So, with a kappa value of 0.4, there is a moderate level of agreement


beyond what would be expected by chance between the two readers. While
it's not perfect agreement, it still suggests a meaningful level of consensus
between them.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 87 / 248
Chapter 1: Correlation and Association Analysis Python Code: The Kappa Statistic

Python Code: The Kappa Statistic


# Observed agreement
num_agreements = 20 + 15
total_obs = 50
P_o = num_agreements / total_obs

# Expected agreement
P_present_reader1 = 25 / total_obs
P_absent_reader1 = 25 / total_obs
MRKARIM
P_present_reader2 = 30 / total_obs
P_absent_reader2 = 20 / total_obs

P_e_present = P_present_reader1 * P_present_reader2


P_e_absent = P_absent_reader1 * P_absent_reader2
P_e = P_e_present + P_e_absent

# Cohen's Kappa
kappa = (P_o - P_e) / (1 - P_e)
print("Cohen's Kappa:", kappa)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 88 / 248
Chapter 1: Correlation and Association Analysis Python Code: The Kappa Statistic

Another Example
from sklearn.metrics import cohen_kappa_score

# Example data
doctor1_ratings = [3, 2, 1, 5, 4, 2, 3, 4, 1, 5]
doctor2_ratings = [4, 2, 1, 5, 3, 2, 3, 4, 2, 5]

# Calculate Kappa Statistic


kappa = cohen_kappa_score(doctor1_ratings, doctor2_ratings)
print("Kappa Statistic:", kappa)
MRKARIM
# Alternatively,
import statsmodels.api as sm

# Example data
doctor1_ratings = [3, 2, 1, 5, 4, 2, 3, 4, 1, 5]
doctor2_ratings = [4, 2, 1, 5, 3, 2, 3, 4, 2, 5]

# Calculate Kappa Statistic


kappa = sm.stats.inter_rater.cohens_kappa(doctor1_ratings,
doctor2_ratings)
print("Kappa Statistic:", kappa)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 89 / 248
Chapter 1: Correlation and Association Analysis Partial Correlation

Partial Correlation
Partial correlation is a statistical technique used to measure the strength
and direction of the relationship between two variables while controlling for
the inuence of one or more additional variables. The formula for
calculating partial correlation coecient (denoted as rxy .z ) between
variables X and Y while controlling for variable Z is given by:

rxy − (rxz × rzy )


rxy .z = √
MRKARIM
2 )(1 − rzy2 )
(1 − rxz

Where:

rxy : Correlation coecient between variables X and Y


rxz : Correlation coecient between variables X and Z
rzy : Correlation coecient between variables Z and Y
Partial correlation helps in understanding the unique association between
variables after accounting for the eects of other variables.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 90 / 248
Chapter 1: Correlation and Association Analysis Partial Correlation

Example: Partial Correlation


Suppose we have three variables: X, Y, and Z. We want to calculate the
partial correlation coecient rxy .z between X and Y while controlling for
Z.
Given correlation coecients:

rxy = 0.6, rxz = 0.4, rzy = 0.3


We can use the formula:
MRKARIM
rxy − (rxz × rzy )
rxy .z = √
2 )(1 − rzy2 )
(1 − rxz
Substituting the given values:

0.6 − (0.4 × 0.3) 0.48


rxy .z = √ =√ ≈ 0.549
2 2
(1 − 0.4 )(1 − 0.3 ) 0.84 × 0.91

So, the partial correlation coecient between X and Y while controlling


for Z is approximately 0.549.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 91 / 248
Chapter 1: Correlation and Association Analysis Partial Correlation

Suppose we are interested in the association between two variables X and


Y but want to control for other covariates Z1 , Z2 , . . . , Zk . The partial
correlation is dened to be the Pearson correlation between two derived
variables x and y , where

x = the residual from the linear regression of X on Z1 , Z2 , . . . , Zk


y = the residual from the MRKARIM
linear regression of Y on Z1 , Z2 , . . . , Zk .
However, we are also often interested in the association between Y and all
the predictors when considered as a group. This measure of association is
given by the multiple correlation.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 92 / 248
Chapter 1: Correlation and Association Analysis Python Code: The Partial Correlation

Python Code: The Partial Correlation


pip install pingouin
import pandas as pd
import pingouin as pg

# Sample data
data = {
'X': [1, 2, 3, 4, 5],
'Y': [2, 3, 5, 4, 6],
'Z': [5, 4, 3, 2, 1]
} MRKARIM

# Create a DataFrame
df = pd.DataFrame(data)

# Calculate partial correlation between X and Y controlling for Z


partial_corr_result = pg.partial_corr(data=df, x='X', y='Y',
covar='Z')

# Print the partial correlation coefficient and p-value


print("Partial Correlation Coefficient:",
partial_corr_result['r'].values[0])
print("p-value:", partial_corr_result['p-val'].values[0])

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 93 / 248
Chapter 1: Correlation and Association Analysis Multiple Correlation

Multiple Correlation
Suppose we have an outcome variable y and a set of predictors x1 , . . . , xk .
The maximum possible correlation between y and a linear combination of
the predictors c1 x1 + . . . + ck xk is given by the correlation between y and
the regression function β1 x1 + . . . + βk xk and is called the multiple
correlation between y and {x1 , . . . , xk }. It is estimated by the Pearson
correlation between y and b1 x1 + . . . + bk xk , where b1 , . . . , bk are the
least-squares estimates of β1 , . . . , βk . The multiple correlation can also be
√ MRKARIM
Reg SS = √R 2 . So multiple correlation is
shown to be equivalent to
Total SS
dened as √
R= R2
where R2 is the coecient of determination for the number of independent
variables in the model.

R ranges from 0 to 1. A higher R value suggests a stronger linear


relationship between the variables and R =0 indicates no linear
relationship.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 94 / 248
Chapter 1: Correlation and Association Analysis Python Code: The Multiple Correlation

Python Code: The Multiple Correlation


import numpy as np

# Sample data: three variables X1, X2, X3


X1 = [1, 2, 3, 4, 5]
X2 = [2, 3, 5, 4, 6]
X3 = [5, 4, 3, 2, 1]

# Combine variables into a 2D MRKARIM


array
data = np.array([X1, X2, X3])

# Calculate the correlation matrix


correlation_matrix = np.corrcoef(data)

# Extract the multiple correlation coefficient (square root of


determinant of the correlation matrix)
multiple_correlation = np.sqrt(np.linalg.det(correlation_matrix))

print("Multiple correlation coefficient:", multiple_correlation)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 95 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises

Chapter Exercises

1 Consider the following dataset representing the scores of two students


in two exams:
Student Quiz 1 Quiz 2
A 80 75
B MRKARIM
90 85
C 70 65
D 85 80

Calculate the Spearman's rank correlation coecient between Quiz 1


and Quiz 2 scores.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 96 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises

2 Suppose you have collected data on the height and weight of ten
individuals:
Height (cm) Weight (kg)
160 55
165 60
170 65
175 70
180 75
MRKARIM
185 80
190 85
195 90
200 95
205 100

Compute the Pearson correlation coecient between height and


weight and interpret your result.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 97 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises

3 Given the following contingency table representing the relationship


between two categorical variables:

Variable 1 Total
Variable 2 A B
C 20
MRKARIM 30
D 10 40
Total 30 70

Calculate the phi correlation coecient and interpret your result.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 98 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises

4 Consider the following contingency table representing the relationship


between two categorical variables:

Myocardial Infraction (MI) Total


Smoking Status Yes No Total
Yes 40 30 70
MRKARIM
No 10 40 50
Total 50 70 120

Calculate Cramér's V for the given contingency table and interpret


your result.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 99 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises

5 Two raters independently evaluate the performance of 10 students in


a quiz competition. Each rater scores the students as either "Pass" or
"Fail." The ratings are compared to assess the agreement between
the raters using Cohen's Kappa statistic.

Student Rater 1 Rater 2


1 Pass Pass
2 Pass Fail
3 Fail
MRKARIM Fail
4 Pass Pass
5 Pass Pass
6 Fail Pass
7 Fail Fail
8 Pass Pass
9 Pass Pass
10 Fail Fail

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 100 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises

(i) Convert the ratings into binary labels. For example, you might assign
1 for "Pass" and 0 for "Fail."

(ii) Calculate the observed agreement (Po ) between Rater 1 and Rater 2's
ratings.

(iii) Calculate the expected agreement (Pe ) by chance.


MRKARIM
(iv) Use the formula for Cohen's Kappa (κ) to calculate the statistic.

(v) Interpret the value of Cohen's Kappa in terms of the level of


agreement between the two raters.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 101 / 248
Chapter 1: Correlation and Association Analysis Chapter Exercises

Python Code: The Kappa Statistic


from sklearn.metrics import cohen_kappa_score

# Data representing the ratings by two raters


rater1 = ["Pass", "Pass", "Fail", "Pass", "Pass", "Fail",
"Fail", "Pass", "Pass", "Fail"]
rater2 = ["Pass", "Fail", "Fail", "Pass", "Pass", "Pass",
"Fail", "Pass", "Pass", "Fail"]
MRKARIM

# Define the possible categories


categories = ["Pass", "Fail"]

# Calculate Cohen's Kappa


kappa = cohen_kappa_score(rater1, rater2, labels=categories)

print("Cohen's Kappa:", kappa)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 102 / 248
Chapter 2: Regression Analysis: Simple Linear Regression

Chapter 2: Regression Analysis: Simple Linear


Regression
MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 103 / 248
Chapter 2: Regression Analysis: Simple Linear Regression

3 Chapter 2: Regression Analysis: Simple Linear Regression

3.1 Problem & Motivation

3.2 Functional vs. Statistical Relation

3.3 Regression Analysis

3.4 Historical Origin of the Term Regression

3.5 MRKARIM
Types of parametric regression analysis

3.6 A Probabilistic View of Linear Regression

3.7 Parameter Estimation

3.8 The Estimated Error Variance or Standard Error

3.9 Coecient of Determination

3.10 Interval Estimation and Hypothesis Testing

3.11 Real Data Example: Obstetrics Dataset


Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 104 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Problem & Motivation

Problem & Motivation

Research Problem 1

Obstetricians sometimes order tests to measure estriol levels from 24-hour


urine specimens taken from pregnant women who are near term because
level of estriol has been found to be related to infant birthweight. The test
can provide indirect evidence of an abnormally small fetus. Greene and
MRKARIM
Touchstone conducted a study to relate birthweight and estriol level in
a
pregnant women . The Sample data are presented in following Table 4.
They want to nd any relationship between the estriol level and birthright
How can this relationship be quantied? What is the estimated average
birthweight if a pregnant woman has an estriol level of 15 mg/24 hr?

a
Greene, J., & Touchstone, J. (1963). Urinary tract estriol: An index of placental
function. American Journal of Obstetrics and Gynecology, 85(1), 1-9.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 105 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Problem & Motivation

Table 4: Sample data from the Greene-Touchstone study relating birthweight and
estriol level in pregnant women near term

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 106 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Problem & Motivation

Motivation for Regression Analysis


The drawback of correlation analysis is that it only measures the
strength and direction of the linear relationship between two variables.
It does not provide information about causality or the nature of the
relationship beyond linearity. Additionally, correlation coecients can
be aected by outliers and may not capture complex relationships
that exist between variables.
The motivation for regression analysis stems from the need to
MRKARIM
understand and model the relationship between variables more
comprehensively. Regression analysis allows us to not only measure
the strength and direction of the relationship but also to make
predictions and infer causality, provided certain assumptions are met.
By tting a regression model, we can examine how changes in one
variable are associated with changes in another variable while
controlling for potential confounding factors. This enables deeper
insights into the underlying mechanisms driving the relationship
between variables.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 107 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Functional vs. Statistical Relation

Functional Relation
Suppose f is a known function.

Y = g (X )
Whenever X is known, Y is completely known.
Examples:

(3 ) y = 5 + x 2
1 1
(1 ) y = 2 x (2 ) y = x − 1
2
MRKARIM 2

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 108 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Functional vs. Statistical Relation

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 109 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Regression Analysis

Regression Analysis
correlation analysis does not tell us why and how behind the
relationship but it just says the relationship exists

regression is to build a function of independent variables (also known


as predictors) to predict a dependent variable (also called response)
for example,
y = g (x1 , x2 , . . . , xp ) + e
MRKARIM
where y 1 2
is a dependent variable, x , x , . . . , xp independent variables,
e is an error term.
for example, banks assess the risk of home-loan applicants based on
their
▸ age
▸ income
▸ expenses
▸ occupation
▸ number of dependents,
▸ total credit, etc.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 110 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Regression Analysis

regression analysis: to nd out the relationship and measure the


dependency between a response variable Y and a set of covariates X
standard regression methods have focused on the estimation of
conditional mean function E(Y ∣ ) = g(
MRKARIM X X) of a conditional response
Y given a set of p covariates X = (X1 , ..., Xp )

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 111 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression

Historical Origin of the Term Regression

MRKARIM

Figure 2: Hypothetical distribution of sons' heights corresponding to given heights


of fathers
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 112 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression

MRKARIM

Figure 3: Hypothetical distribution of heights corresponding to given age

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 113 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 114 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression

Main Use of Regression Analysis

The main use of regression analysis is to:

Understand and quantify relationships between variables.

Identify signicant factors inuencing a dependent variable.

Assess the strength and direction of associations between variables.


MRKARIM
Examine how changes in one variable are associated with changes in
another variable while controlling for potential confounding factors.

Predict future outcomes or values based on historical data.

Make informed decisions and recommendations based on statistical


evidence.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 115 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression

Comparison of correlation analysis and regression analysis

Aspect Correlation Analysis Regression Analysis


Purpose Measure the strength and Model and quantify the re-
direction of association be- lationship between a de-
tween two (continuous) pendent variable and one
variables. or more independent vari-
ables.
MRKARIM
Direction of Examines relationship be- Explores relationship be-
Analysis tween variables without as- tween variables with as-
suming causation. sumption of causation.
Output Produces correlation coef- Provides regression coef-
cient (e.g., Pearson's r, cients (slope and inter-
Spearman's ρ). cept), along with mea-
sures of model t (e.g., R-
squared).

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 116 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Historical Origin of the Term Regression

Table 5: Comparison between Correlation Analysis and Regression Analysis

Aspect Correlation Analysis Regression Analysis


Model Com- Relatively simple analysis; Can range from simple lin-
plexity provides single summary ear regression to complex
statistic. models with multiple pre-
MRKARIM dictors.
Inference Descriptive analysis; does Allows for inference and
not imply causation. hypothesis testing; enables
predictions and causal con-
clusions.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 117 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis

Types of parametric regression analysis

for continuous response variable

▸ Simple Linear Regression


▸ Multiple Regression
▸ Polynomial Regression
▸ Multivariate Regression
for categorical response variable
▸ Logistic Regression MRKARIM

∎ binomial (also called ordinary) logistic regression


∎ multinomial logistic regression
∎ ordinal logistic regression
∎ alternating logistic regressions
for discrete response variable

▸ Poisson Regression
for survival-time (time-to-event) outcomes

▸ Cox Regression (or proportional hazards regression)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 118 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis

Regression Models
another types of regression analysis:

Ridge Regression

Lasso Regression

Bayesian regression

Nonparametric regression

types of regression model in machine learning


MRKARIM
Generalized linear model (GLM)

Additive model

Generalized additive model (GAM)

Random eect model

Linear mixed model (LMM)

Generalized linear mixed model (GLMM)

Generalized estimating equations (GEE)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 119 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis

Simple Regression Model


To explore the relationship between independent variable x and dependent
variable y we consider the following population regression equation:

E (y ∣x) = β0 + β1 x

where E (y ∣x) = expected or average value of y for the give value of x.


Simple regression model:
MRKARIM

yi = β0 + β1 xi + ei ; i = 1, 2, . . . , n (1)

where we assume, ei is normally distributed with mean 0 and constant


variance. That ie, E(ei ) = 0 and Var(ei ) = σ 2 (say).
Remarks: Normality of the errors (or residuals) is not strictly re-
quired. However, the normality assumption in Equation (1) is neces-
sary to perform hypothesis tests concerning regression parameters,
as discussed in the next.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 120 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis

Graphical Presentation

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 121 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis

For x = xi , we have one (sample) observation, y = yi , the population


regression function (PRF), it can be expressed as

yi = β0 + β1 xi
= E(y ∣xi ) + ei

In terms of the sample regression function (SRF), the observed yi can


be expressed as
yi = ŷi + êi
MRKARIM

The estimated model (or estimated line) of (1) can be written as

ŷi = β̂0 + β̂1 xi

where

▸ ŷi = estimator of E(y ∣xi ) under the assumption ei ∼ N (0, σ 2 )


iid

▸ β̂0 = estimator of β0
▸ β̂1 = estimator of β1

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 122 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis

MRKARIM

Figure 4: Conditional distribution of the disturbances ei

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 123 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 124 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 125 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Types of parametric regression analysis

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 126 / 248
Chapter 2: Regression Analysis: Simple Linear Regression A Probabilistic View of Linear Regression

A Probabilistic View of Regression Analysis


regression analysis: to nd out the relationship between a response
variable Y and a set of covariates X.
response variable: Y and a set of covariates: X = (X1 , X2 , . . . , Xp )T
the conditional density:

µ(x)
fY ∣X (y ∣x)MRKARIM ← { 2
σ (x )
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
parametric ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
parametric/nonparametric
conditional Mean of Y given X=x :
E(Y ∣X = x) = µ(X)

aims:

▸ estimation of unknown function E(Y ∣X) = µ(X)


Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 127 / 248
Chapter 2: Regression Analysis: Simple Linear Regression A Probabilistic View of Linear Regression

Distributional Assumptions Underlying Classical Regression

if

▸ fy ∣x (y ∣x) is a normal density (Remarks: y must be a normal random


variate)
▸ µ(x) is a linear function (i.e., µ(x) = β0 + β1 x ) and
▸ σ 2 (x) is a constant i.e., σ 2 (x) = σ 2 (say)
then we use a classical linear regression model to estimate E(y ∣x)
MRKARIM
in this case, we assume

E(y ∣x) = β0 + β1 x + E(e∣x)

where β1 measures the marginal change in the mean of y due to a


marginal change in x.
We can now reformulate the assumptions of the classical regression model
in the next step

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 128 / 248
Chapter 2: Regression Analysis: Simple Linear Regression A Probabilistic View of Linear Regression

Assumption of Classical Linear Model


The classical (simple) linear regression model is

yi = β0 + β1 xi + ei ; i = 1 , 2, . . . , n (2)

Assumptions
1 Linearity: There exists a linear relationship between the independent
variable X and the dependent variable Y.
2 Independence: MRKARIM
The observations are independent of each other.
3 Homoscedasticity: The variance of the errors (residuals) is constant
across all levels of the independent variable X. That is , Var(ei ) = σ2.
4 Normality: The errors (residuals) follow a normal distribution with
mean 0. That is , E(ei ) = 0.
5 No perfect multicollinearity: There is no perfect linear relationship
between the independent variable X and any other variable.
iid 2 iid 2
That is, ei ∼ N (0, σ ), that is yi ∼ N (β0 + β1 xi , σ ). Therefore, it is also
called normal regression
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 129 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation

How do you estimate parameters of the simple linear


regression model?
Suppose our simple regression model is

yi = β0 + β1 xi + ei ; i = 1, 2, . . . , n

where we assume, ei is normally distributed with mean 0 and variance


σ2. MRKARIM
usually, β0 , β1 and σ 2 are unknown
we can estimate β0 , β1 and σ 2 by following methods

▸ ordinary least square (OLS) method


▸ maximum likelihood estimation
▸ method-of-moments (or GMM)
▸ ...

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 130 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation

Interpretation of Slope Coecient

the estimated regression model (or equation) is

ŷi = β̂0 + β̂1 xi

where β̂0 and β̂1 are respectively, the estimator of


MRKARIM
β0 and β1
β̂1 is the slope of the tted (estimated) line

▸ it shows the amount of change in ŷ for a change of one unit in x


▸ a positive value for β̂1 indicates a direct relationship between the two
variables and a negative value indicates an inverse relationship
▸ the sign of β̂1 and the sign of r , the correlation coecient, are always
the same

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 131 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation

Least Square Estimation


A mathematical procedure that uses the data to position a line with the
objective of minimizing the sum of the squares of the vertical distances
between the actual y values and the estimated (or predicted) values of y

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 132 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation

Ordinary Least Square (OLS) Estimation


1 minimizing function:

n n
Q = min ∑ ei2 = min ∑(yi − ŷi )2
β0 ,β1 i= 1 β0 ,β1 i= 1
n
= min ∑(yi − β0 − β1 xi )2
β0 ,β1 i= 1
2 the estimators for β0 MRKARIM
1
and β can be found by solving the following
normalized equation

∂Q ∂ n 2
= ∑(yi − β0 − β1 xi ) = 0 (3)
∂β0 ∂β0 i=1
and

∂Q ∂ n 2
= ∑(yi − β0 − β1 xi ) = 0 (4)
∂β1 ∂β1 i=1

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 133 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation

OLS estimatior
from (3), we have

n
∑(yi − β0 − β1 Xi )(−1) = 0
i= 1
n n
⇒ ∑ yi − nβ0 − β1 ∑ xi = 0
i= 1 i=1
MRKARIM
or
β0 = ȳ − β1 X̄
from (4), we have

n
∑(yi − β0 − β1 xi )(−xi ) = 0
i= 1
n n n
⇒ ∑ xi yi − β0 ∑ xi − β1 ∑ xi2 = 0 (5)
i= 1 i= 1 i=1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 134 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation

putting the value of β0 in Equation (5), we have

n n n
2
∑ xi yi − (ȳ − β1 x̄) ∑ xi − β1 ∑ xi = 0
i= 1 i=1 1
i=
n n n n
⇒ ∑ xi yi − ȳ ∑ xi + β1 x̄ ∑ xi − β1 ∑ xi2
i= 1 i=1 i= 1 i= 1
n n
⇒ ∑ xi yi − nȳ x̄ + nβ1 x̄ 2 − β1 ∑ xi2 = 0
i= 1 MRKARIM 1
i=
n n
⇒ ∑ xi yi − nȳ x̄ − β1 (∑ xi2 − nx̄ 2 ) = 0
i= 1 i= 1
n
∑ xi yi − nx̄ ȳ
⇒β1 =
i=1
∑ xi2 − nx̄ 2
n

i= 1

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 135 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation
hence, the ordinary least square (OLS) estimators of β0 and β1 are

β̂0 = ȳ − β̂1 x̄
n
∑ xi yi − nx̄ ȳ
β̂1 =
i=1
∑ xi2 − nx̄ 2
n

i= 1
the expression of β̂1 can also be written as
MRKARIM
n
∑ (xi − x̄)(yi − ȳ )
β̂1 =
1
i=
=r(
sy
)
∑ (yi − ȳ )2
n
sx
i=1
because
n n n n
2 2
∑(xi − x̄) = ∑ xi − nx̄
2 and ∑(xi − x̄)(yi − ȳ ) = ∑ xi yi − nx̄ ȳ
i= 1 i= 1 i= 1 1
i=

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 136 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Parameter Estimation

Remarks: Note that the method of least squares is appropriate when-


ever the average residual for each given value of x is 0 that is, when
E(e∣X = x) = 0 in Equation (2). Normality of the errors (or residuals)
is not strictly required. However, the normality assumption in Equa-
MRKARIM
tion (2) is necessary to perform hypothesis tests concerning regression
parameters, as discussed in the next.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 137 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error

The Estimated Error Variance or Standard Error


The error variance or standard error of estimate measures the scatter, or
dispersion, of the observed values around the line of regression. The
formulas that are used to compute the error variance or the standard error:

∑(yi − ŷi )2 ∑ yi2 − β̂0 ∑ yi − β̂1 ∑ xi yi


n n n n

σ2 =
i i i i
̂ =
n−2 MRKARIM n−2
where
ŷi = β̂0 + β̂1 xi
Hence, the standard error of estimate is
¿
¿ Án
Á ∑(yi − ŷi )2 Á ∑ y 2 − β̂0 ∑ yi − β̂1 ∑ xi yi
n n
Á Á
Á
Ài Ài i
Á i i
̂
σ= =
n−2 n−2

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 138 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error

Example 3.1 (Sultan's Dine restaurants Sales Dataset)

To illustrate the least squares method, suppose data were collected from a
sample of 10 Sultan's Dine restaurants located near to the university
campuses. For the i th observation or restaurant in the sample, xi is the
size of the student population (in thousands) and yi is the quarterly sales
(in thousands of dollars).

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 139 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error

(i). Show the relationship between the size of student population and the
quarterly sales. Make a comment on the diagram.

(ii). Write down the regression model for this example, and mention the
assumptions of the model.

(iii). Write the estimated regression equation and nd the least square
estimates. Interpret the results.
MRKARIM
(iv). Draw the regression line.

(v). Predict quarterly sales for a restaurant to be located near a campus


with 16,000 students.

(vi). Find the value of the standard error of the estimates.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 140 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error

Student Population And Quarterly Sales Data: for 10 pair observations

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 141 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error
Simple regression model:

yi = β0 + β1 xi + ei ; i = 1, 2, . . . , 10 (6)

where
▸ yi = Quarterly Sales ($1000s)
▸ xi = bStudent Population (1000s)
▸ β0 is intercept
▸ β1 is slope coecient
▸ ei error term and we assume, ei is normally distributed with mean 0
and variance σ 2 .
MRKARIM
The estimated regression model (or equation) of (6) is

ŷi = β̂0 + β̂1 xi


where β̂0 and β̂1 are respectively, the estimator of β0 and β1 .
The ordinary least square (OLS) estimators of β0 and β1 are
n
∑ xi yi − nx̄ ȳ
β̂0 = ȳ − β̂1 x̄ ; β̂1 =
i= 1 .
∑ xi2 − nx̄ 2
n

i= 1
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 142 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error

The Least Squares Estimation

xi yi xi2 xi yi
2 58 4 116
6 105 36 630
8 88 64 704
8 118 64 944
MRKARIM
12 117 144 1404
16 137 256 2192
20 157 400 3140
20 169 400 3380
22 149 484 3278
26 202 676 5252
Total= 140 1300 2528 21040

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 143 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error

The ordinary least square (OLS) estimators of β0 and β1 are

n
∑ xi yi − nx̄ ȳ
β̂1 =
i= 1 =
2840
=5
∑ xi2 − nx̄ 2
n
568

i= 1
β̂0 = ȳ − β̂1 x̄ = 130 − 5(14) = 60
Thus, the estimated regression equation
MRKARIM is

ŷi = 60 + 5xi
̂
The slope of the estimated regression equation (β 1 = 5) is positive,
implying that as student population increases, sales increase. In fact, we
can conclude that an increase in the student population of 1000 is
associated with an increase of $5000 in expected sales; that is, quarterly
sales are expected to increase by $5 per student.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 144 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 145 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error

To predict quarterly sales for a restaurant to be located near a campus


with 16,000 students, we would compute

ŷ = 60 + 5 × 16 = 140
MRKARIM

Hence, we would predict quarterly sales of $140,000 for this restaurant.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 146 / 248
Chapter 2: Regression Analysis: Simple Linear Regression The Estimated Error Variance or Standard Error

The Standard Error of the Estimates


xi yi ŷi = 60 + 5xi (yi − ŷi )2
2 58 70 144
6 105 90 225
8 88 100 144
8 118 100 324
12 117 120 9
16 137 MRKARIM
140 9
20 157 160 9
20 169 160 81
22 149 170 441
26 202 190 144
140 1300 760 1530

∑(yi − ŷi )2 √
σ2 =
i 1530
̂ = = 765 and ̂
σ= 765 = 27.66
n−2 10 − 2

Hence the standard error of the estimates is ̂


σ = 27.66.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 147 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

Coecient of Determination
the coecient of determination is the proportion of the total variation
in the dependent variable Y that is explained by the independent
variable(s) in a regression model. It is denoted by R2 and dened by

∑(yi − ŷi )2
R2 = 1 −
SSE i
=1−
SST ∑(yi − ȳ )2
i
MRKARIM
where
▸ the total sum of squares (proportional to the variance of the data):
SST = ∑(yi − ȳ )2
i
▸ The sum of squares of residuals, also called the Residual or Error Sum
of Squares (SSE): SSE = ∑(yi − ŷi )2 = ∑ ei2
i i
For example, suppose R 2 = 0.9027. This implies that 90.27% of the
variability of the dependent variable is explained and the remaining
9.73% of the variability is still unexplained by the regression model.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 148 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination
For the Sultan's Dine restaurants Sales Dataset give in 3.1, we have

xi yi (yi − ȳ )2 ŷi = 60 + 5xi (yi − ŷi )2


2 58 5184 70 144
6 105 625 90 225
8 88 1764 100 144
8 118 144 100 324
12 117 169 120 9
16 137 49 140 9
20 157 MRKARIM160
729 9
20 169 1521 160 81
22 149 361 170 441
26 202 5184 190 144
140 1300 15730 1300 1530
∑(yi − ŷi )2
2
hence, R = 1 −
i
=1−
1530
= 0.9027
∑(yi − ȳ )2 15730
i

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 149 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

Classroom Practice

Example 3.2 (Classroom Practice)

The following sample of observations were randomly selected.

x 4 5 3 6 10
y 4MRKARIM
6 5 7 7

(a). Determine the regression equation.

(b). Write the estimated regression equation (or line).

(c). Determine the estimated value of y when x is 7.

(d). Find the value of the coecient of determination and interpret your
results.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 150 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

Solution of the Exercises

Solution of the Exercise given in 3.2

xi yi xi2 yi2 xi yi
4 4 16 16 16
5 6 25
MRKARIM 36 30
3 5 9 25 15
6 7 36 49 42
10 7 100 49 70

∑ xi = 28 ∑ yi = 29 ∑ xi2 = 186 ∑ yi2 = 175 ∑ xi yi = 173


x̄ = 5.6 ȳ = 5.8

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 151 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

(a). The least square estimator of β0 and β1 are

n
∑ xi yi − nx̄ ȳ
β̂1 =
i= 1 = 0.3630 and β̂0 = ȳ − β̂1 x̄ = 3.7671
∑ xi2 − nx̄ 2
n

i=1

(b). The estimated equation (or line) is


MRKARIM

ŷi = β̂0 + β̂1 xi = 3.7671 + 0.3630xi

(c). The estimated value of y is

ŷi = 3.7671 + 0.3630 × 7 = 6.3082

when xi is 7.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 152 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

Example: Coecient of Determination


xi yi (yi − ȳ )2 ŷi (yi − ŷi )2
4 4 3.24 5.2191 1.486205
5 6 0.04 5.5821 0.17464
3 5 0.64 4.8561 0.020707
6 7 1.44 5.9451 1.112814
10 7 1.44 7.3971 0.157688

∑i xi = 28 ∑i yi = 29 MRKARIM
SST= 6.8 SSE= 2.9521
where

ŷi = β̂0 + β̂1 xi


= 3.7671 + 0.3630xi
Hence
∑(yi − ŷi )2
R2 = 1 −
i 2.9521
=1− = 0.566
∑i (yi − ȳ )2 6.8

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 153 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

Relationship between R 2 and rxy


The coecient of correlation measures the strength and direction of a
linear relationship between two variables. The coecient of
correlation is denoted by rxy .
Coecient of Determination (R 2 ):
2
The relationship between R and r is that the square of the
coecient of correlation (rxy ) is equal to the coecient of
2
determination (R ) for the MRKARIM
simple regression model. Mathematically,

R 2 = (rxy )2
Hence, for the previous example,

R 2 = (rxy )2
= (0.7522)2
= 0.566

Remarks: Note that the relationship R 2 = (rxy )2 only holds for the
simple linear regression model.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 154 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

the estimated regression equation for simple linear regression model


provides
ŷi = β̂0 + β̂1 xi
the sample correlation coecient is

rxy = (sign of β̂1 ) coecient of determination

= (sign of ) R2
β̂1MRKARIM (7)

where R 2 is the coecient of determination for simple regression


model yi = β0 + β1 xi + ei ; i = 1, 2 , . . . , n

Remarks: Note that the equation (7) is true only for simple regres-
sion model.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 155 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

Advantages and Disadvantages of R 2

R2 is a statistic that will give some information about the


goodness-of-t of a model

in regression, the R2 coecient of determination is a statistical


measure of how well the regression predictions approximate the real
data points

an R2 MRKARIM
of 1 indicates that the regression predictions perfectly t the
data

R2 increases as we increase the number of variables in the model (R


2
is monotone increasing with the number of variables included i.e., it
will never decrease)

an adjusted R2 is a modication of R2 that adjusts for the number of


explanatory terms in a model (p ) relative to the number of data
points (n)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 156 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

Adjusted R 2

The adjusted R2 ( denoted by R̄ 2 ) is dened as

n−1
R̄ 2 = 1 − (1 − R 2 )
n−p−1
where p is the total number of explanatory variables in the model
(not including the constant term), and n is the sample size
MRKARIM
it can also be written as:

SSE /dfe
R̄ 2 = 1 −
SST /dft

where dft is the degrees of freedom n−1 of the estimate of the


population variance of the dependent variable, and dfe is the degrees
of freedom n−p−1 of the estimate of the underlying population error
variance

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 157 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

Adjusted R 2

the explanation of this statistic is almost the same as R2 but it


penalizes the statistic as extra variables are included in the model
1
n−
the term
n−p− 1 is called the penalty of using the more regressors in a
model

when the number of regressors p, increases, (1 − R 2 ) will decrease,


n− 1 MRKARIM
but
n−p− 1 will increase

whether more regressors improve the explanatory power of a model


depends on the trade-of between R2 and the penalty

For the previous example, p=1 and R 2 = 0.566 and hence,

5−1
R̄ 2 = 1 − (1 − 0.566)( ) = 0.4212
5−1−1

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 158 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Coecient of Determination

Figure 5: Excel output for the classroom practice example.

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 159 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Condence Interval for β0


The condence interval for β0 can be computed using the standard
formula for linear regression parameter estimates. The formula for the
condence interval for β0 is:

β̂0 ± t α ,(n−2) ⋅ se(β̂0 )


2
Where:

β̂0 is the estimated coecientMRKARIM


for x ,

tα/2 is the critical value of the t-distribution with n − 2 degrees of


freedom at a signicance level of α/2 (where α is typically 0.05 for a
95% condence interval),

the standard error of the estimator β̂0 is


¿
Á ⎛1 x̄ 2 ⎞
̂ Á 2
se(β0 ) = Á
À̂σ +√ n .
⎝n ∑i=1 (xi − x̄)2 ⎠

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 160 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Condence Interval for β1


The condence interval for β1 can be computed using the standard
formula for linear regression parameter estimates. The formula for the
condence interval for β1 is:

β̂1 ± t α ,(n−2) ⋅ se(β̂1 )


2
Where:

β̂1
MRKARIM
is the estimated coecient for x ,

tα/2 is the critical value of the t-distribution with n−2 degrees of


freedom at a signicance level of α/2 (where α is typically 0.05 for a
95% condence interval),

the standard error of the estimator β̂1 is

̂
σ
se(β̂1 ) = √
∑ni= 1 (xi − x̄)2
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 161 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Where:

¿
Án
Á ∑(yi − ŷi )2
Á
Á
Ài
̂
σ=
n−2
is the estimated standard error of the residuals (or the square root of
MRKARIM
the mean squared error, often obtained from the regression output),

n is the number of observations,

x̄ is the mean of the independent variable x.


Once you have these values, you can compute the condence interval.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 162 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Test overall signicance of the regression model


In a simple linear regression model, the F-test is used to assess the overall
signicance of the regression model. The hypotheses are:

H0 ∶ β1 = 0
H1 ∶ β1 ≠ 0
The null hypothesis H0 implying that the independent variable(s) do not
have any eect on the dependent variable. The alternative hypothesis H
MRKARIM 1
indicating that the independent variable(s) do have a signicant eect on
the dependent variable. The formula for the F-statistic in a simple linear
regression model is:

SSR/1
F=
SSE /(n − 2)
Where:
yi − ȳ )2 is the sum of squared regression
SSR = ∑ni=1 (̂ (explained),
SSE is the sum of squared error (residual) terms,
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 163 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Under the null hypothesis, the F -statistic follows an F -distribution


with p and n−p−1 degrees of freedom, where p is the number of
regressors (excluding the intercept) and n is the number of
observations.

In a simple linear regression model, p=1 because you only have one
independent variable (excluding the intercept).

So, the degrees of freedom for the F-distribution in a simple linear


MRKARIM
regression model are 1 and n − 2.

Once you compute the F-statistic, you can compare it to the critical
value from the F-distribution at a chosen signicance level (e.g.,
α = 0.05) to determine whether to reject the null hypothesis.

If the F-statistic is greater than the critical value, you reject the null
hypothesis and conclude that the model is signicant. Otherwise, you
fail to reject the null hypothesis.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 164 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Decision Rule for ANOVA F -test

Classical Approach

Set the signicance level α.


Calculate the critical value Fcritical = Fα (p, n − p − 1) from the
F -distribution with appropriate degrees of freedom.
MRKARIM
Decision Rule:

▸ If calculated F > Fcritical , reject H0 and conclude that the regression


model is statistically signicant.
▸ If calculated F ≤ Fcritical , fail to reject H0 and conclude that the
regression model is not statistically signicant.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 165 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Decision Rule for ANOVA F -test

p -value Approach

Calculate the p -value associated with the calculated F -statistic.


Decision Rule:
MRKARIM
▸ If p -value < α, reject H0 and conclude that the regression model is
statistically signicant.
▸ If p -value ≥ α, fail to reject H0 and conclude that the regression model
is not statistically signicant.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 166 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Hypothesis Test: H0 ∶ β1 = 0
Consider the simple linear regression model:

yi = β0 + β1 xi + ei ; 1, 2, . . . , n

We want to test the null hypothesis:

H0 ∶ β1 = 0
Under H0 , the test statistic is given by:
MRKARIM

β̂1 − 0
t=
se(β̂1 )
¿
Án
Á ∑(yi − ŷi )2
̂ Á
σ Á
Ài
se(β̂1 ) = √ n
where, and ̂
σ=
∑i=1 (xi − x̄)2 n−2
We reject H0 if ∣t∣ exceeds the critical value tcritical = t α ,(n−2) from the
2
t-distribution with n − 2 degrees of freedom, where n is the sample size.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 167 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Decision Rule for t -test

Classical Approach

Set the signicance level α.


Calculate the critical value tcritical from the t -distribution with
appropriate degrees of freedom.
MRKARIM
Decision Rule for each β̂1 :
▸ If ∣t∣ > tcritical = t α2 ,(n−2) , reject H0 and conclude that the corresponding
coecient β̂1 is statistically signicant.
▸ If ∣t∣ ≤ tcritical , fail to reject H0 and conclude that the corresponding
coecient β̂1 is not statistically signicant.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 168 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

p -value Approach

Calculate the p -value associated with each calculated t -statistic.


Decision Rule for each β̂1 :
▸ If p -value < α, reject H0 and conclude that the corresponding
coecient β̂1 is statistically signicant.
▸ If p -value ≥ α, fail to reject H0 and conclude that the corresponding
MRKARIM
coecient β̂1 is not statistically signicant.
Note that the p -value can be obtained by using the formula:

p -value = 2 × min(P(T < −∣t∣), P(T > ∣t∣))

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 169 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Condence Interval for E (Y ∣X = x)


The Condence Interval for E (Y ∣X = x ∗ ) provides a range of values
where we expect the mean response y to lie for a given value of x∗
with a certain level of condence.

It is computed as:
ŷ∗ ± t α ,(n−2) ⋅ se(̂
y ∗)
2
∗ ∗
where ŷ
is the predicted value of y for a given x , tα/ is the critical 2
MRKARIM
value of the t -distribution, n is the number of observations, and
¿
Á1 ∗ 2
y ∗) = ̂
se(̂ À + (x − x̄)
σÁ
n ∑i=1 (xi − x̄)2
n

is the standard error of the predicted value.

x∗ is the specic value of the independent variable for which you're


predicting the response,

The condence level is typically chosen to be 95% (α = 0.05).


Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 170 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

For Sultan's Dine restaurants Sales Dataset given in Example 3.1, we have
̂
σ = 13.829. With x ∗ = 10, x̄ = 14, ∑(xi − x̄)2 = 568,
and we have


(10 − 14)2
y ∗ ) = 13.829
1
se(̂ +
10 568

= 13.829 0.1282
= 4.95
MRKARIM

With ŷ = 110 and a margin of error of y ∗ ) = 2.306 × 4.95
t0.025,8 × se(̂
= 11.4147, the 95% condence interval for an average quarterly sales for
the Sultan's Dine restaurants located near campus for xed x ∗ = 10 is

110 ± 11.4147

where t α ,(n−1) = t0.025,8 = 2.306.


2

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 171 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Condence and prediction intervals for sales Y at given


values of student population X

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 172 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Prediction Interval for an Individual Value of Y


The Prediction Interval for an Individual Value of Y provides a range
of values where we expect a new observation of Y to lie with a
certain level of condence.
It is wider than the Condence Interval for the mean response
E (Y ∣X = x ∗ ) because it accounts for the variability of individual
observations around the regression line.
It is computed as: MRKARIM
ŷ∗ ± t α ,(n−2) ⋅ spred
2
where
¿
2 =̂ Á (x ∗ − x̄)2
σ 2 + se(̂
y ∗ )2
1
spred spred σÁ
=̂À1 + +
1 (xi − x̄)2
and hence
n ∑ni=
is the standard error of the predicted value and ŷ∗ is the predicted

value of Y for a given x .
The condence level is typically chosen to be 95% (α = 0.05).
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 173 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

For Sultan's Dine restaurants Sales Dataset given in Example 3.1, the
estimated standard deviation corresponding to the prediction of quarterly
sales for a new restaurant located need to the campus with 10,000
students, is computed as follows


1 (10 − 14)2
spred = 13.829 1+ +
10 568

= 13.829 1.1282
MRKARIM
= 14.69

The 95% prediction interval for quarterly sales for the Sultan's Dine
restaurants located near campus can be found t α ,(n−1) = 2.306. Thus, with

2
ŷ = 110 and a margin of error of t0.025 × spred = 2.306 × 14.69 = 33.875,
the 95% prediction interval is

110 ± 33.875

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 174 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Interval Estimation and Hypothesis Testing

Condence and prediction intervals for sales Y at given


values of student population X

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 175 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset

Example 3.3 (Obstetrics)

As discussed earlier in the Problem & Motivation section (Research


Problem 1), obstetricians sometimes order tests to measure estriol levels
from 24-hour urine specimens taken from pregnant women who are near
term because level of estriol has been found to be related to infant
birthweight. The test can provide indirect evidence of an abnormally small
fetus. Greene and Touchstone conducted a study to relate birthweight and
estriol level in pregnant women. The Sample data are presented in Table 6.
MRKARIM
They want to nd any relationship between the estriol level and birthright
How can this relationship be quantied? What is the estimated average
birthweight if a pregnant woman has an estriol level of 15 mg/24 hr?

For the Obstertrics dataset in the Example 3.3, we consider `birthweight' is


the dependent variable and `estriol' is the independent variable for the
problem because estriol levels are being used to try to predict birthweight.
The relationship between estriol level and birthweight can be quantied by
tting a regression line that relates the two variables.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 176 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset

Table 6: Sample data from the Greene-Touchstone study relating birthweight and
estriol level in pregnant women near term

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 177 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset

MRKARIM

Figure 6: Data from the Greene-Touchstone study relating birthweight and estriol
level in pregnant women near term

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 178 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset

To explore the relationship between estriol levels and birthweight, we


created a scatter plot, as shown in Figure 6. If x= estriol level and y=
birthweight, then we can postulate a linear relationship between y and x
that is of the following form:

∣x) = β0 + β1 x
E (yMRKARIM

where E (y ∣x) = expected or average birthweight (y ) among women with a


given estriol level (x ). That is, for a given estriol-level x, the average
birthweight E (y ∣x) = β0 + β1 x .

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 179 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset

Let's assume e follows a normal distribution, with mean 0 and variance σ2.
The full linear-regression model then takes the following form:

y = β0 + β1 x + e

For the data given in Table 6, we have

31 31 31 31
∑ xi = 534,
2
∑ xi = 9876, ∑ yi = 992, ∑ xi yi = 17500
i= 1 i=1 i=
MRKARIM 1 i= 1
For computing the slope and intercept of the regression line, we consider
the least square estimator of β0 and β1 . These are

n
∑ xi yi − nx̄ ȳ
β̂1 =
1
i=
= 0.608 and β̂0 = ȳ − β̂1 x̄ = 21.52
∑ xi2 − nx̄ 2
n

i=1

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 180 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset

Thus, the regression line is given by

ŷ = 21.52 + 0.608x

. This regression line is shown in Figure 6. The slope of 0.608 tells us that
MRKARIM
the predicted y increases by about 0.61 units per 1 mg/24 hr. Thus, the
predicted birthweight increases by 61 g for every 1 mg/24 hr increase in
estriol.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 181 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset

What is the estimated average birthweight if a pregnant woman has


an estriol level of 15 mg/24 hr?
If ŷ = 2500/100 = 25, then x can be obtained from the equation

25 = 21.52 + 0.608x

or
x = (25 MRKARIM
− 21.52) × 0.608 = 5.72
Thus, if a woman has an estriol level of 5.72 mg/24 hr, then the
predicted birthweight is 2500 g. Furthermore, the predicted infant
birthweight for all women with estriol levels of ≤5 mg/24 hr is
< 2500g (assuming estriol can only be measured in increments of 1
mg/24 hr). This level could serve as a critical value for identifying
high-risk women and trying to prolong their pregnancies.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 182 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset

Regression Parameter Condence Intervals


Standard errors and 95% condence intervals for the regression
parameters of the birthweight-estriol data given in the Example 3.3:

▸ From the data, we have


¿ √
Á σ2
̂ 14.60
se(β̂1 ) = Á
À = = 0.147
∑i=1 (xi − x̄)2 677.42
n

▸ a 95% condence interval for β1 is obtained from


MRKARIM

0.608 ± t29,0.025 × 0.147 = 0.608 ± 2.045(0.147)

= 0.608 ± 0.300 = (0.308, 0.908)


a 95% condence interval for β0 is

▸ 31
∑i=1 xi = 534, x = 17.23
▸ Standard error of β̂0 : se(β̂0 ) = 2.62
▸ 95% condence interval for β0 : 21.52 ± t29,.975 (2.62) = (16.16, 26.88)
These intervals are rather wide due to the small sample size.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 183 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Real Data Example: Obstetrics Dataset
Assess the signicant eect of birthweight on estriol level: that is, test the
hypothesis is H0 ∶ β 1 = 0
Thus,

tcal = β̂1 /se(β̂1 ) = 0.608/0.147 = 4.14 ∼ t29 under H0

We nd t29,0.025 = 3.659. Since tcal > t29,0.025 , we reject H0 .


We already have a 95% condence interval for β1 is (0.308, 0.908).
MRKARIM
What is your conclusion?

To determine whether to reject the null hypothesis H0 ∶ β1 = 0 based


on the condence interval (0.308, 0.908), we need to check if the
interval contains the value 0.

Since the condence interval (0.308, 0.908) does not contain the
value 0, we can reject the null hypothesis H0 ∶ β1 = 0 at the 0.05
signicance level. This means that there is evidence to suggest that
the slope coecient β1 is not equal to zero in the regression model.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 184 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis

Residual Analysis

Residuals are the dierences between the observed values of the dependent
variable and the values predicted by the regression model. Residual
analysis is a critical component of regression analysis as it helps to
determine whether the assumptions made about the regression model
appear to be valid. Key Aspects of Residual Analysis:

1 Linearity: Plot the residuals against the predicted values. A random


MRKARIM
scatter around zero suggests a linear relationship between the
dependent and independent variables. Any systematic patterns (e.g.,
curves or clusters) indicate potential issues with linearity.

2 Constant Variance (Homoscedasticity): Plot the residuals against


the predicted values. The spread of residuals should remain roughly
constant across all levels of the independent variable. Unequal spread
or patterns in the residuals suggest heteroscedasticity, violating the
assumption of constant variance.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 185 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis

3 Normality: Examine the distribution of residuals using a histogram or


a Q-Q plot. Ideally, residuals should follow a normal distribution,
allowing for reliable inference. Departures from normality may
indicate potential issues, such as skewness or heavy tails, which could
aect the validity of statistical tests.

4 Inuential Points: residual analysis is also used to identify outliers


and inuential observations
MRKARIM
 high leverage points

1 (xi − x̄)2
hi = + i = 1, ..., n
∑i (xi − x̄)2
;
n

These points can signicantly aect parameter estimates and may


warrant further investigation.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 186 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis

5 Addressing Issues: If residual analysis reveals any discrepancies from


the assumptions of the regression model, corrective actions may be
necessary. This could involve transforming variables, removing
outliers, using robust regression techniques, or exploring alternative
MRKARIM
model specications.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 187 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis

Inuential Observations, Outliers, and Cook's Distance

Inuential Observations

Inuential observations are data points that have a large impact on the
estimated coecients of the regression model. They can signicantly alter
the t of the model if removed. Inuential observations are identied using
Cook's distance, where observations with Cook's distance greater than 4/n
(where n is the number of observations)
MRKARIM are considered inuential.

Outliers

Outliers are data points that deviate signicantly from the rest of the
data. They can aect the regression model's accuracy and should be
investigated to determine if they are genuine data points or errors.
Outliers are identied by selecting observations with Cook's distance
greater than a certain threshold (here, 4/n).

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 188 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis

Cook's Distance

Cook's Distance

Cook's distance measures the inuence of each observation on the tted


values of the model. It combines the eect of leverage and residual to
determine the inuence of each data point.
MRKARIM
∑nj=1 (ŷj − ŷj(i) )2
Di =
p ⋅ MSE

where ŷj is the j th tted value, ŷj(i) is the j th tted value with the i th
observation removed, p is the number of predictors in the model, and MSE
is the mean squared error of the model.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 189 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis

Python Code: OLS for Linear Regression Model

# Simple Linear Regression Model


import statsmodels.api as sm
model =sm.OLS(boston.MEDV, sm.add_constant(boston.LSTAT))
MRKARIM
result = model.fit()
print(result.summary())

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 190 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Residual analysis

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 191 / 248
Chapter 2: Regression Analysis: Simple Linear Regression R Code: Linear Regression Model

R Code: Linear Regression Model


## Linear Regression Model
#data<-read.table(file.choose(),sep ="", header=T)
# Create the data frame
data <- data.frame(
x= c(2, 6, 8, 8, 12, 16, 20, 20, 22, 26),
y = c(58, 105, 88, 118, 117, 137, 157, 169, 149, 202)
)
MRKARIM

# Perform simple linear regression


model <- lm(y ~ x, data = data)

# Print summary of the regression model


summary(model)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 192 / 248
Chapter 2: Regression Analysis: Simple Linear Regression R Code: Linear Regression Model

R Code: Residual Analysis


# Get residuals
residuals <- resid(model)

# Residual plot
plot(model$fitted.values, residuals,
xlab = "Fitted values",
ylab = "Residuals",
main = "Residual Plot")
abline(h = 0, col = "red", lty = 2) # Add a horizontal line at
y=0 MRKARIM

# Histogram of residuals
hist(residuals,
main = "Histogram of Residuals",
xlab = "Residuals",
ylab = "Frequency",
col = "lightblue")

# QQ plot of residuals
qqnorm(residuals)
qqline(residuals)
title("QQ Plot of Residuals")

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 193 / 248
Chapter 2: Regression Analysis: Simple Linear Regression R Code: Linear Regression Model

# Residual plot
plot(model, which = 1)
attach(data)

# Cook's distance plot


plot(cook_dist, pch = 20, main = "Cook's Distance Plot", xlab =
"Observation", ylab = "Cook's Distance")
abline(h = 4 / length(y), col = "red", lty = 2) # Add a
horizontal line at the threshold

# Get influence measures MRKARIM


infl <- influence(model)

# Leverage points
leverage <- infl$hat

# Influential observations
cook_dist <- cooks.distance(model)

# Outliers
outliers <- which(cook_dist > 4 / length(y))

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 194 / 248
Chapter 2: Regression Analysis: Simple Linear Regression R Code: Linear Regression Model

# Print leverage points, influential observations, and outliers


print("Leverage points:")
print(which(leverage > mean(leverage) + 2 * sd(leverage)))

print("Influential observations:")
print(which(cook_dist > 4 / length(y)))

print("Outliers:")
print(outliers)
MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 195 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model

Python Code: Linear Regression Model


import numpy as np
import statsmodels.api as sm

# Define the data


x = np.array([2, 6, 8, 8, 12, 16, 20, 20, 22, 26])
y= np.array([58, 105, 88, 118, 117, 137, 157, 169, 149, 202])

# Add constant for interceptMRKARIM


x_with_const = sm.add_constant(x)

# Fit the model


model = sm.OLS(y, x_with_const).fit()

# Print summary of the regression model


print(model.summary())

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 196 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model

Python Code: Residual Analysis


# Get residuals
residuals = model.resid

# Residual plot
plt.scatter(model.predict(), residuals)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='r', linestyle='--') # Add a horizontal
line at y=0
MRKARIM
plt.show()

# Histogram of residuals
plt.hist(residuals, bins=10)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Histogram of Residuals')
plt.show()

# QQ plot of residuals
sm.qqplot(residuals, line ='45')
plt.title('QQ Plot of Residuals')
plt.show()

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 197 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model

import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence

# Define the data


x = np.array([2, 6, 8, 8, 12, 16, 20, 20, 22, 26])
y = np.array([58, 105, 88, 118, 117, 137, 157, 169, 149, 202])

# Fit the linear regression model


X_with_const = sm.add_constant(x)
MRKARIM
model = sm.OLS(y, X_with_const).fit()

# Get influence measures


influence = OLSInfluence(model)

# Leverage points
leverage = influence.hat_matrix_diag

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 198 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model
# Influential observations
cook_dist = influence.cooks_distance[0]
print(cook_dist)
# Outliers
outliers = np.where(cook_dist > 4 / len(y))[0]

# Print leverage points, influential observations, and outliers


print("Leverage points:")
print(np.where(leverage > np.mean(leverage) + 2 *
np.std(leverage))[0])

print("Influential observations:")
print(np.where(cook_dist > 4 / len(y))[0])

print("Outliers:") MRKARIM
print(outliers)

# Get influence measures


influence = OLSInfluence(model)

# Residual plot
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(model.fittedvalues, model.resid)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='red', linestyle='--')

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 199 / 248
Chapter 2: Regression Analysis: Simple Linear Regression Python Code: Linear Regression Model

# Cook's distance plot


plt.subplot(1, 2, 2)
plt.scatter(np.arange(len(cook_dist)), cook_dist, marker='o',
color='blue')
plt.axhline(y=4 / len(y), color='red', linestyle='--')
plt.xlabel('Observation')
plt.ylabel("Cook's Distance")
plt.title("Cook's Distance Plot")

plt.tight_layout() MRKARIM
plt.show()

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 200 / 248
Chapter 3: Multiple Linear Regression Model

Chapter 3: Multiple Linear Regression Model


MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 201 / 248
Chapter 3: Multiple Linear Regression Model

4 Chapter 3: Multiple Linear Regression Model

4.1 Problems & Motivation


4.2 Estimation Procedure
4.3 Estimation Procedure of Error Variance
4.4 Coecient of Determination
4.5 Adjusted R 2 MRKARIM

4.6 Example
4.7 F-test in Multiple Regression
4.8 ANOVA Table in Regression Analysis
4.9 t -tests in Multiple Regression
4.10 Real Data Example: Hypertension Dataset
4.11 Python Code: Hypertension Dataset
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 202 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation

Problem & Motivation


Research Problem 2

Suppose age (days), birthweight (oz), and SBP are measured for 16
infants and the data are as shown in Table 7. What is the relationship
between infant systolic blood pressure (SBP) and their age and
birthweight? Can we predict SBP based on these factors?

MRKARIM
Table 7: Sample data for infant blood pressure, age, and birthweight for 16 infants

i Age (days) (x ) 1 Birthweight (oz) (x ) 2 SBP (mm Hg) (y )

1 3 135 89
2 4 120 90
3 3 100 83
4 2 105 77
5 4 130 92

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 203 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation

Table 7 (continue)

i Age (days) (x ) 1 Birthweight (oz) (x ) 2 SBP (mm Hg) (y )

6 5 125 98
7 2 125 82
8 3 105 85
9 5 120 96
10 4 90
MRKARIM 95
11 2 120 80
12 3 95 79
13 3 120 86
14 4 150 97
15 3 160 92
16 3 125 88

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 204 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation

Problem & Motivation

Research Problem 3

Let's delve into the exploration of the dataset. We'll utilize the Boston
House Prices Dataset, consisting of 506 rows and 13 attributes,
including a target column. MRKARIM

Our objective is to forecast the median price value of owner-occupied


homes. How do you predict the price?

How can we apply multiple regression to predict the price based on various
attributes? Let's take a quick look at the dataset.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 205 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 206 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation
Crim: Per capita crime rate by town
Zn: Proportion of residential land zoned for lots over 25,000 sq. ft.
Indus: Proportion of non-retail business acres per town
Chas: Charles River dummy variable (= 1 if tract bounds river; 0,
otherwise)

Nox: Nitrogen oxides concentration (parts per 10 million)


Rm: Average number of rooms per dwelling
Age: Proportion of owner-occupied
MRKARIM units built before 1940

Dis: Weighted mean of distances to ve Boston employment centers


Rad: Index of accessibility to radial highways
Tax: Full-value property tax rate per $10,000
Ptratio: PupilTeacher ratio by town
B: 1000(Bk − 0.63)2 , where Bk is the proportion of Blacks by town
Lstat: Lower status of the population (percent)
Medv: Median Price value of owner-occupied homes in $1000s
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 207 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 208 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation

Python Code

from google.colab import files


uploaded = files.upload()
import io
boston = pd.read_csv(io.BytesIO(uploaded['boston.csv']))
# Dataset is now stored in a Pandas Dataframe
MRKARIM
boston.head()
import matplotlib.pyplot as plt
boston.plot(x='LSTAT',y='MEDV',style='o')
plt.xlabel('LSTAT')
plt.ylabel('MEDV')
plt.show()

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 209 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation

Multiple Regression Model

The multiple regression model is given by:

yi = β0 + β1 x1i + β2 x2i + . . . + βp xpi + ei

where:

yi is the dependent variableMRKARIM


(response variable) for the i th
observation.

x1i , x2i , . . . , xpi are the independent variables for the i th observation.

β0 is the intercept term.

β1 , β2 , . . . , βp are the coecients corresponding to the independent


variables x1i , x2i , . . . , xpi .

ei is the error term or residual for the i th observation.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 210 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation

Model Assumptions

Under the multiple regression model

yi = β0 + β1 x1i + β2 x2i + . . . + βp xpi + ei

the following assumptions are made:

1 Linearity: The relationship between the dependent variable yi and


MRKARIM
the independent variables 1 2
x i , x i , . . . , xpi is linear.
2 Independence of Errors: The error terms ei are independent of each
other.

3 Homoscedasticity: The variance of the error terms ei is constant for


all values of the independent variables.

4 Normality of Errors: The error terms ei are normally distributed


with mean zero.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 211 / 248
Chapter 3: Multiple Linear Regression Model Problems & Motivation

5 No Perfect Multicollinearity: There is no perfect linear relationship


among the independent variables.

6 No Autocorrelation: The errors (e ) are not correlated with each


other over time or across observations.
MRKARIM
These assumptions are essential for valid estimation and interpretation of
the classical regression model.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 212 / 248
Chapter 3: Multiple Linear Regression Model Estimation Procedure

Estimation Procedure by OLS in Matrix Notation


To estimate the parameters β0 , β1 , . . . , βp in the multiple regression model

yi = β0 + β1 x1i + β2 x2i + . . . + βp xpi + ei


using ordinary least squares (OLS) in matrix notation, we dene:

⎡y1 ⎤ ⎡1 x11 x21 . . . xp1 ⎤ ⎡ β0 ⎤ ⎡e1 ⎤


⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢y ⎥ ⎢1 x . . . xp2 ⎥⎥ ⎢β ⎥ ⎢e ⎥
Y = ⎢ 2⎥ 12 x22
β = ⎢ 1⎥ = ⎢ 2⎥
⎢ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥

⎢⋮⎥
X =⎢
⎢⋮ ⋮
MRKARIM
⋮ ⋱ ⋮ ⎥⎥
⎥ ⃗
⎢ ⋮ ⎥
e⃗
⎢⋮⎥
⎢ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥
⎢yn ⎥ ⎢1 x1n x2n . . . xpn ⎥ ⎢βp ⎥ ⎢en ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
In this case, the model can be written as

Y⃗ = X β⃗ + e⃗
To nd the estimated ̂ of β using
coecients β ordinary least squares
(OLS), we minimize the sum of squared residuals e:

e⃗ = Y⃗ − X β̂
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 213 / 248
Chapter 3: Multiple Linear Regression Model Estimation Procedure

Calculation of Estimated Coecients


The sum of squared residuals is given by:

T
⃗ ⃗
SSE = e⃗T e⃗ = (Y⃗ − X β)
̂ (Y⃗ − X β)
̂


To minimize SSE, we take the derivative with respect to β̂ and set it to
zero:
MRKARIM


= −2X T (Y⃗ − X β)
∂ SSE ̂ =0

∂ β̂

Solving for β̂, we get:

β̂ = (X T X )−1 X T Y⃗

This equation provides the estimated coecients β̂ that minimize the sum
of squared residuals.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 214 / 248
Chapter 3: Multiple Linear Regression Model Estimation Procedure of Error Variance

Estimation Procedure of Error Variance


To estimate the error variance σ2,
̂ we use the following formula:

σ2 =
T
⃗ ⃗
(Y⃗ − X β) (Y⃗ − X β)
1
̂ ̂ ̂
n−p−1
where:

β̂ is the vector of estimated coecients obtained using ordinary least
squares (OLS). MRKARIM

n is the number of observations.

p is the number of independent variables (excluding the intercept).


T
⃗ ⃗
The term (Y⃗ − X β)
̂ (Y⃗ − X β)
̂ represents the sum of squared residuals,

which measures the unexplained variability in the dependent variable after


accounting for the eects of the independent variables. Dividing by
n − p − 1, the degrees of freedom for the error term, provides an unbiased
estimate of the error variance σ2.
̂
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 215 / 248
Chapter 3: Multiple Linear Regression Model Coecient of Determination

Coecient of Determination (R 2 )
2
The coecient of determination (R ) measures the proportion of the
variance in the dependent variable (y ) that is explained by the
independent variables (x 1 , x2 , . . . , xp ) in the regression model.

R2 = 1 −
SSE
SST
∑n 1 (yi − ŷi )2
MRKARIM

= 1 − i=
∑ni=1 (yi − ȳ )2

where:

SSE is the sum of squared errors (residuals), representing the


unexplained variability in the dependent variable.

SST is the total sum of squares, representing the total variability in


the dependent variable.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 216 / 248
Chapter 3: Multiple Linear Regression Model Coecient of Determination

Interpretation:

R2 ranges from 0 to 1, where a higher value indicates a better t of


the regression model to the data.

R2 represents the proportion of the variance in the dependent variable


MRKARIM
that is explained by the independent variables.

For example, if R 2 = 0.75, it means that 75

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 217 / 248
Chapter 3: Multiple Linear Regression Model Adjusted R 2

Adjusted R 2

the adjusted R2 is denoted by R̄ 2 and is dened as

n−1
R̄ 2 = 1 − (1 − R 2 )
n−p−1
where p is the total number of explanatory variables in the model
(not including the constant term), and n is the sample size
MRKARIM
it can also be written as:

SSE /dfe
R̄ 2 = 1 −
SST /dft

where dft is the degrees of freedom n−1 of the estimate of the


population variance of the dependent variable, and dfe is the degrees
of freedom n−p−1 of the estimate of the underlying population error
variance

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 218 / 248
Chapter 3: Multiple Linear Regression Model Adjusted R 2

Hence,
SSE /(n − p − 1)
R̄ 2 = 1 −
SST /(n − 1)

the explanation of this statistic is almost the same as R2 but it


penalizes the statistic as extra variables are included in the model
1
n−
the term
n−p− 1 is called the penalty of using the more regressors in a
MRKARIM
model

when the number of regressors p, increases, (1 − R 2 ) will decrease,


n− 1
but
n−p− 1 will increase

whether more regressors improve the explanatory power of a model


depends on the trade-of between R2 and the penalty

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 219 / 248
Chapter 3: Multiple Linear Regression Model Adjusted R 2

Interpretation:

R̄ 2 penalizes the addition of unnecessary predictors to the model,


unlike R 2.
It is always less than or equal to R 2, and it increases only if the new
MRKARIM
term improves the model more than would be expected by chance.

Therefore, R̄ 2 is often preferred for comparing the goodness of t of


models with dierent numbers of predictors.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 220 / 248
Chapter 3: Multiple Linear Regression Model Example

Example Dataset and Regression Calculations


Consider a dataset with 10 observations and two independent variables (x 1
and x2 ).
i x1i x2i yi
1 9 16 10
1 13 14 12
1 11 10 14
1 MRKARIM
11 8 16
1 14 11 18
1 15 17 20
1 16 9 22
1 20 16 24
1 15 12 26
1 15 12 28

To t a multiple regression model, we use the least squares method to


estimate the coecients β0 , β1 , and β2 .
Prof. Dr.
The Md. Rezaul
model Karim, Department
equation is: of Statistics and Data Science, JU. MS in Mathematics - 2024 221 / 248
Chapter 3: Multiple Linear Regression Model Example

Calculation of β̂

⎡1 9 16⎤
⎢ ⎥
⎢1 13 14⎥
⎢ ⎥
⎢ ⎥
⎢1 11 10⎥
⎢ ⎥
⎢1 11 8 ⎥ ⎡10⎤
⎢ ⎥ ⎢ ⎥
⎢1 14 11⎥ ⎢12⎥
X = ⎢⎢1 15 17⎥⎥⎥ ;

For the given dataset, we have: MRKARIM
⎢ ⎥
Y⃗ = ⎢ ⎥.
⎢⋮⎥
⎢ ⎥ ⎢ ⎥
⎢1 16 9 ⎥ ⎢28⎥
⎢ ⎥ ⎣ ⎦
⎢1 20 16⎥
⎢ ⎥
⎢ ⎥
⎢1 15 12⎥
⎢ ⎥
⎢1 15 12⎥
⎣ ⎦
−1 ⃗
Now, let's calculate X X , (X X ) , X Y , and β
T T T ⃗ ̂.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 222 / 248
Chapter 3: Multiple Linear Regression Model Example

Calculation of XTX

⎡1 9 16⎤
⎢ ⎥
⎡1 . . . 1 ⎥ ⎢⎢1
⎤ ⎥
⎢ 1 1 13 14⎥
⎢ ⎥⎢ ⎥
XTX =⎢9

13 11 . . . 15⎥ ⎢1
⎥⎢
11 10⎥

⎢16 14 10 . . . 12⎥ ⎢ ⋮ ⎥⎥
⎣ MRKARIM ⎦⎢⋮ ⋮
⎢1 ⎥
⎣ 15 12⎦

⎡ 10 ⎤
⎢ 139 125⎥
⎢ ⎥
= ⎢139 2019 1757⎥
⎢ ⎥
⎢125 1757 1651⎥
⎣ ⎦

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 223 / 248
Chapter 3: Multiple Linear Regression Model Example

Calculation of β̂

Using matrix inversion, we nd:

⎡ 3.369 −0.135 −0.112⎤


⎢ ⎥
(X T
X) 1 =
− 1 ⎢
adj(X X ) = ⎢−0.135
T
0.012 −

0.003⎥
det(X X ) ⎢ ⎥
T
⎢−0.112 −0.003 0.012 ⎥
MRKARIM⎣ ⎦
Now, we calculate

⎡ 2.821 ⎤
⎢ ⎥
⃗ −1 T ⃗ ⎢ ⎥
β = (X X ) X Y = ⎢ 1.591 ⎥
̂ T
⎢ ⎥
⎢−0.475⎥
⎣ ⎦

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 224 / 248
Chapter 3: Multiple Linear Regression Model Example

Calculation of Error Variance Estimation

i x1i x2i yi ŷi (yi − ŷi )2


1 9 16 10 9.542 0.209764
1 13 14 12 16.856 23.58074
1 11 10 14 15.573 2.474329
1 11 8 16 16.523 0.273529
MRKARIM
1 14 11 18 19.871 3.500641
1 15 17 20 18.613 1.923769
1 16 9 22 24.003 4.012009
1 20 16 24 27.043 9.259849
1 15 12 26 20.988 25.12014
1 15 12 28 20.988 49.16814
Total 139 125 190 190 119.5229

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 225 / 248
Chapter 3: Multiple Linear Regression Model Example

The error variance estimation, σ2,


̂ is calculated as:

n
σ2 = 2
1 119.5229
̂ ∑(yi − ŷi ) = = 17.0747
n − p − 1 i=1 10 − 2 − 1

where n is the number of observations, p is the number of predictors,


yi is the observed value, and ŷi is the predicted value.
MRKARIM
Coecient of determination:

R2 = 1 −
SSE 119.522
=1− = 0.6378
TSS 330
(1 − R 2 )(n − 1) (1 − 0.6378)(10 − 1)
R̄ 2 = 1 − =1− = 0.5343
n−p−1 10 − 2 − 1

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 226 / 248
Chapter 3: Multiple Linear Regression Model Example

Goodness-of-t R 2 and Adjusted R 2

The coecient of determination R 2 = 0.6378 suggests that


approximately 63.78% of the variance in the dependent variable (y )
can be explained by the independent variables x1 and x2 included in
the model.

The Adjusted R2 MRKARIM


value takes into account the number of predictors
and the sample size, providing a more conservative estimate of the
model's goodness-of-t. In this case, R̄ 2 = 0.5343 indicating that
approximately 53.43% of the variance in the dependent variable(y ) is
explained by the independent variables x1 and x2 after adjusting for
the number of predictors and the sample size.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 227 / 248
Chapter 3: Multiple Linear Regression Model F-test in Multiple Regression

F-test in Multiple Regression

The F-test in multiple regression assesses the overall signicance of


the regression model.

It tests whether at least one of the independent variables has a


non-zero coecient.
MRKARIM
The null hypothesis H0 for the F-test is:

H0 ∶ β̂1 = β̂2 = ⋅ ⋅ ⋅ = β̂p = 0

where β̂1 , β̂2 , . . . , β̂p are the coecients of the independent variables.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 228 / 248
Chapter 3: Multiple Linear Regression Model ANOVA Table in Regression Analysis

ANOVA Table in Multiple Regression

The ANOVA table in multiple regression assesses the overall


signicance of the regression model.

It partitions the total variance in the dependent variable into


explained variance and unexplained variance.

The table includes sums of squares (SS), degrees of freedom (df ),


mean squares (MS), and the F -test statistic.
MRKARIM

Source of Variation SS df MS F
Regression SSR p MSR = SSR
p
Residual (Error) SSE n−p−1 MSE = n−p−
SSE
1 F= MSR
MSE
Total SST n−1

Reject the null hypothesis H0 if the calculated F -statistic is greater


than the critical value from the F -distribution.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 229 / 248
Chapter 3: Multiple Linear Regression Model ANOVA Table in Regression Analysis

Decision Rule for ANOVA F -test

Classical Approach

Set the signicance level α.


Calculate the critical value Fcritical from the F -distribution with
appropriate degrees of freedom.
MRKARIM
Decision Rule:

▸ If calculated F > Fcritical , reject H0 and conclude that the regression


model is statistically signicant.
▸ If calculated F ≤ Fcritical , fail to reject H0 and conclude that the
regression model is not statistically signicant.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 230 / 248
Chapter 3: Multiple Linear Regression Model ANOVA Table in Regression Analysis

Decision Rule for ANOVA F -test (Cont'd)

p -value Approach

Calculate the p -value associated with the calculated F -statistic.


Decision Rule:
MRKARIM
▸ If p -value < α, reject H0 and conclude that the regression model is
statistically signicant.
▸ If p -value ≥ α, fail to reject H0 and conclude that the regression model
is not statistically signicant.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 231 / 248
Chapter 3: Multiple Linear Regression Model ANOVA Table in Regression Analysis

Example: ANOVA Table

df SS MS F Signicance F
Regression 2 210.4651 105.2326 6.1625 0.0286
MRKARIM
Residual 7 119.5349 17.0764
Total 9 330

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 232 / 248
Chapter 3: Multiple Linear Regression Model The t -tests in Multiple Regression

The t -tests in Multiple Regression

The t -tests in multiple regression assess the signicance of individual


coecients (parameters) in the model.

Each t -test tests the null hypothesis that the corresponding


coecient is zero.
MRKARIM̂
The t -test statistic for the coecient β i is calculated as:

β̂i
t=
SE(β ̂i )

where β̂i ̂i ) is its standard error.


is the estimated coecient, and SE(β

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 233 / 248
Chapter 3: Multiple Linear Regression Model The t -tests in Multiple Regression

Decision Rule for t -tests in Multiple Regression

Classical Approach

Set the signicance level α.


Calculate the critical value tcritical from the t -distribution with
appropriate degrees of freedom.
MRKARIM
Decision Rule for each β̂i :
▸ If ∣t∣ > tcritical , reject H0 and conclude that the corresponding coecient
β̂i is statistically signicant.
▸ If ∣t∣ ≤ tcritical , fail to reject H0 and conclude that the corresponding
coecient β̂i is not statistically signicant.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 234 / 248
Chapter 3: Multiple Linear Regression Model The t -tests in Multiple Regression

Decision Rule for t-tests in Multiple Regression (Cont'd)

p -value Approach

Calculate the p -value associated with each calculated t -statistic.


Decision Rule for each β̂i : MRKARIM
▸ If p -value < α, reject H0 and conclude that the corresponding
coecient β̂i is statistically signicant.
▸ If p -value ≥ α, fail to reject H0 and conclude that the corresponding
coecient β̂i is not statistically signicant.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 235 / 248
Chapter 3: Multiple Linear Regression Model The t -tests in Multiple Regression

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 236 / 248
Chapter 3: Multiple Linear Regression Model Real Data Example: Hypertension Dataset

Example 4.1

As discussed earlier in the Problem & Motivation section (Research


Problem 2), age (days), birthweight (oz), and SBP are measured for 16
infants and the data are as shown in Table 7. What is the relationship
MRKARIM
between infant systolic blood pressure (SBP) and their age and
birthweight? Can we predict SBP based on these factors?

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 237 / 248
Chapter 3: Multiple Linear Regression Model Real Data Example: Hypertension Dataset

Solution of the Research Problem 2


The multiple regression model for this problem is

yi = β0 + β1 x1i + β2 x2i + ei ; i = 1, 2, . . . , 16
where

y is the SBP of infants


x1 is the Age of the infants
x2 is the birthweight (oz) ofMRKARIM
the infants

e is the error term


According to the parameter-estimate column in python output, the
regression equation is given by

yi = 53.45 + 5.89x1i + 0.126x2i ; i = 1, 2, . . . , 16


The regression equation tells us that for a newborn, the average blood
pressure increases by an estimated 5.89 mm Hg per day of age and 0.126
mm Hg per ounce of birthweight.
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 238 / 248
Chapter 3: Multiple Linear Regression Model Python Code: Hypertension Dataset

Python Code: Hypertension Dataset


import pandas as pd
import statsmodels.api as sm

# Creating the dataframe from the provided data


data = {
'Age': [3, 4, 3, 2, 4, 5, 2, 3, 5, 4, 2, 3, 3, 4, 3, 3],
'Birthweight': [135, 120, 100, 105, 130, 125, 125, 105, 120,
90, 120, 95, 120, 150, 160, 125],
'SBP': [89, 90, 83, 77, 92, 98, 82, 85, 96, 95, 80, 79, 86,
97, 92, 88] MRKARIM
}
df = pd.DataFrame(data)

# Adding a constant term for the intercept


X = sm.add_constant(df[['Age', 'Birthweight']])
y = df['SBP']

# Fitting the multiple regression model


model = sm.OLS(y, X).fit()

# Printing the summary of the regression model


print(model.summary())

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 239 / 248
Chapter 3: Multiple Linear Regression Model Python Code: Hypertension Dataset

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 240 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset

R Code: Hypertension Dataset


# Creating the dataframe from the provided data
data <- data.frame(
Age = c(3, 4, 3, 2, 4, 5, 2, 3, 5, 4, 2, 3, 3, 4, 3, 3),
Birthweight = c(135, 120, 100, 105, 130, 125, 125, 105, 120, 90,
120, 95, 120, 150, 160, 125),
SBP = c(89, 90, 83, 77, 92, 98, 82, 85, 96, 95, 80, 79, 86, 97,
92, 88)
) MRKARIM

# Fitting the multiple regression model


model <- lm(SBP ~ Age + Birthweight, data = data)

# Printing the summary of the regression model


summary(model)

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 241 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 242 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset

Example 4.2

As discussed earlier in the Problem & Motivation section (Research


Problem 3), we want to forecast the median price value of owner-occupied
homes. How do you predict the MRKARIM
price? How can we apply multiple
regression to predict the price based on various attributes? Let's take a
quick look at the dataset.

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 243 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 244 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset
Crim: Per capita crime rate by town
Zn: Proportion of residential land zoned for lots over 25,000 sq. ft.
Indus: Proportion of non-retail business acres per town
Chas: Charles River dummy variable (= 1 if tract bounds river; 0,
otherwise)

Nox: Nitrogen oxides concentration (parts per 10 million)


Rm: Average number of rooms per dwelling
Age: Proportion of owner-occupied
MRKARIM units built before 1940

Dis: Weighted mean of distances to ve Boston employment centers


Rad: Index of accessibility to radial highways
Tax: Full-value property tax rate per $10,000
Ptratio: PupilTeacher ratio by town
B: 1000(Bk − 0.63)2 , where Bk is the proportion of Blacks by town
Lstat: Lower status of the population (percent)
Medv: Median Price value of owner-occupied homes in $1000s
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 245 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 246 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset

Python Code: OLS for Multiple Linear Regression Model


# Multiple Linear Regression Model
import statsmodels.api as sm
X = boston[['LSTAT', 'CRIM']]
model =sm.OLS(boston.MEDV, sm.add_constant(X))
result = model.fit()
print(result.summary())

## Alternatively MRKARIM

import statsmodels.formula.api as smf


# formula: response ~ predictor + predictor
est = smf.ols(formula='MEDV ~ LSTAT + CRIM',
data=boston).fit()
print(est.summary())
# In GLM framework
Gaussian_model = sm.GLM(boston.MEDV, sm.add_constant(X),
family=sm.families.Gaussian()).fit()
print(Gaussian_model.summary())
Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 247 / 248
Chapter 3: Multiple Linear Regression Model R Code: Hypertension Dataset

MRKARIM

Prof. Dr. Md. Rezaul Karim, Department of Statistics and Data Science, JU. MS in Mathematics - 2024 248 / 248

You might also like