0% found this document useful (0 votes)

21 views

Chapter 4

In chapter 4, the document will discuss describing relationships between two quantitative variables using correlation coefficients when the relationship is assumed to be linear. It will define the correlation coefficient r as a measure of the strength of the linear relationship between two variables. The document provides an example of calculating r from a sample data set and interpreting the results. It also introduces the concept of regression analysis for modeling relationships where one variable is dependent on the other.

Uploaded by

ead062712

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Chapter 4

Uploaded by

ead062712

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Summary of Chapter 4

In chapter 3, we looked at ways of summarizing the distribution of a single quantitative variable either
through tables and pictures (e.g., histograms, stem-and-leaf displays, polygons, ogives, box plots) or
through, numerical measures (e.g., mean, median, range, variance).

In chapter 4, we will extend this summarization of quantitative data by looking at ways of describing the
relationships between two quantitative variables when we believe that the relationship between the
two variables is linear, but not necessarily perfectly linear. This summarization of linear relationships
will begin with a measure of the strength of a linear relationship, that is, by the correlation coefficient.

The correlation coefficient is a measure of the strength of the linear relationship between two
quantitative random variables where causation is not assumed (or that the values of one of the
variables does not cause or affect the values of the other variable). Once we determine how to calculate
this strength, we will also realize that the value of this strength also tells us if the relationship between
these two variables is either positive, negative or non-existent. As always, we attach a symbol for any
numerical measure which describes a specific aspect of our data set. The sample correlation coefficient
is given the symbol r while the population correlation coefficient is given the symbol ρ (the Greek letter
‘rho’). We will focus on the sample correlation coefficient, r, although the calculation of ρ and its
interpretation would be same but would apply to the population and not the sample. The sample
correlation coefficient can be calculated in a number of ways, each resulting in the same value of r.
Three of the formulas for the calculation of r are:

∑ ( zx z y )i xi − x̄ y i− ȳ
r= i
where z xi = and z yi =
n−1 sx sy

∑ [ (x i − x̄)( y i − ȳ)] ∑ [ ( x i− x̄ )( y i − ȳ ) ]
r= r=
√ ∑ (xi − x̄)2 ∑ ( yi − ȳ )2 or (n−1)s x s y

To calculate r, we first need data. This data will consist of pairs of values of x and y obtained for each
element in our sample. We can symbolize these pairs by (xi, yi). As an example, using a sample of size,
n = 5, we obtained the following data:

element xi yi
1 6 5
2 10 3
3 14 7
4 19 8
5 21 12
But, before we calculate r, we should create a scatterplot to see if it is reasonable to assume that the
relationship between x and y is linear. In this example, the scatterplot looks like:

y vs x
12

9
Y

2
6 8 10 12 14 16 18 20 22

From observing this scatterplot, we see that there is not sufficient evidence to indicate that this
relationship is not linear. So, we will assume that it is linear and proceed to calculate r. Based on the
results of this plot, we should expect the value of r would tell us the relationship is positive (i.e., x and y
tend to move in the same direction) and, although the relationship is not perfect, it is somewhat strong.

Of the above 3 formulas for r, to reduce the number of calculations we should use either of the last two
if our calculations are to be done by ‘hand’. Using either of these formulas, the following table would
allow us to calculate r in a systematic way.

xi yi x i− x̄ y i − ȳ ( x i− x̄ )2 ( y i − ȳ )2 ( x i− x̄ )( y i − ȳ )
6 5 -8 -2 64 4 16
10 3 -4 -4 16 16 16
14 7 0 0 0 0 0
19 8 5 1 25 1 5
21 12 7 5 49 25 35
70 35 0 0 154 46 72

x̄=
∑ x i =70 =14
n 5
ȳ =
∑ y i = 35 =7
n 5
s x=
√ ∑ ( x i − x̄ )2 =
n−1 √ 154
4
=6 . 205

s y=
√ ∑ ( y i − ȳ )2 =
n−1 √ 46
4
=3. 391

∑ [ (x i − x̄)( y i − ȳ)] 72
r= = ≈0.855
Using √ i
∑ (x − x̄)2
∑ ( yi − ȳ )2
√ (154 )(46 )
∑ [ ( x i− x̄)( y i − ȳ) ] 72
r= = ≈0. 856
Using (n−1)s x s y (5−1)(6 . 205)(3 . 391) (rounding errors)

What does r tell us about the relationship between x and y in our sample?

For any sample data set, r can vary between – 1 and + 1, with a value of – 1 indicating a perfect negative
linear relationship, a value of 0 indicating no linear relationship, and a value of + 1 indicating a perfect
positive linear relationship. And, the closer its value is to – 1 or + 1, the stronger the linear relationship
between x and y in our sample. (Note: if this data set were the population data set, the value of ρ would
be identical as above and we would be able make the same statements about ρ as we did about r,
except that the statements would refer to the population.)

When interpreting r, we assume that the variables are random and quantitative and that there isn’t a
cause-effect relationship between the two. And, if outliers appear to be present in our data set, we may
want to examine what effect these outliers may have on r. (Our scatterplot does not indicate the
presence of outliers and there is no reason to conclude that the relationship is not linear. And, if our
elements in our sample were randomly selected, we assume that these two variables are random.)

Based on these observations, we can state that, since r ≈ 0.855 or 0.856, there is a strong positive linear
relationship between x and y in our sample.

As previously mentioned, we use correlation coefficients when we assume no cause-effect linear

relationship between the two quantitative variables and we assume that both these variables are
random variables. But what if there is a cause-effect relationship where one of these variables, y,
depends on the other variable x? And, what if y is a random variable and x is not? If this is the situation,
we analyze the linear relationship between x and y using regression analysis. Although regression
analysis will be more thoroughly discussed in chapter 14 of our text, chapter 4 introduces us to it now,
when we are only interested in examining the relationship between x and y in our sample. (The material
in chapter 14, to be covered in our second statistics course, will allow us to use what we learned in this
chapter to make inferences about the true linear relationship between these two variables.)

Because we believe that y depends on x, we will call y the dependent variable and x the independent
variable. And, because we believe that the relationship is linear but not perfect, we can express the
relationship between x and y, in our sample, as:

y i =b0 +b1 x i +e i

Where, b0 = y-intercept of the linear relationship

b1= slope of the linear relationship
ei = residual, or, deviation of y from the linear relationship

If we forget about e, the perfect linear relationship can be express as:

^y i =b0 +b1 x i
making
e i= y i− ^y i
another example

Many companies try to improve their sales by sending out the same emails to individual’s email
addresses, some sending out several per day, day after day after day. But, does this strategy actually
increase sales? A study was undertaken in which 8 email addresses were randomly selected from a list
of previous customer’s emails and each address was sent a specific number of emails per day over a
month period. The dollar sales which resulted are recorded below:

Customer Daily emails $ Sales

1 2 70
2 4 30
3 6 80
4 8 20
5 10 110
6 12 100
7 14 54
8 16 120

(Note: In this example we believe that sales, y, depends on no. of daily emails, x, that sales is random
(although dependent on no. of emails), but no. of daily emails is not random as its values were
specifically chosen for this study.)

Before calculating b0 and b1, we should create a scatterplot to see if our assumption of a linear
relationship seems reasonable.

Sales vs. Number of Emails

140

120

100

80
Sales ($)

0
0 2 4 6 8 10 12 14 16 18
No. of emails

The above scatterplot does not rule out our assumption of linearity although it does indicate that the
relationship is somewhat weak. It also indicates that the relationship is somewhat positive.

With our assumptions validated, we can then proceed to determine the values of b 0 and b1, or, find the
best line through our data points.
It would seem reasonable that the best line through these points should take into consideration of the
residuals, ei’s, and that, the best line should be that line that minimizes some function of these e i’s.
Based on our previous assumptions and the additional assumptions that:

 the residuals are independent of one another

 the variability of y is the same no matter what the value of x
 and, there are no problems with outliers

the best line is the line which minimizes

∑ e2i
Continued next class
If we wish to calculate the slope and intercept ‘by hand’, formulas mentioned in chapter 6 are:

sy
b 1=r and b0 = ȳ −b1 x̄
sx
Using the same mathematical manipulations of our data as with our first example,

Customer xi yi
x i− x̄ y i − ȳ ( x i− x̄ )2 (
y i − ȳ )2 ( x i− x̄ )( y i − ȳ )

1 2 70 -7 -3 49 9 21
2 4 30 -5 -43 25 1849 215
3 6 80 -3 7 9 49 -21
4 8 20 -1 -53 1 2809 53
5 10 110 1 37 1 1369 37
6 12 100 3 27 9 729 81
7 14 54 5 -19 25 361 -95
8 16 120 7 47 49 2209 329
_______________________________________________________________

totals 72 584 0 0 168 9384 620

The following calculations allow us to determine the best line through our data points:

x̄=
∑ x i =72 =9
n 8
ȳ=
∑ y i =584 =73
n 8
s x=
√ ∑ ( x i − x̄ )2 =
n−1 √ 168
7
≈4 .899

s y=
√ ∑ ( y i − ȳ )2 =
n−1 √ 9384
7
≈36 . 614

∑ [ (xi − x̄)( y i − ȳ)] 620

r= = ≈0.4938
Using √∑ i ∑ i
(x − x̄)2
( y − ȳ )2
√ (168)(9384)
Or,

∑ [ ( x i− x̄)( y i − ȳ) ] 620

r= = ≈0 . 4938
using (n−1)s x s y (8−1)( 4 . 899)(36 . 614 )

b1 and b0 are calculated to be:

sy 36 . 614
b 1=r =0. 4938 ≈3 . 6905 and b 0 = ȳ −b1 x̄=73−3 .6905 (9)≈39 .7855
sx 4 . 899

and,

exp ected sales=39 .7855+3. 6905( no. of daily emails )

Literally, interpreting this expression:

 $39.7855 would be the expected sales if no emails were sent out to a customer
 Each additional daily email sent out is expected to increase sales by $3.6905

But, as will be mentioned below, the literal interpretation of the regression line is usually not the correct
interpretation.

The strength of the linear relationship

Using correlation analysis, we used r as a measure of the strength of the linear relationship between two
random quantitative variables. An equivalent measure of the strength of the linear relationship using
regression analysis is R2, the coefficient of determination. What R2 does is it looks at the total variation
of y in our sample and breaks down that total variation into two components – the variation of y due to
its relationship with x, and, the variation that is not explained by X. The equation expressing the
relationship among these variations is:

∑ ( y i− ȳ)2 ¿ ∑ ( ^y i − ȳ)2 + ∑ ( y i − y^ i )2
↑ ↑ ↑
total variation = variation in y + unexplained
in y due to x variation
(SST) (SSM) (SSE)

and, using this equation,

SSM SSE
R2 = =1−
SST SST

Why does this statistic measure the strength of the linear relationship?

 If the relationship is perfect, SSE = 0 and, thus, R2 = 1

 If there is no linear relationship, SSM = 0, and, thus, SSE = SST and R2 = 0 (Note: SSM would
equal zero, not only because the regression line would be horizontal but also because the
regression line always passes through the point ( x̄ , ȳ ) whether the line is horizontal or not)
 The closer SSM is to SST or the closer SSE is to zero, the larger the R2 and the stronger the
linear relationship

While there are several formulas to calculate R2 (all giving the same value), if we have only one
independent variable in the regression model,

R 2 = r2
In our example,

R2 = .49382 ≈ .2438

We can interpret R2 in this example to read,

“24.38% of the total in variation in daily sales in our sample can be explained
by its linear relationship to number of daily emails received.”

Using Excel’s scatterplot and its additional functions, the following scatter plot summarizes
our previously calculated statistics.

Sales vs No. of Emails

120

100
f(x) = 3.69047619047619 x + 39.7857142857143
80 R² = 0.243829415824301
Sales ($)

0
0 2 4 6 8 10 12 14 16 18
Daily emails

Note: You will notice that the regression line using Excel (also indicated in the scatterplot if requested)
does not go as far as the y axis. It starts when x = 2 and stops when x = 16, which just happen to be the
lowest and highest values of x in our sample. There is a logical reason for this.

We drew a scatterplot to see whether or not a linear relationship seemed reasonable. We can only
judge this assumption over the range of x’s in our sample. So, in this case, we observe that the
relationship seems reasonable when x is somewhere between a value of 2 and a value of 16 daily emails.
Outside this range, we have no information to either support or not support this linearity assumption.
Therefore, outside of the range of x’s in our sample, we cannot assume that the same relationship will
apply. Thus, our estimated intercept of $39.7855 may not apply nor would are estimated increase in
sales of $3.6905 for an additional email not necessarily apply.
We have now completed summarizing sample data and we will now discuss probabilities, the last topic
of this introductory course.

Numerical Method For Engineers-Chapter 17
90% (10)
Numerical Method For Engineers-Chapter 17
30 pages
Reclaiming Conversation
100% (1)
Reclaiming Conversation
223 pages
A Basic Approach in Sampling Methodology and Sample Size Calculation 249
No ratings yet
A Basic Approach in Sampling Methodology and Sample Size Calculation 249
5 pages
Chapter 13 Quick Overview of Correlation and Linear Regression
No ratings yet
Chapter 13 Quick Overview of Correlation and Linear Regression
17 pages
SEE5211 Chapter3-P2017
No ratings yet
SEE5211 Chapter3-P2017
58 pages
Y X y X N B: Linear Regression
No ratings yet
Y X y X N B: Linear Regression
7 pages
Correlations: Islamic University of Gaza Statistics and Probability For Engineers (ENGC 6310)
No ratings yet
Correlations: Islamic University of Gaza Statistics and Probability For Engineers (ENGC 6310)
22 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Correlation and Regression
No ratings yet
Correlation and Regression
23 pages
Book 2 Notes-71-78
No ratings yet
Book 2 Notes-71-78
8 pages
Correg
No ratings yet
Correg
19 pages
DADM-Correlation and Regression
No ratings yet
DADM-Correlation and Regression
138 pages
Chapter 1
No ratings yet
Chapter 1
22 pages
Pearson's Correlation Coefficient
No ratings yet
Pearson's Correlation Coefficient
7 pages
Gade 12 & 12 Promaths STATS 2024 June 2024
No ratings yet
Gade 12 & 12 Promaths STATS 2024 June 2024
206 pages
Math 133 - Unit 7 Graphing Data-1
No ratings yet
Math 133 - Unit 7 Graphing Data-1
20 pages
Correlation
100% (1)
Correlation
26 pages
Topic 5-Lecture Notes
No ratings yet
Topic 5-Lecture Notes
12 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
14 pages
6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2
No ratings yet
6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2
39 pages
Business Stat CHAPTER 6
No ratings yet
Business Stat CHAPTER 6
5 pages
Lesson 1-Correlation
100% (1)
Lesson 1-Correlation
12 pages
Stat II Chapter 6
No ratings yet
Stat II Chapter 6
11 pages
Chapter 5 Bivariate Analysis Students Notes 230125 152159-1
No ratings yet
Chapter 5 Bivariate Analysis Students Notes 230125 152159-1
13 pages
Regression and Correlation
No ratings yet
Regression and Correlation
37 pages
Session 5 Marked B PDF
No ratings yet
Session 5 Marked B PDF
36 pages
Correlation & Simple Regression
No ratings yet
Correlation & Simple Regression
15 pages
Chapter_10.QM sir pac
No ratings yet
Chapter_10.QM sir pac
8 pages
Lecture 4 - Correlation and Regression
No ratings yet
Lecture 4 - Correlation and Regression
35 pages
Course Pack Correlation
No ratings yet
Course Pack Correlation
12 pages
Lecture 7
No ratings yet
Lecture 7
26 pages
Correlation and Regression Using Jamovi
No ratings yet
Correlation and Regression Using Jamovi
8 pages
Lesson 18 - Correlation
No ratings yet
Lesson 18 - Correlation
3 pages
Sta404 - Chapter 5 - Bivariate Analysis (Student)
No ratings yet
Sta404 - Chapter 5 - Bivariate Analysis (Student)
27 pages
Pearson's Correlation Coefficient
No ratings yet
Pearson's Correlation Coefficient
7 pages
Unit 5 (CORRELATION AND REGRESSION)
No ratings yet
Unit 5 (CORRELATION AND REGRESSION)
23 pages
15 MAY - NR - Correlation and Regression
No ratings yet
15 MAY - NR - Correlation and Regression
10 pages
5_Chapter9-linear regression
No ratings yet
5_Chapter9-linear regression
15 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
Lecture 11
No ratings yet
Lecture 11
16 pages
Correlation and Regression
No ratings yet
Correlation and Regression
27 pages
Correlation
No ratings yet
Correlation
57 pages
6 Correlation and Regression
No ratings yet
6 Correlation and Regression
29 pages
ASS#1-FINALS Doromal
No ratings yet
ASS#1-FINALS Doromal
8 pages
Regression and correlation notes
No ratings yet
Regression and correlation notes
28 pages
Statistics Correlation Analysis
No ratings yet
Statistics Correlation Analysis
10 pages
06 Simple Linear Regression Part1
No ratings yet
06 Simple Linear Regression Part1
8 pages
Notes - Correlation and Regression
No ratings yet
Notes - Correlation and Regression
26 pages
SQQS2073 Note 1 Simple Linear Regression
No ratings yet
SQQS2073 Note 1 Simple Linear Regression
11 pages
Regcorr 5
No ratings yet
Regcorr 5
20 pages
Unit 5: Correlation and Regression
No ratings yet
Unit 5: Correlation and Regression
23 pages
6 Correlation and Linear Regression
No ratings yet
6 Correlation and Linear Regression
32 pages
BUSN 2429 Chapter 14 Correlation and Single Regression Model
No ratings yet
BUSN 2429 Chapter 14 Correlation and Single Regression Model
85 pages
Correlation Analysis
No ratings yet
Correlation Analysis
16 pages
CORRELATION
No ratings yet
CORRELATION
10 pages
Chapter_3_Notes_2024_2025.pdf
No ratings yet
Chapter_3_Notes_2024_2025.pdf
28 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
17 pages
Chapter 5 - 1
No ratings yet
Chapter 5 - 1
5 pages
Lecture 4 Correlation and regression
No ratings yet
Lecture 4 Correlation and regression
50 pages
6th Sem Project
No ratings yet
6th Sem Project
48 pages
ALGEBRA SIMPLIFIED EQUATIONS WORKBOOK WITH ANSWERS: Linear Equations, Quadratic Equations, Systems of Equations
From Everand
ALGEBRA SIMPLIFIED EQUATIONS WORKBOOK WITH ANSWERS: Linear Equations, Quadratic Equations, Systems of Equations
Luke Aneke
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Fu-Wang Foods Ltd. Financial Calculations (2009 - 2013)
100% (3)
Fu-Wang Foods Ltd. Financial Calculations (2009 - 2013)
46 pages
Crash Course in Analytics For Non Analytics Managers
No ratings yet
Crash Course in Analytics For Non Analytics Managers
74 pages
Beta Saham 20191011 en
No ratings yet
Beta Saham 20191011 en
13 pages
poisson
No ratings yet
poisson
54 pages
Economic Questions and Data: Multiple Choice
No ratings yet
Economic Questions and Data: Multiple Choice
19 pages
Cu Stat3008 Assignment 1
0% (1)
Cu Stat3008 Assignment 1
2 pages
MLR
No ratings yet
MLR
48 pages
Método de Superficie de Respuesta en R
No ratings yet
Método de Superficie de Respuesta en R
17 pages
Foundations of Econometrics Using SAS Simulations and Examples
No ratings yet
Foundations of Econometrics Using SAS Simulations and Examples
56 pages
Uncertainties in Above Ground Tree Biomass Estimation: Lihou Qin Shengwang Meng Guang Zhou Qijing Liu Zhenzhao Xu
No ratings yet
Uncertainties in Above Ground Tree Biomass Estimation: Lihou Qin Shengwang Meng Guang Zhou Qijing Liu Zhenzhao Xu
12 pages
Ujian Statistik Untuk Bisnis Desi Rusfiani
No ratings yet
Ujian Statistik Untuk Bisnis Desi Rusfiani
7 pages
Metrology 1
100% (1)
Metrology 1
102 pages
Impact of Social Media On Consumer Buying Behavior in Oman An Exploratory Study
No ratings yet
Impact of Social Media On Consumer Buying Behavior in Oman An Exploratory Study
18 pages
Ai Fundamentals Midterm Exam Source by Ate Zein (1)
No ratings yet
Ai Fundamentals Midterm Exam Source by Ate Zein (1)
125 pages
Estimation and Confidence Intervals: Mcgraw Hill/Irwin
No ratings yet
Estimation and Confidence Intervals: Mcgraw Hill/Irwin
15 pages
Research Quality & Content
No ratings yet
Research Quality & Content
7 pages
BOORE - ATKINSON - 2008 - Ground Motion Prediction Equations For The Average Horizontal Component of PGA, PGV, and 5% Damped PSA at Spectral Periods Between 0.01s and 10.0s
No ratings yet
BOORE - ATKINSON - 2008 - Ground Motion Prediction Equations For The Average Horizontal Component of PGA, PGV, and 5% Damped PSA at Spectral Periods Between 0.01s and 10.0s
40 pages
Previewpdf
No ratings yet
Previewpdf
27 pages
Influence of External Environmental Factors On The
No ratings yet
Influence of External Environmental Factors On The
15 pages
MUF0142 Sample Exam Questions 1
No ratings yet
MUF0142 Sample Exam Questions 1
18 pages
01.13.hierarchical Model-Based Motion Estimation
No ratings yet
01.13.hierarchical Model-Based Motion Estimation
16 pages
Exam Econometrics Bachelor's 2018 Utrecht University
No ratings yet
Exam Econometrics Bachelor's 2018 Utrecht University
17 pages
Social Sciences & Humanities: Aryo Bismo, Haryadi Sarjono and Andika Ferian
No ratings yet
Social Sciences & Humanities: Aryo Bismo, Haryadi Sarjono and Andika Ferian
16 pages
4.0 TRIP ASSIGNMENT MODEL
No ratings yet
4.0 TRIP ASSIGNMENT MODEL
11 pages
LAB 11 Refine Factorial Design
No ratings yet
LAB 11 Refine Factorial Design
16 pages
Factors Affecting Turnover Tax Collection Performance: A Case of West Shoa Zone Selected Districts
No ratings yet
Factors Affecting Turnover Tax Collection Performance: A Case of West Shoa Zone Selected Districts
17 pages
Y (Xom Price) Xi1 (Interest Rate) Xi2 (Oil Price) Xi3 (Value of S&P 500 Index
No ratings yet
Y (Xom Price) Xi1 (Interest Rate) Xi2 (Oil Price) Xi3 (Value of S&P 500 Index
19 pages
The Soybean Crush Spread - Empirical Evidence and Trading Strategies
No ratings yet
The Soybean Crush Spread - Empirical Evidence and Trading Strategies
19 pages