0% found this document useful (0 votes)
21 views

Chapter 4

In chapter 4, the document will discuss describing relationships between two quantitative variables using correlation coefficients when the relationship is assumed to be linear. It will define the correlation coefficient r as a measure of the strength of the linear relationship between two variables. The document provides an example of calculating r from a sample data set and interpreting the results. It also introduces the concept of regression analysis for modeling relationships where one variable is dependent on the other.

Uploaded by

ead062712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Chapter 4

In chapter 4, the document will discuss describing relationships between two quantitative variables using correlation coefficients when the relationship is assumed to be linear. It will define the correlation coefficient r as a measure of the strength of the linear relationship between two variables. The document provides an example of calculating r from a sample data set and interpreting the results. It also introduces the concept of regression analysis for modeling relationships where one variable is dependent on the other.

Uploaded by

ead062712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Summary of Chapter 4

In chapter 3, we looked at ways of summarizing the distribution of a single quantitative variable either
through tables and pictures (e.g., histograms, stem-and-leaf displays, polygons, ogives, box plots) or
through, numerical measures (e.g., mean, median, range, variance).

In chapter 4, we will extend this summarization of quantitative data by looking at ways of describing the
relationships between two quantitative variables when we believe that the relationship between the
two variables is linear, but not necessarily perfectly linear. This summarization of linear relationships
will begin with a measure of the strength of a linear relationship, that is, by the correlation coefficient.

The correlation coefficient is a measure of the strength of the linear relationship between two
quantitative random variables where causation is not assumed (or that the values of one of the
variables does not cause or affect the values of the other variable). Once we determine how to calculate
this strength, we will also realize that the value of this strength also tells us if the relationship between
these two variables is either positive, negative or non-existent. As always, we attach a symbol for any
numerical measure which describes a specific aspect of our data set. The sample correlation coefficient
is given the symbol r while the population correlation coefficient is given the symbol ρ (the Greek letter
‘rho’). We will focus on the sample correlation coefficient, r, although the calculation of ρ and its
interpretation would be same but would apply to the population and not the sample. The sample
correlation coefficient can be calculated in a number of ways, each resulting in the same value of r.
Three of the formulas for the calculation of r are:

∑ ( zx z y )i xi − x̄ y i− ȳ
r= i
where z xi = and z yi =
n−1 sx sy

∑ [ (x i − x̄)( y i − ȳ)] ∑ [ ( x i− x̄ )( y i − ȳ ) ]
r= r=
√ ∑ (xi − x̄)2 ∑ ( yi − ȳ )2 or (n−1)s x s y

To calculate r, we first need data. This data will consist of pairs of values of x and y obtained for each
element in our sample. We can symbolize these pairs by (xi, yi). As an example, using a sample of size,
n = 5, we obtained the following data:

element xi yi
1 6 5
2 10 3
3 14 7
4 19 8
5 21 12
But, before we calculate r, we should create a scatterplot to see if it is reasonable to assume that the
relationship between x and y is linear. In this example, the scatterplot looks like:

y vs x
12

11

10

9
Y

2
6 8 10 12 14 16 18 20 22

From observing this scatterplot, we see that there is not sufficient evidence to indicate that this
relationship is not linear. So, we will assume that it is linear and proceed to calculate r. Based on the
results of this plot, we should expect the value of r would tell us the relationship is positive (i.e., x and y
tend to move in the same direction) and, although the relationship is not perfect, it is somewhat strong.

Of the above 3 formulas for r, to reduce the number of calculations we should use either of the last two
if our calculations are to be done by ‘hand’. Using either of these formulas, the following table would
allow us to calculate r in a systematic way.

xi yi x i− x̄ y i − ȳ ( x i− x̄ )2 ( y i − ȳ )2 ( x i− x̄ )( y i − ȳ )
6 5 -8 -2 64 4 16
10 3 -4 -4 16 16 16
14 7 0 0 0 0 0
19 8 5 1 25 1 5
21 12 7 5 49 25 35
70 35 0 0 154 46 72

x̄=
∑ x i =70 =14
n 5
ȳ =
∑ y i = 35 =7
n 5
s x=
√ ∑ ( x i − x̄ )2 =
n−1 √ 154
4
=6 . 205

s y=
√ ∑ ( y i − ȳ )2 =
n−1 √ 46
4
=3. 391

∑ [ (x i − x̄)( y i − ȳ)] 72
r= = ≈0.855
Using √ i
∑ (x − x̄)2
∑ ( yi − ȳ )2
√ (154 )(46 )
∑ [ ( x i− x̄)( y i − ȳ) ] 72
r= = ≈0. 856
Using (n−1)s x s y (5−1)(6 . 205)(3 . 391) (rounding errors)

What does r tell us about the relationship between x and y in our sample?

For any sample data set, r can vary between – 1 and + 1, with a value of – 1 indicating a perfect negative
linear relationship, a value of 0 indicating no linear relationship, and a value of + 1 indicating a perfect
positive linear relationship. And, the closer its value is to – 1 or + 1, the stronger the linear relationship
between x and y in our sample. (Note: if this data set were the population data set, the value of ρ would
be identical as above and we would be able make the same statements about ρ as we did about r,
except that the statements would refer to the population.)

When interpreting r, we assume that the variables are random and quantitative and that there isn’t a
cause-effect relationship between the two. And, if outliers appear to be present in our data set, we may
want to examine what effect these outliers may have on r. (Our scatterplot does not indicate the
presence of outliers and there is no reason to conclude that the relationship is not linear. And, if our
elements in our sample were randomly selected, we assume that these two variables are random.)

Based on these observations, we can state that, since r ≈ 0.855 or 0.856, there is a strong positive linear
relationship between x and y in our sample.

As previously mentioned, we use correlation coefficients when we assume no cause-effect linear


relationship between the two quantitative variables and we assume that both these variables are
random variables. But what if there is a cause-effect relationship where one of these variables, y,
depends on the other variable x? And, what if y is a random variable and x is not? If this is the situation,
we analyze the linear relationship between x and y using regression analysis. Although regression
analysis will be more thoroughly discussed in chapter 14 of our text, chapter 4 introduces us to it now,
when we are only interested in examining the relationship between x and y in our sample. (The material
in chapter 14, to be covered in our second statistics course, will allow us to use what we learned in this
chapter to make inferences about the true linear relationship between these two variables.)

Because we believe that y depends on x, we will call y the dependent variable and x the independent
variable. And, because we believe that the relationship is linear but not perfect, we can express the
relationship between x and y, in our sample, as:

y i =b0 +b1 x i +e i

Where, b0 = y-intercept of the linear relationship


b1= slope of the linear relationship
ei = residual, or, deviation of y from the linear relationship

If we forget about e, the perfect linear relationship can be express as:

^y i =b0 +b1 x i
making
e i= y i− ^y i
another example

Many companies try to improve their sales by sending out the same emails to individual’s email
addresses, some sending out several per day, day after day after day. But, does this strategy actually
increase sales? A study was undertaken in which 8 email addresses were randomly selected from a list
of previous customer’s emails and each address was sent a specific number of emails per day over a
month period. The dollar sales which resulted are recorded below:

Customer Daily emails $ Sales

1 2 70
2 4 30
3 6 80
4 8 20
5 10 110
6 12 100
7 14 54
8 16 120

(Note: In this example we believe that sales, y, depends on no. of daily emails, x, that sales is random
(although dependent on no. of emails), but no. of daily emails is not random as its values were
specifically chosen for this study.)

Before calculating b0 and b1, we should create a scatterplot to see if our assumption of a linear
relationship seems reasonable.

Sales vs. Number of Emails

140

120

100

80
Sales ($)

60

40

20

0
0 2 4 6 8 10 12 14 16 18
No. of emails

The above scatterplot does not rule out our assumption of linearity although it does indicate that the
relationship is somewhat weak. It also indicates that the relationship is somewhat positive.

With our assumptions validated, we can then proceed to determine the values of b 0 and b1, or, find the
best line through our data points.
It would seem reasonable that the best line through these points should take into consideration of the
residuals, ei’s, and that, the best line should be that line that minimizes some function of these e i’s.
Based on our previous assumptions and the additional assumptions that:

 the residuals are independent of one another


 the variability of y is the same no matter what the value of x
 and, there are no problems with outliers

the best line is the line which minimizes

∑ e2i
Continued next class
If we wish to calculate the slope and intercept ‘by hand’, formulas mentioned in chapter 6 are:

sy
b 1=r and b0 = ȳ −b1 x̄
sx
Using the same mathematical manipulations of our data as with our first example,

Customer xi yi
x i− x̄ y i − ȳ ( x i− x̄ )2 (
y i − ȳ )2 ( x i− x̄ )( y i − ȳ )

1 2 70 -7 -3 49 9 21
2 4 30 -5 -43 25 1849 215
3 6 80 -3 7 9 49 -21
4 8 20 -1 -53 1 2809 53
5 10 110 1 37 1 1369 37
6 12 100 3 27 9 729 81
7 14 54 5 -19 25 361 -95
8 16 120 7 47 49 2209 329
_______________________________________________________________

totals 72 584 0 0 168 9384 620

The following calculations allow us to determine the best line through our data points:

x̄=
∑ x i =72 =9
n 8
ȳ=
∑ y i =584 =73
n 8
s x=
√ ∑ ( x i − x̄ )2 =
n−1 √ 168
7
≈4 .899

s y=
√ ∑ ( y i − ȳ )2 =
n−1 √ 9384
7
≈36 . 614

∑ [ (xi − x̄)( y i − ȳ)] 620


r= = ≈0.4938
Using √∑ i ∑ i
(x − x̄)2
( y − ȳ )2
√ (168)(9384)
Or,

∑ [ ( x i− x̄)( y i − ȳ) ] 620


r= = ≈0 . 4938
using (n−1)s x s y (8−1)( 4 . 899)(36 . 614 )

b1 and b0 are calculated to be:

sy 36 . 614
b 1=r =0. 4938 ≈3 . 6905 and b 0 = ȳ −b1 x̄=73−3 .6905 (9)≈39 .7855
sx 4 . 899

and,

exp ected sales=39 .7855+3. 6905( no. of daily emails )

Literally, interpreting this expression:

 $39.7855 would be the expected sales if no emails were sent out to a customer
 Each additional daily email sent out is expected to increase sales by $3.6905

But, as will be mentioned below, the literal interpretation of the regression line is usually not the correct
interpretation.

The strength of the linear relationship

Using correlation analysis, we used r as a measure of the strength of the linear relationship between two
random quantitative variables. An equivalent measure of the strength of the linear relationship using
regression analysis is R2, the coefficient of determination. What R2 does is it looks at the total variation
of y in our sample and breaks down that total variation into two components – the variation of y due to
its relationship with x, and, the variation that is not explained by X. The equation expressing the
relationship among these variations is:

∑ ( y i− ȳ)2 ¿ ∑ ( ^y i − ȳ)2 + ∑ ( y i − y^ i )2
↑ ↑ ↑
total variation = variation in y + unexplained
in y due to x variation
(SST) (SSM) (SSE)

and, using this equation,

SSM SSE
R2 = =1−
SST SST

Why does this statistic measure the strength of the linear relationship?

 If the relationship is perfect, SSE = 0 and, thus, R2 = 1


 If there is no linear relationship, SSM = 0, and, thus, SSE = SST and R2 = 0 (Note: SSM would
equal zero, not only because the regression line would be horizontal but also because the
regression line always passes through the point ( x̄ , ȳ ) whether the line is horizontal or not)
 The closer SSM is to SST or the closer SSE is to zero, the larger the R2 and the stronger the
linear relationship

While there are several formulas to calculate R2 (all giving the same value), if we have only one
independent variable in the regression model,

R 2 = r2
In our example,

R2 = .49382 ≈ .2438

We can interpret R2 in this example to read,

“24.38% of the total in variation in daily sales in our sample can be explained
by its linear relationship to number of daily emails received.”

Using Excel’s scatterplot and its additional functions, the following scatter plot summarizes
our previously calculated statistics.

Sales vs No. of Emails


120

100
f(x) = 3.69047619047619 x + 39.7857142857143
80 R² = 0.243829415824301
Sales ($)

60

40

20

0
0 2 4 6 8 10 12 14 16 18
Daily emails

Note: You will notice that the regression line using Excel (also indicated in the scatterplot if requested)
does not go as far as the y axis. It starts when x = 2 and stops when x = 16, which just happen to be the
lowest and highest values of x in our sample. There is a logical reason for this.

We drew a scatterplot to see whether or not a linear relationship seemed reasonable. We can only
judge this assumption over the range of x’s in our sample. So, in this case, we observe that the
relationship seems reasonable when x is somewhere between a value of 2 and a value of 16 daily emails.
Outside this range, we have no information to either support or not support this linearity assumption.
Therefore, outside of the range of x’s in our sample, we cannot assume that the same relationship will
apply. Thus, our estimated intercept of $39.7855 may not apply nor would are estimated increase in
sales of $3.6905 for an additional email not necessarily apply.
We have now completed summarizing sample data and we will now discuss probabilities, the last topic
of this introductory course.

You might also like