0% found this document useful (0 votes)
172 views

Chapter 3

This document discusses correlation and regression analysis. It defines correlation as a measure of the strength and direction of the linear relationship between two continuous variables. A scatter plot is used to visualize the relationship, and Pearson's correlation coefficient r quantifies it. Regression finds the linear relationship that best predicts the dependent variable from the independent variable. It derives an equation for the regression line that can be used to make predictions from new independent variable values. Assumptions like independent and normally distributed errors must be met for valid regression results.

Uploaded by

aisyahazali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views

Chapter 3

This document discusses correlation and regression analysis. It defines correlation as a measure of the strength and direction of the linear relationship between two continuous variables. A scatter plot is used to visualize the relationship, and Pearson's correlation coefficient r quantifies it. Regression finds the linear relationship that best predicts the dependent variable from the independent variable. It derives an equation for the regression line that can be used to make predictions from new independent variable values. Assumptions like independent and normally distributed errors must be met for valid regression results.

Uploaded by

aisyahazali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CHAPTER 3

CORRELATION AND REGRESSION


3.1 Introduction

In this section we will be investigating the relationship between two continuous


variables (Dependent and Independent variable), such as height and weight, the
concentration of an injected drug and heart rate, or the consumption level of some
nutrient and weight gain. The tools used to explore this relationship, is the regression
and correlation analysis. These tools can be used to find out if the outcome from one
variable depends on the value of the other variable, which would mean a dependency
from one variable on the other. Regression and correlation analysis can be used to
describe the nature and strength of the relationship between two continuous variables.

Note:
Dependent Variable: Refer to that type of variable that measure the effect of the
independent variable on the test units. It also known as response variable.

Independent Variable: The variable that is being manipulated in an experiment in


order to observe the effect on a dependent variable. It also known as treatment or
predictor variable.

3.2 Scatter Plot

The first step in determining whether a relationship exists between two variables is to
plot a graph for the data. This graph is called a scatter plot. The scatter plot is a visual
way to describe the nature of the relationship between the independent (X) and
dependent (Y) variable. The scales of the variables can be different, and the
coordinates of the axes are determined by the smallest and largest data values of the
variables. Examples of scatter plot are given in Figure 3.1 to 3.7.

Figure 3.1: Perfect positive correlation Figure 3.2: Perfect negative correlation
Figure 3.3: Positive correlation Figure 3.4: Negative correlation

Figure 3.5: No correlation Figure 3.6: Curvilinear correlation

Figure 3.7: Curvilinear correlation


3.3 Correlation

Correlation analysis is used to measure the strength of the relationship between two
variables. It is represented as a number. The correlation coefficient is a measure of
how closely related two data series are. In particular, the correlation coefficient
measures the direction and extent of linear association between two variables. There
are several types of correlation coefficients. The one explained in this section is called
the Pearson product moment correlation coefficient which is normally denoted by r.
Pearson’s correlation coefficient tell us two aspects of the relationship between two
variables. The sign ( - or + ) for r identifies the kind of relationship between the two
quantitative variables, and the magnitude of r describes the strength of the relationship.
The magnitude of the correlation lies between -1.0 and 1.0.

a. The mathematical formula for Pearson’s correlation coefficient r is;

x y
x y  n
i i
i i
r
 x    y   y 
 x  
 2 2

 
2 i 2 i

 n  
i i
n
  
where
r  correlatio n coefficien t
n  number of observatio ns
x  Independen t variable
y  Dependent variable

b. Characteristics of the correlation coefficient

The value of r is always -1 ≤ r ≤ 1.


A value of r greater than 0 indicates a positive linear association between the two
variables.
A value of r less than 0 indicates a negative linear association between the two
variables.
A value of r equal to 0 indicates no linear relation between the two variables.

c. Strength of the Correlation Coefficient

r 1 : Perfect Correlatio n
r  0.8 : Strong Correlatio n
0.5  r  0.8 : Moderate Correlatio n
r  0.5 : Weak Correlatio n
r 0 : No Correlatio n
Example 3.1

Draw a scatter diagram and compute the value of the correlation coefficient for the data
obtained in the study of the number of absences and the final grade of the seven students
in the statistics class.

Number of Final
Students Absences Grade % XY X2 Y2
(x) (Y)
A 6 82 492 36 6724
B 2 86 172 4 7396
C 15 43 645 225 1849
D 9 74 666 81 5476
E 12 58 696 144 3364
F 5 90 450 25 8100
G 8 78 624 64 6084
Total X  57  Y  511  XY  3745  X 2  579  Y  38993
2

Solution:

100
90
80
Final Grade (%)

70
60
50
40
30
20
10
0
0 5 10 15 20
Number of Absences

(57 )(511)
3745 
r  7
 (57 )  
2
(511) 2 
579  38993  
 7  7 
 0.9442

The value of r  0.9442 suggests a strong negative relationship between a


student’s final grade and the number of absences a student has. That is, the more
absences a student has, the lower is his or her grade.
3.4 Regression

Simple Linear Regression Model is a basic regression model where there is only one
independent variable and one dependent variable. In studying relationships between
two variables, collect the data and then construct a scatter plot. The purpose of the
scatter plot, as indicated previously, is to determine the nature of the relationship. The
possibilities include a positive linear relationship, a negative linear relationship, a
curvilinear relationship, or no discernible relationship. After the scatter plot is drawn,
the next steps are to compute the value of the correlation coefficient and to test the
significance of the relationship. If the value of the correlation coefficient is significant,
the next step is to determine the equation of the regression line, which is the data’s line
of best fit. (Note: Determining the regression line when r is not significant and then
making predictions using the regression line are meaningless.) The purpose of the
regression line is to enable the researcher to see the trend and make predictions on
the basis of the data. The simple linear model can be stated as follows;

Yi   0   1 X i   i
where
Y is the value of the response variable in the i th trial
i
 and  are regression coefficien ts or parameters
0 1
X is a known constant (the value of the independen t variables in the i th trial )
i
 is a random error with mean E ( )  0 and V ( )   2
i i i

In the regression analysis, the assumptions of the model and error terms must be
considered in order to ensure that the result or estimated regression model is correct.

Assumptions of the Model

1. The response variables are independent.


2. The response variables are normally distributed.
3. The response variables have the same variance  2.
4. The true relationship between the mean of the response variable and
the explanatory variable is a straight line or linear.

Assumptions of Error Terms

1. The error terms are normally distributed.


2. The error terms have constant variance.
3. The error terms are independent.
3.4 Fitting a Straight Line

Figure 3.8 shows a scatter plot for the data of two variables. It shows that several lines
can be drawn on the graph near the points. Given a scatter plot, you must be able to
draw the line of best fit. Best fit means that the sum of the squares of the vertical
distances from each point to the line is at a minimum (Figure 3.9). The reason you need
a line of best fit is that the values of y will be predicted from the values of x; hence, the
closer the points are to the line, the better the fit and the prediction will be.

Figure 3.8: The regression line Figure 3.9: Best fit of regression line

The prediction regression line is expressed as Yˆ  b0  b1 X where b0 and b1 are


estimates of  0 and 1 respectively. 1 is the slope of regression line and it indicates
that the change in the mean of Y as per unit increase in X. The parameter of  0 is the
Y intercept of the regression line when X is equal to zero. To find “good” estimators of
the regression parameters  0 and 1 , we employed the method of ordinary least
squares. The mathematical formula for Ordinary Least Square Method is;

 x y
 xy  n
b1  b0  y  b1 x

( x )2
x 2

n

3.5 Coefficient of Determination

The coefficient of determination is the ratio of the explained variation to the total
variation. It is normally denoted by R2. In other words, R2 explains how much of the
variability in Y can be explained by the fact that they are related to X. For simple linear
regression line of y on x, coefficient of determination is the square of correlation
coefficient, r. Thus, we can state that;

Explained Variation
Coefficient of Determination  R 2 
Total Variation
For example, if the correlation coefficient r  0.91 , then the coefficient of determination
is R 2  (r ) 2  (0.91) 2  0.828 . Therefore, R 2  0.828 means that 82.80% of the
variability of Y can be explained by the variability in X. the remaining 17.20% is
unexplained variability in Y.

Example 3.2

The following table shows the data on the post test and final exam of ten peoples.

Person 1 2 3 4 5 6 7 8 9 10
Post test 100 96 88 100 100 96 80 68 92 96

Final exam 98 97 88 100 100 78 68 47 90 94

a. Calculate the Pearson product moment correlation coefficient and interpret its
meaning.
b. Find the regression equation line using least squares method.
c. Calculate coefficient of determination and explain its meaning.
d. Estimate the final exam score if the post test score is 55.

Solution:

Post Test Final Exam


X2 Y2 XY
(X) (Y)
100 98 10000 9604 9800
96 97 9216 9409 9312
88 88 7744 7744 7744
100 100 10000 10000 10000
100 100 10000 10000 10000
96 78 9216 6084 7488
80 68 6400 4624 5440
68 47 4624 2209 3196
92 90 8464 8100 8280
96 94 9216 8836 9024
 X  916 Y  860  X 2
 84880 Y 2
 76610  XY  80284
a. Pearson Product Moment, r

(916 )(860 )
80284 
r  10
 (916 ) 2   (860 ) 2 
84880   76610  
 10   10 
 0.9384

The value of r  0.9384 suggests a strong positive relationship between a post test
and final exam score.
b. Regression equation line

 x y (916 )(860 )
 xy  n
80284 
10
b1    1.5476

( x )2 (916 ) 2
x 2

n
84880 
10

b0  y  b1 x  86  (1.5476 )(91.6)   55.7619

yˆ  55.7619  1.5476 x

c. Coefficient of Determination, R2

R 2  (r ) 2  (0.9384 ) 2  0.8806

88.06% of the variability of Final Exam score can be explained by the variability in
Post Test score. The remaining 11.94% is unexplained variability in Final Exam
score.

d. Prediction

yˆ  55.7619  1.5476(55)
 29.36
3.6 Correlation and Simple Linear Regression Using MINITAB Software

This section will illustrate how to run correlation and simple linear regression using
Minitab software. The steps of analysis is shown below.

3.6.1 Correlation

Step 1: From the menu at the top of the screen, click on Stat, then Basic Statistics,
then Correlation.

Step 2: Enter the variable Y into Variables column by click Y, then click select.
Step 3: Enter the variable X into Variables column by click X, then click select, then
click ok.

Output of correlation analysis: r = 0.986


3.6.2 Simple Linear Regression

Step 1: From the menu at the top of the screen, click on Stat, then select Regression,
then Regression, then Fit Regression Model.

Step 2: Enter the variable Y into Response column by click Y, then click Select.
Step 3: Enter the variable X into Continuous predictors column by click X, then click
Select, then click ok.

Output of simple linear regression analysis: Regression Equation is


yˆ  168.60  2.034 x
Example 3.3

The data, along with the MINITAB output are produced below.

Y 199 205 218 220 237 234 250 248


X 16 16 24 24 32 32 40 40

Model Summary

S R-sq R-sq(adj) R-sq(pred)


2.35407 98.68% 98.46% 97.27%

Coefficients

Term Coef SE Coef T-Value P-Value VIF


Constant 171.25 2.74 62.61 0.000
X 1.9688 0.0931 21.16 0.000 1.00

a. State the estimated regression equation.

yˆ  171.25  1.9688 x

b. Determine the coefficient of determination and interpret the value obtained.

R2=98.68%, means that 98.68% of variability of Y can be explained by


variability of X. The remaining 1.32% is explained by others factor.

c. Interpret the value of slope.

 1  1.9688 , means that for every 1 unit increase in X, the mean of Y


will increase by 1.9688.
EXERCISES
1. The following table shows the data on the experience (in years) and monthly salaries (RM
’00) of 9 secretaries selected randomly.

Experience 14 3 5 6 4 9 18 5 16
Salary 22 12 15 17 15 19 24 13 27

a. Plot the scatter diagram for the given data.


b. Compute the Pearson product moment correlation coefficient.
c. Explain the meaning of the value obtained in (b).

2. The following table shows the data on the tuition class period (hours) and number of
student who failed in the examination.

Tuition Class Period (Hours) 10 12 20 22 12 7 6 7


Number Of Student (Failed) 19 11 6 5 9 20 23 21

a. Calculate the Pearson product moment correlation coefficient and explain its
meaning.
b. Find the regression equation line using the least squares method.
c. Estimate the number of students who failed if the tuition class period is 12 hours.

3. The following are the MINITAB results on CGPA and starting salaries (in RM ’00) of seven
graduates. Based on the output, answer the following questions.

Regression Analysis: Salary versus CGPA

Term Coef SE Coef T-Value P-Value


Constant 7.712 3.001 2.57 0.050
CGPA 5.5429 0.9941 5.58 0.003

S = 1.82349 R-sq = 86.1% R-sq(adj) = 83.4%

a. Identify the independent and dependent variable.


b. Explain the value of the correlation of determination.
c. State the estimated regression line.
d. Interpret the value of slope of the regression line obtained in (c).
e. Estimate the salary obtained when the CGPA is 3.15.

4. The following are the Minitab result on weight (in kg) and systolic blood pressure of 10
randomly selected students. Assume that the weight and blood pressure are both normally
distributed and are linearly related. Based on the output below, answer the following
questions.
Regression Analysis: Systolic blood pressure versus Weight

Term Coef SE Coef T-Value P-Value


Constant 48.61 5.49 8.85 0.000
Weight 0.361 0.133 2.70 0.027

S = 5.83008 R-sq = 47.75% R-sq(adj) = 41.22%

a. Identify the independent and dependent variable.


b. State the coefficient of determination and interpret the value obtained.
c. State the value of slope and interpret its meaning.

5. A biologist assumes that there is a linear relationship between the amount of fertilizer
supplied to tomato plants and the subsequent yield of tomatoes obtained. Eight tomato
plants of the same variety were selected at random and treated weekly with a solution in
which x grams of fertilizer was dissolved in a fixed quantity of water. The yield y kilograms
of the tomato was recorded.

Plant A B C D E F G H
X 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Y 3.9 4.4 5.8 6.6 7.0 7.1 7.3 7.7

a. Plot a scatter diagram of yield against amount of fertilizer.


b. Calculate the regression equation line and explain the value obtained.
c. Estimate the yield of a plant treated weekly with 3.2 grams of fertilizer.

6. An automobile company produced a lubricant known as SaveMie. It is believed that the


SaveMile can be used to save the fuel usage. The amount (in milliliter) of SaveMile and the
percentage of fuel saving for nine similar cars are as follows:

Fuel saving (%) Amount of SaveMile (milliliter)


41 265
35 220
25 165
52 410
55 435
46 350
30 180
22 135
38 250

a. Compute the Pearson product moment correlation coefficient and interpret the value
obtained.
b. Using least squares method, find the liner regression equation of the fuel saving against
the amount of SaveMile.
c. Explain the slope obtained in (b).
d. Determine the coefficient of determination and interpret its meaning.

You might also like