Chapter 3
Chapter 3
Note:
Dependent Variable: Refer to that type of variable that measure the effect of the
independent variable on the test units. It also known as response variable.
The first step in determining whether a relationship exists between two variables is to
plot a graph for the data. This graph is called a scatter plot. The scatter plot is a visual
way to describe the nature of the relationship between the independent (X) and
dependent (Y) variable. The scales of the variables can be different, and the
coordinates of the axes are determined by the smallest and largest data values of the
variables. Examples of scatter plot are given in Figure 3.1 to 3.7.
Figure 3.1: Perfect positive correlation Figure 3.2: Perfect negative correlation
Figure 3.3: Positive correlation Figure 3.4: Negative correlation
Correlation analysis is used to measure the strength of the relationship between two
variables. It is represented as a number. The correlation coefficient is a measure of
how closely related two data series are. In particular, the correlation coefficient
measures the direction and extent of linear association between two variables. There
are several types of correlation coefficients. The one explained in this section is called
the Pearson product moment correlation coefficient which is normally denoted by r.
Pearson’s correlation coefficient tell us two aspects of the relationship between two
variables. The sign ( - or + ) for r identifies the kind of relationship between the two
quantitative variables, and the magnitude of r describes the strength of the relationship.
The magnitude of the correlation lies between -1.0 and 1.0.
x y
x y n
i i
i i
r
x y y
x
2 2
2 i 2 i
n
i i
n
where
r correlatio n coefficien t
n number of observatio ns
x Independen t variable
y Dependent variable
r 1 : Perfect Correlatio n
r 0.8 : Strong Correlatio n
0.5 r 0.8 : Moderate Correlatio n
r 0.5 : Weak Correlatio n
r 0 : No Correlatio n
Example 3.1
Draw a scatter diagram and compute the value of the correlation coefficient for the data
obtained in the study of the number of absences and the final grade of the seven students
in the statistics class.
Number of Final
Students Absences Grade % XY X2 Y2
(x) (Y)
A 6 82 492 36 6724
B 2 86 172 4 7396
C 15 43 645 225 1849
D 9 74 666 81 5476
E 12 58 696 144 3364
F 5 90 450 25 8100
G 8 78 624 64 6084
Total X 57 Y 511 XY 3745 X 2 579 Y 38993
2
Solution:
100
90
80
Final Grade (%)
70
60
50
40
30
20
10
0
0 5 10 15 20
Number of Absences
(57 )(511)
3745
r 7
(57 )
2
(511) 2
579 38993
7 7
0.9442
Simple Linear Regression Model is a basic regression model where there is only one
independent variable and one dependent variable. In studying relationships between
two variables, collect the data and then construct a scatter plot. The purpose of the
scatter plot, as indicated previously, is to determine the nature of the relationship. The
possibilities include a positive linear relationship, a negative linear relationship, a
curvilinear relationship, or no discernible relationship. After the scatter plot is drawn,
the next steps are to compute the value of the correlation coefficient and to test the
significance of the relationship. If the value of the correlation coefficient is significant,
the next step is to determine the equation of the regression line, which is the data’s line
of best fit. (Note: Determining the regression line when r is not significant and then
making predictions using the regression line are meaningless.) The purpose of the
regression line is to enable the researcher to see the trend and make predictions on
the basis of the data. The simple linear model can be stated as follows;
Yi 0 1 X i i
where
Y is the value of the response variable in the i th trial
i
and are regression coefficien ts or parameters
0 1
X is a known constant (the value of the independen t variables in the i th trial )
i
is a random error with mean E ( ) 0 and V ( ) 2
i i i
In the regression analysis, the assumptions of the model and error terms must be
considered in order to ensure that the result or estimated regression model is correct.
Figure 3.8 shows a scatter plot for the data of two variables. It shows that several lines
can be drawn on the graph near the points. Given a scatter plot, you must be able to
draw the line of best fit. Best fit means that the sum of the squares of the vertical
distances from each point to the line is at a minimum (Figure 3.9). The reason you need
a line of best fit is that the values of y will be predicted from the values of x; hence, the
closer the points are to the line, the better the fit and the prediction will be.
Figure 3.8: The regression line Figure 3.9: Best fit of regression line
x y
xy n
b1 b0 y b1 x
( x )2
x 2
n
The coefficient of determination is the ratio of the explained variation to the total
variation. It is normally denoted by R2. In other words, R2 explains how much of the
variability in Y can be explained by the fact that they are related to X. For simple linear
regression line of y on x, coefficient of determination is the square of correlation
coefficient, r. Thus, we can state that;
Explained Variation
Coefficient of Determination R 2
Total Variation
For example, if the correlation coefficient r 0.91 , then the coefficient of determination
is R 2 (r ) 2 (0.91) 2 0.828 . Therefore, R 2 0.828 means that 82.80% of the
variability of Y can be explained by the variability in X. the remaining 17.20% is
unexplained variability in Y.
Example 3.2
The following table shows the data on the post test and final exam of ten peoples.
Person 1 2 3 4 5 6 7 8 9 10
Post test 100 96 88 100 100 96 80 68 92 96
a. Calculate the Pearson product moment correlation coefficient and interpret its
meaning.
b. Find the regression equation line using least squares method.
c. Calculate coefficient of determination and explain its meaning.
d. Estimate the final exam score if the post test score is 55.
Solution:
(916 )(860 )
80284
r 10
(916 ) 2 (860 ) 2
84880 76610
10 10
0.9384
The value of r 0.9384 suggests a strong positive relationship between a post test
and final exam score.
b. Regression equation line
x y (916 )(860 )
xy n
80284
10
b1 1.5476
( x )2 (916 ) 2
x 2
n
84880
10
yˆ 55.7619 1.5476 x
c. Coefficient of Determination, R2
R 2 (r ) 2 (0.9384 ) 2 0.8806
88.06% of the variability of Final Exam score can be explained by the variability in
Post Test score. The remaining 11.94% is unexplained variability in Final Exam
score.
d. Prediction
yˆ 55.7619 1.5476(55)
29.36
3.6 Correlation and Simple Linear Regression Using MINITAB Software
This section will illustrate how to run correlation and simple linear regression using
Minitab software. The steps of analysis is shown below.
3.6.1 Correlation
Step 1: From the menu at the top of the screen, click on Stat, then Basic Statistics,
then Correlation.
Step 2: Enter the variable Y into Variables column by click Y, then click select.
Step 3: Enter the variable X into Variables column by click X, then click select, then
click ok.
Step 1: From the menu at the top of the screen, click on Stat, then select Regression,
then Regression, then Fit Regression Model.
Step 2: Enter the variable Y into Response column by click Y, then click Select.
Step 3: Enter the variable X into Continuous predictors column by click X, then click
Select, then click ok.
The data, along with the MINITAB output are produced below.
Model Summary
Coefficients
yˆ 171.25 1.9688 x
Experience 14 3 5 6 4 9 18 5 16
Salary 22 12 15 17 15 19 24 13 27
2. The following table shows the data on the tuition class period (hours) and number of
student who failed in the examination.
a. Calculate the Pearson product moment correlation coefficient and explain its
meaning.
b. Find the regression equation line using the least squares method.
c. Estimate the number of students who failed if the tuition class period is 12 hours.
3. The following are the MINITAB results on CGPA and starting salaries (in RM ’00) of seven
graduates. Based on the output, answer the following questions.
4. The following are the Minitab result on weight (in kg) and systolic blood pressure of 10
randomly selected students. Assume that the weight and blood pressure are both normally
distributed and are linearly related. Based on the output below, answer the following
questions.
Regression Analysis: Systolic blood pressure versus Weight
5. A biologist assumes that there is a linear relationship between the amount of fertilizer
supplied to tomato plants and the subsequent yield of tomatoes obtained. Eight tomato
plants of the same variety were selected at random and treated weekly with a solution in
which x grams of fertilizer was dissolved in a fixed quantity of water. The yield y kilograms
of the tomato was recorded.
Plant A B C D E F G H
X 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Y 3.9 4.4 5.8 6.6 7.0 7.1 7.3 7.7
a. Compute the Pearson product moment correlation coefficient and interpret the value
obtained.
b. Using least squares method, find the liner regression equation of the fuel saving against
the amount of SaveMile.
c. Explain the slope obtained in (b).
d. Determine the coefficient of determination and interpret its meaning.