Week08 Correlation and Regression
Week08 Correlation and Regression
Covariance:
Population Parameter: X ,Y (Y
i Y )( X i X )
N
The population parameter describes linear association between
X and Y for the population.
Estimator/Sample Statistic: s X ,Y
(Y Y )( X
i i X)
n 1
The sample statistic or estimator is used with sample data to
estimate the linear association between X and Y for the population.
Covariance
1. No causal effect is implied
2. Create deviations for Y and deviations for X for each
observation.
3. Form the products of these deviations.
4. The graph that follows illustrates these deviations.
1. In Quadrant 1, the products of deviations are positive.
2. In Quadrant 2, the products of deviations are negative.
3. Covariance – on average, what are the products of
deviations? Are they positive or negative?
5. Covariance is not widely used, because the units are
often confusing.
17000 Quadrant II
(Xi X ) 0 Quadrant I
16000
(Xi X ) 0
(Yi Y ) 0
15000
(Yi Y ) 0
(Yi Y ) 0
14000
Price
(Yi Y ) 0
13000 (Xi X ) 0
12000
(Xi X ) 0
11000
Quadrant III Quadrant IV
10000
10000 15000 20000 25000 30000 35000 40000 45000
Mileage
SIS 1037Y 14 2020/2021
Interpreting Covariance
Covariance between two variables:
cov(X,Y) > 0 X and Y tend to move in the same direction
cov(X,Y) < 0 X and Y tend to move in opposite directions
2020/2021
SIS 1037Y 15
Smoking and Lung Capacity
Suppose, for example, we wanted to investigate the
relationship between cigarette smoking and lung
capacity
We might ask a group of people about their smoking
habits, and measure their lung capacities
50
40
30
Lung Capacity
20
-10 0 10 20 30
0 10 90 +9 45
5 5 30 +6 42
10 0 0 3 33
15 +5 25 5 31
20 +10 70 7 29
1
S xy ( 215) 53.75
4
6 x y xi x yi y ( xi x )( yi y )
5
0 3 -3 0 0
4
3
2 2 -1 -1 1
2
3 4 0 1 0
1
4 0 1 -3 -3
0 6 6 3 3 9
0 1 2 3 4 5 6 7
x3 y3 7
( x x)( y y))
i i
7
What does this
cov(x, y ) i 1
1.75 number tell us?
2020/2021
n SIS11037Y 4 23
Problem with Covariance:
The value obtained by covariance is dependent on the
size of the data’s standard deviations: if large, the value
will be greater than if small… even if the relationship
between x and y is exactly the same in the large versus
small standard deviation datasets.
cov(x, y )
rxy
sx s y
2020/2021 SIS 1037Y 26
Pearson’s R continued
( x x)( y y)
n
( x x)( y y)
i i i i
cov(x, y ) i 1
rxy i 1
n 1 (n 1) s x s y
Z xi * Z yi
rxy i 1
n 1
SIS 1037Y 27 2020/2021
Coefficient of Correlation: r
Measures the relative strength of the linear relationship
between two numerical variables
The correlation coefficient is also known as the product-
moment coefficient of correlation or Pearson's
correlation.
x
2
indicates that each x-value should be added and the total then squared
xy indicates each x-value is multiplied by its corresponding y -value. Then sum those up.
r linear correlation coefficient for sample data
linear correlation coefficient for a population of paired data
r
( x x )( y y )
[ ( x x ) ][ ( y y )
2 2
]
n xy x y
r
[n( x 2 ) ( x) 2 ][n( y 2 ) ( y ) 2 ]
( xy x * y )
r
( x 2 x 2 )( y 2 y 2 )
SIS 1037Y 31 2020/2021
Correlation:
Correlation measures the degree of linear association between
two variables, say X and Y. There are no units – dividing
covariance by the standard deviations eliminates units.
Correlation is a pure number. The range is from -1 to +1. If the
correlation coefficient is -1, it means perfect negative linear
association; +1 means perfect positive linear association.
Cov ( X Y )
XY
Population Parameter: X Y
s X ,Y
Estimator/Sample Statistic: rX ,Y
s X sY
The sample statistic or estimator is used with sample data to
estimate the linear association between X and Y for the population.
Computing a correlation
Cigarettes Lung
(X) Capacity
X2 XY Y2 (Y)
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29
50 750 1585 6680 180
2020/2021 SIS 1037Y
33
Computing a Correlation
(5)(1585) (50)(180)
rxy
(5)(750) 502 (5)(6680) 1802
7925 9000
(3750 2500)(33400 32400)
1075
.9615
1250 (1000)
SIS 1037Y 34 2020/2021
Conclusion
rxy 0.96
rxy = -0.96 implies almost certainty smoker will
have diminish lung capacity
Tree n xy x y
Height, r
y 70 [n( x 2 ) ( x)2 ][n( y 2 ) ( y)2 ]
60
8(3142) (73)(321)
50
40
[8(713) (73)2 ][8(14111) (321) 2 ]
0.886
30
20
10
0
r = 0.886 → relatively strong positive
0 2 4 6 8 10 12 14
linear association between x and y
Trunk Diameter, x
The paired shoe / height data from five males are listed
below. Use a computer or a calculator to find the value of
the correlation coefficient r.
Test Statistic: r
Critical Values: Refer to Table.
With the test statistic r = 0.591 from the earlier example. The
critical values of r = ± 0.878 are found in the table with n = 5
and α = 0.05.
n2
r 0.591
t 1.269
1 r 2
1 0.591 2
n2 52
Regression Equation:
Population Sample
Parameter Statistic
y-Intercept of
β0 b0
regression equation
Slope of regression
β1 b1
equation
Equation of the
y = β0 + β1x ŷ = b0 + b1x
regression line
sy
Slope: b1 r
sx
y-intercept: b0 y b1 x
n xi yi xi yi y a1 xi
b1 b0 i
y b1 x
n xi2 xi
2
n
yˆ 125 1.73x
Now we use the formulas to determine the regression
equation.
Technology can be used to find the values of the sample means and
sample standard deviations used below.
sy 4.87391
b1 r 0.591269 1.72745
sx 1.66823
The regression line does not fit the points well. The
correlation is r = 0.591, which suggests there is not a linear
correlation (the P-value was 0.294).
y 177.3 cm
SIS 1037Y 78 2020/2021
Example
Use the 40 pairs of shoe print lengths from the given Data
Set to predict the height of a person with a shoe print
length of 29 cm.
Now, the regression line does fit the points well, and the
correlation of r = 0.813 suggests that there is a linear
correlation (the P-value is 0.000).
yˆ 80.9 3.22 x
80.9 3.22 29
174.3 cm
yˆ 80.9 3.22 x
The slope of 3.22 tells us that if we increase shoe print
length by 1 cm, the predicted height of a person increases by
3.22 cm.
That is:
The residual plot should not have any obvious patterns (not
even a straight line pattern). This confirms that the
scatterplot of the sample data is a straight-line pattern.
yˆ E y yˆ E
n x0 x
2
1
E t /2 se 1
n n x 2 x 2
y yˆ
2
se
n2
Recall :
yˆ 80.9 3.22 x
x0 29.0
se 5.943761
yˆ 174.3
t /2 2.024 (Table A-3, df = 38, 0.05 area in two tails)
This is a large range of values, so the single shoe print doesn’t give us
very good information about a someone’s height.
The figure shows (5,13) lies on the regression line, but (5,19) does not.
We arrive at:
( y y) = ( yˆ y ) + ( y yˆ )
( y y ) =
2
( y y ) +
ˆ 2
( y yˆ ) 2
explained variation
r
2
total variation
r2 = 0.8132 = 0.661
( x1 , x2 , x3 ..., xk )
The general form of the multiple regression equation
obtained from sample data is
yˆ b0 b1 x1 b2 x2 ... bk xk
n = sample size
k = number of predictor variables
ŷ = predicted value of y
are the predictor variables
( x1 , x2 , x3 ..., xk )
[n (k 1)]
The value of 0.000 results from a test of the null hypothesis that
β1 = β2 = 0, and rejection of this hypothesis indicates the equation
is effective in predicting the heights of daughters.
SIS 1037Y 120 2020/2021
Finding the Best Multiple
Regression Equation
Using those sample data, find the regression equation that is the best for
predicting height.
The table on the next slide includes key results from the combinations of
the five predictor variables.
Quadratic: y ax 2 bx c
Logarithmic: y a b ln x
Exponential: y ab x
Power: y ax b