Corr and Regress
Corr and Regress
Variance: n
• Gives information on variability of a
single variable. ( x − x) i
2
S =
2 i =1
n −1
x
Covariance:
• Gives information on the degree to
which two variables vary together. n
• Note how similar the covariance is
to variance: the equation simply (x i − x)( yi − y )
multiplies x’s error scores by y’s error
cov(x, y ) = i =1
n −1
scores as opposed to squaring x’s
error scores.
Covariance
( x − x)( y − y)
i i
cov(x, y ) = i =1
n −1
◼ When X and Y : cov (x,y) = pos.
◼ When X and Y : cov (x,y) = neg.
◼ When no constant relationship: cov (x,y) = 0
Example Covariance
6 x y xi − x yi − y ( xi − x )( yi − y )
5
0 3 -3 0 0
4
3
2 2 -1 -1 1
2
3 4 0 1 0
1
4 0 1 -3 -3
0 6 6 3 3 9
0 1 2 3 4 5 6 7
x=3 y=3 = 7
( x − x)( y − y))
i i
7
What does this
cov(x, y ) = i =1
= = 1.75 number tell us?
n −1 4
Problem with Covariance:
cov(x, y )
rxy =
sx s y
Pearson correlation coefficient
• Computational Formula
• rxy
• CORRELATION CAUSATION
• In order to infer causality: manipulate independent variable and observe effect
on dependent variable
When to use the Pearson correlation
coefficient
The Pearson correlation coefficient (rxy) is one of several correlation
coefficients that you need to choose between when you want to measure a
correlation. The Pearson correlation coefficient is a good choice when all of the
following are true:
•Both variables are quantitative: You will need to use a different method if
either of the variables is qualitative.
•The variables are normally distributed: You can create a histogram of each
variable to verify whether the distributions are approximately normal. It’s not
a problem if the variables are a little non-normal.
•The data have no outliers: Outliers are observations that don’t follow the
same patterns as the rest of the data. A scatterplot is one way to check for
outliers—look for points that are far away from the others.
•The relationship is linear: “Linear” means that the relationship between the
two variables can be described reasonably well by a straight line. You can use a
scatterplot to check whether the relationship between two variables is linear.
Limitations of r
• When r = 1 or r = -1:
• We can predict y from x with certainty
• all data points are on a straight line: y = ax + b
• r is actually
• r = true r of whole population
• r̂= estimate of r based on data
• r is very sensitive to extreme values:
5
0
0 1 2 3 4 5 6
Example
• Example: Dataset
• Imagine that you’re studying the relationship between newborns’ weight and
length. You have the weights and lengths of the 10 babies born last month at
your local hospital. You enter the data in a table:
• Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that gives best
prediction of y for any value of x
= ŷ, predicted value
= y i , true value
ε = residual error
Least Squares Regression
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
• First we find the value of b that gives the min sum of squares
b
ε b ε
b
• Now we find the value of a that gives the min sum of squares
b b b
Values of a and b
• So the min sum of squares is at
the bottom of the curve, where the
gradient is zero.
The maths bit
y = ax + b b = y – ax
◼ We can put our equation for “a” into this giving:
r sy r = correlation coefficient of x and y
b=y- x s = standard deviation of y
y
sx s = standard deviation of x
x
• We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
How good is our model?
• Total variance of y:
∑(y – y)2 SSy
sy2 = =
n-1 dfy
r2 = sŷ2 / sy2
• Recall that if the relation between y and x is exactly a straight line, then the
• variables are connected by the formula
Simple Linear Regression - Summary
Simple Linear Regression - Summary
Simple Linear Regression - Summary