0% found this document useful (0 votes)
95 views

Corr and Regress

The document discusses correlation and regression. It defines correlation as the statistical association between two variables, where changes in one variable correspond to changes in the other. Linear correlation between variables x and y can be described by an equation with an error term. The strength of the linear relationship depends on the size of the error. Covariance measures how two variables vary together, while variance measures variability of a single variable. Pearson's r standardizes covariance to measure the linear correlation between two variables from -1 to 1. Regression predicts the dependent variable from the independent using the best fitting straight line determined by least squares regression which minimizes the residuals.

Uploaded by

pasan lahiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Corr and Regress

The document discusses correlation and regression. It defines correlation as the statistical association between two variables, where changes in one variable correspond to changes in the other. Linear correlation between variables x and y can be described by an equation with an error term. The strength of the linear relationship depends on the size of the error. Covariance measures how two variables vary together, while variance measures variability of a single variable. Pearson's r standardizes covariance to measure the linear correlation between two variables from -1 to 1. Regression predicts the dependent variable from the independent using the best fitting straight line determined by least squares regression which minimizes the residuals.

Uploaded by

pasan lahiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Correlation and Regression

ST1013 – Introduction to Statistics


Correlated variables

• Two variables are correlated if changes in one variable, correspond to


changes in the other
• Correlation in general refers to the statistical association between
two variables
• This statistical association allows us to make estimates for the one variable
based on the value of the other
• While we usually examine linear relationships, between variables, non-linear
might exist too
Linear correlation

• Two variables x and y are said to be linearly correlated if their


relationship can be described by the following equation:

• This relationship will not be exact


• There will be an error associated with it, and hence, in reality:

• ε is the error term for the relationship


Perfect linear relationship – no error
Strong linear relationship – small error

Weak linear relationship – large error


No linear relationship
Variance vs Covariance

• Do two variables change together?

Variance: n
• Gives information on variability of a
single variable.  ( x − x) i
2

S =
2 i =1
n −1
x
Covariance:
• Gives information on the degree to
which two variables vary together. n
• Note how similar the covariance is
to variance: the equation simply  (x i − x)( yi − y )
multiplies x’s error scores by y’s error
cov(x, y ) = i =1
n −1
scores as opposed to squaring x’s
error scores.
Covariance

 ( x − x)( y − y)
i i
cov(x, y ) = i =1
n −1
◼ When X and Y : cov (x,y) = pos.
◼ When X and Y : cov (x,y) = neg.
◼ When no constant relationship: cov (x,y) = 0
Example Covariance

6 x y xi − x yi − y ( xi − x )( yi − y )
5
0 3 -3 0 0
4

3
2 2 -1 -1 1
2
3 4 0 1 0
1
4 0 1 -3 -3
0 6 6 3 3 9
0 1 2 3 4 5 6 7
x=3 y=3 = 7

 ( x − x)( y − y))
i i
7
What does this
cov(x, y ) = i =1
= = 1.75 number tell us?
n −1 4
Problem with Covariance:

• The value obtained by covariance is dependent on the size of


the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x and
y is exactly the same in the large versus small standard
deviation datasets.
Example of how covariance value
relies on variance
High variance data Low variance data

Subject x y x error * y x y X error * y


error error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50

Sum of x error * y error : 7000 Sum of x error * y error : 28

Covariance: 1166.67 Covariance: 4.67


Solution: Pearson’s r

• Covariance does not really tell us anything

• Solution: standardise this measure

• Pearson’s R: standardises the covariance value.


• Divides the covariance by the multiplied standard deviations of X and Y:

cov(x, y )
rxy =
sx s y
Pearson correlation coefficient

• In order to test the linear association between two variables x and y we


can use the Pearson correlation coefficient rxy

• Computational Formula

• rxy

• The correlation coefficient takes values between -1 to 1


• 1: perfect/strong and positive linear correlation
• -1: perfect/strong and negative linear correlation
• 0: no linear correlation
Pearson correlation coefficient

• It is very crucial to understand that this correlation coefficient can


only examine linear associations between two variables

1 0.8 0.4 0 -0.4 -0.8 -1

What is the correlation here?


The relationship between x and y

• Correlation: is there a relationship between 2 variables?


• Regression: how well a certain independent variable predict
dependent variable?

• CORRELATION  CAUSATION
• In order to infer causality: manipulate independent variable and observe effect
on dependent variable
When to use the Pearson correlation
coefficient
The Pearson correlation coefficient (rxy) is one of several correlation
coefficients that you need to choose between when you want to measure a
correlation. The Pearson correlation coefficient is a good choice when all of the
following are true:

•Both variables are quantitative: You will need to use a different method if
either of the variables is qualitative.
•The variables are normally distributed: You can create a histogram of each
variable to verify whether the distributions are approximately normal. It’s not
a problem if the variables are a little non-normal.
•The data have no outliers: Outliers are observations that don’t follow the
same patterns as the rest of the data. A scatterplot is one way to check for
outliers—look for points that are far away from the others.
•The relationship is linear: “Linear” means that the relationship between the
two variables can be described reasonably well by a straight line. You can use a
scatterplot to check whether the relationship between two variables is linear.
Limitations of r

• When r = 1 or r = -1:
• We can predict y from x with certainty
• all data points are on a straight line: y = ax + b
• r is actually
• r = true r of whole population
• r̂= estimate of r based on data
• r is very sensitive to extreme values:
5

0
0 1 2 3 4 5 6
Example

• Example: Dataset
• Imagine that you’re studying the relationship between newborns’ weight and
length. You have the weights and lengths of the 10 babies born last month at
your local hospital. You enter the data in a table:

Weight (kg) Length (cm)


3.63 53.1
3.02 49.7
3.82 48.4
3.42 54.2
3.59 54.9
2.87 43.7
3.03 47.2
3.46 45.2
3.36 54.4
3.3 50.4
Regression

• Correlation tells you if there is an association between x and y but it


doesn’t describe the relationship or allow you to predict one variable
from the other.

• To do this we need REGRESSION!


Best-fit Line

• Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that gives best
prediction of y for any value of x

• This will be the line that ŷ = ax + b


minimises distance between
data and fitted line, i.e. slope intercept
the residuals
ε

= ŷ, predicted value
= y i , true value
ε = residual error
Least Squares Regression

• To find the best line we must minimise the sum of


the squares of the residuals (the vertical distances
from the data points to our line)

Model line: ŷ = ax + b a = slope, b = intercept

Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2

◼ we must find values of a and b that minimise


Σ (y – ŷ)2
Finding b

• First we find the value of b that gives the min sum of squares

b
ε b ε
b

◼ Trying different values of b is equivalent to


shifting the line up and down the scatter plot
Finding a

• Now we find the value of a that gives the min sum of squares

b b b

◼ Trying out different values of a is equivalent to


changing the slope of the line, while b stays
constant
Minimising sums of squares

• Need to minimise Σ(y–ŷ)2


• ŷ = ax + b
• so need to minimise:

sums of squares (S)


Σ(y - ax - b)2

• If we plot the sums of squares for


all different values of a and b we
get a parabola, because it is a
squared term Gradient = 0
min S

Values of a and b
• So the min sum of squares is at
the bottom of the curve, where the
gradient is zero.
The maths bit

• The min sum of squares is at the bottom of the curve


where the gradient = 0

• So we can find a and b that give min sum of squares


by taking partial derivatives of Σ(y - ax - b)2 with
respect to a and b separately

• Then we solve these for 0 to give us the values of a


and b that give the min sum of squares
The solution

• Doing this gives the following equations for a and b:

r sy r = correlation coefficient of x and y


a= sx
sy = standard deviation of y
sx = standard deviation of x

◼ From you can see that:


▪ A low correlation coefficient gives a flatter slope (small value of
a)
▪ Large spread of y, i.e. high standard deviation, results in a
steeper slope (high value of a)
▪ Large spread of x, i.e. high standard deviation, results in a flatter
slope (high value of a)
The solution cont.

• Our model equation is ŷ = ax + b


• This line must pass through the mean so:

y = ax + b b = y – ax
◼ We can put our equation for “a” into this giving:
r sy r = correlation coefficient of x and y
b=y- x s = standard deviation of y
y
sx s = standard deviation of x
x

◼ The smaller the correlation, the closer the


intercept is to the mean of y
Back to the model
a b
r sy r sy
ŷ = ax + b = x+y- x
sx sx
a a
r sy
Rearranges to: ŷ= (x – x) + y
sx
• If the correlation is zero, we will simply predict the mean of y for every
value of x, and our regression line is just a flat straight line.

• But this isn’t very useful.

• We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
How good is our model?

• Total variance of y:
∑(y – y)2 SSy
sy2 = =
n-1 dfy

◼ Variance of predicted y values (ŷ):


∑(ŷ – y)2 SSpred This is the variance
sŷ2 = = explained by our
n-1 dfŷ regression model

◼ Error variance: This is the variance of the error


between our predicted y values
∑(y – ŷ)2 SSer and the actual y values, and thus is
serror2 = = the variance in y that is NOT
n-2 dfer
explained by the regression model
How good is our model cont.
• Total variance = predicted variance + error variance
sy2 = sŷ2 + ser2

• Conveniently, via some complicated rearranging


sŷ2 = r2 sy2

r2 = sŷ2 / sy2

• so r2 is the proportion of the variance in y that is explained by


our regression model
How good is our model cont.

• Insert r2 sy2 into sy2 = sŷ2 + ser2 and rearrange to get:

ser2 = sy2 – r2sy2


= sy2 (1 – r2)

• From this we can see that the greater the correlation


the smaller the error variance, so the better our
prediction
Simple Linear Regression - Summary

• A regression problem involving a single predictor (also called simple regression)


arises when we wish to study the relation between two variables x and y and use
it to predict y.
Simple Linear Regression - Summary
Simple Linear Regression - Summary

• Correlation varies between -1 and +1.


• Closer to + or – 1 indicates a strong linear correlation while 0 indicates no linear
correlation.
• A high sample correlation coefficient
• Does not necessarily signify a causal relation between two variables.
Simple Linear Regression - Summary

• Recall that if the relation between y and x is exactly a straight line, then the
• variables are connected by the formula
Simple Linear Regression - Summary
Simple Linear Regression - Summary
Simple Linear Regression - Summary

• The Method of Least Squares


Simple Linear Regression - Summary
Simple Linear Regression - Summary
Simple Linear Regression - Summary
Simple Linear Regression - Summary
Simple Linear Regression - Summary

• Strength of the Linear Regression


Simple Linear Regression - Summary

You might also like