0% found this document useful (0 votes)
4 views

4.2 Correlation & Regression

The document provides an overview of correlation and regression in the context of bivariate data, explaining concepts such as scatter diagrams, correlation types, and the distinction between correlation and causation. It details how to calculate and interpret Pearson's product-moment correlation coefficient (PMCC) and how to derive regression lines for predicting values. Additionally, it includes worked examples to illustrate these concepts in practice.

Uploaded by

Allxn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

4.2 Correlation & Regression

The document provides an overview of correlation and regression in the context of bivariate data, explaining concepts such as scatter diagrams, correlation types, and the distinction between correlation and causation. It details how to calculate and interpret Pearson's product-moment correlation coefficient (PMCC) and how to derive regression lines for predicting values. Additionally, it includes worked examples to illustrate these concepts in practice.

Uploaded by

Allxn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Head to www.savemyexams.

com for more awesome resources

DP IB Maths: AA HL Your notes

4.2 Correlation & Regression


Contents
4.2.1 Bivariate Data
4.2.2 Correlation & Regression

Page 1 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

4.2.1 Bivariate Data


Your notes
Scatter Diagrams
What does bivariate data mean?
Bivariate data is data which is collected on two variables and looks at how one of the factors a ects
the other
Each data value from one variable will be paired with a data value from the other variable
The two variables are often related, but do not have to be
What is a scatter diagram?
A scatter diagram is a way of graphing bivariate data
One variable will be on the x-axis and the other will be on the y-axis
The variable that can be controlled in the data collection is known as the independent or
explanatory variable and is plotted on the x-axis
The variable that is measured or discovered in the data collection is known as the dependent or
response variable and is plotted on the y-axis
Scatter diagrams can contain outliers that do not follow the trend of the data

Examiner Tip
If you use scatter diagrams in your Internal Assessment then be aware that nding outliers for
bivariate data is di erent to nding outliers for univariate data
(x, y) could be an outlier for the bivariate data even if x and y are not outliers for their separate
univariate data

Page 2 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

Correlation
What is correlation? Your notes
Correlation is how the two variables change in relation to each other
Correlation could be the result of a causal relationship but this is not always the case
Linear correlation is when the changes are proportional to each other
Perfect linear correlation means that the bivariate data will all lie on a straight line on a scatter diagram
When describing correlation mention
The type of the correlation
Positive correlation is when an increase in one variable results in the other variable increasing
Negative correlation is when an increase in one variable results in the other variable
decreasing
No linear correlation is when the data points don’t appear to follow a trend
The strength of the correlation
Strong linear correlation is when the data points lie close to a straight line
Weak linear correlation is when the data points are not close to a straight line
If there is strong linear correlation you can draw a line of best t (by eye)
The line of best t will pass through the mean point (x , y )
⎯⎯ ⎯⎯

If you are asked to draw a line of best t


Plot the mean point
Draw a line going through it that follows the trend of the data

Page 3 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

What is the di erence between correlation and causation?


It is important to be aware that just because correlation exists, it does not mean that the change in one Your notes
of the variables is causing the change in the other variable
Correlation does not imply causation!
If a change in one variable causes a change in the other then the two variables are said to have a causal
relationship
Observing correlation between two variables does not always mean that there is a causal
relationship
There could be underlying factors which is causing the correlation
Look at the two variables in question and consider the context of the question to decide if there
could be a causal relationship
If the two variables are temperature and number of ice creams sold at a park then it is likely to
be a causal relationship
Correlation may exist between global temperatures and the number of monkeys kept as pets
in the UK but they are unlikely to have a causal relationship

Page 4 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

Worked example
Your notes
A teacher is interested in the relationship between the number of hours her students spend on a phone
per day and the number of hours they spend on a computer. She takes a sample of nine students and
records the results in the table below.

Hours spent on a
7.6 7.0 8.9 3.0 3.0 7.5 2.1 1.3 5.8
phone per day
Hours spent on a
1.7 1.1 0.7 5.8 5.2 1.7 6.9 7.1 3.3
computer per day

a) Draw a scatter diagram for the data.

b) Describe the correlation.

c) Draw a line of best t.

Page 5 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

Your notes

Page 6 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

4.2.2 Correlation & Regression


Your notes
Linear Regression
What is linear regression?
If strong linear correlation exists on a scatter diagram then the data can be modelled by a linear model
Drawing lines of best t by eye is not the best method as it can be di cult to judge the best
position for the line
The least squares regression line is the line of best t that minimises the sum of the squares of the gap
between the line and each data value
It can be calculated by either looking at:
vertical distances between the line and the data values
This is the regression line of y on x
horizontal distances between the line and the data values
This is the regression line of x on y
How do I nd the regression line of y on x?
The regression line of y on x is written in the form y = ax + b
a is the gradient of the line
It represents the change in y for each individual unit change in x
If a is positive this means y increases by a for a unit increase in x
If a is negative this means y decreases by |a| for a unit increase in x
b is the y – intercept
It shows the value of y when x is zero
You are expected to use your GDC to nd the equation of the regression line
Enter the bivariate data and choose the model “ax + b”
Remember the mean point (x , y ) will lie on the regression line
⎯⎯ ⎯⎯

How do I nd the regression line of x on y?


The regression line of x on y is written in the form x = cy + d
c is the gradient of the line
It represents the change in x for each individual unit change in y
If c is positive this means x increases by c for a unit increase in y
If c is negative this means x decreases by |c| for a unit increase in y
d is the x – intercept
It shows the value of x when y is zero
You are expected to use your GDC to nd the equation of the regression line
It is found the same way as the regression line of y on x but with the two data sets switched around
Remember the mean point (x , y ) will lie on the regression line
⎯⎯ ⎯⎯

How do I use a regression line?

Page 7 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

The regression line can be used to decide what type of correlation there is if there is no scatter diagram
If the gradient is positive then the data set has positive correlation
If the gradient is negative then the data set has negative correlation Your notes
The regression line can also be used to predict the value of a dependent variable from an independent
variable
The equation for the y on x line should only be used to make predictions for y
Using a y on x line to predict x is not always reliable
The equation for the x on y line should only be used to make predictions for x
Using an x on y line to predict y is not always reliable
Making a prediction within the range of the given data is called interpolation
This is usually reliable
The stronger the correlation the more reliable the prediction
Making a prediction outside of the range of the given data is called extrapolation
This is much less reliable
The prediction will be more reliable if the number of data values in the original sample set is bigger
The y on x and x on y regression lines intersect at the mean point (x , y )
⎯⎯ ⎯⎯

Examiner Tip
Once you calculate the values of a and b store then in your GDC
This means you can use the full display values rather than the rounded values when using the
linear regression equation to predict values
This avoids rounding errors

Page 8 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

Worked example
Your notes
The table below shows the scores of eight students for a maths test and an English test.

Maths (x ) 7 18 37 52 61 68 75 82

English (y ) 5 3 9 12 17 41 49 97

a) Write down the value of Pearson’s product-moment correlation coe cient, r .

b) Write down the equation of the regression line of y on x , giving your answer in the form
y = ax + b where a and b are constants to be found.

c) Write down the equation of the regression line of x on y , giving your answer in the form
x = cy + d where c and d are constants to be found.

d) Use the appropriate regression line to predict the score on the maths test of a student who got
a score of 63 on the English test.

Page 9 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

Your notes

Page 10 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

PMCC
What is Pearson’s product-moment correlation coe cient? Your notes
Pearson’s product-moment correlation coe cient (PMCC) is a way of giving a numerical value to a
linear relationship of bivariate data
The PMCC of a sample is denoted by the letter r
r can take any value such that −1 ≤ r ≤ 1
A positive value of r describes positive correlation
A negative value of r describes negative correlation
r = 0 means there is no linear correlation
r = 1 means perfect positive linear correlation
r = -1 means perfect negative linear correlation
The closer to 1 or -1 the stronger the correlation

How do I calculate Pearson’s product-moment correlation coe cient (PMCC)?


You will be expected to use the statistics mode on your GDC to calculate the PMCC
The formula can be useful to deepen your understanding

Page 11 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to www.savemyexams.com for more awesome resources

S xy
r= Your notes
SxSy
n ⎛⎜ n ⎞⎟ ⎛⎜ n ⎞⎟
1
S xy = ∑x i y i − ⎜⎜ ∑x ⎟⎟ ⎜⎜ ∑y ⎟⎟ is linked to the covariance
i =1 n i i
⎝i =1 ⎠ ⎝i =1 ⎠
n
1 ⎛⎜ n ⎞⎟2 n
1 ⎛⎜ n ⎞⎟2
Sx = ∑x i 2 − n ⎜⎜ ∑x i ⎟⎟ and S y = ∑y i 2 − n ⎜⎜ ∑y i ⎟⎟ are linked to the
i =1 ⎝i =1 ⎠ i =1 ⎝i =1 ⎠
variances
You do not need to learn this as using your GDC will be expected
When does the PMCC suggest there is a linear relationship?
Critical values of r indicate when the PMCC would suggest there is a linear relationship
In your exam you will be given critical values where appropriate
Critical values will depend on the size of the sample
If the absolute value of the PMCC is bigger than the critical value then this suggests a linear model is
appropriate

Page 12 of 12
© 2015−2024 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers

You might also like