Chapter 23 Correlation and Linear Regression Tutorial Solutions With Comments
Chapter 23 Correlation and Linear Regression Tutorial Solutions With Comments
H2 Mathematics (9758)
Chapter 23 Correlation and Linear Regression
Tutorial Questions
Tutorial Questions
(i) Draw a scatter diagram of these data, and explain how you know from your diagram
that the relationship between m and d should not be modelled by an equation of the
form y = ax + b . [2]
(iii) Use the formula you chose from part (ii) to estimate the mass of a giant pumpkin
with
(a) over the top length 6m,
(b) over the top length 12 m.
Explain which of your two estimates is more reliable. [3]
Q1 Solution
(i) m
When plotting a scatter diagram,
449 you must:
• Label the axes,
• Indicate min and max values on
each of the axes,
• Show the relative positions of
the points clearly,
11 • Check that all points are drawn.
d
2.31 9.17
From the diagram, as d increases, m increases at an increasing rate. Therefore, linear
equation of the form y = ax + b is not appropriate.
Page 1 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
Input with
Xlist: L3 (which is d 2 )
Ylist: L 2 (which is m)
For (b), d = 12 is outside of the data range of d and thus the linear relationship might
no longer hold. Since d = 6 lies within the data range of d and r = 0.99951 is close to
1, the estimate for (a) is more reliable.
An estimate is reliable when
• r is close to 1
• Interpolation (the value that we substitute in is within the data range)
Page 2 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
2 N2008/II/8
A certain metal discolours when exposed to air. To protect the metal against discolouring,
it is treated with a chemical. In an experiment, different quantities, x ml, of the chemical
were applied to standard samples of the metal, and the times, t hours, for the metal to
discolour were measured. The results are given in the table.
x 1.2 2.0 2.7 3.8 4.8 5.6 6.9
t 2.2 4.5 5.8 7.3 7.6 9.0 9.9
(i) Calculate the product moment correlation coefficient between x and t, and explain
whether your answer suggests that a linear model is appropriate. [3]
(ii) Draw a scatter diagram for the data. [1]
One of the values of t appears to be incorrect.
(iii) Indicate the corresponding point on your diagram by labeling it P, and explain why
the scatter diagram for the remaining points may be consistent with a model of the
form t = a + b ln x . [2]
(iv) Omitting P, calculate least squares estimates of a and b for the model t = a + b ln x .
[2]
(v) Assume that the value of x at P is correct. Estimate the value of t for this value of
x. [1]
(vi) Comment on the use of the model in part (iv) in predicting the value of t when
x = 8.0 . [1]
2 Solution
(i) Using the GC, the product moment correlation coefficient, r 0.970 (3s.f.). Since
value of r is close to 1 which suggests a strong positive linear correlation, a linear
model seems appropriate.
Page 3 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
(vi) Since x = 8.0 is outside the data range of x, linear relation between t and ln x may no
longer hold. Thus the predicted value of t is unreliable.
Page 4 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
3 N2009/II/6
The table gives the world record time, in seconds above 3 minutes 30 seconds, for running
1 mile as at 1st January in various years.
Year, x 1930 1940 1950 1960 1970 1980 1990 2000
Time, t 40.4 36.4 31.3 24.5 21.1 19.0 16.3 13.1
(i) Draw a scatter diagram to illustrate the data. [2]
(ii) Comment on whether a linear model would be appropriate, referring both to the
scatter diagram and the context of the question. [2]
(iii) Explain why in this context a quadratic model would probably not be appropriate
for long-term predictions. [1]
(iv) Fit a model of the form ln t = a + bx to the data, and use it to predict the world
record time as at 1st January 2010. Comment on the reliability of your prediction.
[3]
3 Solution
(i) t When plotting a scatter diagram,
40.4 you must:
• Label the axes,
• Indicate min and max values on
each of the axes,
• Show the relative positions of
13.1 the points clearly,
x
• Check that all points are drawn.
1930 2000
(ii) From the data gathered from 1930 to 2000, a linear model is appropriate from the
scatter diagram since most of the data points lie close to a straight line.
However, in the context of the question, a linear model may not be appropriate in the
long run since human capacity will at some point in time reach a plateau.
(iii) A quadratic model is not suitable for long-term predictions, as a quadratic model with
a minimum turning point means that there will be a point in time where the value of t
(record time) increases as x (years) increases. But t (record time) can only decrease or
maintain the same as years go by.
Page 5 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
Since x = 2010 falls outside the data range of x, linear relation between ln t and x may
no longer hold. Thus the prediction is not reliable.
An estimate is reliable when
• r is close to 1
• Interpolation (the value that we substitute in is within the data range)
Page 6 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
4 N2015/II/10
In an experiment the following information was gathered about air pressure P, measured
inches of mercury, at different heights above sea-level h, measured in feet.
h 2000 5000 10 000 15 000 20 000 25 000 30 000 35 000 40 000 45 000
P 27.8 24.9 20.6 16.9 13.8 11.1 8.89 7.04 5.52 4.28
(i) Draw a scatter diagram for these values, labelling the axes. [1]
(ii) Find, correct to 4 decimal places, the product moment correlation coefficient
between
(a) h and P,
(b) ln h and P,
(c) h and P. [3]
(iii) Using the most appropriate case from part (ii), find the equation which best models
air pressure at different heights. [3]
(iv) Given that 1 metre = 3.28 feet, re-write your equation from part (iii) so that it can
be used to estimate the air pressure when the height is given in metres. [2]
4 Solution
(i) P
27.8 x When plotting a scatter diagram,
you must:
x • Label the axes,
x • Indicate min and max values on
x each of the axes,
x
x • Show the relative positions of
x x
4.28 x x the points clearly,
h • Check that all points are drawn.
2000 45000
(ii)(a) r = −0.980731 −0.9807 (correct to 4 d.p)
(ii)(b) r = −0.974800 −0.9748 (correct to 4 d.p)
Page 7 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
(iii) Since for part (c), r = −0.9986 is closest to −1 , and as h increases, P decreases at a
decreasing rate, part (c) is the most appropriate model.
Page 8 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
5 N2012/II/8
Amy is revising for a mathematics examination and takes a different practice paper each
week. Her marks, y% in week x, are as follows.
Week x 1 2 3 4 5 6
Percentage mark y 38 63 67 75 71 82
(i) Draw a scatter diagram showing these marks. [1]
(ii) Suggest a possible reason why one of the marks does not seem to follow the trend.
[1]
(iii) It is desired to predict Amy’s marks on future papers. Explain why, in this context,
neither a linear nor a quadratic model is likely to be appropriate. [2]
5 Solution
(i) y When plotting a scatter diagram,
you must:
82
• Label the axes,
• Indicate min and max values on
each of the axes,
• Show the relative positions of
38 the points clearly,
x • Check that all points are drawn.
1 6
(ii) The irregularity occurred in Week 1. That practice paper may be more difficult than
the other papers.
Note: Another possible reason could be that Amy was not prepared academicaîly
for the practice paper in Week 1.
(iii) The marks cannot exceed 100%, and so a linear model, which models an infinite
upward/downward trend of data, is not appropriate.
The marks are likely to plateau off or stay constant as the weeks go by, rather than
in the case of a quadratic model which is expected to fit data with an increase and
then a decrease (or the other way round) trend. Thus, a quadratic model is also not
appropriate.
(iv) For L = 91, r = −0.929744 (6 decimal places)
Page 9 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
(v) Since r = 0.929944 is closest to 1 for L = 92, this is the most appropriate value for
L.
Concept: r measures the strength of linear relationship.
Remember to answer the question. Note that since the number of marks
increases over the week, we should round up to 13 weeks.
(vii) L is the percentage mark she gets if she continues practising indefinitely.
ln ( L − y ) = a + bx
y = L − ea +bx
Since b 0 , ea +bx → 0 as x → 0
y → L
Page 10 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
6 N2013/II/10
(i) Sketch a scatter diagram that might be expected when x and y are related
approximately as given in each of the cases (A), (B) and (C) below. In each case
your diagram should include 6 points, approximately equally spaced with respect
to x , and with all x- and y- values positive. The letters a, b, c, d, e and f represent
constants.
(A) y = a + bx 2 , where a is positive and b is negative,
(B) y = c + d ln x , where c is positive and d is negative,
f
(C) y = e + , where e is positive and f is negative. [3]
x
A motoring website gives the following information about the distance travelled, y km,
by a certain type of car at different speeds, x km h −1 , on a fixed amount of fuel.
Speed, x 88 96 104 112 120 128
Distance, y 148 147 144 138 126 107
(ii) Draw the scatter diagram for these values, labelling the axes. [1]
(iii) Explain which of the three cases in part (i) is the most appropriate for modelling
these values, and calculate the product moment correlation coefficient for this case.
[2]
(iv) It is required to estimate the distance travelled at a speed of 110 km h −1 . Use the
case that you identified in part (iii) to find the equation of a suitable regression line,
and use your equation to find the required estimate. [3]
6 Solution
(i)
Ensure that there are:
• 6 points
• Equally spaced with respect to x
• Positive x and y values
Page 11 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
f
(C) y = e + , where e is positive and f is negative
x
When unsure, use the GC to try and
sketch a graph that has the required
2
shape. One example is y = 5 −
x
(ii) y
148 When plotting a scatter diagram,
you must:
• Label the axes,
• Indicate min and max values on
each of the axes,
• Show the relative positions of
107 x the points clearly,
88 128 • Check that all points are drawn.
(iii) From the scatter diagram, as x increases, y decreases at an increasing rate, thus
model (A) y = a + bx 2 is the most appropriate model.
r = −0.939
Describe the trend of data points from the scatter diagram. Since all three
models have different general shapes, choose the one that matches the scatter
diagram the most without the need to compare their correlation coefficient.
Page 12 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
Page 13 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
7 N2016/II/8
A website about electric motors gives information about the percentage efficiency of
motors depending on their power, measured in horsepower. Xian has copied the following
table for a particular type of electric motor, but he has copied one of the efficiency values
wrongly.
(i) Plot a scatter diagram on graph paper for these values, labelling the axes, using a
scale of 2 cm to represent 10% efficiency on the y-axis and an appropriate scale for
the x-axis. On your diagram, circle the point that Xian has copied wrongly. [2]
For parts (ii), (iii) and (iv) of this question you should exclude the point for which Xian
has copied the efficiency value wrongly.
(ii) Explain from your scatter diagram why the relationship between x and y should not
be modelled by an equation of the form y = ax + b . [1]
(iii) Suppose that the relationship between x and y is modelled by an equation of the
c
form y = + d , where c and d are constants. State with a reason whether each of c
x
and d is positive or negative. [2]
(iv) Find the product moment correlation coefficient and the constants c and d for the
model in part (iii). [3]
c
(v) Use the model y = + d , with the values of c and d found in part (iv), to estimate
x
the efficiency value (y) that Xian has copied wrongly. Give two reasons why you
would expect this estimate to be reliable. [3]
7 Solution
(i) y When plotting a scatter diagram,
92.4 you must:
• Label the axes,
• Indicate min and max values on
each of the axes,
This point does
not follow the • Show the relative positions of
the points clearly,
curvilinear trend
72.5 • Check that all points are drawn.
x
1 50
(ii) As x increases, y increases at a decreasing rate. Therefore y = ax + b is not a suitable
model.
(iii) Since x increases as y increases, c is negative. You can try keying in +ve &
Since efficiency is non-negative, d is positive. -ve values of c & d to check
Page 14 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
(iv)
Page 15 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
8 N2017/II/8
(a) Draw separate scatter diagrams, each with 8 points, all in the first quadrant, which
represent the situation where the product moment correlation coefficient between
variables x and y is
(i) −1 ,
(ii) 0 ,
(iii) between 0.5 and 0.9. [3]
(b) An investigation into the effect of a fertiliser on yields of corn found that the amount
of fertiliser applied, x, resulted in the average yields of corn, y, given below, where
x and y are measured in suitable units.
x 0 40 80 120 160 200
y 70 104 118 119 126 129
(i) Draw a scatter diagram for these values. State which of the following
equations, where a and b are positive constants, provides the most accurate
model of the relationship between x and y.
a
(A) y = ax 2 + b (B) y= +b
x2
(C) y = a ln 2 x + b (D) y = a x +b [2]
(ii) Using the model you chose in part (i), write down the equation for the
relationship between x and y, giving the numerical values of the coefficients.
State the product moment correlation coefficient for this model. [3]
(iii) Give two reasons why it would be reasonable to use your model to estimate
the value of y when x = 189. [2]
8 Solution
a
(i)
Do not draw a regression line on
your scatter diagram. The points
should look obviously collinear.
a
(ii)
Page 16 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
a
(iii)
b
(i)
Model (D) provides the most accurate model of relationship between x and y.
b
(ii)
y = 4.18211387 x + 74.04787
= 4.18 x + 74.0 (3 s.f.)
r = 0.981 (3 s.f.)
b Since x = 189 is within the data range of x and the value of r = 0.981 is close to 1,
(iii) implying a strong positive linear correlation, the linear correlation between x and y
holds. Thus it is reasonable to use model (D) to estimate the value of y when x =189.
An estimate is reliable when
• r is close to 1
• Interpolation (the value that we substitute in is within the data range)
9 2011 MJC/II/11
A random sample of nine pairs of values of x and y are given in the table.
Page 18 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
Page 19 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
10 A car is travelling along a stretch of road with speed v km/h when the brakes are applied.
The car comes to rest after travelling a further distance of s metre. The values of s for 8
different values of v are given in the table, correct to 2 decimal places.
v 25 30 35 40 45 50 55 60
s 2.83 4.63 4.84 5.29 9.73 10.30 14.82 15.21
(i) Calculate the product moment correlation coefficient between v and s . What
does this indicate about the scatter diagram of the points (v, s )?
(ii) It is given that the product moment correlation coefficient between v and s is 0.965,
correct to 3 decimal places. State why the regression line of s on v is more
suitable than the regression line of s on v, and find the equation of the regression
line of s on v.
(iii) Consider the equation of the regression line of s on v. In the context of the
question,
(a) comment on the value of the constant term,
(iv) Would you be willing to use this model to predict the further distance travelled if
the speed is 70 km/h? Explain your answer with reason(s).
10 Solution
(i) r = 0.97496 = 0.975 (3 s.f.)
This r = 0.975 indicates that most of the points lie close to a straight line.
(ii) Since 0.975 is closer to 1 compared with 0.965, the regression line of s on v is more
suitable than the regression line of s on v.
(iii)(a) When v = 0, s = – 0.0177 ≠ 0 suggests that there is an error in the data. it is unrealistic
to use the model for v = 0 (or close to 0) as the value of s will be imaginary.
In fact, we should not use the regression line to model beyond the range of v.
OR
The value of the constant term represents the distance the car travel if its speed is 0, so
the constant term should be 0. The value of – 0.0177 is close to 0, which shows the
model is quite accurate for the range of v given. (Again, we should not use the model for
values of v outside the range, let alone for v close to 0)
(iii)(b) For each 1 km/h increase to the speed, the square root of the distance travelled will
increase by 0.0663m1/2.
(iv) No. 70 km/h lies outside the given data range of v and therefore, the estimation of s is
unreliable as the linear relation between s and v may no longer hold
Page 20 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022
11 The table shows the number y (in millions) of cell-phone subscribers in a country from
2001 to 2010, where t represents number of years from 2000.
t 1 2 3 4 5 6 7 8 9 10
y 1.6 2.7 4.4 6.4 8.9 13.1 19.3 28.2 38.2 48.7
The relationship between y and t is given by the formula y = abt , where a and b are
constants.
(i) Using the substitution I = ln y , show that the relation between I and t is linear.
(ii) Find the equation of the estimated regression line of I on t and hence give estimates
for a and b.
(iii) Find the product moment correlation coefficient between I and t.
(iv) Predict the number of cell-phone subscribers in the year 2015. Comment on the
reliability of your prediction.
(v) It is required to estimate the value of t for which I = 1.5. Explain which of the
regression lines I on t or t on I, should be used. Use the equation of your choice to
find the value of t when I = 1.5.
11 Solution
(i) Apply ln to both sides to show linearization:
ln ( y ) = ln ( abt )
ln y = ln a + ln ( bt )
ln y = ln a + t ln b
(ii) From GC, I = 0.377423t + 0.26295183
Thus, comparing with ln y = t ln b + ln a
ln a = 0.26295183 a = e0.26295183 = 1.300764 = 1.30(to 3 sf)
ln b = 0.377423 b = e0.377423 = 1.4582 = 1.46(to 3 sf)
(iii) r = 0.996741=0.997 (to 3 sf)
(iv) When t =15,
I = 0.377423(15) + 0.26295183
I = 5.92429683
y = 374.0153
374 millions
Since t =15 falls outside the data range of t , the prediction of y is unreliable as the linear
relation between ln y and t may no longer hold
Page 21 of 21