0% found this document useful (0 votes)
72 views

Chapter 23 Correlation and Linear Regression Tutorial Solutions With Comments

This document discusses linear regression and correlation. It provides examples of using scatter plots and calculating correlation coefficients to determine if a linear model is appropriate for relationships between two variables. It also demonstrates finding constants and using linear models to estimate values.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Chapter 23 Correlation and Linear Regression Tutorial Solutions With Comments

This document discusses linear regression and correlation. It provides examples of using scatter plots and calculating correlation coefficients to determine if a linear model is appropriate for relationships between two variables. It also demonstrates finding constants and using linear models to estimate values.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 23 Correlation and Linear Regression TMJC 2022

H2 Mathematics (9758)
Chapter 23 Correlation and Linear Regression
Tutorial Questions
Tutorial Questions

1 9758 Specimen Paper/II/6


Giant pumpkins are often irregular in shape. In order to account for the different shapes
of pumpkins, growers of giant pumpkins measure the size of a pumpkin by a combination
of three measurements, called the ‘over the top’ length. Pumpkin growers keep records
so that they can estimate the mass of giant pumpkins while they are still growing. The
over the top lengths (d m) and the masses (m kg) of a random sample of 7 giant pumpkins
are as follows.

d 2.31 2.9 4.05 5.5 6.7 7.92 9.17


m 11 14 47 104 170 282 449

(i) Draw a scatter diagram of these data, and explain how you know from your diagram
that the relationship between m and d should not be modelled by an equation of the
form y = ax + b . [2]

(ii) Which of the formulae m = ed 2 + f and m = gd 3 + h , where e, f, g and h are


constants, is the better model for the relationship between m and d? Explain fully
how you decided, and find the constants for the better formula. [5]

(iii) Use the formula you chose from part (ii) to estimate the mass of a giant pumpkin
with
(a) over the top length 6m,
(b) over the top length 12 m.
Explain which of your two estimates is more reliable. [3]

Q1 Solution
(i) m
When plotting a scatter diagram,
449 you must:
• Label the axes,
• Indicate min and max values on
each of the axes,
• Show the relative positions of
the points clearly,
11 • Check that all points are drawn.
d
2.31 9.17
From the diagram, as d increases, m increases at an increasing rate. Therefore, linear
equation of the form y = ax + b is not appropriate.

From a scatter diagram, a linear model is appropriate when


• Points lie close to a straight line in the scatter diagram
• The trend showing that as d increases, m increases at a constant rate

Page 1 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

(ii) For m = ed 2 + f , r = 0.98889.


For m = gd 3 + h, r = 0.99951.

Input into List with


L3 = ( L1 ) & L4 = ( L1 )
2 3

Input with
Xlist: L3 (which is d 2 )
Ylist: L 2 (which is m)

Xlist: L3 (which is d 2 ) Input with


Ylist: L 2 (which is m) Xlist: L 4 (which is d 3 )
Ylist: L 2 (which is m)

m = gd 3 + h is the better model since |r|= 0.99951 is closer to 1.

g = 0.57165  0.572 ( 3 s.f.)


h = 3.7431  3.74 ( 3 s.f.)

(iii) From GC,


(a) mass = 127 (3s.f.)
(b) mass = 992 (3s.f.)

For (b), d = 12 is outside of the data range of d and thus the linear relationship might
no longer hold. Since d = 6 lies within the data range of d and r = 0.99951 is close to
1, the estimate for (a) is more reliable.
An estimate is reliable when
• r is close to 1
• Interpolation (the value that we substitute in is within the data range)

An estimate is not reliable when


• Extrapolation (the value that we substitute in is outside of the data range so the
linear relation between the 2 variables may no longer hold)

Page 2 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

2 N2008/II/8
A certain metal discolours when exposed to air. To protect the metal against discolouring,
it is treated with a chemical. In an experiment, different quantities, x ml, of the chemical
were applied to standard samples of the metal, and the times, t hours, for the metal to
discolour were measured. The results are given in the table.
x 1.2 2.0 2.7 3.8 4.8 5.6 6.9
t 2.2 4.5 5.8 7.3 7.6 9.0 9.9
(i) Calculate the product moment correlation coefficient between x and t, and explain
whether your answer suggests that a linear model is appropriate. [3]
(ii) Draw a scatter diagram for the data. [1]
One of the values of t appears to be incorrect.
(iii) Indicate the corresponding point on your diagram by labeling it P, and explain why
the scatter diagram for the remaining points may be consistent with a model of the
form t = a + b ln x . [2]
(iv) Omitting P, calculate least squares estimates of a and b for the model t = a + b ln x .
[2]
(v) Assume that the value of x at P is correct. Estimate the value of t for this value of
x. [1]
(vi) Comment on the use of the model in part (iv) in predicting the value of t when
x = 8.0 . [1]

2 Solution
(i) Using the GC, the product moment correlation coefficient, r  0.970 (3s.f.). Since
value of r is close to 1 which suggests a strong positive linear correlation, a linear
model seems appropriate.

(ii) t When plotting a scatter diagram,


you must:
9.9 • Label the axes,
• Indicate min and max values on
P(4.8, 7.6) each of the axes,
• Show the relative positions of
the points clearly,
2.2
x • Check that all points are drawn.
1.2 6.9
(iii) With the point P removed as x increases, the values of t increases but at decreasing
rate. Hence it is consistent with a model of the form t = a + b ln x .

Page 3 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

(iv) t = 1.4247 + 4.3966(ln x)


Therefore a = 1.42 , b = 4.40

Delete P(4.8, 7.6)


from the list first then
input with
L3 = ln ( L1 )

(v) When x = 4.8 , t = 1.4247 + 4.3966 ( ln 4.8 ) = 8.32 (3 s.f.)

(vi) Since x = 8.0 is outside the data range of x, linear relation between t and ln x may no
longer hold. Thus the predicted value of t is unreliable.

An estimate is reliable when


• r is close to 1
• Interpolation (the value that we substitute in is within the data range)

An estimate is not reliable when


• Extrapolation (the value that we substitute in is outside of the data range so the
linear relation between the 2 variables may no longer hold)

Page 4 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

3 N2009/II/6
The table gives the world record time, in seconds above 3 minutes 30 seconds, for running
1 mile as at 1st January in various years.
Year, x 1930 1940 1950 1960 1970 1980 1990 2000
Time, t 40.4 36.4 31.3 24.5 21.1 19.0 16.3 13.1
(i) Draw a scatter diagram to illustrate the data. [2]
(ii) Comment on whether a linear model would be appropriate, referring both to the
scatter diagram and the context of the question. [2]
(iii) Explain why in this context a quadratic model would probably not be appropriate
for long-term predictions. [1]
(iv) Fit a model of the form ln t = a + bx to the data, and use it to predict the world
record time as at 1st January 2010. Comment on the reliability of your prediction.
[3]
3 Solution
(i) t When plotting a scatter diagram,
40.4 you must:
• Label the axes,
• Indicate min and max values on
each of the axes,
• Show the relative positions of
13.1 the points clearly,
x
• Check that all points are drawn.
1930 2000

(ii) From the data gathered from 1930 to 2000, a linear model is appropriate from the
scatter diagram since most of the data points lie close to a straight line.
However, in the context of the question, a linear model may not be appropriate in the
long run since human capacity will at some point in time reach a plateau.

(iii) A quadratic model is not suitable for long-term predictions, as a quadratic model with
a minimum turning point means that there will be a point in time where the value of t
(record time) increases as x (years) increases. But t (record time) can only decrease or
maintain the same as years go by.

Note: A quadratic curve based on the current


shape of the scatter diagram would mean
that we are expecting a minimum turning
point along the way and the data points
will exhibit an increasing trend after the
turning point. However, in this context,
we are recording world record time, so it
is impossible for us to record a time that
is longer than the previous years.

Page 5 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

(iv) Using GC, ln t = 34.853 − 0.016128x  ln t = 34.9 − 0.0161x (3s.f)

When x = 2010, ln t = 34.853 − 0.016128 ( 2010 )


ln t = 2.4359
t = 11.4 ( 3 s.f.)
Thus, the world record time is 3 minutes 41.4 seconds.

Since x = 2010 falls outside the data range of x, linear relation between ln t and x may
no longer hold. Thus the prediction is not reliable.
An estimate is reliable when
• r is close to 1
• Interpolation (the value that we substitute in is within the data range)

An estimate is not reliable when


• Extrapolation (the value that we substitute in is outside of the data range so the
linear relation between the 2 variables may no longer hold)

Page 6 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

4 N2015/II/10
In an experiment the following information was gathered about air pressure P, measured
inches of mercury, at different heights above sea-level h, measured in feet.
h 2000 5000 10 000 15 000 20 000 25 000 30 000 35 000 40 000 45 000
P 27.8 24.9 20.6 16.9 13.8 11.1 8.89 7.04 5.52 4.28
(i) Draw a scatter diagram for these values, labelling the axes. [1]
(ii) Find, correct to 4 decimal places, the product moment correlation coefficient
between
(a) h and P,
(b) ln h and P,
(c) h and P. [3]
(iii) Using the most appropriate case from part (ii), find the equation which best models
air pressure at different heights. [3]
(iv) Given that 1 metre = 3.28 feet, re-write your equation from part (iii) so that it can
be used to estimate the air pressure when the height is given in metres. [2]
4 Solution
(i) P
27.8 x When plotting a scatter diagram,
you must:
x • Label the axes,
x • Indicate min and max values on
x each of the axes,
x
x • Show the relative positions of
x x
4.28 x x the points clearly,
h • Check that all points are drawn.
2000 45000
(ii)(a) r = −0.980731  −0.9807 (correct to 4 d.p)
(ii)(b) r = −0.974800  −0.9748 (correct to 4 d.p)

Page 7 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

(ii)(c) r = −0.998637  −0.9986 (correct to 4 d.p)

(iii) Since for part (c), r = −0.9986 is closest to −1 , and as h increases, P decreases at a
decreasing rate, part (c) is the most appropriate model.

Give 2 reasons when choosing the most appropriate model:


• Describe the trend of data points based on scatter diagram
• |r| value closest to 1 (or r value closest to -1 in this case since all are negative).

Sea level h is the independent variable,


 P = 34.789 − 0.14687 h hence we should find the regression line
of P on h .
P = 34.8 − 0.147 h (correct to 3 s.f)
Hence, in GC,
Xlist (independent): L 4 ( h)
(iv) P = 34.789 − 0.14687 h (h is in feet) Ylist (dependent): L 2 ( P )

Identify the relationship between the units.


Since 1 metre = 3.28 feet, When given sea level in x meters, we have to multiply
x metre  3.28x feet x by 3.28 to change the units to feet. Hence, we replace
P = 34.789 − 0.14687 3.28 x h by 3.28x so that we are working with the same units
as the equation we have found in (iii).
P = 34.789 − 0.26599 x
Change of variable from x to h since sea
level is denoted by h in the question.

P = 34.789 − 0.26599 h (h is in metres)


P = 34.8 − 0.266 h (correct to 3 s.f) (h is in metres)
Otherwise method: Convert all data values to meters and recompute. You will get
the same answer.

Page 8 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

5 N2012/II/8
Amy is revising for a mathematics examination and takes a different practice paper each
week. Her marks, y% in week x, are as follows.
Week x 1 2 3 4 5 6
Percentage mark y 38 63 67 75 71 82
(i) Draw a scatter diagram showing these marks. [1]
(ii) Suggest a possible reason why one of the marks does not seem to follow the trend.
[1]
(iii) It is desired to predict Amy’s marks on future papers. Explain why, in this context,
neither a linear nor a quadratic model is likely to be appropriate. [2]

It is decided to fit a model of the form ln ( L − y ) = a + bx , where L is a suitable constant.


The product moment correlation coefficient between x and ln ( L − y ) is denoted by r.
The following table gives the values of r for some possible values of L.
L 91 92 93
r −0.929944 −0.929918
(iv) Calculate the value of r for L = 91, giving your answer correct to 6 decimal places.
[1]
(v) Use the table and your answer to part (iv) to suggest with a reason which of 91, 92
or 93 is the most appropriate value for L. [1]
(vi) Using this value of L, calculate the values of a and b, and use them to predict the
week in which Amy will obtain her first mark of at least 90%. [4]
(vii) Give an interpretation, in context, of the value of L. [1]

5 Solution
(i) y When plotting a scatter diagram,
you must:
82
• Label the axes,
• Indicate min and max values on
each of the axes,
• Show the relative positions of
38 the points clearly,
x • Check that all points are drawn.
1 6
(ii) The irregularity occurred in Week 1. That practice paper may be more difficult than
the other papers.
Note: Another possible reason could be that Amy was not prepared academicaîly
for the practice paper in Week 1.
(iii) The marks cannot exceed 100%, and so a linear model, which models an infinite
upward/downward trend of data, is not appropriate.
The marks are likely to plateau off or stay constant as the weeks go by, rather than
in the case of a quadratic model which is expected to fit data with an increase and
then a decrease (or the other way round) trend. Thus, a quadratic model is also not
appropriate.
(iv) For L = 91, r = −0.929744 (6 decimal places)

Page 9 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

(v) Since r = 0.929944 is closest to 1 for L = 92, this is the most appropriate value for
L.
Concept: r measures the strength of linear relationship.

(vi) ln ( 92 − y ) = a + bx Xlist for variable x


From GC, a = 4.10, b = −0.280 Ylist for variable ln ( 92 − y )

3s.f. for final answer


ln ( 92 − y ) = 4.1045 − 0.27960 x 5 s.f. for intermediate working
Thus, ln ( 92 − 90 ) = 4.1045 − 0.27960 x
 x = 12.2
Amy will get at least 90% in Week 13.

Remember to answer the question. Note that since the number of marks
increases over the week, we should round up to 13 weeks.

(vii) L is the percentage mark she gets if she continues practising indefinitely.
ln ( L − y ) = a + bx
y = L − ea +bx
Since b  0 , ea +bx → 0 as x → 0
y → L

Page 10 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

6 N2013/II/10
(i) Sketch a scatter diagram that might be expected when x and y are related
approximately as given in each of the cases (A), (B) and (C) below. In each case
your diagram should include 6 points, approximately equally spaced with respect
to x , and with all x- and y- values positive. The letters a, b, c, d, e and f represent
constants.
(A) y = a + bx 2 , where a is positive and b is negative,
(B) y = c + d ln x , where c is positive and d is negative,
f
(C) y = e + , where e is positive and f is negative. [3]
x
A motoring website gives the following information about the distance travelled, y km,
by a certain type of car at different speeds, x km h −1 , on a fixed amount of fuel.
Speed, x 88 96 104 112 120 128
Distance, y 148 147 144 138 126 107
(ii) Draw the scatter diagram for these values, labelling the axes. [1]
(iii) Explain which of the three cases in part (i) is the most appropriate for modelling
these values, and calculate the product moment correlation coefficient for this case.
[2]
(iv) It is required to estimate the distance travelled at a speed of 110 km h −1 . Use the
case that you identified in part (iii) to find the equation of a suitable regression line,
and use your equation to find the required estimate. [3]
6 Solution
(i)
Ensure that there are:
• 6 points
• Equally spaced with respect to x
• Positive x and y values

(A) y = a + bx 2 , where a is positive and b is negative

When unsure, use the GC to try and


sketch a graph that has the required
shape. One example is y = 1 − x 2

We are only interested in first quadrant


for positive x and y values.

Page 11 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

(B) y = c + d ln x , where c is positive and d is negative

When unsure, use the GC to try and


sketch a graph that has the required
shape. One example is y = 5 − 2ln x

We are only interested in first quadrant


for positive x and y values.

f
(C) y = e + , where e is positive and f is negative
x
When unsure, use the GC to try and
sketch a graph that has the required
2
shape. One example is y = 5 −
x

We are only interested in first quadrant


for positive x and y values.

(ii) y
148 When plotting a scatter diagram,
you must:
• Label the axes,
• Indicate min and max values on
each of the axes,
• Show the relative positions of
107 x the points clearly,
88 128 • Check that all points are drawn.

(iii) From the scatter diagram, as x increases, y decreases at an increasing rate, thus
model (A) y = a + bx 2 is the most appropriate model.
r = −0.939
Describe the trend of data points from the scatter diagram. Since all three
models have different general shapes, choose the one that matches the scatter
diagram the most without the need to compare their correlation coefficient.

Page 12 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

(iv) Using GC,


There is no controlled variable in this
y = 189.75 − 0.0046198 x 2 context. Since we are given the value of x, we
 y = 190 − 0.00462 x 2 (to 3 s.f.) should use the regression line of y on x 2 .

when x = 110, y = 189.75 − 0.0046198 (110 )


2

= 134 (to 3 s.f.)

Page 13 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

7 N2016/II/8
A website about electric motors gives information about the percentage efficiency of
motors depending on their power, measured in horsepower. Xian has copied the following
table for a particular type of electric motor, but he has copied one of the efficiency values
wrongly.

Power, x 1 1.5 2 3 5 7.5 10 20 30 40 50


Efficiency, y% 72.5 82.5 84.0 87.4 87.5 88.5 89.5 90.2 91.0 91.7 92.4

(i) Plot a scatter diagram on graph paper for these values, labelling the axes, using a
scale of 2 cm to represent 10% efficiency on the y-axis and an appropriate scale for
the x-axis. On your diagram, circle the point that Xian has copied wrongly. [2]
For parts (ii), (iii) and (iv) of this question you should exclude the point for which Xian
has copied the efficiency value wrongly.
(ii) Explain from your scatter diagram why the relationship between x and y should not
be modelled by an equation of the form y = ax + b . [1]
(iii) Suppose that the relationship between x and y is modelled by an equation of the
c
form y = + d , where c and d are constants. State with a reason whether each of c
x
and d is positive or negative. [2]
(iv) Find the product moment correlation coefficient and the constants c and d for the
model in part (iii). [3]
c
(v) Use the model y = + d , with the values of c and d found in part (iv), to estimate
x
the efficiency value (y) that Xian has copied wrongly. Give two reasons why you
would expect this estimate to be reliable. [3]

7 Solution
(i) y When plotting a scatter diagram,
92.4 you must:
• Label the axes,
• Indicate min and max values on
each of the axes,
This point does
not follow the • Show the relative positions of
the points clearly,
curvilinear trend
72.5 • Check that all points are drawn.
x
1 50
(ii) As x increases, y increases at a decreasing rate. Therefore y = ax + b is not a suitable
model.
(iii) Since x increases as y increases, c is negative. You can try keying in +ve &
Since efficiency is non-negative, d is positive. -ve values of c & d to check

Page 14 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

(iv)

From GC, • Remove the outlier before


r = −0.980 ( 3 s.f .) keying in x into L1 & y into
L2
c = −17.5 ( 3 s.f .) • Key in L3 = 1/L2
d = 91.8 ( 3 s.f .)
(v) From GC, the estimated value = 85.9 ( 3 s.f.)
Since r = −0.980 is close to −1 , which indicates a strong negative linear correlation
1
between y and and x = 3 is within the data range, the estimate is reliable.
x
An estimate is reliable when
• r is close to 1
• Interpolation (the value that we substitute in is within the data range)

An estimate is not reliable when


• Extrapolation (the value that we substitute in is outside of the data range so the
linear relation between the 2 variables may no longer hold)

Page 15 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

8 N2017/II/8
(a) Draw separate scatter diagrams, each with 8 points, all in the first quadrant, which
represent the situation where the product moment correlation coefficient between
variables x and y is
(i) −1 ,
(ii) 0 ,
(iii) between 0.5 and 0.9. [3]

(b) An investigation into the effect of a fertiliser on yields of corn found that the amount
of fertiliser applied, x, resulted in the average yields of corn, y, given below, where
x and y are measured in suitable units.
x 0 40 80 120 160 200
y 70 104 118 119 126 129
(i) Draw a scatter diagram for these values. State which of the following
equations, where a and b are positive constants, provides the most accurate
model of the relationship between x and y.
a
(A) y = ax 2 + b (B) y= +b
x2
(C) y = a ln 2 x + b (D) y = a x +b [2]
(ii) Using the model you chose in part (i), write down the equation for the
relationship between x and y, giving the numerical values of the coefficients.
State the product moment correlation coefficient for this model. [3]
(iii) Give two reasons why it would be reasonable to use your model to estimate
the value of y when x = 189. [2]

8 Solution
a
(i)
Do not draw a regression line on
your scatter diagram. The points
should look obviously collinear.

a
(ii)

Page 16 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

a
(iii)

b
(i)

Keeping in mind a &


b are positive. Model
D is the most
accurate one.

Model (D) provides the most accurate model of relationship between x and y.
b
(ii)

y = 4.18211387 x + 74.04787
= 4.18 x + 74.0 (3 s.f.)
r = 0.981 (3 s.f.)
b Since x = 189 is within the data range of x and the value of r = 0.981 is close to 1,
(iii) implying a strong positive linear correlation, the linear correlation between x and y
holds. Thus it is reasonable to use model (D) to estimate the value of y when x =189.
An estimate is reliable when
• r is close to 1
• Interpolation (the value that we substitute in is within the data range)

An estimate is not reliable when


• Extrapolation (the value that we substitute in is outside of the data range so the
linear relation between the 2 variables may no longer hold)
Page 17 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

9 2011 MJC/II/11
A random sample of nine pairs of values of x and y are given in the table.

x 2.5 2 3 3.5 5 4 5.3 7.5 6


y 3.20 3.40 3.0 2.86 2.61 2.75 2.57 k 2.55
(i) The equation of the regression line of y on x is y = −0.175 x + 3.57.
Show that k = 2.4. [3]
(ii) Draw a scatter diagram for this set of data and obtain the product moment
correlation coefficient. Comment on the suitability of the linear model. [4]
(iii) Determine which of the following models is more appropriate:
A. ln y = a + bx
b
B. y =a+
x
where a and b are constants. [2]
(iv) It is required to estimate the value of y for which x = 8. Find the equation of a
suitable regression line, and use it to find the required estimate. Comment on the
reliability of your estimation. [3]
9 Solution
(i) 2.5 + 2 + 3 + 3.5 + 5 + 4 + 5.3 + 7.5 + 6 38.8
x= =
9 9
3.2 + 3.4 + 3 + 2.86 + 2.61 + 2.75 + 2.57 + k + 2.55 22.94 + k
y= =
9 9
y = −0.175 x + 3.57
y = −0.175 x + 3.57
The only point you can be
22.94 + k  38.8 
= −0.175   + 3.57 certain is on the regression
9  9  line is ( x, y )
1267
22.94 + k =
50
 k = 2.4 ( shown )
(ii) y
When plotting a scatter diagram,
3.40 you must:
• Label the axes,
• Indicate min and max values on
each of the axes,
• Show the relative positions of
2.4 the points clearly,
x • Check that all points are drawn.
2.0 7.5
Using GC: r = −0.943 ( 3 s.f.)
Even though r is close to – 1, from the scatter diagram, as x increases, y decreases at
a decreasing rate. Therefore, a linear model may not be suitable.
(iii) For Model A, r = −0.958 ( 3 s.f.)

Page 18 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

For Model B, r = 0.997 ( 3 s.f.)


Since r is closer to 1 for Model B, it is more appropriate.
(iv)

Equation of regression line for model B is


2.75
y= + 2.07
x
When x = 8,
2.7480
y= + 2.0651 = 2.41 ( 3 s.f.)
8
Although r = 0.997 suggests a strong positive correlation, x = 8 falls outside the
data range of x and therefore, the estimation of y is unreliable as the linear relation
1
between y and may no longer hold
x

Page 19 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

10 A car is travelling along a stretch of road with speed v km/h when the brakes are applied.
The car comes to rest after travelling a further distance of s metre. The values of s for 8
different values of v are given in the table, correct to 2 decimal places.

v 25 30 35 40 45 50 55 60
s 2.83 4.63 4.84 5.29 9.73 10.30 14.82 15.21

(i) Calculate the product moment correlation coefficient between v and s . What
does this indicate about the scatter diagram of the points (v, s )?
(ii) It is given that the product moment correlation coefficient between v and s is 0.965,
correct to 3 decimal places. State why the regression line of s on v is more
suitable than the regression line of s on v, and find the equation of the regression
line of s on v.

(iii) Consider the equation of the regression line of s on v. In the context of the
question,
(a) comment on the value of the constant term,

(b) interpret the slope of the regression line.

(iv) Would you be willing to use this model to predict the further distance travelled if
the speed is 70 km/h? Explain your answer with reason(s).

10 Solution
(i) r = 0.97496 = 0.975 (3 s.f.)

This r = 0.975 indicates that most of the points lie close to a straight line.
(ii) Since 0.975 is closer to 1 compared with 0.965, the regression line of s on v is more
suitable than the regression line of s on v.

Equation of regression line of s on v is


s = 0.066336v – 0.017748
i.e. s = 0.0663v – 0.0177 (3 s.f.)

(iii)(a) When v = 0, s = – 0.0177 ≠ 0 suggests that there is an error in the data. it is unrealistic
to use the model for v = 0 (or close to 0) as the value of s will be imaginary.
In fact, we should not use the regression line to model beyond the range of v.
OR
The value of the constant term represents the distance the car travel if its speed is 0, so
the constant term should be 0. The value of – 0.0177 is close to 0, which shows the
model is quite accurate for the range of v given. (Again, we should not use the model for
values of v outside the range, let alone for v close to 0)
(iii)(b) For each 1 km/h increase to the speed, the square root of the distance travelled will
increase by 0.0663m1/2.
(iv) No. 70 km/h lies outside the given data range of v and therefore, the estimation of s is
unreliable as the linear relation between s and v may no longer hold

Page 20 of 21
Chapter 23 Correlation and Linear Regression TMJC 2022

11 The table shows the number y (in millions) of cell-phone subscribers in a country from
2001 to 2010, where t represents number of years from 2000.

t 1 2 3 4 5 6 7 8 9 10
y 1.6 2.7 4.4 6.4 8.9 13.1 19.3 28.2 38.2 48.7

The relationship between y and t is given by the formula y = abt , where a and b are
constants.
(i) Using the substitution I = ln y , show that the relation between I and t is linear.
(ii) Find the equation of the estimated regression line of I on t and hence give estimates
for a and b.
(iii) Find the product moment correlation coefficient between I and t.
(iv) Predict the number of cell-phone subscribers in the year 2015. Comment on the
reliability of your prediction.
(v) It is required to estimate the value of t for which I = 1.5. Explain which of the
regression lines I on t or t on I, should be used. Use the equation of your choice to
find the value of t when I = 1.5.
11 Solution
(i) Apply ln to both sides to show linearization:
ln ( y ) = ln ( abt )
ln y = ln a + ln ( bt )
ln y = ln a + t ln b
(ii) From GC, I = 0.377423t + 0.26295183
Thus, comparing with ln y = t ln b + ln a
ln a = 0.26295183  a = e0.26295183 = 1.300764 = 1.30(to 3 sf)
ln b = 0.377423  b = e0.377423 = 1.4582 = 1.46(to 3 sf)
(iii) r = 0.996741=0.997 (to 3 sf)
(iv) When t =15,
I = 0.377423(15) + 0.26295183
I = 5.92429683
y = 374.0153
 374 millions
Since t =15 falls outside the data range of t , the prediction of y is unreliable as the linear
relation between ln y and t may no longer hold

(v) Since t is the independent variable, use line I on t.


When I = 1.5  y = 4.48
1.5 = 0.377423t + 0.26295183
t = 3.2776 = 3.28(to 3 sf)

Page 21 of 21

You might also like