0% found this document useful (0 votes)
190 views

ST104a Commentary 2022

This document provides examiners' commentary on the ST104a Statistics 1 examination from 2022. It outlines the learning outcomes students should achieve, advice on how to plan time during the exam, what examiners are looking for in answers, and key steps to improving performance. The commentary emphasizes covering all topics in the syllabus rather than focusing only on past paper questions, as examiners can test any part of the syllabus.

Uploaded by

Nghia Tuan Nghia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
190 views

ST104a Commentary 2022

This document provides examiners' commentary on the ST104a Statistics 1 examination from 2022. It outlines the learning outcomes students should achieve, advice on how to plan time during the exam, what examiners are looking for in answers, and key steps to improving performance. The commentary emphasizes covering all topics in the syllabus rather than focusing only on past paper questions, as examiners can test any part of the syllabus.

Uploaded by

Nghia Tuan Nghia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Examiners’ commentaries 2022

Examiners’ commentaries 2022


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2021–22. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

General remarks

Learning outcomes

At the end of the half course and having completed the Essential reading and activities you should:

be familiar with the key ideas of statistics that are accessible to a candidate with a
moderate mathematical competence
be able to routinely apply a variety of methods for explaining, summarising and presenting
data and interpreting results clearly using appropriate diagrams, titles and labels when
required
be able to summarise the ideas of randomness and variability, and the way in which these
link to probability theory to allow the systematic and logical collection of statistical
techniques of great practical importance in many applied areas
have a grounding in probability theory and some grasp of the most common statistical
methods
be able to perform inference to test the significance of common measures such as means and
proportions and conduct chi-squared tests of contingency tables
be able to use simple linear regression and correlation analysis and know when it is
appropriate to do so.

Planning your time in the examination

You have two hours to complete this paper, which is in two parts. The first part, Section A, is
compulsory which covers several subquestions and accounts for 50 per cent of the total marks.

1
ST104a Statistics 1

Section B contains three questions, each worth 25 per cent, from which you are asked to choose two.
Remember that each of the Section B questions is likely to cover more than one topic. In 2022, for
example, Question 2 had a series of questions involving drawing diagrams, such as histograms,
hypothesis testing, in particular differences between proportions, and confidence intervals. The first
part of Question 3 was on regression and involved drawing a diagram, while the second part was a
hypothesis test comparing population means using the sample data given. The first part of Question
4 asked for a chi-squared test and survey design problems appeared in the second part. This means
that it is really important that you make sure you have a reasonable idea of what topics are covered
before you start work on the paper! We suggest you divide your time as follows during the
examination.

Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
Once you have chosen your two Section B questions, give them about 25 minutes each.
This leaves you with 15 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required and, if you did more than the
two questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are!

What are the examiners looking for?

The examiners are looking for very simple demonstrations from you. They want to be sure that you:

have covered the syllabus as described and explained in the subject guide
know the basic formulae given there and when and how to use them
understand and answer the questions set.

You are not expected to write long essays where explanations or descriptions of sampling design
are required, and note-form answers are acceptable. However, clear and accurate language, both
mathematical and written, is expected and marked. The explanations below and in the specific
commentaries for the papers for each zone should make these requirements clear.

Key steps to improvement

The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates did not do, though asked, in the 2022 examinations!
Remember the following.

If you are asked to label a diagram (which is almost always the case!), please do so. Writing
‘Histogram’, ‘Stem-and-leaf diagram’, ‘Boxplot’ or ‘Scatter diagram’ in itself is insufficient.
What do the data describe? What are the units? What are the x-axis and y-axis?
If you are specifically asked to perform a hypothesis test, or calculate a confidence interval,
do so. It is not acceptable to do one rather than the other! If you are asked to use a 5%
significance level, this is what will be marked.
Do not waste time calculating things which are not required by the examiners. If you are
asked to find the line of best fit, you will get no marks if you calculate the correlation
coefficient as well. If you are asked to use the confidence interval you have just calculated to
comment on the results, carrying out an additional hypothesis test will not gain you marks.
When performing calculations try to use as many decimal places as possible in intermediate
steps to reach the most accurate solution. It is advised to have at least two decimal places
in general and at least three decimal places when calculating probabilities.

2
Examiners’ commentaries 2022

How should you use the specific comments on each question given in the
Examiners0 commentaries?

We hope that you find these useful. For each question and subquestion, they give:

further guidance for each question on the points made in the last section
the answers, or keys to the answers, which the examiners were looking for
the relevant detailed reference to Newbold, P., W.L. Carlson and B.M. Thorne Statistics for
Business and Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN
9780273767060] and the subject guide
where appropriate, suggested activities from the subject guide which should help you to
prepare, and similar questions from Newbold et al. (2012).

Any further references you might need are given in the part of the subject guide to which you are
referred for each answer.

Memorising from the Examiners0 commentaries

It was noted recently that a small number of candidates appeared to be memorising answers from
previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of
them without looking at the current year’s examination paper questions! Note that this is very easy
to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination
questions and it is pointless to attempt to memorise them.

Examination revision strategy

Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons, but one particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.

We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.

The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.

If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.

3
ST104a Statistics 1

Examiners’ commentaries 2022


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2021–22. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

Comments on specific questions – Zone A

Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.

Section A

Answer all parts of question 1 (50 marks in total).

Question 1

(a) Suppose that x1 = −2, x2 = 1, x3 = 27, and y1 = 2, y2 = 3, y3 = 9. Calculate


the following quantities:
2 3 2
X X yi2 √ X y
i. x3i ii. iii. | y3 | + xi i .
i=1 i=1
xi i=1

(6 marks)

Reading for this question


This question refers to the basic bookwork which can be found on Section 2.9 of the subject
guide and in particular Activities 2.2 and 2.3.

Approaching the question


Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
This question was generally well done; the answers are as follows.

4
Examiners’ commentaries 2022

i. We have:
2
X
x3i = (−2)3 + 13 = −8 + 1 = −7.
i=1

ii. We have:
3
X y2 i 22 32 92
= + + = −2 + 9 + 3 = 10.
i=1
xi −2 1 27

iii. We have:
2
√ X y √
| y3 | + xi i = | 9| + (−2)2 + 13 = 3 + 4 + 1 = 8.
i=1

Note: The calculation of | y3 | = 3 was allocated a mark on its own.

(b) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer. (No marks will be awarded without a justification.)
i. Labour force participation rates in 2022.
ii. Football team positions in the English Premier League (EPL) table.
iii. Participant number in an anonymous clinical trial.
(6 marks)

Reading for this question


This question requires identifying types of variable so reading the relevant section in the
subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion of
a variable and be able to distinguish between discrete and continuous (measurable) data. In
addition to identifying whether a variable is categorical or measurable, further distinctions
between ordinal and nominal categorical variables should be made by candidates.
Approaching the question
A general tip for identifying continuous and categorical variables is to think of the possible
values they can take. If these are finite and represent specific entities the variable is
categorical. Otherwise, if these consist of numbers corresponding to measurements, the data
are continuous and the variable is measurable. Such variables may also have measurement
units or can be measured to various decimal places.
i. Measurable – labour force participation rates can be measured in percentages to several
decimal places.
ii. Categorical, ordinal. Positions are in ranked order despite based on quantitative points
(and goal differences).
iii. Categorical, nominal. Participant numbers are for identification only.

Weak candidates did not provide justifications for their choices, reported nominal or
categorical to a measurable variable and sometimes answered ordinal when their justification
was pointing to a nominal variable. There were also phrases like ‘It is measurable because it
can be measured’ that were not awarded any marks.

(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. A histogram is suitable for displaying a categorical variable.
ii. Impossible events can occur.
iii. If X ∼ N (a, a2 ), then P (X ≤ a) > P (X ≥ a).

5
ST104a Statistics 1

iv. A fair six-sided die was rolled n = 40 times. The expected number of ‘2’s is
an integer.
v. The correlation between x and y is the same as the correlation between y
and x.
(10 marks)

Reading for this question


This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part i. refers to Chapter 4 and in particular Section 4.7.3, whereas
parts ii. and iv. require knowledge of basic probability properties that can be found in
Sections 5.7–5.9. Part iii. is about probability properties of the normal distribution, see
Section 6.8. Finally, part v. focuses on material of Chapter 12 and more specifically Section
12.8.
Approaching the question
Candidates always find this type of question tricky. It requires a brief explanation of the
reason for why the statement is true/false and not just a choice between the two. Some
candidates lost marks too for long rambling explanations without a declaration as to
whether the statement was true or false.
i. False. A histogram is suitable for displaying a measurable variable. Or, a bar chart is
suitable for displaying a categorical variable.
ii. False. Impossible events have zero probability of occurrence – for example, if A is
impossible then P (A) = 0. Or, only events with P (A) > 0 can occur.
iii. False. X is symmetric about a, hence P (X ≤ a) = P (X ≥ a). (Could also show that
both equal 0.50.)
iv. False. The expected number of ‘2’s is 40 × 1/6 = 6.6̇, which is not an integer.
v. True. Correlation is symmetric.

(d) Briefly explain the characteristics of an observational study and provide an


example.
(4 marks)

Reading for this question


This question contains material on observational studies covered in Section 11.6 of the
subject guide.
Approaching the question
The following can be extracted from the relevant text in the subject guide.
In an observational study data are collected on units (not necessarily people) without any
intervention. Researchers do their best not to influence the observations in any way.
A sample survey is a good example of such a study, where data are collected in the form of
questionnaire responses. Other reasonable examples were also accepted.

(e) The probability distribution of a random variable X is given below.


X=x 1 3 5 7
P (X = x) 0.20 0.30 0.25 0.25

i. Calculate the expected value of X.


(2 marks)
ii. Calculate the standard deviation of X to four decimal places.
(3 marks)

6
Examiners’ commentaries 2022

iii. Calculate P (X > 1 | X < 7) to four decimal places.


(3 marks)
iv. Does X have a normal distribution? Briefly justify your answer.
(2 marks)

Reading for this question


This is a question on probability, exploring the concepts of relative frequency, conditional
probability and probability distribution. Reading from Chapter 5 is suggested with focus on
the sections on these topics.

Approaching the question


i. We have:
X
E(X) = xp(x) = 1 × 0.20 + 3 × 0.30 + 5 × 0.25 + 7 × 0.25 = 4.1.
x

ii. We have:
X
E(X 2 ) = x2 p(x)
x

= 12 × 0.20 + 32 × 0.30 + 52 × 0.25 + 72 × 0.25


= 21.4

hence:
Var(X) = E(X 2 ) − (E(X))2 = 21.4 − (4.1)2 = 4.59
and so, to four decimal places:

Std. dev.(X) = 4.59 = 2.1424.

Make sure to distinguish between E(X 2 ), Var(X), and standard deviation.

iii. We have:
P ({X > 1} ∩ {X < 7}) P (1 < X < 7) 0.30 + 0.25
P (X > 1 | X < 7) = = = = 0.7333.
P (X < 7) P (X < 7) 0.75

Knowledge of the conditional probability formula was essential for this exercise.

iv. X is a discrete random variable, while the normal distribution is continuous. Hence X
does not have a normal distribution.

(f ) Two homebuyers ranked eight properties listed on a website in their preferred


order as follows:
Property A B C D E F G H
Homebuyer 1 3 1 4 8 2 5 7 6
Homebuyer 2 2 4 1 3 6 7 5 8

Calculate the Spearman rank correlation to four decimal places, and interpret
its value.
(6 marks)

Reading for this question


This questions contains material on correlation and in particular the Spearman rank
correlation. Check Section 12.8.1, and practise Example 12.6.

7
ST104a Statistics 1

Approaching the question


The home preferences are already ranked so we compute the differences immediately:
Property A B C D E F G H
Homebuyer 1 3 1 4 8 2 5 7 6
Homebuyer 2 2 4 1 3 6 7 5 8
di 1 −3 3 5 −4 −2 2 −2
d2i 1 9 9 25 16 4 4 4
The Spearman rank correlation is:
n
d2i
P
6
i=1 6 × 72
rs = 1 − =1− = 0.1429.
n(n2 − 1) 8(64 − 1)

Interpretation: There is a weak, positive relationship between the homebuyers’ preferences.

(g) An employers’ group thinks that 80% of workers would prefer hybrid working
practices (i.e. a mixture of working in the office and from home) following the
pandemic.
i. Using a suitable approximation, calculate the probability that in a random
sample of n = 125 office workers at least 70% would indicate a preference for
hybrid working practices. State any assumption(s) you make.
(5 marks)
ii. A random sample of n = 125 office workers were asked about their own
preference for hybrid working practices following the pandemic. Of these, 85
indicated they would prefer this type of working arrangement. Based on your
result to part i., what would you conclude?
(3 marks)

Reading for this question


This question contains material on sample size determination in relation to the normal
distribution and the distribution of the sample proportion. Sample size determination is
covered in Section 7.11 whereas for information on the normal distribution and the sample
proportion can be found in Sections 6.8–6.10. For confidence intervals, check Section 7.6 for
the principle and Section 7.10 for the case of a single proportion. The second part of the
question looks at p-values and the relevant reference in the subject guide is Section 8.11.
Approaching the question
i. The approximate sampling distribution of the sample proportion is:
   
π(1 − π) 0.80 × 0.20
P ∼ N π, = N 0.80, = N (0.80, 0.00128).
n 125

Hence:  
0.70 − 0.80
P (P ≥ 0.70) ≈ P Z≥ √ ≈ P (Z ≥ −2.80) = 0.99744.
0.00128
The sample size n = 125 is assumed to be sufficiently large so that the normal
approximation is justified according to the central limit theorem.
ii. 85 out of 125 is 85/125 = 68%. Based on the result above, if π = 0.80 it is very unlikely
to observe less than 70% support in a random sample of 125, specifically the
(approximate) probability is 1 − 0.99744 = 0.00256. So since only 68% in the sample
supported hybrid working practices, this suggests the employers’ group’s belief that
π = 0.80 is wrong and it is actually lower.

8
Examiners’ commentaries 2022

Section B

Answer two out of the three questions from this section (25 marks each).

Question 2

(a) A bank manager wanted to have an idea of how many people are visiting a
particular bank branch on a typical day. For this reason, the number of
customers visiting that branch was recorded on 30 different days spread
throughout the year and is shown below:

95 96 96 98 99
99 101 101 102 102
103 104 104 107 107
111 112 113 113 114
115 117 121 123 124
127 129 131 135 143

i. Carefully construct, draw and label a histogram of these data. You can draw
the histogram on ordinary paper, instead of graph paper, but take care to
ensure reasonable accuracy.
ii. Find the mean and the modal group. You are given that the sum of the data
is 3,342.
iii. Find the median and the upper quartile.
iv. Comment on the data, given the shape of the histogram and the measures
which you have calculated.
(12 marks)

Reading for this question


Chapter 4 provides all the relevant material for this question. More specifically, reading on
histograms can be found in Section 4.7.3, but the entire Section 4.7 is highly relevant. For
measures of location (mean, median, modal group) see Section 4.8.
Approaching the question
i. An acceptable histogram is displayed below:

9
ST104a Statistics 1

More specifically, marks were awarded for the following:


• Title.
• Sensible choice of classes. Note that at least 4 classes are required otherwise the
histogram will be too simple. Too many classes is also undesirable.
• Label ‘Number of customers’ on the x-axis.
• Label on the y-axis (frequency density).
• Calculation of frequency densities.
• Plot accuracy.
It is generally helpful to produce a table of frequencies such as the one below and provide
it in the submitted solution:
Interval Frequency
Class interval width Frequency density
[90, 100) 10 6 0.6
[100, 110) 10 9 0.9
[110, 120) 10 7 0.7
[120, 130) 10 5 0.5
[130, 140) 10 2 0.2
[140, 150) 10 1 0.1

ii. The requested results are below:


• Mean = 3,342/30 = 111.4 customers.
• Modal group: [100, 110) customers (note that this depends on the chosen histogram
class intervals).
Make sure to use measurements units. Also, avoid the use of grouped data formulae as
they are approximate.
iii. The requested results are below:
• Median: 109 customers.
• Correct position of Q3 (between 22nd and 23rd inclusive).
• Q3 ≈ 119 customers. Note that any reasonable quartile method was accepted, i.e.
anything between 117 and 121 inclusive.

iv. The two main things to note here, are positive/right skewness and the fact that
mean > median.

(b) A large company gathered a random sample of 500 of its employees to


determine whether they like the design of their new web page. The table below
summarises the responses of the employees.

Positive view on the


Gender Sample size new web page
Males 225 110
Females 275 165
i. Do the employees’ responses indicate a difference between males and females
in whether they like the design of the latest web page? Conduct a suitable
hypothesis test at two appropriate significance levels and comment on your
results. State any assumptions that you make.
ii. Compute a 98% confidence interval for the difference of proportions with a
positive view between males and females in the population.
(13 marks)

Reading for this question


Look up the sections about hypothesis testing and confidence intervals for differences
between proportions; more specifically Sections 7.12 and 8.15.

10
Examiners’ commentaries 2022

Approaching the question


The working of the exercise is given below:
i. • H0 : π1 = π2 vs. H1 : π1 6= π2 .
• Calculation of pooled sample proportion: p = (110 + 165)/(225 + 275) = 0.55.
• Calculation of the standard error:
s  
1 1
S.E.(p1 − p2 ) = 0.55 × 0.45 × + = 0.045.
225 275

• Test statistic value:


165/275 − 110/225
= 2.485.
0.045
• For α = 0.05, critical values are ±1.96.
• Reject H0 at the 5% significance level since 1.96 < 2.485.
• Choose second (smaller) α, say 1% gives critical values of ±2.576, hence do not reject
H0 since 2.485 < 2.576.
• Moderate evidence of a difference between males and females with a positive view on
the new web page.
• Use of the standard normal distribution is justified by the large sample sizes.
ii. The working and marking was determined based on the elements below. Note that it is
essential to use the following formula for the standard error here:
r
0.4889 × 0.5111 0.6 × 0.4
+ = 0.045
225 275
• Correct endpoints 0.1111 ± 2.326 × 0.045 and hence (0.007, 0.215) or else (0.01, 0.22).
• Correct z-value: 2.326.
• Knowledge of method (this includes presenting as an interval and using the correct
standard error, i.e. the non-pooled case).

Question 3

(a) A company would like to predict how its trainees in sales will perform based on
the results of an aptitude test that is given to them at the beginning of the
training. The table below contains the test scores (x) and the values of the sales
(y, in hundreds of dollars) for 9 randomly selected trainees during the first
month of working at the company.

Trainee A B C D E F G H I
x 1.8 2.6 2.8 3.4 3.6 4.2 4.8 5.2 5.4
y 5.4 6.4 6.0 6.2 6.8 7.0 7.6 7.3 7.6

The summary statistics for these data are:

Sum of the x data: 33.8 Sum of the squares of x data 139.24


Sum of the y data: 60.3 Sum of the squares of y data: 408.61
Sum of the products of x and y data: 233.6

i. Draw a scatter diagram of these data. Carefully label the diagram. You can
draw the scatter diagram on ordinary paper, instead of graph paper, but take
care to ensure reasonable accuracy.
ii. Calculate the sample correlation coefficient. Interpret its value.
iii. Calculate and report the least squares line of y on x. Draw the line on the
scatter diagram.

11
ST104a Statistics 1

iv. Based on the regression model above, what amount of sales during the first
month would you expect from someone who scored 4 in this aptitude test?
Would you trust this value? Justify your answer.
(13 marks)

Reading for this question


This is a standard regression question and the reading is to be found in Chapter 12. Section
12.6 provides details for scatter diagrams and is suitable for part i. whereas the remaining
parts are on correlation and regression that are covered in Sections 12.8–12.10 of the subject
guide. Section 12.7 is also relevant. Sample examination question 2 of this chapter is also
recommended for practice on questions of this type.
Approaching the question
i. Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes
which give their units in addition. Far too many candidates threw away marks by
neglecting these points and consequently were only given one mark out of the possible
four allocated for this part of the question.
A reasonable scatter diagram is show below:

Note that the following elements are essential to get full marks:
• Informative title.
• Axis labels. (Units in either title or axis labels.)
• Plot accuracy.

ii. The summary statistics can be substituted into the formula for the correlation (make
sure you know which one it is!) to obtain the value 0.9491. An interpretation of this

12
Examiners’ commentaries 2022

value is the following: A higher aptitude test score results in more sales during the first
month. The fact that the value is close to 1, suggests that this is a strong, linear,
positive relationship.
Many candidates did not mention all three words (strong, linear, positive). Note that all
of these words provide useful information on interpreting the relationship.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 0.5804.
The formula for a is a = ȳ − bx̄, so we get a = 4.5203.
Hence the regression line can be written as:

yb = 4.5203 + 0.5804x or y = 4.5203 + 0.5804x + ε.

It should also be plotted on the scatter diagram as shown above.


Many candidates reported incorrectly the regression line as y = 4.5203 + 0.5804x. This
expression is false; one of the two above is required. Also, many candidates did not draw
this line on the scatter diagram; instead they drew an approximate line trying to go
around the points but without reference to the above equation. No marks were awarded
in such cases.
iv. The expected amount of sales is 4.5203 + 0.5804 × 4 ≈ $684. This prediction is
reasonable, since it is in the range of the data and the model seems quite reasonable (due
to the strong linear relationship).

(b) Two separate filling units, Unit 1 and Unit 2, are used to fill jars with coffee.
An experiment was then carried out, in which the quantities of coffee in some
randomly selected jars filled by each of these units, were measured. The
measurements are summarised in the table below.
Sample size Sample mean Sample standard deviation
Jar filling unit 1 40 25.15 2.30
Jar filling unit 2 32 23.90 2.00

i. Use an appropriate hypothesis test to determine whether there is a difference


between the mean quantities of coffee in jars from these two units. State
clearly the hypotheses, the test statistic and its distribution under the null
hypothesis, and carry out the test at two appropriate significance levels.
Comment on your findings.
ii. State clearly any assumptions you made in (b) part i.
iii. Repeat the procedure in (b) part i. to determine whether the mean quantity
of coffee in jars of Unit 1 is higher than that of Unit 2.
(12 marks)

Reading for this question


The first two parts of the question refer to a two-tailed hypothesis test comparing two
population means. While the entire chapter on hypothesis testing is relevant, one can focus
on Section 8.16. In terms of exercises see Example 8.7. The third part of the question refers
to one-tailed hypothesis tests.
Approaching the question
i. The working of the exercise is shown below, where µ1 denotes the mean quantity of
coffee in jars of Unit 1 and µ2 denotes the mean quantity of coffee in jars of Unit 2.
• H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .

13
ST104a Statistics 1

• Test statistic value: 2.426. (If equal variances are not assumed the test statistic value
is 2.465.) For reference the test statistic formula is:
x̄1 − x̄2 x̄1 − x̄2
q or p .
s2p (1/n1 + 1/n2 ) s21 /n1 + s22 /n2

• For α = 0.05, critical values are ±1.96 (±2.00 if the t60 distribution is used).
• Decision: reject H0 since 1.96 < 2.426.
• Choose smaller α, say α = 0.01, hence the critical values become ±2.576 (or ±2.660),
hence do not reject H0 since 2.426 < 2.576.
• Moderate evidence of a difference between the mean quantities of coffee in jars from
the two units.
ii. The assumptions for i. were:
• about equal variances
• about whether n1 + n2 is ‘large’ so that the normality assumption is satisfied
• about independent samples.
Some candidates stated assumptions in this part that were not made in part i. Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
iii. The working for the hypothesis test in this case is shown below.
• H0 : µ1 = µ2 vs. H1 : µ1 < µ2 .
• The correct z-values are: 1.645 for the 5% significance level and 2.326 for the 1%
significance level.
• Reject H0 for α = 0.05 since 1.645 < 2.426 and for α = 0.01 since 2.326 < 2.426,
hence there is strong evidence that the mean coffee quantity in jars of Unit 1 is higher
than that of Unit 2.

Question 4

(a) The director of a university is exploring ways to improve student experience,


looking at student satisfaction evaluations. Before devising a plan, a student
satisfaction survey is conducted by taking a random sample of 100 students,
where student satisfaction is recorded separately for UK/EU and overseas
students. The results are summarised in the table below:
Satisfied Indifferent Dissatisfied
UK/EU 10 26 15
Overseas 20 14 15

i. Based on the data in the table, and without conducting any significance test,
would you say there is an association between the student’s origin and
satisfaction with university life?
ii. Calculate the χ2 statistic and use it to test for independence of student’s
origin and satisfaction with university life at two appropriate significance
levels. What do you conclude?
(13 marks)

Reading for this question


This part targets Chapter 9 on contingency tables and chi-squared tests. Note that part i. of
the question does not require any calculations, just understanding and interpreting
contingency tables. Part ii. is a straightforward chi-squared test and the reading is also
given in Chapter 9. Look also at Example 9.5.

14
Examiners’ commentaries 2022

Approaching the question


i. An example of a ‘good’ answer is given below:
There are some differences in rates of satisfaction between UK/EU and overseas
students. More specifically, two thirds of the satisfied students were overseas students,
whereas half of those dissatisfied were overseas students. Hence there seems to be an
association between a student’s origin and satisfaction with university life, although this
needs to be investigated further.
ii. We test H0 : No association between a student’s origin and satisfaction with university
life vs. H1 : Association between a student’s origin and satisfaction with university life.
Be careful to get these the correct way round!
It is essential to calculate the expected values. which are shown below:
Satisfied Indifferent Dissatisfied
UK/EU 15.3 20.4 15.3
Overseas 14.7 19.6 14.7
The test statistic formula is:
X (Oi,j − Ei,j )2

i,j
Ei,j
which gives a value of 6.896. This is a 2 × 3 contingency table so the degrees of freedom
are (2 − 1) × (3 − 1) = 2.
For α = 0.05 ⇒ the critical value is 5.991, hence reject H0 since 5.991 < 6.896.
For α = 0.01 ⇒ the critical value is 9.210, hence do not reject H0 since 6.896 < 9.210.
There is moderate evidence of an association between a student’s origin and satisfaction
with university life.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.

(b) You have been asked to design a stratified random sample survey on private
sector workers to examine whether life satisfaction of employees varies between
different job types.
i. Discuss in no more than three sentences how you will choose your sampling
frame, while mentioning potential limitation(s) of your choice.
ii. Propose two relevant stratification factors. Justify your choices.
iii. Discuss in no more than two sentences, how response bias could affect the
results in the context of this survey.
iv. Briefly outline the statistical methodology you would use to analyse the
collected data.
(12 marks)

Reading for this question


This was a question on basic material on survey designs. Background reading is given in
Chapter 10 of the subject guide which, along with the recommended reading should be
looked at carefully. Stratified random sampling is covered in Section 10.7.2, whereas
response bias in particular is covered in Section 10.10. Candidates were expected to have
studied and understood the main important constituents of design in random sampling.
Approaching the question
One of the main things to avoid in this part is to write essays without any structure. This
exercise asks for specific things and each one of them requires 1 or 2 lines at most. If you are
unsure of what these things are, do not write lengthy essays. This is not giving you
anything and is a waste of your invaluable examination time. If you can identify what is
being asked, keep in mind that the answer should not be too long.
Note also that in some cases there is no unique answer to the question. Some suggested
answers are provided below.

15
ST104a Statistics 1

i. Tax registers may be accessible but they may be outdated.


ii. Potential stratification factors include:
• income level
• gender
• age group.

iii. Those with poor life satisfaction may be working longer hours and therefore be tired
when responding, giving inaccurate responses. (Other reasonable suggestions were also
accepted.)
iv. Marks were awarded for discussing the use of graphs, confidence intervals and hypothesis
tests.

16
Examiners’ commentaries 2022

Examiners’ commentaries 2022


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2021–22. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

Comments on specific questions – Zone B

Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.

Section A

Answer all parts of question 1 (50 marks in total).

Question 1

(a) Suppose that x1 = 16, x2 = 4, x3 = 2, and y1 = 64, y2 = 1, y3 = −2. Calculate


the following quantities:
3 3 3
X X x2i √ X x
i. yi3 ii. iii. | x1 | + yi i .
i=2 i=1
yi i=2

(6 marks)

Reading for this question


This question refers to the basic bookwork which can be found on Section 2.9 of the subject
guide and in particular Activities 2.2 and 2.3.

Approaching the question


Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
This question was generally well done; the answers are as follows.

17
ST104a Statistics 1

i. We have:
3
X
yi3 = 13 + (−2)3 = 1 − 8 = −7.
i=2

ii. We have:
3
X x2 i (16)2 42 22
= + + = 4 + 16 − 2 = 18
i=1
yi 64 1 −2

iii. We have:
3
√ X √
| x1 | + yixi = | 16| + 14 + (−2)2 = 4 + 1 + 4 = 9
i=2

Note: The calculation of | x1 | = 4 was allocated a mark on its own.

(b) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer. (No marks will be awarded without a justification.)
i. A person’s social security number.
ii. Orbiting speed of satellites.
iii. Final position in the medals table of an Olympic Games.
(6 marks)

Reading for this question


This question requires identifying types of variable so reading the relevant section in the
subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion of
a variable and be able to distinguish between discrete and continuous (measurable) data. In
addition to identifying whether a variable is categorical or measurable, further distinctions
between ordinal and nominal categorical variables should be made by candidates.
Approaching the question
A general tip for identifying continuous and categorical variables is to think of the possible
values they can take. If these are finite and represent specific entities the variable is
categorical. Otherwise, if these consist of numbers corresponding to measurements, the data
are continuous and the variable is measurable. Such variables may also have measurement
units or can be measured to various decimal places.
i. Categorical, nominal. Social security numbers are for identification only.
ii. Measurable – orbiting speed of satellites can be measured in units (such as km/h) to
several decimal places.
iii. Categorical, ordinal. Positions are in ranked order despite based on the number of gold
medals, then silver, then bronze.
Weak candidates did not provide justifications for their choices, reported nominal or
categorical to a measurable variable and sometimes answered ordinal when their justification
was pointing to a nominal variable. There were also phrases like ‘It is measurable because it
can be measured’ that were not awarded any marks.

(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. Skewness of a distribution can be inferred from a boxplot.
ii. If P (A) < P (B), then if the event B does not occur, then A cannot occur.
iii. If X ∼ N (4, σ 2 ), then X is as likely to be positive as it is likely to be
negative.

18
Examiners’ commentaries 2022

iv. A fair six-sided die was rolled n = 120 times. The expected number of ‘5’s is
an integer.
v. The slope coefficient when y is regressed on x is always the same as when x
is regressed on y.
(10 marks)

Reading for this question


This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part i. refers to Chapter 4 and in particular Section 4.9.2, whereas
parts ii. and iv. require knowledge of basic probability properties that can be found in
Sections 5.7–5.9. Part iii. is about probability properties of the normal distribution, see
Section 6.8. Finally, part v. focuses on material of Chapter 12 and more specifically Section
12.9.
Approaching the question
Candidates always find this type of question tricky. It requires a brief explanation of the
reason for why the statement is true/false and not just a choice between the two. Some
candidates lost marks too for long rambling explanations without a declaration as to
whether the statement was true or false.
i. True. Symmetry/asymmetry can be inferred from a boxplot, such as the position of the
median in the box and the whiskers.
ii. False. Although A is less likely to occur, since P (A) < P (B), it is possible for A to occur
and B not to occur.
iii. False. Only if µ = 0 is X as likely to be positive as it is negative, and here µ = 4.
iv. True. The expected number of ‘5’s is 120 × 1/6 = 20 which is an integer.
v. False. The formula for b is not symmetric in terms of x and y.

(d) Briefly explain the characteristics of an experimental study and provide an


example.
(4 marks)

Reading for this question


This question contains material on observational studies covered in Section 11.7 of the
subject guide.
Approaching the question
The following can be extracted from the relevant text in the subject guide.
In an experiment, an intervention or treatment is administered to some or all of the
experimental units (usually people). Allocation of the treatment (or perhaps a combination
of treatments) is determined by using a form of randomisation.
A clinical trial of a new drug is a good example of such a study, where participants are
assigned to either a control group or a treatment group. Any other reasonable example
accepted.

(e) The probability distribution of a random variable X is given below.


X=x 2 4 6 8
P (X = x) 0.25 0.25 0.40 0.10

i. Calculate the expected value of X.


(2 marks)

19
ST104a Statistics 1

ii. Calculate the standard deviation of X to four decimal places.


(3 marks)
iii. Calculate P (X < 8 | X > 2) to four decimal places.
(3 marks)
iv. Does X have a uniform distribution? Briefly justify your answer.
(2 marks)

Reading for this question


This is a question on probability, exploring the concepts of relative frequency, conditional
probability and probability distribution. Reading from Chapter 5 is suggested with focus on
the sections on these topics.
Approaching the question
i. We have:
X
E(X) = xp(x) = 2 × 0.25 + 4 × 0.25 + 6 × 0.40 + 8 × 0.10 = 4.7.
x

ii. We have:
X
E(X 2 ) = x2 p(x)
x

= 22 × 0.25 + 42 × 0.25 + 62 × 0.40 + 82 × 0.10


= 25.8

hence:
Var(X) = E(X 2 ) − (E(X))2 = 25.8 − (4.7)2 = 3.71
and so, to four decimal places:

Std. dev.(X) = 4.59 = 1.9261.

Make sure to distinguish between E(X 2 ), Var(X), and standard deviation.


iii. We have:
P ({X < 8} ∩ {X > 2}) P (2 < X < 8) 0.25 + 0.40
P (X < 8 | X > 2) = = = = 0.8667.
P (X > 2) P (X > 2) 0.75

Knowledge of the conditional probability formula was essential for this exercise.
iv. The values of X occur with unequal probabilities, unlike a uniform distribution, hence X
does not have a uniform distribution.

(f ) Two shoppers ranked eight products listed on a retailer’s website in their


preferred order as follows:
Product A B C D E F G H
Shopper 1 6 5 8 2 3 7 1 4
Shopper 2 7 8 6 1 2 5 4 3

Calculate the Spearman rank correlation to four decimal places, and interpret
its value.
(6 marks)

Reading for this question


This questions contains material on correlation and in particular the Spearman rank
correlation. Check Section 12.8.1, and practise Example 12.6.

20
Examiners’ commentaries 2022

Approaching the question


The home preferences are already ranked so we compute the differences immediately:
Product A B C D E F G H
Shopper 1 6 5 8 2 3 7 1 4
Shopper 2 7 8 6 1 2 5 4 3
di −1 −3 2 1 1 2 −3 1
d2i 1 9 4 1 1 4 9 1
The Spearman rank correlation is:
n
d2i
P
6
i=1 6 × 30
rs = 1 − =1− = 0.6429.
n(n2 − 1) 8(64 − 1)

Interpretation: There is a moderate, positive relationship between the shoppers’ preferences.

(g) An office workers’ union thinks that 30% of its members would prefer to return
to the office full-time following the pandemic.
i. Using a suitable approximation, calculate the probability that in a random
sample of n = 240 members at most 38% would indicate a preference for
returning to the office full-time. State any assumption(s) you make.
(5 marks)
ii. A random sample of n = 240 union members were asked about their own
preference for returning to the office full-time following the pandemic. Of
these, 96 indicated they would prefer this type of working arrangement.
Based on your result to part i., what would you conclude?
(3 marks)

Reading for this question


This question contains material on sample size determination in relation to the normal
distribution and the distribution of the sample proportion. Sample size determination is
covered in Section 7.11 whereas for information on the normal distribution and the sample
proportion can be found in Sections 6.8–6.10. For confidence intervals, check Section 7.6 for
the principle and Section 7.10 for the case of a single proportion. The second part of the
question looks at p-values and the relevant reference in the subject guide is Section 8.11.
Approaching the question
i. The approximate sampling distribution of the sample proportion is:
   
π(1 − π) 0.30 × 0.70
P ∼ N π, = N 0.30, = N (0.30, 0.000875).
n 240

Hence:  
0.38 − 0.30
P (P ≤ 0.38) ≈ P Z ≤ √ ≈ P (Z ≤ 2.70) = 0.99653.
0.000875
The sample size n = 240 is assumed to be sufficiently large so that the normal
approximation is justified according to the central limit theorem.
ii. 96 out of 240 is 96/240 = 40%. Based on the result above, if π = 0.30 it is very unlikely
to observe more than 38% support in a random sample of 240, specifically the
(approximate) probability is 1 − 0.99653 = 0.00347. So since 40% in the sample
supported returning to the office full-time, this suggests the union’s belief that π = 0.30
is wrong and it is actually higher.

21
ST104a Statistics 1

Section B

Answer two out of the three questions from this section (25 marks each).

Question 2

(a) The data below represent the weights (in kg) of 30 participants in a study about
a drug.

57 59 61 63 64
65 73 74 74 74
75 77 77 81 82
82 82 83 83 85
87 89 91 93 96
96 98 99 99 101

i. Carefully construct, draw and label a histogram of these data. You can draw
the histogram on ordinary paper, instead of graph paper, but take care to
ensure reasonable accuracy.
ii. Find the mean and the modal group. You are given that the sum of the data
is 2,420.
iii. Find the median and the lower quartile.
iv. Comment on the data, given the shape of the histogram and the measures
which you have calculated.
(12 marks)

Reading for this question


Chapter 4 provides all the relevant material for this question. More specifically, reading on
histograms can be found in Section 4.7.3, but the entire Section 4.7 is highly relevant. For
measures of location (mean, median, modal group) see Section 4.8.
Approaching the question
i. An acceptable histogram is displayed below:

22
Examiners’ commentaries 2022

More specifically, marks were awarded for the following:


• Title.
• Sensible choice of classes. Note that at least 4 classes are required otherwise the
histogram will be too simple. Too many classes is also undesirable.
• Label ‘Weights in kg’ on the x-axis.
• Label on the y-axis (frequency density).
• Calculation of frequency densities.
• Plot accuracy.
It is generally helpful to produce a table of frequencies such as the one below and provide
it in the submitted solution:
Interval Frequency
Class interval width Frequency density
[50, 60) 10 2 0.2
[60, 70) 10 4 0.4
[70, 80) 10 7 0.7
[80, 90) 10 9 0.9
[90, 100) 10 7 0.7
[100, 110) 10 1 0.1

ii. The requested results are below:


• Mean = 2,420/30 = 80.67 kg.
• Modal group: [80, 90) kg (note that this depends on the chosen histogram class
intervals).
Make sure to use measurements units. Also, avoid the use of grouped data formulae as
they are approximate.
iii. The requested results are below:
• Median: 82 kg.
• Correct position of Q1 (between 7th and 8th inclusive).
• Q1 ≈ 73.5 kg. Note that any reasonable quartile method was accepted, i.e. anything
between 73 and 74 inclusive.

iv. The two main things to note here, are negative/left skewness and the fact that
mean < median.

(b) A football club gathered a random sample of 525 of their supporters to


determine whether they are in favour of a new pricing system on season tickets.
The table below summarises the responses of the supporters.

In favour of
Age group Sample size new pricing system
40 years old or above 325 221
Below 40 years old 200 120
i. Do the responses of the supporters indicate a difference between supporters
aged below 40 and older on whether they are in favour of the new pricing
system? Conduct a suitable hypothesis test at two appropriate significance
levels and comment on your results. State any assumptions that you make.
ii. Compute a 97% confidence interval for the difference of proportions in favour
of the new pricing system between supporters aged below 40 and older.
(13 marks)

Reading for this question


Look up the sections about hypothesis testing and confidence intervals for differences
between proportions; more specifically Sections 7.12 and 8.15.

23
ST104a Statistics 1

Approaching the question


The working of the exercise is given below:
i. • H0 : π1 = π2 vs. H1 : π1 6= π2 .
• Calculation of pooled sample proportion: p = (221 + 120)/(325 + 200) ≈ 0.65.
• Calculation of the standard error:
s  
1 1
S.E.(p1 − p2 ) = 0.65 × 0.35 × + = 0.043.
325 200

• Test statistic value:


221/325 − 120/200
= 1.866.
0.043
• For α = 0.05, critical values are ±1.96.
• Do not reject H0 at the 5% significance level since 1.866 < 1.96.
• Choose second (larger) α, say 10% gives critical values of ±1.645, hence reject H0
since 1.866 < 1.645.
• Weak evidence of a difference between supporters aged above or below 40, regarding
the new pricing system on season tickets.
• Use of the standard normal distribution is justified by the large sample sizes.
ii. The working and marking was determined based on the elements below. Note that it is
essential to use the following formula for the standard error here:
r
0.68 × 0.32 0.6 × 0.4
+ = 0.043
325 200
• Correct endpoints 0.08 ± 2.17 × 0.043 and hence (−0.013, 0.173) or else (−0.01, 0.17).
• Correct z-value: 2.17.
• Knowledge of method (this includes presenting as an interval and using the correct
standard error, i.e. the non-pooled case).

Question 3

(a) A study was conducted to assess the effectiveness of a diet programme over
time. More specifically, the number of weeks on that diet (x) and the weight
loss (y, in kg) were recorded on 9 randomly-selected participants and the data
are summarised in the table below.
Participant A B C D E F G H I
x 6.0 9.7 8.0 11.4 8.7 5.7 10.3 7.3 12.4
y 2.0 3.7 2.7 3.7 2.9 2.6 3.5 2.7 3.8

The summary statistics for these data are:

Sum of the x data: 79.5 Sum of the squares of x data 745.37


Sum of the y data: 27.6 Sum of the squares of y data: 87.82
Sum of the products of x and y data: 254.6

i. Draw a scatter diagram of these data. Carefully label the diagram. You can
draw the scatter diagram on ordinary paper, instead of graph paper, but take
care to ensure reasonable accuracy.
ii. Calculate the sample correlation coefficient. Interpret its value.
iii. Calculate and report the least squares line of y on x. Draw the line on the
scatter diagram.

24
Examiners’ commentaries 2022

iv. Based on the regression model above, what weight loss would you expect for
someone who followed this diet programme for 8 weeks? Would you trust
this value? Justify your answer.

(13 marks)

Reading for this question


This is a standard regression question and the reading is to be found in Chapter 12. Section
12.6 provides details for scatter diagrams and is suitable for part i. whereas the remaining
parts are on correlation and regression that are covered in Sections 12.8–12.10 of the subject
guide. Section 12.7 is also relevant. Sample examination question 2 of this chapter is also
recommended for practice on questions of this type.

Approaching the question

i. Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes
which give their units in addition. Far too many candidates threw away marks by
neglecting these points and consequently were only given one mark out of the possible
four allocated for this part of the question.
A reasonable scatter diagram is show below:

Note that the following elements are essential to get full marks:
• Informative title.
• Axis labels. (Units in either title or axis labels.)
• Plot accuracy.

25
ST104a Statistics 1

ii. The summary statistics can be substituted into the formula for the correlation (make
sure you know which one it is!) to obtain the value 0.9223. An interpretation of this
value is the following: The data suggest that the longer one follows the programme, the
higher the weight loss. The fact that the value is close to 1, suggests that this is a strong,
linear, positive relationship.
Many candidates did not mention all three words (strong, linear, positive). Note that all
of these words provide useful information on interpreting the relationship.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 0.2505.
The formula for a is a = ȳ − bx̄, so we get a = 0.8542.
Hence the regression line can be written as:

yb = 0.8542 + 0.2505x or y = 0.8542 + 0.2505x + ε.

It should also be plotted on the scatter diagram as shown above.


Many candidates reported incorrectly the regression line as y = 0.8542 + 0.2505x. This
expression is false; one of the two above is required. Also, many candidates did not draw
this line on the scatter diagram; instead they drew an approximate line trying to go
around the points but without reference to the above equation. No marks were awarded
in such cases.
iv. The expected amount of sales is 0.8542 + 0.2505 × 8 ≈ 2.858 kg. This prediction is
reasonable, since it is in the range of the data and the model seems quite reasonable (due
to the strong linear relationship).

(b) Two separate filling units, Unit 1 and Unit 2, are used to fill jars with coffee.
An experiment was then carried out, in which the quantities of coffee in some
randomly selected jars filled by each of these units, were measured. The
measurements are summarised in the table below.
Sample size Sample mean Sample standard deviation
Jar filling unit 1 37 22.15 2.00
Jar filling unit 2 36 23.15 2.10

i. Use an appropriate hypothesis test to determine whether there is a difference


between the mean quantities of coffee in jars from these two units. State
clearly the hypotheses, the test statistic and its distribution under the null
hypothesis, and carry out the test at two appropriate significance levels.
Comment on your findings.
ii. State clearly any assumptions you made in (b) part i.
iii. Repeat the procedure in (b) part i. to determine whether the mean quantity
of coffee in jars of Unit 1 is lower than that of Unit 2.
(12 marks)

Reading for this question


The first two parts of the question refer to a two-tailed hypothesis test comparing two
population means. While the entire chapter on hypothesis testing is relevant, one can focus
on Section 8.16. In terms of exercises see Example 8.7. The third part of the question refers
to one-tailed hypothesis tests.
Approaching the question
i. The working of the exercise is shown below, where µ1 denotes the mean quantity of
coffee in jars of Unit 1 and µ2 denotes the mean quantity of coffee in jars of Unit 2.

26
Examiners’ commentaries 2022

• H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
• Test statistic value: −2.084. (If equal variances are not assumed the test statistic
value is −2.082.) For reference the test statistic formula is:
x̄1 − x̄2 x̄1 − x̄2
q or p .
s2p (1/n1 + 1/n2 ) s1 /n1 + s22 /n2
2

• For α = 0.05, critical values are ±1.96 (±2.00 if the t60 distribution is used).
• Decision: reject H0 since −2.084 < −1.96.
• Choose smaller α, say α = 0.01, hence the critical values become ±2.576 (or ±2.660),
hence do not reject H0 since −2.576 < −2.084.
• Moderate evidence of a difference between the mean quantities of coffee in jars from
the two units.
ii. The assumptions for i. were:
• about equal variances
• about whether n1 + n2 is ‘large’ so that the normality assumption is satisfied
• about independent samples.
Some candidates stated assumptions in this part that were not made in part i. Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
iii. The working for the hypothesis test in this case is shown below.
• H0 : µ1 = µ2 vs. H1 : µ1 < µ2 .
• The correct z-values are: −1.645 for the 5% significance level and −2.326 for the 1%
significance level.
• Reject H0 for α = 0.05 since −2.084 < −1.645 but do not reject H0 for α = 0.01 since
−2.326 < −2.084, hence there is moderate evidence that the mean coffee quantity in
jars of Unit 1 is lower than that of Unit 2.

Question 4

(a) The ministry of education is considering funding pre-school education. Before


making their recommendations, administrators take a random sample of 100
students from various areas to compare the performance of students in algebra
between those who attended pre-school and those who did not. The results are
summarised in the table below:
Below Grade Level At Grade Level Advanced
Pre-school 12 29 16
No pre-school 18 11 14
i. Based on the data in the table, and without conducting any significance test,
would you say there is an association between attending pre-school and
performance in algebra?
ii. Calculate the χ2 statistic and use it to test for independence between
attending pre-school and performance in algebra at two appropriate
significance levels. What do you conclude?
(13 marks)

Reading for this question


This part targets Chapter 9 on contingency tables and chi-squared tests. Note that part i. of
the question does not require any calculations, just understanding and interpreting
contingency tables. Part ii. is a straightforward chi-squared test and the reading is also
given in Chapter 9. Look also at Example 9.5.

27
ST104a Statistics 1

Approaching the question


i. An example of a ‘good’ answer is given below:
There are some differences in the performance on algebra between students that did and
did not attend pre-school. More specifically, 60% of those in below grade level did not
attend, whereas more than 50% of those who attained advanced level did attend
pre-school. Hence there seems to be an association between attending pre-school and
performance in algebra, although this needs to be investigated further.
ii. We test H0 : No association between attending pre-school and performance in algebra vs.
H1 : Association between attending pre-school and performance in algebra. Be careful to
get these the correct way round!
It is essential to calculate the expected values. which are shown below:
Below Grade Level At Grade Level Advanced
Pre-school 17.1 22.9 17.1
No pre-school 12.9 17.2 12.9
The test statistic formula is:
X (Oi,j − Ei,j )2

i,j
Ei,j
which gives a value of 7.623. This is a 2 × 3 contingency table so the degrees of freedom
are (2 − 1) × (3 − 1) = 2.
For α = 0.05 ⇒ the critical value is 5.991, hence reject H0 since 5.991 < 7.623.
For α = 0.01 ⇒ the critical value is 9.210, hence do not reject H0 since 7.623 < 9.210.
There is moderate evidence of an association between attending pre-school and
performance in algebra.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.

(b) You have been asked to design a cluster random sample survey on public sector
workers to examine whether satisfaction on working conditions of employees
varies between different job types.
i. Discuss in no more than three sentences how you will choose your sampling
frame, while mentioning potential limitation(s) of your choice.
ii. Propose two relevant clusters. Justify your choices.
iii. Discuss in no more than two sentences, how selection bias could affect the
results in the context of this survey.
iv. Briefly discuss the statistical methodology you would use to analyse the
collected data.
(12 marks)

Reading for this question


This was a question on basic material on survey designs. Background reading is given in
Chapter 10 of the subject guide which, along with the recommended reading should be
looked at carefully. Cluster random sampling is covered in Section 10.7.2, whereas selection
bias in particular is covered in Section 10.8. Candidates were expected to have studied and
understood the main important constituents of design in random sampling.
Approaching the question
One of the main things to avoid in this part is to write essays without any structure. This
exercise asks for specific things and each one of them requires 1 or 2 lines at most. If you are
unsure of what these things are, do not write lengthy essays. This is not giving you
anything and is a waste of your invaluable examination time. If you can identify what is
being asked, keep in mind that the answer should not be too long.
Note also that in some cases there is no unique answer to the question. Some suggested
answers are provided below.

28
Examiners’ commentaries 2022

i. Lists of companies may be accessible but they may have restrictions on access to
information on stratification factors like income level.
ii. Potential clusters include:
• area of a city
• company type.

iii. If the list of clusters contains only urban areas, certain job types (agriculture) could be
under-represented. If the list of clusters consists of companies of a specific type, the
demographics in these companies may differ from those in the population. (Other
reasonable suggestions were also accepted.)
iv. Marks were awarded for discussing the use of graphs, confidence intervals and hypothesis
tests.

29

You might also like