Chapter 2: Statistical Tests, Confidence Intervals and Comparative Studies
Chapter 2: Statistical Tests, Confidence Intervals and Comparative Studies
2.0 Introduction
The principal focus will be on sample means. However, the ideas underlying the
statistical tests and confidence intervals introduced in this chapter are quite
general: they underpin the rest of the material discussed in the course.
Statistical significance tests are widely used (and often abused!) in the research
literature across most disciplines where empirical data are assessed. The
concepts involved in all such tests are the same irrespective of the specific
purpose of the test. The ideas will be introduced here in the context of tests on
sample means. Our first example asks if a measurement system is unbiased,
i.e., does it get the right answer, on average, when some property of a material is
measured and the correct answer is known.
Prefatory remarks
Statistical significance tests are concerned with identifying when observed values
may be considered unusually large or small when compared to expected or
hypothesised values. Before engaging with these tests it may be worthwhile to
recall a similar situation encountered in Chapter 1.
z = = 2.5
which clearly points to her being unusually tall. The standardisation simply
expresses how far her height deviates from the population mean in terms of
standard deviation units.
Pendl et al. [1] reported a new gas chromatographic method for determining total
fat in foods and feeds. Several standard reference materials (SRMs) were
measured by two laboratories, laboratory (A), where the method was developed,
and a peer laboratory (B). One of these was a baby food powder with a certified
value of µo=27.1% fat. Each laboratory analysed the SRM on ten different days
and reported the results shown below.
% Fat Content
A: 26.75 26.73 26.40 26.68 26.90 27.11 27.18 27.21 27.20 27.27
B: 26.33 26.68 26.31 26.59 25.61 25.29 25.88 26.08 25.85 25.78
Laboratories
A B
Table 2.1.1: Test results and summary statistics for the baby food study
For laboratory A, the average result is 26.9% fat. While this is lower than the
certified value of µo=27.1% the deviation might simply be a consequence of the
chance day-to-day variation that is evidently present in both measurement
systems. To assess whether or not the deviation, , is large compared to
the likely chance variation in the average result, it is compared to the estimated
standard error of the sample mean . The standard error is a measure of
the variability of values, just as the standard deviation, s, is a measure of the
variability in individual x values. The ratio:
(2.1)
measures by how many (estimated) standard errors the sample mean differs
from the certified value.
Assume that the laboratory is unbiased, i.e., that the measurements are indeed
varying at random around the certified value of µo=27.1%. In such a case, if
many thousands of samples of size n were obtained under the same conditions
and and s calculated for each sample and then used to calculate the t-statistic
using equation (2.1), the histogram of the resulting t-values would approximate
the smooth curve shown in Figure 2.1.2. This curve, which is called the
‘sampling distribution’ curve and can be derived mathematically, is known as
Student’s t-distribution; the total area under the curve is 1. The concept of a
sampling distribution is important: it describes the distribution of values of the test
statistic (here the t-ratio, equation (2.1)) that would be obtained if very many
samples were obtained under the same conditions. Accordingly, it describes the
expected distribution of summary measures (such as , s or t) and allows us to
identify unusual values.
This lies close to the centre of the t-distribution, so we do not reject the
hypothesis that the analytical system in Laboratory A is unbiased1.
3 2 4.30
5 4 2.78
10 9 2.26
20 19 2.09
30 29 2.05
120 119 1.98
Note that, as the sample size increases, the t-distribution becomes more and
more like a standard Normal distribution and the 95% cut-off values get closer
and closer to ±1.96. If the degrees of freedom are more than about 30 many
statisticians will use ±2 as the cut-off values.
• Specify the hypothesis to be tested and the alternative that will be decided
upon if this is rejected
Ho: µ = 27.1%
H1: µ ≠ 27.1%
where µ is the long-run mean for a very large number (in principle, an
infinitely large number) of measurements on the reference material.
• Specify a statistic which measures departure from the null hypothesis µ=µ ο
(here µo=27.1)
measures by how many (estimated) standard errors deviates from µo, i.e., it
scales the deviation of from µo in terms of standard error units. The t-
distribution with n–1 degrees of freedom, see Figure 2.1.1, describes the
frequency distribution of t-values that might be expected to occur if, in fact,
the null hypothesis is true.
Since the t-distribution describes what t-values might be expected when the
null hypothesis is true, this implies that when we decide to reject the null
hypothesis if the test statistic is in the rejection region, we automatically run a
risk of α=0.05 of rejecting the null hypothesis even though it is correct. This
is a price we must pay when using the statistical significance testing
procedure.
2
The implications of the choice of significance level are discussed more fully in Chapter 4; see
Section 4.2.
Note that the entire procedure and the criterion for rejecting the null hypothesis
are defined before the calculations are carried out, and often before any
measurements are made. This ensures that the decision criterion is not
influenced by the data generated.
For Laboratory B the difference between the observed mean of 26.04 and the
certified value of 27.10 is said to be ‘statistically significant’. The difference
between the mean for Laboratory A and the certified value would be declared to
be ‘not statistically significant’. ‘Statistically significant’ simply means ‘unlikely to
be due to chance variation’ – it is evidence that a reproducible effect exists
(remember that the root meaning of the verb ‘to signify’ is ‘to signal or indicate’
something). Thus, ‘statistically significant’ does not (in itself) mean ‘important’.
Whether or not the observed difference is of any practical importance is an
entirely different question. The answer will be context specific and will require an
informed judgment on the part of someone with an understanding of the context.
An analysis of the Laboratory B data is shown in Table 2.1.3: this was generated
using Minitab, but much the same output would be obtained from any statistical
package.
One-Sample T: Fat-%
The t-value is –7.61 (note that the standard deviation is really 0.4406, hence the
rounding difference when compared to our value of –7.60, which is based on
0.441) and this has an associated p-value of 0.000, i.e., <0.0005. The p-value is
the area in the tails of the t-distribution (with 9 degrees of freedom), i.e., the sum
of the areas to the right of 7.61 and to the left of –7.61; there is, in fact, an area of
0.000016 on each side, but Minitab reports only three decimal places. The t-
distribution is the sampling distribution of the test statistic, when the null
hypothesis is true. Hence, the tail areas give us the probability of seeing a
more extreme t-value (in either direction) than the one obtained in the study, if
the null hypothesis were true. The very small p-value indicates that our observed
t-value is highly improbable under the hypothesis that the average around which
JF Mathematics (SH and TSM) 8
the measurement system is varying is indeed 27.1; this implies that the null
hypothesis µo=27.1 is implausible and should be rejected.
The use of the p-value as a basis for deciding whether or not to reject the null
hypothesis is entirely consistent with our earlier approach of choosing a
significance level, which in turn determines the critical values. If the p-value were
exactly 0.05 it would mean that the observed test statistic fell exactly on one of
the critical values. If the p-value is smaller than 0.05 then it means that the
observed test statistic is further out into the tails than one of the critical values –
and thus the null hypothesis should be rejected. For those who choose a
significance level of α=0.05, a p-value less than 0.05 means that the result is
statistically significant and the null hypothesis should be rejected. P-values
greater than 0.05 then lead us to accept (at least provisionally) the null
hypothesis. Thus, large p-values support the null hypothesis, while small p-
values lead us to reject it. Our choice of significance level indicates what we
consider large or small.
The use of p-values is, also, a simple and convenient way of reporting the results
of statistical tests, without requiring the availability of statistical tables. Thus, in
Chapter 1, Figure 1.3.2 shows a Normal plot of sample data used to explain the
logic underlying such plots. The Figure shows, also, the results of an Anderson-
Darling test of the hypothesis that the data come from a Normal distribution – it
reports a p-value of 0.971, which supports the null hypothesis. Similarly, the
Anderson-Darling test of Figure 1.3.6 gives a p-value of p <0.005, which strongly
rejects the assumption that the data (β-OHCS values) come from a Normal
distribution. We do not need to resort to tables to understand what the test result
indicates – indeed we do not even need to know how to calculate the test
statistic! Simply knowing how to interpret the p-value allows us to interpret the
result of the test reported by the computer analysis.
Exercise
2.1.1 A filling line has been in operation for some time and is considered stable. Whenever a
product changeover takes place it is standard operating procedure to weigh the contents of
10 containers, as a process control check on the target setting. The most recent
changeover was to a target value of µ=21g. Table 2.1.4 shows summary results for the ten
Head 1 Head 2
Standard
Deviation 0.118 0.360
Model Validation
The statistical model that underlies the t-test requires that the data values are
independent of each other and that they behave as if they come from a single
Normal distribution. Since the material was measured on ten different days, the
independence assumption is likely to hold. If we knew the time order in which
the measurements were made then we could draw a time-series plot, to check
that there were no trends, which would call into question the assumption of a
single distribution. For example, a once-off shift or trend upwards or downwards
in the plot would suggest changes in the mean level of the measurement system.
A Normal probability plot assesses the Normality assumption, though with as few
as ten observations only strong departures from Normality would be likely to be
detected.
Figure 2.1.3 shows a Normal plot for the fat data: the data vary around a straight
line and the p-value for the Anderson-Darling test is large: there is no reason to
question the Normality of the data. Accordingly, we can have confidence in the
conclusions drawn from our statistical test.
Exercise
2.1.2 In Chapters 1 and 2 of Mullins, 2003 [2], several examples from a Public Analyst’s
laboratory of the analysis of sugars in orange juice were presented. Exercise 2.2 (page 49)
gave 50 values of the recoveries of glucose from control samples used in monitoring the
stability of the analytical system. The controls were either soft drinks or clarified orange
juice, spiked with glucose. The summary statistics for the 50 results are: mean=98.443 and
SD=2.451. Carry out a t-test, using a significance level of 0.05, of the hypothesis that the
analytical system is unbiased, i.e., that the long-run average recovery rate is 100%.
One-sided tests
The tests described above are known as two-sided tests and the null hypothesis
is rejected if either an unusually large or an unusually small value of the test
statistic is obtained (they are also referred to as two-tailed tests – the alternative
hypothesis is two-sided, leading to critical or rejection regions in two tails). In
special circumstances, we would want to reject the hypothesis only if the
observed average were large; for example, in a public health laboratory if we
were measuring the level of trace metals in drinking water or pesticide residues
in food to check if they exceeded regulatory limits. In other circumstances we
would reject the hypothesis if the observed average were small; for example, if
we were assaying a raw material for purity - our next example is based on the
latter case.
Test portions are sampled from each of five randomly selected drums from a
consignment containing a large number of drums of raw material; the test
portions are then assayed for purity. The contract specified an average purity of
at least 90%. If the average of five results is and the standard
deviation is s=2.4%, should the material be accepted?
Ho: µ ≥ 90
H1: µ < 90
Here the hypotheses are directional. A result in the centre or in the right-hand
tail of the sampling distribution curve would support the null hypothesis while one
in the left-hand tail would support the alternative hypothesis. We specify the
significance level to be α=0.05. When the t-statistic:
Figure 2.1.4 One-tail critical region for the t-distribution with 4 df.
And, since this does not lie in the critical region, we do not reject the null
hypothesis that the average purity is at least 90%. Given the large variability
involved in sampling and measurement error (s=2.4), a sample average value of
87.9, based on five measurements, is not sufficiently far below 90 to lead us to
conclude that the average batch purity is below specification. Such a deficit
could have arisen by chance in the sampling and measurement processes.
The null hypothesis being tested by the significance test is whether the results
are varying at random around the lower specification bound of 90%. This might
not be the case for two reasons, viz., the material could be out of specification or
the analytical system could be biased. In carrying out the test as we did above,
we implicitly assumed that the analytical system is unbiased. In practice, there is
not much point in making measurements (or rather it is dangerous to do so!)
without first validating the measurement system – this applies equally to social
science measurements as it does to engineering or scientific measurements.
Note that if we had carried out a two-sided test using the same significance level
of α=0.05 the critical values would have been ±2.78, which means that values
between –2.13 and –2.78 would be statistically significant on a one-sided test but
‘non-significant’ on a two-sided test. In an academic context where α=0.05 is the
conventionally used significance level there can be a temptation to switch from a
two-sided to a one-sided test in order to get a ‘significant’ result – especially if
this could make the difference between the work being published or not. Many
statisticians would be strongly of the view that in scientific research a two-sided
test would be natural; see quotation from Altman [3] below3 (Altman is a medical
statistician) and the footnote on page 71.
There are several issues involved in this question. Suppose we obtain a test
statistic corresponding to a p-value of 0.08 on a two-sided-test: this would be
considered ‘non-significant, but if we switched to a one-sided test it would be
“In rare cases it is reasonable to consider that a real difference can occur in only one direction, so
that an observed difference in the opposite direction must be due to chance. ….One sided tests
are rarely appropriate. Even when we have strong prior expectations, for example that a new
treatment cannot be worse than an old one, we cannot be sure that we are right. If we could be
sure we would not need to do an experiment! If it is felt that a one-sided test really is appropriate,
then this decision must be made before the data are analysed; it must not depend on what the
results were. The small number of one-sided tests that I have seen reported in published papers
have usually yielded p-values between 0.025 and 0.05, so that the result would have been non-
significant with a two-sided test. I doubt that most of these were pre-planned one-sided tests”.
In many situations data are collected in such a way that two measured values
have a special relationship to each other, and it is the difference between the two
values that is of interest. A group of students is assessed on a competency test,
given training, and then re-tested; interest then centres on the changes or
improvements in the scores. A patient is measured on some health indicator
(e.g., cholesterol level), put on a treatment, and then re-tested later; again, the
change in the test result (a reduction in cholesterol level) is the quantity of
interest. Twins have been extensively studied by psychologists to determine
differences, e.g., twins separated at birth and raised in different environments
have been compared to assess the contributions of ‘nature’ and ‘nurture’, i.e.,
genetic versus ‘environmental’ effects.
Industrial studies, also, are often designed in such a way that the resulting data
may be considered matched or paired. Example 4 below is based on a
comparison of two measurement systems, but the ideas are applicable to
comparisons of production methods also. In all such comparisons, we focus on
differences, thus reducing two sets of numbers to one: this brings us back to the
same methods of analysis (one-sample t-test and the corresponding confidence
interval, which will be discussed in Section 2.3) used for our validation study,
Example 1. Although the test is identical to the one-sample t-test, it is generally
referred to as a ‘paired t-test’, because the raw data come in the form of pairs.
Statistical Model
It is clear from Table 2.2.1 that there is considerable variation both between
animals (the mean RBP levels vary from 5697 to 9767) and within animals (the
within-animal differences (ipsi – contra) vary from –1051 to 2726). The statistical
task is to identify and measure a possible systematic ipsi-contra difference in the
presence of this chance variation.
A simple statistical model for the within-animal differences provides a basis for
our analysis. We will assume that a population of such animals would result in a
Normal curve for the ipsi-contra differences, with some mean µ and some
standard deviation σ, as shown in Figure 2.2.1. We assume also that the
differences for the sixteen animals are independent of each other.
If the long-run mean of this curve, µ, is zero, then there is no systematic ipsi-
contra difference in RBP levels – what we see is simply chance variation from
cow to cow. If, on the other hand, µ is non-zero, then there exists a systematic
ipsi-contra RBP level difference. Of course, on top of this possible systematic
difference there is an additional chance component which varies from cow to
cow.
In this section we will carry out a t-test on the differences to address the
statistical question “is µ equal to zero?”, following exactly the same procedure
used for our earlier analytical validation example (Example 1). Later (see
Section 2.3), we will address the issue of estimating the size of µ.
Model validation
Our statistical model assumes that our sample differences are independent of
each other and come from a single stable Normal distribution. Are these
assumptions valid?
One possibility that would invalidate our model, would be a relationship between
the magnitude of the differences and the average RBP levels for the cows. If this
were the case it would suggest that we do not have a single underlying Normal
curve, but several curves with different standard deviations, which depend on the
average RBP levels. Figure 2.2.2 shows the differences plotted against the
means of the RBP measurements for the 16 cows.
Significance Test
To carry out the test we first specify the competing hypotheses as:
Ho : µ = 0
H1 : µ ≠ 0
And choose a significance level of α=0.05. For a two-sided test with degrees of
freedom n–1=15, we use critical values of tc=±2.13, as shown in Figure 2.2.4.
We now calculate the test statistic and compare it to the critical values.
The test-statistic5 is
Since the test statistic t=3.65 greatly exceeds the critical value of 2.13, we reject
the null hypothesis: the long-run mean µ does not appear to be zero. Since the
sample difference is positive, we conclude that, on average, the ipsi side of the
uterus secretes higher concentrations of RBP than does the contra side.
Computer Analysis
The computer output shown in Table 2.2.2 gives summaries of the data. Note
that the line for “differences” gives the mean of the differences, which is
arithmetically the same as the difference between the means for ipsi and contra.
It also gives the standard deviation of differences. Note, however, that this
cannot be calculated directly from the standard deviations for ipsi and contra (a
covariance term would be required). The standard error is just the standard
deviation divided by the square root of the sample size, n=16.
5
Zero has been inserted explicitly in the formula to emphasise that what is being examined is the
distance from to the hypothesised mean of zero.
The p-value of 0.002 shown in the table is the probability of obtaining, by chance,
a more extreme test statistic than that calculated for our study (t=3.65), if the
null hypothesis were true. Figure 2.2.5 shows the sampling distribution of the
test statistic when the null hypothesis is true – this is Student’s t-distribution with
15 degrees of freedom. The areas to the right of 3.65 and to the left of –3.65 are
0.001, respectively. This means that the probability of obtaining a t-value further
from zero than these two values is 0.002, the p-value shown in the table.
-3.65 3.65
The small p-value implies that the null hypothesis is implausible and leads us to
reject it. While simply inspecting Figure 2.2.2 strongly suggests that the
observed differences do not vary at random around zero (as required by the null
hypothesis) the t-test provides an objective measure (via the p-value) of how
unlikely our observed values would be if the null hypothesis were true.
Calculating tail areas (p-values) for t-distributions is, conceptually, the same as
calculating tail areas for the standard Normal curve, as was done in Chapter 1.
Our analysis of the RBP data leads us to conclude that the ipsi side of the uterus
secretes more RBP than does the contra side – this has scientific implications as
outlined in the introduction. In Section 2.3 we will return to this dataset and
obtain an estimate, together with error bounds, for the magnitude of this
difference in the population of cows.
Exercise
2.2.1 Rice [4] presents data from a study by Levine on the effect of cigarette smoking on platelet
aggregation. Blood was taken from 11 subjects before and after they smoked a cigarette.
The background to the study is that it is known that smokers suffer from disorders involving
blood clots more often than non-smokers, and platelets are involved in the formation of
blood clots. Table 2.2.3 gives a measure of platelet aggregation (where larger values
mean more aggregation; units are the percentages of platelets that aggregated) for each
subject before and after exposure to the stimulus.
Is there evidence that smoking even one cigarette affects the ability of platelets to
aggregate? Note: the mean of the differences is 10.27 and the standard deviation is 7.98.
25 27 2
25 29 4
27 37 10
44 56 12
30 46 16
67 82 15
53 57 4
53 80 27
52 61 9
60 59 -1
28 43 15
Table 2.2.3: Platelet aggregation data before and after smoking (percentages)
Batch
A
B
Batch
A
B
Batch
A
B
Batch
A
B
1
41.7
40.6
11
40.9
41.4
21
34.1
36.2
31
38.0
39.6
2
42.1
43.6
12
41.3
44.9
22
39.3
40.6
32
42.9
43.2
3
37.0
39.0
13
40.5
41.9
23
37.3
40.9
33
37.3
38.7
4
37.6
39.0
14
37.8
37.3
24
38.2
38.8
34
42.8
42.6
5
35.2
38.3
15
39.6
39.6
25
37.7
39.8
35
40.2
38.2
6
43.9
48.9
16
39.9
44.4
26
42.0
44.7
36
49.2
51.3
7
39.0
40.5
17
39.2
41.2
27
36.3
39.4
37
40.3
43.0
8
39.2
40.8
18
39.4
40.1
28
39.8
42.7
38
41.7
41.8
9
37.8
40.0
19
40.2
41.6
29
45.8
44.3
39
40.7
41.0
10
46.3
46.1
20
37.8
39.3
30
41.6
43.7
40
40.0
44.8
Before carrying out the formal statistical test we will, as before, investigate the
assumptions that underlie the test. Plotting the data in time order, in the form of
a control chart (see Chapter 1, Section 1.6 for a discussion of control charts)
does not show any time-related variation in the differences which might call the
assumption of data independence into question – the plotted points appear to
vary at random around the centre line (which is the average of the 40
observations). Figure 2.2.6 suggests we have a stable distribution of differences.
It also suggests that the differences do not vary about a long-run mean of zero.
We will, nevertheless, carry out a significance test of this hypothesis.
Figure 2.2.7 in which the laboratory differences are plotted against the average of
the two measurements for each batch shows no tendency for batches with higher
dissolution rates to have either greater or smaller variation in differences than
batches with lower dissolution rates. Any systematic relationship between the
variability of the differences and the batch means would suggest that the
assumption of a single stable distribution (constant mean and standard deviation)
of differences was false. This assumption appears to be reasonable based on
the evidence of Figures 2.2.6 and 2.2.7.
Figure 2.2.8 shows a Normal plot of the differences. The scatterplot of the 40
differences versus their corresponding Normal scores produces a fairly straight
line, as would be expected if the data came from a Normal distribution. The
Anderson-Darling test statistic has a corresponding p-value of 0.558 which
supports the assumption of Normality. Figures 2.2.6-2.2.8 taken together provide
reasonable assurance that the assumptions underlying the t-test (as shown
graphically in Figure 2.2.1) are valid in this case.
The summary statistics for the 40 batches are given below in Table 2.2.5
To carry out the test we first specify the competing hypotheses as:
Ho : µ = 0
H1 : µ ≠ 0
and choose a significance level of α=0.05. Here, µ is the relative bias, i.e., the
long-run mean difference between the results that would be produced by the two
laboratories if a very large number of batches (in principle, an infinite number)
were analysed by both laboratories. For a two-sided test with degrees of
freedom n–1=39, we use critical values of tc=±2.02. The test-statistic is
Since the test statistic t=6.12 is much greater than the critical value of 2.02, we
reject the null hypothesis: the long-run mean µ does not appear to be zero – we
conclude that there is a relative bias between the two laboratories. Laboratory B
gives higher results, on average. In the next section we will put error bounds
around the sample mean difference of 1.56 in order to have a more trustworthy
estimate of the long-run difference (the relative bias) between the two
laboratories.
2.2.2 Disagreements arose regarding the purity of material being supplied by one plant to a
sister plant in a multinational corporation [2]. The quality control (QC) laboratories at the
two plants routinely measured the purity of each batch of the material. The results (units
are percentage purity) from the six most recent batches as measured by each laboratory
are presented in Table 2.2.6 below. Carry out a paired t-test to determine if there is a
relative bias between the laboratories.
2.2.3. The QC manager who carried out the analysis of the laboratory comparison data in the
preceding example noticed that the difference between the results for the last batch was by
far the largest difference in the dataset. Since the result for this batch from laboratory 1
was the only value in the dataset which was less than 89.0, she wondered if it was correct
and whether the large difference for batch 6 was responsible for the strong statistical
significance of the difference between the laboratory means. Before returning to the
laboratory notebooks to investigate this result, she decided to exclude this batch from the
dataset and re-analyse the remaining data. The data with summaries are shown in Table
2.2.7. Carry out the calculations and draw the appropriate conclusions. Compare your
results to those obtained from the full dataset.
The statistical significance testing procedure discussed in Sections 2.1 and 2.2
is quite straightforward. It may be summarised in the following steps:
• Exceptional values of a test statistic are those that have only a small chance
of being observed. The small probability that defines ‘exceptional’ is called
the significance level of the test.
These steps underlie all the significance tests presented in this text (and
hundreds of other tests also). Anyone who understands this simple logical
procedure, and who assesses the statistical assumptions required for the
relevant test, should be able to apply effectively any statistical test without
knowledge of the underlying mathematical details (especially if validated
statistical software is available).
Statistical significance tests address the question “is the sample difference likely
to reflect an underlying long-run systematic difference?”. A statistically significant
test result means that the answer to this question is “yes”. A natural follow-up
question is “how big is the difference?”.
Confidence intervals provide the statistical answer to this question – they put
error bounds around the observed sample result in such a way as to provide not
just a ‘point estimate’ (the observed sample value) but an interval estimate of the
long-run value. Providing a measure of the size of the systematic quantity being
studied is better, clearly, than simply saying that we can, or cannot, rule out a
particular long-run value.
Our sample mean of 26.04 is an estimate of that value. It is clear, though, that if
another ten measurements were made, the result almost certainly would be other
than 26.04, since our sample average is affected by chance measurement error.
It makes sense, therefore, to attach error bounds to our sample average to reflect
this uncertainty. The statistical approach to obtaining error bounds is to calculate
a confidence interval.
A confidence interval simply re-arranges the elements used in carrying out the
statistical test to produce error bounds around the sample mean. Thus, a 95%
confidence interval for the long-run mean result that would be obtained if very
many measurements were made on the baby food in Laboratory B is given by:
± tc
26.04 ± 2.26
26.04 ± 0.32.
We estimate that the long-run mean is somewhere between 25.72 and 26.36
units (percentage fat). Note that this interval does not include the certified value
of 27.1. It will always be the case that where a confidence interval does not
contain the null hypothesis value for the corresponding t-test, the test result will
be statistically significant, i.e., the null hypothesis will be rejected. Conversely, if
the test fails to reject the null hypothesis, then the hypothesised mean will be
included in the confidence interval. Thus, the confidence interval is the set of
possible long-run values which would not be rejected by a t-test.
The logic behind this interval is most easily seen by focusing on the Normal
curve. The t-distribution coincides with the Normal curve when σ is known and it
is simpler to focus on the Normal, which we have studied extensively in Chapter
1. We have seen that if a quantity X is Normally distributed, then 95% of the X
values lie in the interval µ±1.96σ, as shown in Figure 2.3.1(a). Similarly, as
shown in Figure 2.3.1(b), 95% of means based on samples of size n=4 sampled
from this distribution will lie in the interval µ±1.96 . Thus, if we take a single
sample of size n and calculate , there is a probability of 0.95 that it will be
within a distance of 1.96 of µ.
So what status does our interval 25.72 to 26.36 units have? From Figure
2.3.1(b) we deduce that if we calculate intervals in this way, then 95% of the
intervals so calculated will cover the unknown long-run mean. This gives us
confidence in the particular interval we have just calculated, and we say we are
95% confident that the interval covers µ. Note that the 95% confidence is a
property of the method used in calculating the interval, rather than a property of
the pair of numbers we calculate.
By changing the multiplier 1.96, we can get different confidence levels; thus,
using 2.58 gives 99% confidence, as this multiplier corresponds to the cut-off
A simulation exercise
The properties of this method of attaching error bounds to sample means, based
as it is on the idea of repeated sampling, can be illustrated usefully by computer
simulation of repeated sampling. The simulation described below is designed to
help the reader to understand the implications of the confidence bounds and of
the associated confidence level.
Fifty samples of size n=4 were generated randomly, as if from a process with
known mean, µ, and standard deviation, σ. From each sample of four test results
the calculated sample mean, , and the corresponding interval6:
were then calculated; the intervals are shown as horizontal lines in Figure 2.3.2.
The centre of each horizontal line is the observed mean, , its endpoints are the
bounds given by adding or subtracting 1.96 from . The vertical line
represents the known mean, µ. The flat curve represents the frequency
distribution of individual values; the narrow curve the sampling distribution of
averages of 4 values. Only two of the fifty intervals do not cover the true mean,
µ. Some intervals have the true mean almost in the centre of the interval (i.e.,
is very close to µ), but some just about cover the true mean, which since is
relatively far from µ. Theory suggests that 5% of all such intervals would fail to
cover the true mean; the results of the simulation exercise are consistent with
this.
6
Note the use of a small x to identify an observed mean, where a capital X was used to label the
‘random variable’; this difference in notation is a refinement we will not worry about in general in
the text. Note, also, that where σ is known the t value will be the same as the Standard Normal
value ±1.96. This was used here to simplify the drawing of the figure: it meant all the bars were
the same length. If σ is not known (as is usually the case) then the sample estimates, s, will vary
randomly from sample to sample and the widths of the intervals will vary accordingly.
Exercises
2.3.1. For the baby food data of Table 2.1.1 (page 3) calculate a 95% confidence interval for the
long-run average around which the results in Laboratory A were varying. Since the t-test
did not reject the null hypothesis that the long-run mean was 27.1, we would expect this
interval to contain 27.1. Does it?
2.3.2. For the fill head data of Table 2.1.4 calculate 95% confidence intervals for the long-run
average fill levels for the two heads. Are your intervals consistent with the results of
Exercise 2.1.1 (pages 9, 10) where t-tests of the hypothesis that the target value of µo=21g
was being achieved were carried out?
Our earlier analysis showed that the ipsi side of the uterus secretes more RBP
than does the contra side. In order to measure the size of this difference we
calculate a 95% confidence interval for the long-run mean difference (i.e., the
value that would be obtained in a population of cows). This merely requires us to
apply the same formula as above, but our calculations are now based on a
sample of differences, rather than on individual raw measurements.
± tc
851 ± 2.13
851 ± 497
Exercise
2.3.3 Return to Exercise 2.2.1 (Platelet aggregation study of cigarette smoking; page 21) and
calculate a 95% confidence interval for the long-run mean shift in the measure of platelet
aggregation after smoking one cigarette.
Table 2.3.1: Paired t-test and confidence interval for the laboratory data
Exercise
2.3.4 Return to Exercises 2.2.2 and 2.2.3 (page 26) and calculate 95% confidence intervals for
the relative bias between the two laboratories, based on the two sample sizes (n=6 for
Exercise 2.2.2 and n=5 for exercise 2.2.3). Comment on their relative widths. What is the
relation between the widths of these intervals and the sizes of the t-values calculated for
the significance tests in the earlier exercises?
Introduction
Two-sample data may arise from an observational study; for example, we might
want to compare the average junior certificate points achieved by two randomly
selected samples of boys and girls. The study design may, alternatively, be
experimental; for example, a group of students might be randomly split into two,
one group might be asked to learn a body of material using a textbook as a
learning tool, while the second group uses an interactive teaching package: the
resulting scores on a common test will then be compared. In both cases, we will
be concerned with differences that might arise in suitably defined populations
rather than with the observed differences between the groups of subjects
selected for study.
The analysis of the data resulting from such studies typically involves formal
comparison of the means of the two sets of results, using statistical significance
tests and confidence intervals; these will be discussed first. These inferential
methods pre-suppose a particular underlying statistical model. Graphical
methods for validating that model are then discussed, as is a formal test to
compare standard deviations. Study design questions, such as randomisation of
the study sequence for experimental studies and use of control groups, will be
discussed in Chapter 4. A key question that arises before the study is
undertaken is “what sample size is required?”. This, also, will be addressed in
Chapter 4.
Speed-B Speed-A
76.2 73.5
81.3 77.0
77.0 74.8
79.9 72.7
76.4 75.4
76.2 77.1
77.6 76.1
80.5 74.4
81.5 78.1
77.3 76.5
78.2 75.0
It is clear from the dotplots and the summary statistics that in this study Speed-B
gave a higher average percentage yield. Several questions arise:
• given that there is quite a lot of chance variability present in the data
(thus, the yields for Speed-A vary from 72.7 to 78.1%, while those for
Speed-B range from 76.2 to 81.5%, a range of approximately 5 units in
each case), could the difference of 2.9% between the two averages be
a consequence of the chance variation?
The first of these questions will be addressed by carrying out a t-test of the
difference between the means. The second by obtaining a confidence interval
for the long-run mean difference between results produced by the two process
configurations, and the third by using a different significance test (the F-test) for
comparing the standard deviations. Initially, the analyses will assume that the
Statistical Model
A simple statistical model for the data is that all the observations may be
regarded as being independently generated from Normal distributions with
standard deviation σ, common to both process configurations. The values
generated using Speed-A have a long-run mean µ1, while those from Speed-B
have a long-run mean µ2.
The underlying model is illustrated schematically by Figure 2.4.2. The two long-
run means, µ1 and µ2, could be the same or different, but (in the simplest case)
the standard deviations are assumed to be the same. Because the standard
deviations are assumed the same, the shapes of the two Normal curves are
identical. These curves describe the properties of the populations from which the
two samples are drawn. The question as to whether there is a long-run
difference between the yields may then be posed in terms of the difference (if
any) between µ1 and µ2. The question is addressed directly, using a statistical
significance test.
Figure 2.4.2: Two Normal curves: (possibly) different means but the same standard deviation, σ
This estimate has 20 degrees of freedom, since each of the sample standard
deviations has 11–1=10 degrees of freedom.
Ho : µ − µ = 0
2 1
H1 : µ − µ ≠ 0
2 1
7
Note that the variances add, even though the means are subtracted. If this seems odd, ask
yourself if you would expect the combination of two uncertain quantities to be more or less
uncertain than the two quantities being combined. Refer back to Chapter 1, Section 1.4 where
we discussed combining random quantities.
t=
The test statistic can fall into the tails for one of two reasons: just by chance
when there is no long-run difference, or because the two long-run means are not
the same. In using statistical significance tests, the latter is always assumed to
be the reason for an exceptional test-statistic. Accordingly, an unusually large or
small test-statistic will result in the null hypothesis being rejected.
and as this value falls outside the critical values (-2.09, +2.09) Ho is rejected.
The two sample means are said to be statistically significantly different. Speed-B
is considered to give a higher yield, on average.
If you compare the steps involved in carrying out the two-sample t-test described
above with those previously described for a single-sample (or paired) t-test you
will see that the procedures are virtually identical. The differences are purely
technical: they simply allow for the fact that in the two-sample case there are two
sample means each subject to chance variation, whereas there is only one for
the one-sample or paired t-test.
Exercises
2.4.1. Samuels [5] reports a study of a pain-killing drug for treating uterine cramping pain after
childbirth (She describes the data as fictitious, but realistic, which presumably means the
numbers are based on experience of similar studies). Fifty women were randomly
assigned into two groups of 25, one of which received the drug and the other a placebo. A
pain-relief score, based on hourly interviews throughout the day, which varied from 0 (no
relief) to 56 (complete relief for 8 hours), was assigned to each study participant. The data
summaries are shown in Table 2.4.3.
Does the study suggest that the drug works? Carry out a test to determine if there is a
statistically significant difference between the sample means.
2.4.2 Samuels [5] presents data from an experiment designed to see if wounding a tomato plant
would induce changes that improve its defence against insect attack. Larvae of the
8
tobacco hornworm (Manduca Sexta) were grown on 17 wounded and 17 control plants .
Summary statistics for the weights of the larvae (mg) after 7 days of growth are shown in
Table 2.4.4. Analyse the study results.
8
Note: the actual numbers were 16 and 18, but to avoid the complication of taking a weighted
average of the standard deviations, I have made the sample sizes the same – it does not affect
the results in any substantive way.
For the pellet process development study, the difference between the two sample
means, , is a point estimate of the difference, µ –µ , between the long-run
2 1
means for the two process speeds. This observed difference is obviously subject
to chance variation, so we might ask what it tells us about the long-run
difference. Following the previously discussed one-sample case, where long-run
means were estimated from sample means (in Example 1 we estimated the long-
run average value around which the laboratory was varying in measuring the
percentage fat in baby food and in Example 4 we estimated the long-run mean
difference between two laboratories), a natural approach to answering this
question is to calculate a confidence interval for µ –µ . 2 1
In the one-sample case, a 95% confidence interval for a single mean, µ. was
given by:
± tc
or equivalently:
± tc
± tc
± tc
(78.37–75.51) ± 2.09
2.86 ± 1.65
Although the study showed an average difference of 2.86 units, the confidence
interval estimates that the difference in long-run yields is somewhere between
1.2 and 4.5 percentage points, with 95% confidence
Suppose that the confidence interval had turned out as 2.86±5.00, i.e., ranging
from –2.14 to +7.86. How would this result be interpreted?
Such a result could mean that Speed-B gives a long-run yield that is greater by
as much as 7.86 units; it could, alternatively, mean that Speed-A gives results
higher by 2.14 units, on average. In other words, the data cannot tell
unambiguously which speed gives higher results, on average. In such a case the
sample means are said to be not statistically significantly different from each
other, i.e. the observed difference could have resulted from chance variation.
The relationship between confidence intervals and significance tests is the same
for two-sample studies as it is for single means (discussed earlier). If the
confidence interval covers zero the null hypothesis that (µ –µ )=0 will not be
2 1
rejected by the significance test. If the interval does not contain zero then the
null hypothesis will be rejected and the two sample means will be declared to be
‘statistically significantly different’.
Exercises
2.4.3. Refer to Exercises 2.4.1 and 2.4.2 and calculate 95% confidence intervals for the long-run
mean difference between the two treatments in each case. Interpret your intervals.
The analysis shown in Table 2.4.5 was specified under the assumption of equal
long-run standard deviations for both methods.
The numerical results are the same as those presented above for the test
statistic and conference interval (apart from some small rounding differences),
but, in addition, the output gives us a p-value associated with the test statistic.
Interpretation of p-value
Both tails are taken into account since a t-value of –3.62 would be considered as
indicating a statistically significant difference, also, i.e., we have a two-sided
alternative hypothesis. The test is often called 'two-tailed' for this reason - one-
sided hypotheses with corresponding one-tailed critical or rejection regions were
encountered for acceptance sampling decisions, earlier.
If a significance level of 0.05 is chosen for the significance test, then a p-value
less than 0.05 is taken as indicating that the observed difference between the
means is statistically significant. If the p-value is greater than 0.05 then the result
is not statistically significant, i.e., the observed difference is considered
consistent with only chance variation away from zero. The advantage of quoting
p-values is that it is immediately obvious how extreme the t-value is, i.e., how
unlikely such a value is under the hypothesis of no long-run difference. It is
worthwhile pointing out again that ‘significant’ means ‘likely to be due to other
than chance variation’ – it may or may not be the case that the observed
difference is ‘important’ – this would depend on considerations that have nothing
to do with statistics and everything to do with the study domain.
As discussed earlier, and illustrated by Figure 2.4.2, the model underlying our
statistical analysis assumes equal long-run standard deviations and data
Normality within the two populations from which the data are considered to be
randomly selected; it also assumes independence. The validity of the
independence assumption follows from the randomisation of the order in which
the runs were carried out.
It is obvious from Figure 2.4.1 that the variability is about the same for the two
samples (suggesting that the constant long-run standard deviation assumption
holds). Sometimes, it is helpful to subtract the respective group means from the
sample data before plotting, as shown in Figure 2.4.5; subtracting the group
means results in both samples varying around zero and this can make it (a little)
easier to compare the scatter, as the data will now be lined up opposite each
other. When the means are subtracted, the resulting deviations are called
‘residuals’, as they are what remain after the systematic components in the
responses (the group means here, but we will see other possibilities, later) are
removed.
To assess whether the pellet process development data are Normal, separate
Normal plots might be drawn for the eleven results from each process speed.
However, the sample size would be small in both cases. If the long-run standard
deviations can be assumed equal for the two groups (as we have seen, this
appears a reasonable assumption here), the two sets of residuals may be
combined into one, since they have the same mean of zero and a common, but
unknown, standard deviation. We can then draw a Normal plot to determine if
the residual variation is consistent with a single Normal distribution.
Figure 2.4.6 shows a Normal plot of these residuals. The scatter of the plotted
points is close to a straight line, as would be expected if they come from a
Normal distribution. The p-value indicates that 22 observations selected
randomly from a truly Normal distribution would have a probability of p=0.324 of
showing stronger departure from a straight-line relationship than that observed in
this study. Accordingly, there is no reason to reject the hypothesis that the data
come from a Normal distribution.
In carrying out the t-test and in calculating the confidence interval it was assumed
that the long-run standard deviations were the same for the two spheroniser
speeds. Clearly, the sample standard deviations are close in this case, as can
be seen both from Figure 2.4.5 and the summary statistics of Table 2.4.2, but a
formal statistical test of the equality of the long-run values may be desired in
other cases. Such a test (the F-test9) will be described below. However, the
value of this widely used test is open to question on two grounds. Moore [6]
argues against it on the grounds that it is highly sensitive to the assumption of
data Normality. It is also known not to be powerful (see e.g., Mullins [2], pp. 166-
168): this means that large sample sizes are required in order to be reasonably
confident of detecting even moderately large differences between the standard
deviations of the populations being compared. For the modest sample sizes
often encountered in research studies the test will not be powerful. This means
9
The test is named F for Fisher, in honour of Sir Ronald Fisher who made major contributions to
th
the theory and practice of statistics (and genetics) in the first half of the 20 century.
The test is commonly described as a test for equal variances, but since the square of the
standard deviation is the variance, the two descriptions are equivalent. Standard deviations are
measured in the same units are the original measurements; for this reason they appear to me to
be a more natural way to describe the variability of the data.
To carry out the test, the null hypothesis of equal standard deviations is
specified; the alternative hypothesis denies their equality:
Ho : σ = σ
1 2
H1 : σ ≠ σ .
1 2
The test statistic is the ratio of the standard deviations squared, or equivalently,
the ratio of the sample variances:
If the two long-run standard deviations are equal, i.e., σ =σ , this test statistic
1 2
Since the calculated F statistic lies in the body of the distribution the test provides
no basis for rejecting the assumption of equal long-run standard deviations, i.e.,
the variability appears to be about the same for the two process speeds.
Note that the F-distribution is not symmetrical and, therefore, for a two-sided test
critical values are required which are different in magnitude for the two tails.
Statistical tables usually give the right- tail critical values only.10 To avoid having
to calculate the left-hand critical value, it is conventional to use the larger sample
standard deviation as the numerator of the F-ratio. If this is done, then the test
asks if the sample ratio is too large to be consistent with random fluctuation away
from F=1.0, which is what the result would be if the long-run standard deviations
were known and equal. In doing this, it is important that a table which gives the
critical value for 0.025 in the right-tail should be consulted, even though a
significance level of α=0.05 is chosen for the test, since a left-tail critical value is,
in principle, also applied; the left-tail critical value is not explicitly specified
because we have arranged for the sample ratio to be bigger than one and,
consequently, what we wish to determine is if it is unusually large
Sample N StDev Variance
1 11 2.054 4.219
2 11 1.635 2.673
CI for
Distribution CI for StDev Variance
of Data Ratio Ratio
Tests
Test
Method DF1 DF2 Statistic P-Value
F Test (normal) 10 10 1.58 0.483
10
The left-tail value F0.025,a,b (the value that leaves 0.025 in the left-hand tail where the degrees
of freedom are 'a' for the numerator and 'b' for the denominator) is given by the reciprocal of
F0.975,b,a (the value that leaves 0.025 in the right-tail; note the reversal of degrees of freedom).
Thus, for example, Table ST-3 gives F0.975,3,6 = 6.6, so, F0.025,6, 3 =1/6.6 = 0.15. The p-value
shown in the Table is twice the area to the right of F=1.578, since the area to the left of 1/1.578 is
equal to that to the right of 1.578 for an F10, 10 distribution.
Exercise
2.4.4. Carry out F-tests to check the equality of the long-run standard deviations for the data in
Exercises 2.4.1 and 2.4.2
2.4.5. Pendl et al. [1] describe a fast, easy and reliable gas chromatographic method for
determining total fat in foods and animal feeds. The data below (% Fat) represent the
results of replicate measurements on a margarine, which were made on ten different days
by two laboratories A, B. Verify that an F-test, with a significance level of 0.05, will reject
the hypothesis of equal precision (equal standard deviations for repeated measurements)
in the two laboratories.
A: 79.63 79.64 78.86 78.63 78.92 79.19 79.66 79.37 79.42 79.60 SA=0.374
B: 74.96 74.81 76.91 78.41 77.95 79.17 79.82 79.31 77.65 78.36 SB=1.72
In some cases the long-run or population standard deviations will not be equal,
σ ≠σ . This is more likely to arise in observational studies in which groups with
1 2
Where the group sizes are reasonably large (say greater than 30 in each case),
the inequality of the standard deviations is unlikely to cause a problem. The
standard error of the difference between the sample means can be estimated
using the formula:
For small sample sizes, a modified version of the t-test is available – the test
statistic is calculated using the standard error formula given above, but the
degrees of freedom are obtained using a rather complicated formula (this is given
in Mullins [2] or Moore and McCabe [7]); statistical packages will automatically
take care of the messy calculations. For small sample sizes it is virtually
impossible to compare, usefully, two sample standard deviations – the power of
the test will be very low (see preceding discussion). Consequently, as indicated
above, it will very often be the case that professional judgement (of the study
domain, not statistical) will be required to decide whether or not the assumption
of equal population standard deviations is, or is not, appropriate. See Chapter 7
of Moore and McCabe [7] for further discussion.
2.5 Review
The main purpose of this chapter was to introduce the ideas of significance
testing, confidence intervals and comparative studies. These ideas will now be
reviewed in a final example.
11
I am grateful to Dr. Jane Sanders, a former student in the TCD Postgraduate Diploma in
Statistics, for permission to use her data.
Individual
P1 P2 Differences
ADHD children Mean 485 507 22
St. Dev. 112 120 73
Control children Mean 502 493 –9
St. Dev. 113 106 61
It is of interest to know whether or not differences exist between the two groups
of children. We might guess that for the first part of the test any difference might
be small or even zero, but that as the test progresses, requiring more sustained
attention, the difference, if any exists, might grow. This would arise if the
performance within one or both groups changed over the test duration. Our
analysis will attempt to throw light on these questions.
Data Collection
The structure of the SART dataset is a little more complex than the comparative
studies discussed earlier in the chapter. Our first example of a comparative
study involved one group being measured twice, with the objective of detecting a
shift in the long-run mean using a paired t-test (for example, RBP measurements
on two sides of the uterus for a sample of cows, or batches of tablets being
measured in two laboratories). Subsequently, we had two independent groups of
results (e.g., process yields for 11 runs of a spheroniser under two different
speeds); here the objective was to make inferences about possible differences
between the long-run mean yields, using a two-sample t-test. The SART dataset
combines both types of comparison; it involves two independent groups of 65
subjects each measured twice. In the psychology literature this would be
referred to as a ‘repeated measures design’; obviously, if there is only one group,
the subjects of which are each measured twice, the repeated measures design
reduces to the paired design discussed in Sections 2.2 and 2.3.
Graphical Analysis
Figure 2.5.1 shows dotplots of the raw data on which Table 2.5.1 is based.
Figure 2.5.1: Dotplots for the response times (ms) of the two groups on the two parts of SART
Two aspects of the figure are noteworthy: the within-group variability for all four
datasets is about the same and is considerably greater than any between-group
mean differences. However, because the sample sizes are relatively large, the
standard errors of the sample means will be much reduced, so it makes
statistical sense to carry out significance tests for possibly small between-dataset
mean differences. It is clear, though, from Figure 2.5.1 that any differences
between the ADHD and control groups are small. Possible differences within-
subject, between the first and second parts of the test, cannot be discerned from
Figure 2.5.1, which displays between-child variation.
The two sample t-test discussed in Section 2.4 assumes constant standard
deviation (this is supported by Figure 2.5.1) and Normality. Figures 2.5.2-2.5.5
show Normal plots for the response times for the four datasets.
Figure 2.5.2: Normal plot of response times (ms) for the ADHD children for P1
Figure 2.5.3: Normal plot of response times (ms) for the ADHD children for P2
Figure 2.5.5: Normal plot of response times (ms) for the Control children for P2
Although the scatterplots approximate reasonably straight lines, in both cases the
Anderson-Darling test statistics have associated p-values which are on the
borderline of being statistically significant, thus calling the Normality assumption
Between-group differences
First, we will carry out statistical tests to ask if the sample mean differences
between the two groups of children suggest long-run mean differences; we will
do this separately for the two parts of the SART. As discussed in the
introduction, we might expect that while there might or might not be a difference
on the first part, as the test progresses (requiring more sustained attention) a
difference might develop or increase with time.
For the first part of the SART (P1) we ask if a difference exists, on average,
between the two populations of children. The hypotheses are specified as:
Ho : µ − µ = 0
1 2
H1 : µ − µ ≠ 0
1 2
where µ1 refers to the control children and µ2 refers to the ADHD children We
choose a significance level of α=0.05.
This has 128 degrees of freedom, since the standard deviation for each group
has n–1=64 degrees of freedom. For a two-sided test with degrees of freedom
2n–2=128, the critical values are tc=±1.98; with such high degrees of freedom
many statisticians would simply use ±2 as the cut-off values, corresponding to
the standard Normal cutoff values of ±1.96, which rounds to 2. The test statistic
is:
t=
Table 2.5.2 shows the corresponding Minitab analysis to compare the sample
means of the two groups on the second part of the SART.
SE
N Mean StDev Mean
ADHD-P2 65 506 120 15
Control-P2 65 493 106 13
Table 2.5.2: Minitab analysis of the data for the second part of the test (P2)
The test statistic is 0.69 which has a corresponding p-value of 0.49: the sample
difference of 13.8 msec between the mean response times for the two groups is
not statistically significant. Note that the confidence interval for µ −µ covers zero,
1 2
Within-group differences
We now turn to within-group comparisons of the average response times for the
two parts of the SART. For the ADHD children, first, we ask if the mean
difference over time (P2-P1) is statistically significantly different from zero:
Ho : µ = 0
H1 : µ ≠ 0
A 95% confidence interval for the population mean difference is given by:
which is 3.7 to 39.9. We are 95% confident that a population of such ADHD
children would show a mean difference of between 4 and 40 msec in response
times on the two parts of the SART. This represents a slowing of their responses
to the stimuli over the course of the test. As always, the confidence level refers
to the method used rather than the particular numbers quoted above.
The corresponding analyses for the control group were carried out in Minitab;
Table 2.5.3 displays the relevant output.
Table 2.5.3: Minitab analysis of the P2-P1 differences for the control group
The two-sample t-tests suggested no mean differences between the two groups
of children for the two parts of the test, when they were compared separately.
The paired t-test on the control children detected no difference in response times
over the two parts of the study, while that on the ADHD children pointed to an
increased response time in the second part of the SART. Why the apparent
contradiction?
Essentially, the paired tests are more sensitive (powerful) – they have a greater
capacity to detect small differences. This is so because the measure of chance
variation (sd) that appears in the denominator of the paired test statistic is based
on the variation between children of the within-child difference (which for ADHD
is 73), while that in the two-sample t-tests measures between-child variation of
individual response times. Figure 2.5.1 shows the between-child variation in
response times to be large (thus, the standard deviation for ADHD children on P1
of the test was 112). Consequently, although the between-group difference for
P1 (502–485=17) is not markedly different from that between the two parts of the
SART for the ADHD children (507–485=22), the two-sample t-test between
groups is not able to separate the sample means statistically, whereas the paired
t-test on the mean difference for the ADHD children is statistically significant
(p=0.02).
Concluding Remarks
As mentioned in the introduction, the usual starting point for the analysis of data
obtained from a repeated measures study design would be ANOVA. This
method provides a single statistical test (called a test for ‘interaction’) which
shows that the difference in mean response times between P1 and P2 is different
for the two groups of children. We carried out four tests in arriving at this
conclusion. The undesirability of multiplying the number of tests carried out in
any analysis will be discussed in Chapter 5. Here, though, our rather crude
approach provided us with an opportunity to review the two different variants of
the t-test which formed the substance of this chapter.
[1] Pendl, R., Bauer, M., Caviezel, R., and Schulthess, P., J. AOAC Int., 1998,
81, 907.
[2] Mullins, E., Statistics for the Quality Control Chemistry Laboratory, Royal
Society of Chemistry, Cambridge, 2003.
[3] Altman, D.G., Practical Statistics for Medical Research, Chapman and Hall,
1991.
[4] Rice, J.A., Mathematical Statistics and Data Analysis, Duxbury Press, 1995.
[5] Samuels, M.L., Statistics for the Life Sciences, Dellen Publishing Company,
1989.
[6] Moore D.S., in Statistics for the Twenty-first Century, F. Gordon and S.
Gordon, Mathematical Association of America, 1992.
[7] D.S. Moore and G. P. McCabe, Introduction to the Practice of Statistics,
Freeman, 5th edition, 2006
[8] Robertson, I.H. et al., Oops!: Performance correlates of everyday attentional
failures in traumatic brained injured and normal subjects, Neuropsychologia,
35, 6, 1997.
Note that some portions of the text are based on Mullins (2003), the copyright owner is the Royal Society of Chemistry
2.1.1
We wish to test the hypothesis that each of the fill heads delivers µo= 21g into the
containers, on average. We carry out a t-test for each head. We specify the null
and alternative hypotheses as:
Ho: µ = 21.00
H1: µ ≠ 21.00
For a significance level of α=0.05 we have critical values of ±2.26, since the
standard deviations each have (n–1)=9 degrees of freedom and this means the
appropriate reference distribution is a Student’s t-distribution with 9 degrees of
freedom; this is shown in Figure 2.1.1.1
The test statistic for head 1 falls between the critical values and so we do not
reject the null hypothesis; the sample data are consistent with fill head 1 being
on-target. The test statistic for head 2 is outside the right-hand critical value and
leads us to reject the null hypothesis; we do not accept that head 2 is on-target –
it appears to be overfilling, on average.
2.1.2
Ho: µ = 100%
H1: µ ≠ 100%.
The critical values for a significance level of 0.05 and 49 degrees of freedom are
± 2.01. Accordingly, we reject Ho and conclude that the system gives a recovery
rate lower than 100%.
The experimental design for this study involved self-pairing of the subjects; the
natural approach to analysing the resulting data is, consequently, to carry out a
paired t-test and calculate the corresponding confidence interval.
If we had the time order in which the measurements were made then a time
series plot would be a good idea, as it could indicate a drift with time of the
measurements. This could, for example, be the result of a drift in the
temperature of a water bath in which the blood samples were stored, or some
other time related measurement problem. Any such drift would affect the
independence assumption built into the statistical test.
Figure 2.2.1.1: Differences plotted against before-after means for the 11 subjects
The Normal plot of differences, shown in Figure 2.2.1.3, and the associated
Anderson-Darling test (p-value=0.62) do not raise doubts about the Normality
assumption.
Table 2.2.1.1 shows a Minitab paired t-test analysis of the platelet data. The null
hypothesis for the test is that the mean of the population of differences is zero:
there are 10 degrees of freedom, hence, the critical value for a two-tailed test
with a significance level of α=0.05 is 2.228 – the observed test statistic of t=4.27
(which is 10.27/2.40) is highly statistically significant; it has a corresponding p-
value of 0.002. The interpretation of the test result is that the observed mean
difference of 10.3 units is unlikely to be due to the inevitable chance biological
variation and measurement error in the study. It suggests an underlying
systematic effect. The magnitude of this effect is estimated by the 95%
confidence interval in Table 2.2.1.1:
Note the correspondence between the statistical test and the confidence interval
– the test rejects the hypothesis that the long-run mean is zero, the interval does
not include zero as a possible value for the long-run mean. This agreement
between the two methods is necessary, as they embody the same information.
However, the confidence interval is more useful, in that it not only tells us that the
long-run mean is unlikely to be zero, but also provides a range within which we
are confident that the mean difference lies.
In summary, based on this small study, it appears that smoking one cigarette
increases the aggregation ability of platelets. Of course, the study raises many
more questions: e.g., is this only a short-term effect or does it persist?; if many
cigarettes are smoked does the effect increase?; if yes, is the increase additive
or multiplicative, does it tail off as the number of cigarettes increases?; is there a
cumulative effect of chronic smoking?, and so on.
In this small study 10/11 of the responses increased after smoking a cigarette, so
smoking even one cigarette does have an effect. The analysis based on the
measured differences would be more informative (it indicates the size of the
effect) if we could have confidence in it. However, the two plots (Figures
2.2.1.1/2) raise doubts about the data which could only be resolved in discussion
with the researchers and this is not an option for us. An important lesson to be
drawn from this example, then, is that simple graphical analyses may be more
informative than apparently more sophisticated statistical tests!
A simple statistical model for the data asserts that the differences are
independent and, if we had a very large sample (in principle, an infinite sample)
the values would be Normally distributed about a mean, µ, with a standard
deviation σ. The question of the existence of a relative bias can then be
addressed by asking if the long-run mean difference between the results
produced by the two laboratories, µ, is zero. If the significance test rejects this
Significance Test
If the proposed statistical model is appropriate for the data, then a paired t-test of
the hypothesis that µ=0 may be carried out. To carry out the test a significance
level, say α=0.05, is chosen, the null and alternative hypotheses are specified as:
Ho : µ = 0
H1 : µ ≠ 0
This distribution has 95% of its area between +2.57 and –2.57. Since the
calculated t-value lies far out in the right-hand tail, the data strongly suggest that
the laboratories do not, on average, obtain the same purity results when
measuring the same material.
For the six batches of product included in the study the average difference was
0.637. However, if a different set of six batches was measured (or if these same
batches were measured again) we would almost certainly obtain a different
result, because a different set of sampling and measurement errors would be
included in our results. Accordingly, rather than report 0.637 as the long-run
average difference (relative bias) between the laboratories, it makes sense to
attach error bounds to this point estimate; we obtain these by calculating a
confidence interval.
± tc
0.637 ± 2.57
0.637 ± 0.249.
The estimate of the relative bias between the laboratories indicates that it is
somewhere between 0.39 and 0.89 units.
Computer Analysis
The output gives a confidence interval and a test of the hypothesis that the long-
run mean difference is zero; the test statistic (slightly different from our value of
-6.59 6.59
We have identified a relative bias between the laboratories. How important this
is would depend on the economics of the product – it is not a statistical question.
The confidence interval estimates the size of the bias – this would allow the
scientists to evaluate the importance of the difference.
Similarly, the confidence interval narrows with the reduced standard deviation;
the full width of the interval is 0.497 when 6 differences are used in the analysis,
but this reduces to 0.291 when only 5 differences are used. These changes are
complementary.
This example shows the marked impact even a single unusually large or small
value can have on a statistical analysis. This underlines the need to examine
data carefully rather than simply loading them into a computer and requesting a
standard statistical test. Data processing is not the same as data analysis!
2.3.1
A 95% confidence interval for the long-run mean result that would be obtained if
very many measurements were made on the baby food in Laboratory A is given
by:
± tc
26.943 ± 2.26
26.943 ± 0.210.
We estimate that the long-run mean is somewhere between 26.733 and 27.153
units (percentage fat). This interval covers the certified value of 27.100. This
was expected as the statistical test did not reject the certified value. The critical
value of 2.26 used in the calculation is based on 9 degrees of freedom, since the
sample size was 10.
2.3.2
Test of mu = 21 vs not = 21
Note that for head 1 the confidence interval contains 21.00, corresponding to the
null hypothesis value of 21.00 not being rejected by the corresponding t-test,
while the interval for head 2 does not contain 21.00, corresponding to the
rejection of such an hypothesised value.
Note the confidence intervals for 2.3.3 and 2.3.4 are discussed together with the
t-tests in solutions 2.2.2 and 2.2.3, respectively.
Pain-relief Study
Since we do not have the raw data we will have to assume that the data are
sufficiently close to Normality to permit us to carry out a t-test. Samuels uses the
data in an exercise on t-tests so, presumably, this is a reasonable assumption.
In any case, the sample sizes are reasonably large so that the distribution of the
test-statistic would not be badly affected by minor departures from this
assumption.
A two-sample t-test addresses the hypothesis that the population means are the
same. In a case like this, obviously, we do not have a random sample from a
fixed population. If we can regard the subjects in the study as being
representative of mothers who have just delivered a baby, or at least some well-
specified subset of mothers, then we can hope that our inferences generalise
beyond the set of women being studied. The means in question are then the
The Minitab analysis shown in Table 2.4.1.1 gives a p-value of 0.076; this applies
for a two-tail test. If a one-tail test were carried out (as suggested by Samuels12)
then the p-value would be half of this (0.038) and the results would be
considered statistically significant, when compared to the conventional
significance level of 0.05. The one-tailed critical value for a test with a
significance level of α=0.05 is 1.68, and our observed t-statistic exceeds this –
which is exactly equivalent to the p-value being less than 0.05.
12
In our discussion of one versus two-sided tests (page 13-14) it was pointed out that many
statisticians would discourage one-sided tests in a research context. So why might Samuels look
for a one-sided test here? In this study a drug is being compared to a placebo (rather than to an
established treatment). In this case any placebo effect will apply equally to the drug and the
placebo, so it is reasonable to look for an increased pain relief score in the drug group.
Table 2.4.1.1: Minitab two-sample analysis of the pain relief study data
It is important to review results like this one and ask why the difference (on a two-
sided test) is ‘not statistically significant’. The implicit assumption for those who
simply declare the result ‘not statistically significant’ is that there really is no long-
run difference between the population means, i.e., the treatment has no effect.
However, if we examine the test-statistic, it becomes clear that it may be small
because the sample difference is small (perhaps because there is no
treatment effect), or because the denominator is large, i.e., because the sampling
variability as measured by the common standard deviation, s, is large, or
because the sample size in each group, n, is small. If the sampling variability is
large then in designing the study a sample size, n, which reflects the large
sampling variability (here the woman-to-woman variation in pain relief scores),
should have been chosen. If this was not done then the reason that the result
was not statistically significant could simply be that the sample size was too small
to counteract the large sampling variability. Technically, this is described by
saying that the power of the test to detect the treatment effect was too low. This
immediately raises the question as to how many observations are needed in
studies of this type – sample size determination will be discussed in Chapter 4.
This question requires exactly the same calculations as the previous one, so we
will take these as given and examine the Minitab analysis as shown in Table
2.4.2.1.
Two-Sample T-Test and CI
2.4.3
The confidence interval for the pain-relief study is given in Solution 2.4.1, while
that for the tobacco plant study is given in Solution 2.4.2.
2.4.4
For the pain relief data of Exercise 2.4.1, the F ratio is F=(13.78/12.05)2=1.31.
The critical value is Fc=F0.975,24,24=2.27 for a significance level of α=0.05 – there
is no reason to reject the assumption of underlying standard deviations which are
the same.
For the tobacco plant data of Exercise 2.4.2, the F ratio is F=(11.14/9.02)2 =1.53.
The critical value is Fc=F0.975,16,16 =2.76 for a significance level of α=0.05. – again,
there is no reason to reject the assumption of equal underlying standard
deviations.
The standard deviations are 0.374 and 1.723 for Laboratories A and B,
respectively. These give an F-ratio of 21.18, which is highly statistically
significant when compared to the critical value of Fc=F0.975,9,9=4.03, for a
significance level of α=0.05.
Minitab gives the following output for a t-test on the data – note that equal
standard deviations are not assumed by this test. Note, also, that the degrees of
freedom of the t-test are shown as 9 rather than 18, as would be expected if the
population standard deviations were assumed equal and the sample values were
combined for the test (see comments on page 49).
Two-sample T for A vs B
Difference = mu A - mu B
Estimate for difference: 1.557
95% CI for difference: (0.296, 2.818)
T-Test of difference = 0 (vs not =): T-Value = 2.79 P-Value = 0.021 DF = 9
Table 2.4.5.1: Comparing groups where the population standard deviations are different.