Test Statistics
Test Statistics
Methods
Volume 16 | Issue 1 Article 9
5-1-2017
Bethan Russ
Office for National Statistics, [email protected]
Deirdre Toher
University of the West of England, [email protected]
Paul White
University of the West of England, [email protected]
Recommended Citation
Derrick, B., Russ, B., Toher, D., & White, P. (2017). Test statistics for the comparison of means for two samples that include both
paired and independent observations. Journal of Modern Applied Statistical Methods, 16(1), 137-157. doi: 10.22237/jmasm/
1493597280
This Regular Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for
inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState.
Journal of Modern Applied Statistical Methods Copyright © 2017 JMASM, Inc.
May 2017, Vol. 16, No. 1, 137-157. ISSN 1538 − 9472
doi: 10.22237/jmasm/1493597280
Standard approaches for analyzing the difference in two means, where partially
overlapping samples are present, are less than desirable. Here are introduced two test
statistics, making reference to the t-distribution. It is shown that these test statistics are
Type I error robust, and more powerful than standard tests.
Keywords: partially overlapping samples, test for equality of means, corrected z-test,
partially correlated data, partially matched pairs
Introduction
Hypothesis tests for the comparison of two population means, μ1 and μ2, with two
samples of either independent observations or paired observations are well
established. When the assumptions of the test are met, the independent samples
t-test is the most powerful test for comparing means between two independent
samples (Sawilowsky and Blair, 1992). Similarly, when the assumptions of the
test are met, the paired samples t-test is the most powerful test for the comparison
of means between two dependent samples (Zimmerman, 1997). If a paired design
137
COMPARISON OF MEANS FOR TWO SAMPLES
can avoid extraneous systematic bias, then paired designs are generally
considered to be advantageous when contrasted with independent designs.
There are scenarios where, in a paired design, some observations may be
missing. In the literature, this scenario is referred to as paired samples that are
either “incomplete” (Ekbohm, 1976) or with “missing observations” (Bhoj, 1978).
There are designs that do not have completely balanced pairings. Occasions where
there may be two samples with both paired observations and independent
observations include:
i) Two groups with some common element between both groups. For
example, in education when comparing the average exam marks for
two optional subjects, where some students take one of the two
subjects and some students take both.
iii) When some natural pairing occurs. For example, in a survey taken
comparing views of males and females, there will be some matched
pairs (couples) and some independent individuals (single).
The examples given above can be seen as part of the wider missing data
framework. There is much literature on methods for dealing with missing data and
the proposals in this paper do not detract from extensive research into the area.
The simulations and discussion in this paper are done in the context of data
missing completely at random (MCAR).
Two samples that include both paired and independent observations is
referred to using varied terminology in the literature. The example scenarios
outlined can be referred to as “partially paired data” (Samawi and Vogel, 2011).
However, this terminology has connotations suggesting that the pairs themselves
are not directly matched. Derrick et al. (2015) suggest that appropriate
terminology for the scenarios outlined gives reference to “partially overlapping
samples.” For work that has previously been done on a comparison of means
when partially overlapping samples are present, “the partially overlapping
138
DERRICK ET AL.
139
COMPARISON OF MEANS FOR TWO SAMPLES
140
DERRICK ET AL.
Notation
Notation used in the definition of the test statistics is given in Table 1.
All variances above are calculated using Bessel’s correction, i.e. the sample
variance with ni − 1 degrees of freedom (see Kenney and Keeping, 1951, p.161).
141
COMPARISON OF MEANS FOR TWO SAMPLES
As standard notation, random variables are shown in upper case, and derived
sample values are shown are in lower case.
X 1c X 2 c
T1
S12c S22c S S
2r 1c 2 c
nc nc nc
Xa Xb na 1 Sa2 nb 1 Sb2
T2 where S p
Sp
1 1
na 1 nb 1
na nb
Xa Xb
T3
Sa2 Sb2
na nb
142
DERRICK ET AL.
2
Sa2 Sb2
v3 na nb
2 2
Sa2 Sb2
/ na 1 / nb 1
na nb
For large sample sizes, the test statistic for partially overlapping samples
proposed by Looney and Jones (2003) is
X1 X 2
Z corrected
S12
S22
2nc S12
na nc nb nc na nc nb nc
X1 X 2 n1 1 S12 n2 1 S22
Tnew1 where S p
1 1 n n1 1 n2 1
Sp 2r c
n1 n2 n1n2
The test statistic Tnew1 is referenced against the t-distribution with degrees of
freedom derived by linear interpolation between v1 and v2 so that
143
COMPARISON OF MEANS FOR TWO SAMPLES
n n n 1
vnew1 nc 1 a b c na nb .
na nb 2nc
X1 X 2
Tnew2
S 2
S22 SS n
1
2r 1 2 c
n1 n2 n1n2
The test statistic Tnew2 is referenced against the t-distribution with degrees of
freedom derived as a linear interpolation between v1 and v3 so that
nc 1
vnew2 nc 1 na nb
na nb 2nc
2
S12 S22
where n1 n2
2 2
S12 S22
/ n1 1 / n2 1
n1 n2
144
DERRICK ET AL.
Worked Example
An applied example is given to demonstrate the calculation of each of the test
statistics defined. In education, for credit towards an undergraduate Statistics
course, students may take optional modules in either Mathematical Statistics, or
Operational Research, or both. The program leader is interested whether the exam
marks for the two optional modules differ. The exam marks attained for a single
semester are given in Table 2.
Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Mathematical Statistics 73 82 74 59 49 - 42 71 - 39 - - - - 59 85
Operational Research 72 - 89 78 64 83 42 76 79 89 67 82 85 92 63 -
As per standard notion, the derived sample values are given in lower case. In
the calculation of the test statistics, x1 = 63.300, x2 = 75.786, s12 = 263.789,
s22 = 179.874, na = 2, nb = 6, nc = 8, n1 = 10, n2 = 14, v1 = 7, v2 = 6, v3 = 6,
γ = 17.095, vnew1 = 12, v new2 = 10.365, r = 0.366, s12 = 78.679.
For the REML analysis, a mixed model is performed with “Module” as a
repeated measures fixed effect and “Student” as a random effect. Table 3 gives
the calculated test statistics, degrees of freedom and corresponding p-values.
With the exception of REML, the estimates of the mean difference are
simply the difference in the means of the two samples, based on the observations
used in the calculation. It can quickly be seen that the conclusions differ
depending on the test used. It is of note that only the tests using all of the
available data result in the rejection of the null hypothesis at αnominal = 0.05. Also
note that the results of the paired samples t-test and the independent samples t-test
have sample effects in different directions. This is only one specific example
145
COMPARISON OF MEANS FOR TWO SAMPLES
given for illustrative purposes, investigation is required into the power of the test
statistics over a wide range of scenarios. Conclusions based on the proposed tests
cannot be made without a thorough investigation into their Type I error robustness.
Simulation Design
Under normality, Monte-Carlo methods are used to investigate the Type I error
robustness of the defined test statistics and REML. Power should only be used to
compare tests when their Type I error rates are equal (Zimmerman and Zumbo,
1993). Monte-Carlo methods are used to explore the power for the tests that are
Type I error robust under normality.
Unbalanced designs are frequent in psychology (Sawilowsky and Hillman,
1992), thus a comprehensive range of values for na, nb and nc are simulated. These
values offer an extension to the work done by Looney and Jones (2003). Given
the identification of separate test statistics for equal and unequal variances,
multiple population variance parameters { 12 , 22 } are considered. Correlation has
an impact on Type I error and power for the paired samples t-test (Fradette et al.,
2003), hence a range of correlations {ρ} between two normal populations are
considered. Correlated normal variates are obtained as per Kenney and Keeping
(1951). A total of 10,000 replicates of each of the scenarios in Table 4 are
performed in a factorial design.
All simulations are performed in R version 3.1.2. For the mixed model
approach utilizing REML, the R package lme4 is used. Corresponding p-values
are calculated using the R package lmerTest, which uses the Satterthwaite
approximation adopted by SAS (Goodnight, 1976).
For each set of 10,000 p-values, the proportion of times the null hypothesis
is rejected, for a two sided test with αnominal = 0.05 is calculated.
Parameter Values
μ1 0
μ2 0 (under H0); 0.5 (under H1)
σ12 1, 2, 4, 8
σ22 1, 2, 4, 8
na 5, 10, 30, 50, 100, 500
nb 5, 10, 30, 50, 100, 500
nc 5, 10, 30, 50, 100, 500
ρ -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75
146
DERRICK ET AL.
Figure 1. Type I error rates where σ12 = σ22, reference lines show Bradley’s (1978) liberal
criteria.
Figure 1 indicates that when variances are equal, the statistics T1 , T2, T3 ,
Tnew1 and Tnew2 remain within Bradley’s liberal Type I error robustness criteria
throughout the entire simulation design. The statistic Zcorrected is not Type I error
robust, thus confirming the smaller simulation findings of Mehrotra (2004).
Figure 1 also shows that REML is not Type I error robust throughout the entire
147
COMPARISON OF MEANS FOR TWO SAMPLES
simulation design. A review of our results shows that for REML the scenarios that
are outside the range of liberal Type I error robustness are predominantly those
that have negative correlation, and some where zero correlation is specified.
Given that negative correlation is rare in a practical environment, the REML
procedure is not necessarily unjustified.
Type I error robustness is assessed under the condition of unequal variances.
Under the null hypothesis, 10,000 replicates were obtained for the
4 × 3 × 6 × 6 × 6 × 7 = 18,144 scenarios where 12 22 . For assessment against
Bradley’s (1978) liberal criteria, Figure 2 shows the Type I error rates for unequal
variances for normally distributed data.
Figure 2. Type I error rates when σ12 ≠ σ22, reference lines show Bradley’s (1978) liberal
criteria.
Figure 2 illustrates that that the statistics defined using a pooled standard
deviation, T2 and Tnew1, do not provide Type I error robust solutions when equal
variances cannot be assumed. The statistics T1, T3 and Tnew2 retain their Type I
error robustness under unequal variances throughout all conditions simulated.
The statistic Zcorrected maintains similar Type I error rates under equal and
unequal variances. The statistic Zcorrected was designed to be used only in the case
148
DERRICK ET AL.
of equal variances. For unequal variances, we observe that the statistic Zcorrected
results in an unacceptable amount of false positives when ρ ≤ 0.25 or
max{na, nb, n c} − min{na, nb, nc} is large. In addition, the statistic Zcorrected is
conservative when ρ is large and positive. The largest observed deviations from
Type I error robustness for REML are when ρ≤0 or
max{na, nb, n c} − min{na, nb, nc} is large. Further insight to the Type I error rates
for REML can be seen in Figure 3 showing observed p-values against expected p-
values from a uniform distribution.
Figure 3. P-P plots for simulated p-values using REML procedure. Selected parameter
combinations (na, nb, nc, σ12, σ22, ρ) are as follows; A = (5,5,5,1,1,-0.75),
B = (5,10,5,8,1,0), C = (5,10,5,8,1,0.5), D = (10,5,5,8,1,0.5).
If the null hypothesis is true, for any given set of parameters the p-values
should be uniformly distributed. Figure 3 gives indicative parameter combinations
where the p-values are not uniformly distributed when applying a mixed model
assessed using REML. It can be seen that REML is not Type I error robust when
the correlation is negative. In addition, caution should be exercised if using
REML when the larger variance is associated with the smaller sample size.
149
COMPARISON OF MEANS FOR TWO SAMPLES
REML maintains Type I error robustness for positive correlation and equal
variances or when the larger sample size is associated with the larger variance.
Table 5. Power of Type I error robust test statistics σ12 = σ22 = 1, α = 0.05, μ2 − μ1 = 0.5.
Table 5 shows that REML and the test statistics proposed in this paper, Tnew1
and Tnew2, are more powerful than standard approaches, T1 , T2 and T3 , when
variances are equal. Consistent with the paired samples t-test, T1, the power of
Tnew1 and Tnew2 is relatively lower when there is zero or negative correlation
between the two populations. Similar to contrasts of the independent samples t-
test, T2, with Welch’s test, T3 , for equal variances but unequal sample sizes, Tnew1
is marginally more powerful than Tnew2, but not to any practical extent. For each
of the tests statistics making use of paired data, as the correlation between the
paired samples increases, the power increases.
As the correlation between the paired samples increases, the power
advantage of the proposed test statistics relative to the paired samples t-test
becomes smaller. Therefore the proposed statistics Tnew1 and Tnew2 may be
especially useful when the correlation between the two populations is small.
150
DERRICK ET AL.
To show the relative increase in power for varying sample sizes, Figure 4
shows the power for selected test statistics for small-medium sample sizes,
averaged across the simulation design for equal variances.
Figure 4. Power for Type I error robust test statistics, averaged across all values of ρ
where σ12 = σ22 and μ2 − μ1 = 0.5. The sample sizes (na, nb, nc) are as follows:
A = (10,10,10), B = (10,30,10), C = (10,10,30), D = (10,30,30), E = (30,30,30).
From Figure 4 it can be seen that for small-medium sample sizes, the power
of the proposed test statistics Tnew1 and Tnew2 is superior to standard test statistics.
151
COMPARISON OF MEANS FOR TWO SAMPLES
Table 6. Power of Type I error robust test statistics where σ12 > 1, σ22 = 1, α = 0.05,
μ2 − μ1 = 0.5. Within this table, na > nb represents the larger variance associated with the
larger sample size, and na < nb represents the larger variance associated with the smaller
sample size.
ρ T1 T3 Tnew2 REML
0.75 0.555 0.393 0.692 0.645
0.50 0.481 0.393 0.665 0.588
na = n b 0.25 0.429 0.393 0.640 0.545
0 0.391 0.393 0.619 0.515
<0 0.341 0.393 0.582 -
0.75 0.555 0.351 0.715 0.589
0.50 0.481 0.351 0.688 0.508
na > n b 0.25 0.429 0.351 0.665 0.459
0 0.391 0.351 0.642 0.422
<0 0.341 0.351 0.604 -
0.75 0.555 0.213 0.559 0.693
0.50 0.481 0.213 0.539 0.649
na < n b 0.25 0.429 0.213 0.522 0.62
0 0.391 0.213 0.507 0.603
<0 0.341 0.213 0.480 -
The apparent power gain for REML when the larger variance is associated
with the larger sample size, can be explained by the pattern in the Type I error
rates. REML follows a similar pattern to the independent samples t-test, which is
liberal when the larger variance is associated with the larger sample size, thus
giving the perception of higher power.
To show the relative increase in power for varying sample sizes, Figure 5
shows the power for selected test statistics for small-medium sample sizes,
averaged across the simulation design for unequal variances.
152
DERRICK ET AL.
Figure 5. Power for Type I error robust test statistics σ12 > σ22 and μ2 − μ1 = 0.5. The
sample sizes (na, nb, nc) are as follows: A = (10,10,10), B1 = (10,30,10), B2 = (30,10,10),
C = (10,10,30), D1 = (10,30,30), D2 = (30,10,30), E = (30,30,30).
Discussion
The statistic Tnew2 is Type I error robust across all conditions simulated under
normality. The greater power observed for Tnew1, compared to Tnew2, under equal
variances, is likely to be of negligible consequence in a practical environment.
This is in line with empirical evidence for the performance of Welch’s test, when
only independent samples are present, which leads to many observers
recommending the routine use of Welch’s test under normality (e.g. Ruxton,
2006).
The Type I error rates and power of Tnew2 follow the properties of its
counterparts, T1 and T3. Thus Tnew2 can be seen as a trade-off between the paired
samples t-test and Welch’s test, with the advantage of increased power across all
conditions, due to using all available data.
153
COMPARISON OF MEANS FOR TWO SAMPLES
Conclusion
A commonly occurring scenario when comparing two means is a combination of
paired observations and independent observations in both samples, this scenario is
referred to as partially overlapping samples. Standard procedures for analyzing
partially overlapping samples involve discarding observations and performing
either the paired samples t-test, or the independent samples t-test, or Welch’s test.
These approaches are less than desirable. In this paper, two new test statistics
making reference to the t-distribution are introduced and explored under a
comprehensive set of parameters, for normally distributed data. Under equal
variances, Tnew1 and Tnew2 are Type I error robust. In addition they are more
powerful than standard Type I error robust approaches considered in this paper.
When variances are equal, there is a slight power advantage of using Tnew1 over
Tnew2, particularly when sample sizes are not equal. Under unequal variances,
Tnew2 is the most powerful Type I error robust statistic considered in this paper.
We recommend that when faced with a research problem involving partially
overlapping samples and MCAR can be reasonably assumed, the statistic Tnew1
could be used when it is known that variances are equal. Otherwise under the
same conditions when equal variances cannot be assumed the statistic Tnew2 could
be used.
A mixed model procedure using REML is not fully Type I error robust. In
those scenarios in which this procedure is Type I error robust, the power is similar
to that of Tnew1 and Tnew2.
The proposed test statistics for partially overlapping samples provide a real
alternative method for analysis for normally distributed data, which could also be
used for the formation of confidence intervals for the true difference in two means.
154
DERRICK ET AL.
References
Bedeian, A. G., & Feild, H. S. (2002). Assessing group change under
conditions of anonymity and overlapping samples. Nursing research, 51(1), 63-65.
doi: 10.1097/00006199-200201000-00010
Bhoj, D. (1978). Testing equality of means of correlated variates with
missing observations on both responses. Biometrika, 65(1), 225-228. doi:
10.1093/biomet/65.1.225
Bradley, J. V. (1978). Robustness? British Journal of Mathematical and
Statistical Psychology, 31(2), 144-152. doi: 10.1111/j.2044-8317.1978.tb00581.x
Derrick, B., Dobson-McKittrick, A., Toher, D., & White P. (2015). Test
statistics for comparing two proportions with partially overlapping samples.
Journal of Applied Quantitative Methods, 10(3)
Derrick, B., Toher, D., & White, P. (2016). Why Welch’s test is Type I error
robust. The Quantitative Methods for Psychology, 12(1), 30-38. doi:
10.20982/tqmp.12.1.p030
Ekbohm, G. (1976). On comparing means in the paired case with
incomplete data on both responses, Biometrika, 63(2), 299-304. doi:
10.1093/biomet/63.2.299
Fay, M. P., & Proschan, M. A. (2010). Wilcoxon-Mann-Whitney or t-test?
On assumptions for hypothesis tests and multiple interpretations of decision rules.
Statistics surveys, 4(1). doi: 10.1214/09-SS051
Fisher, R. A. (1925). Statistical methods for research workers. New Delhi,
India: Genesis Publishing Pvt. Ltd.
Fradette, K., Keselman, H. J., Lix, L., Algina, J., & Wilcox, R. (2003).
Conventional and robust paired and independent samples t-tests: Type I error and
power rates. Journal of Modern Applied Statistical Methods, 2(2), 481-496. doi:
10.22237/jmasm/1067646120
Kenney, J. F., & Keeping, E. S. (1951). Mathematics of statistics, Pt. 2 (2nd
Ed). Princeton, NJ: Van Nostrand.
Kim, B. S., Kim, I., Lee, S., Kim, S., Rha, S. Y., & Chung, H. C. (2005).
Statistical methods of translating microarray data into clinically relevant
diagnostic information in colorectal cancer. Bioinformatics, 21(4), 517-528. doi:
10.1093/bioinformatics/bti029
155
COMPARISON OF MEANS FOR TWO SAMPLES
156
DERRICK ET AL.
Social Psychology in World War II, Vol. I). Princeton, NJ: Princeton University
Press.
Zimmerman, D. W. (1997). A note on the interpretation of the paired
samples t-test. Journal of Educational and Behavioral Statistics, 22(3), 349 – 360.
doi: 10.3102/10769986022003349
Zimmerman, D. W., & Zumbo, B. D. (1993). Significance testing of
correlation using scores, ranks, and modified ranks. Educational and
Psychological Measurement, 53(4), 897-904. doi:
10.1177/0013164493053004003
Zimmerman, D. W., & Zumbo, B. D. (2009). Hazards in choosing between
pooled and separate-variances t tests. Psicológica: Revista de Metodología y
Psicología Experimental, 30(2), 371-390.
Zumbo, B. D. (2002). An adaptive inference strategy: The case of auditory
data. Journal of Modern Applied Statistical Methods, 1(1), 60-68. doi:
10.22237/jmasm/1020255000
157