Points of significance - Nonparametric tests
Points of significance - Nonparametric tests
Z
1 2 3,4,5,6 7,8,9,10
= 13 – 10
=3
X vs. Z
R =1+2+3+6
PXZ
0.04
Rank 1,2 3,4 5 6,7,8,9,10 W = 12 – 10
Nonparametric tests robustly compare skewed or ranked data. =2
0 0.2 0.4 0.6 0.8 1.0 W 0 4 8 12 16 20 24
We have seen that the t-test is robust with respect to assumptions Figure 2 | Many nonparametric tests are based on ranks. (a) Sample
about normality and equivariance1 and thus is widely applicable. comparisons of X vs. Y and X vs. Z start with ranking pooled values and
There is another class of methods—nonparametric tests—more suit- identifying the ranks in the smaller-sized sample (e.g., 1, 3, 4, 5 for Y; 1,
2, 3, 6 for Z). Error bars show sample mean and s.d., and sample medians
able for data that come from skewed distributions or have a discrete
are shown by vertical dotted lines. (b) The Wilcoxon rank-sum test statistic
or ordinal scale. Nonparametric tests such as the sign and Wilcoxon W is the difference between the sum of ranks and the smallest possible
rank-sum tests relax distribution assumptions and are therefore easier observed sum. (c) For small sample sizes the exact distribution of W can
to justify, but they come at the cost of lower sensitivity owing to less be calculated. For samples of size (6, 4), there are only 210 different rank
information inherent in their assumptions. For small samples, the per- combinations corresponding to 25 distinct values of W.
formance of these tests is also constrained because their P values are
only coarsely sampled and may have a large minimum. Both issues are would need to have all values larger than M (W = 6). Its large P values
mitigated by using larger samples. and straightforward application makes the sign test a useful diagnos-
These tests work analogously to their parametric counterparts: a tic. Take, for example, a hypothetical situation slightly different from
test statistic and its distribution under the null are used to assign sig- that in Figure 1, where P > 0.05 is reported for the case where a treat-
© 2014 Nature America, Inc. All rights reserved.
nificance to observations. We compare in Figure 1 the one-sample ment has lowered blood pressure in 6 out of 6 subjects. You may think
t-test2 to a nonparametric equivalent, the sign test (though more this P seems implausibly large, and you’d be right because the equiva-
sensitive and sophisticated variants exist), using a putative sample X lent scenario for the sign test (W = 6, n = 6) gives a two-tailed P = 0.03.
whose source distribution we cannot readily identify (Fig. 1a). The To compare two samples, the Wilcoxon rank-sum test is widely
null hypothesis of the sign test is that the sample median mX is equal used and is sometimes referred to as the Mann-Whitney or Mann-
to the proposed median, M = 0.4. The test uses the number of sample Whitney-Wilcoxon test. It tests whether the samples come from dis-
values larger than M as its test statistic, W—under the null we expect tributions with the same median. It doesn’t assume normality, but as
to see as many values below the median as above, with the exact prob- a test of equality of medians, it requires both samples to come from
ability given by the binomial distribution (Fig. 1c). The median is a distributions with the same shape. The Wilcoxon test is one of many
more useful descriptor than the mean for asymmetric and otherwise methods that reduce the dynamic range of values by converting them
irregular distributions. The sign test makes no assumptions about the to their ranks in the list of ordered values pooled from both samples
distribution—only that sample values be independent. If we propose (Fig. 2a). The test statistic, W, is the degree to which the sum of ranks
that the population median is M = 0.4 and we observe X, we find is larger than the lowest possible in the sample with the lower ranks
W = 5 (Fig. 1b). The chance of observing a value of W under the null (Fig. 2b). We expect that a sample from a population with a smaller
that is at least as extreme (W ≤ 1 or W ≥ 5) is P = 0.22, using both tails median will be converted to a set of smaller ranks.
of the binomial distribution (Fig. 1c). To limit the test to whether the Because there is a finite number (210) of combinations of rank-
median of X was biased towards values larger than M, we would con- ordering for X (nx = 6) and Y (nY = 4), we can enumerate all outcomes
npg
sider only the area for W ≥ 5 in the right tail to find P = 0.11. of the test and explicitly construct the distribution of W (Fig. 2c) to
The P value of 0.22 from the sign test is much higher than that assign a P value to W. The smallest value of W = 0 occurs when all
from the t-test (P = 0.04), reflecting that the sign test is less sensitive. values in one sample are smaller than those in the other. When they
This is because it is not influenced by the actual distance between the are all larger, the statistic reaches a maximum, W = nXnY = 24. For X
sample values and M—it measures only ‘how many’ instead of ‘how versus Y, W = 3, and there are 14 of 210 test outcomes with W ≤ 3
much’. Consequently, it needs larger sample sizes or more supporting or W ≥ 21. Thus, PXY =14/210 = 0.067. For X versus Z, W = 2, and
evidence than the t-test. For the example of X, to obtain P < 0.05 we PXZ = 8/210 = 0.038. For cases in which both samples are larger than
10, W is approximately normal, and we can obtain the P value from
a z-test of (W – mW)/sW, where mW = n1(n1 + n2 + 1)/2 and sW =
a Sample b Calculate test statistic c Determine P value
√(mWn2/6).
M X mX One-sample t-test Student’s t Binomial
s
X
t = (X – M)/sX The ability to enumerate all outcomes of the test statistic makes
= (0.72 – 0.40)/0.11 P P
s
X = 2.84 0.04 0.22 calculating the P value straightforward (Figs. 1c and 2c), but
X Sign test
W = count(Xi > M )
there is an important consequence: there will be a minimum P
0 0.2 0.4 0.6 0.8 1.0 =5 t –4 –2 0 2 4 W 0 1 2 3 4 5 6 value, Pmin. Depending on the size of samples, Pmin can be rela-
tively large. For comparisons of samples of size nX = 6 and nY = 4
Figure 1 | A sample can be easily tested against a reference value using the (Fig. 2a), Pmin = 1/210 = 0.005 for a one-tailed test, or 0.01 for a
sign test without any assumptions about the population distribution. two-tailed test, corresponding to W = 0. Moreover, because there
(a) Sample X (n = 6) is tested against a reference M = 0.4. Sample mean is
are only 25 distinct values of W (Fig. 2c), only two other two-
shown with s.d. (sX) and s.e.m. error bars (s ). mx is sample median. (b) The
t-statistic compares to M in units of s.e.m. The sign test’s W is the number
tailed P values are <0.05: P = 0.02 (W = 1) and P = 0.038 (W = 2).
of sample values larger than M. (c) Under the null, t follows Student’s The next-largest P value (W = 3) is P = 0.07. Because there is no P
t-distribution with five degrees of freedom, whereas W is described by the with value 0.05, the test cannot be set to reject the null at a type I rate
binomial with 6 trials and P = 0.5. Two-tailed P values are shown. of 5%. Even if we test at a = 0.05, we will be rejecting the null at the
affects test performance for small samples further on. In fact, it may (n = 5 and 25) in the presence and absence of an effect (Fig. 3).
even be impossible to reach significance at a = 0.05 because there is a At n = 5, Wilcoxon FPR = 0.032 < a because this is the largest
limited number of ways in which small samples can vary in the context P value it can produce smaller than a, not because the test inher-
of ranks, and no outcome of the test happens less than 5% of the time. ently performs better. We can always reach this FPR with the t-test
For example, samples of size 4 and 3 offer only 35 arrangements of by setting a = 0.032, where we’ll find that it will still have slightly
ranks and a two-tailed Pmin = 2/35 = 0.057. Contrast this to the t-test, higher power than a Wilcoxon test that rejects at this rate. At n = 5,
which can produce any P value because the test statistic can take on Wilcoxon performs better for discrete sampling—the power (0.43)
an infinite number of values. is essentially the same as the t-test’s (0.46), but the FDR is lower.
This has serious implications in multiple-testing scenarios discussed When both tests are applied at a = 0.032, Wilcoxon power (0.43) is
in the previous column3. Recall that when N tests are performed, slightly higher than t-test power (0.39). The differences between the
multiple-testing corrections will scale the smallest P value to NP. In tests for n = 25 diminishes because the number of arrangements of
the same way as a test may never yield a significant result (Pmin > a), ranks is extremely large and the normal approximation to sample
applying multiple-testing correction may also preclude it (NPmin > a). means is more accurate. However, one case stands out: in the pres-
For example, making N = 6 comparisons on samples such as X and Y ence of skew (e.g., exponential distribution), Wilcoxon power is
shown in Figure 2a (nX = 6, nY = 4) will never yield an adjusted P value much higher than that of the t-test, particularly for continuous sam-
lower than a = 0.05 because Pmin = 0.01 > a/N. To achieve two-tailed pling. This is because the majority of values are tightly spaced and
significance at a = 0.05 across N = 10, 100 or 1,000 tests, we require ranks are more sensitive to small shifts. Skew affects t-test FPR and
npg
sample sizes that produce at least 400, 4,000 or 40,000 distinct rank power in a complex way, depending on whether one- or two-tailed
combinations. This is achieved for sample pairs of size of (5, 6), (7, 8) tests are performed and the direction of the skew relative to the
and (9, 9), respectively. direction of the population shift that is being studied4.
The P values from the Wilcoxon test (PXY = 0.07, PXZ = 0.04) in Nonparametric methods represent a more cautious approach and
Figure 2a appear to be in conflict with those obtained from the remove the burden of assumptions about the distribution. They apply
t-test (PXY = 0.04, PXZ = 0.06). The two methods tell us contradic- naturally to data that are already in the form of ranks or degree of
tory information—or do they? As mentioned, the Wilcoxon test preference, for which numerical differences cannot be interpreted.
concerns the median, whereas the t-test concerns the mean. For Their power is generally lower, especially in multiple-testing scenar-
asymmetric distributions, these values can be quite different, and it ios. However, when data are very skewed, rank methods reach higher
is conceivable that the medians are the same but the means are dif- power and are a better choice than the t-test.
ferent. The t-test does not identify the difference in means of X and Corrected after print 23 May 2014.
Z as significant because the standard deviation, sZ, is relatively large
owing to the influence of the sample’s largest value (0.81). Because COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
the t-test reacts to any change in any sample value, the presence of
outliers can easily influence its outcome when samples are small. Martin Krzywinski & Naomi Altman
For example, simply increasing the largest value in X (1.00) by 0.3
1. Krzywinski, M. & Altman, N. Nat. Methods 11, 215–216 (2014).
will increase sX from 0.28 to 0.35 and result in a PXY value that is 2. Krzywinski, M. & Altman, N. Nat. Methods 10, 1041–1042 (2013).
no longer significant at a = 0.05. This change does not alter the 3. Krzywinski, M. & Altman, N. Nat. Methods 11, 355–356 (2014).
Wilcoxon P value because the rank scheme remains unaltered. This 4. Reineke, D.M, Baggett, J. & Elfessi, A. J. Stat. Educ. 11 (2003).
insensitivity to changes in the data—outliers and typical effects Martin Krzywinski is a staff scientist at Canada’s Michael Smith Genome Sciences
alike—reduces the sensitivity of rank methods. Centre. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.
In the version of this article initially published, the expression X (nX = 6) was incorrectly written as X (nY = 6). The error has been corrected
in the HTML and PDF versions of the article.
© 2014 Nature America, Inc. All rights reserved.
npg
nature methods