0% found this document useful (0 votes)
14 views

Points of significance - Nonparametric tests

Nonparametric tests robustly compare skewed or ranked data.

Uploaded by

cmjessi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Points of significance - Nonparametric tests

Nonparametric tests robustly compare skewed or ranked data.

Uploaded by

cmjessi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

this month

Points of SIGNIFICANCE a Assign ranks b Calculate test statistic


X vs. Y
c Determine P value
PXY
X
R = 1 + 3 + 4 + 5 = 13 0.07
Y W = R – nY (nY + 1)/2

Nonparametric tests Rank

Z
1 2 3,4,5,6 7,8,9,10
= 13 – 10
=3
X vs. Z
R =1+2+3+6
PXZ
0.04
Rank 1,2 3,4 5 6,7,8,9,10 W = 12 – 10
Nonparametric tests robustly compare skewed or ranked data. =2
0 0.2 0.4 0.6 0.8 1.0 W 0 4 8 12 16 20 24

We have seen that the t-test is robust with respect to assumptions Figure 2 | Many nonparametric tests are based on ranks. (a) Sample
about normality and equivariance1 and thus is widely applicable. comparisons of X vs. Y and X vs. Z start with ranking pooled values and
There is another class of methods—nonparametric tests—more suit- identifying the ranks in the smaller-sized sample (e.g., 1, 3, 4, 5 for Y; 1,
2, 3, 6 for Z). Error bars show sample mean and s.d., and sample medians
able for data that come from skewed distributions or have a discrete
are shown by vertical dotted lines. (b) The Wilcoxon rank-sum test statistic
or ordinal scale. Nonparametric tests such as the sign and Wilcoxon W is the difference between the sum of ranks and the smallest possible
rank-sum tests relax distribution assumptions and are therefore easier observed sum. (c) For small sample sizes the exact distribution of W can
to justify, but they come at the cost of lower sensitivity owing to less be calculated. For samples of size (6, 4), there are only 210 different rank
information inherent in their assumptions. For small samples, the per- combinations corresponding to 25 distinct values of W.
formance of these tests is also constrained because their P values are
only coarsely sampled and may have a large minimum. Both issues are would need to have all values larger than M (W = 6). Its large P values
mitigated by using larger samples. and straightforward application makes the sign test a useful diagnos-
These tests work analogously to their parametric counterparts: a tic. Take, for example, a hypothetical situation slightly different from
test statistic and its distribution under the null are used to assign sig- that in Figure 1, where P > 0.05 is reported for the case where a treat-
© 2014 Nature America, Inc. All rights reserved.

nificance to observations. We compare in Figure 1 the one-sample ment has lowered blood pressure in 6 out of 6 subjects. You may think
t-test2 to a nonparametric equivalent, the sign test (though more this P seems implausibly large, and you’d be right because the equiva-
sensitive and sophisticated variants exist), using a putative sample X lent scenario for the sign test (W = 6, n = 6) gives a two-tailed P = 0.03.
whose source distribution we cannot readily identify (Fig. 1a). The To compare two samples, the Wilcoxon rank-sum test is widely
null hypothesis of the sign test is that the sample median mX is equal used and is sometimes referred to as the Mann-Whitney or Mann-
to the proposed median, M = 0.4. The test uses the number of sample Whitney-Wilcoxon test. It tests whether the samples come from dis-
values larger than M as its test statistic, W—under the null we expect tributions with the same median. It doesn’t assume normality, but as
to see as many values below the median as above, with the exact prob- a test of equality of medians, it requires both samples to come from
ability given by the binomial distribution (Fig. 1c). The median is a distributions with the same shape. The Wilcoxon test is one of many
more useful descriptor than the mean for asymmetric and otherwise methods that reduce the dynamic range of values by converting them
irregular distributions. The sign test makes no assumptions about the to their ranks in the list of ordered values pooled from both samples
distribution—only that sample values be independent. If we propose (Fig. 2a). The test statistic, W, is the degree to which the sum of ranks
that the population median is M = 0.4 and we observe X, we find is larger than the lowest possible in the sample with the lower ranks
W = 5 (Fig. 1b). The chance of observing a value of W under the null (Fig. 2b). We expect that a sample from a population with a smaller
that is at least as extreme (W ≤ 1 or W ≥ 5) is P = 0.22, using both tails median will be converted to a set of smaller ranks.
of the binomial distribution (Fig. 1c). To limit the test to whether the Because there is a finite number (210) of combinations of rank-
median of X was biased towards values larger than M, we would con- ordering for X (nx = 6) and Y (nY = 4), we can enumerate all outcomes
npg

sider only the area for W ≥ 5 in the right tail to find P = 0.11. of the test and explicitly construct the distribution of W (Fig. 2c) to
The P value of 0.22 from the sign test is much higher than that assign a P value to W. The smallest value of W = 0 occurs when all
from the t-test (P = 0.04), reflecting that the sign test is less sensitive. values in one sample are smaller than those in the other. When they
This is because it is not influenced by the actual distance between the are all larger, the statistic reaches a maximum, W = nXnY = 24. For X
sample values and M—it measures only ‘how many’ instead of ‘how versus Y, W = 3, and there are 14 of 210 test outcomes with W ≤ 3
much’. Consequently, it needs larger sample sizes or more supporting or W ≥ 21. Thus, PXY =14/210 = 0.067. For X versus Z, W = 2, and
evidence than the t-test. For the example of X, to obtain P < 0.05 we PXZ = 8/210 = 0.038. For cases in which both samples are larger than
10, W is approximately normal, and we can obtain the P value from
a z-test of (W – mW)/sW, where mW = n1(n1 + n2 + 1)/2 and sW =
a Sample b Calculate test statistic c Determine P value
√(mWn2/6).
M X mX One-sample t-test Student’s t Binomial
s
X
t = (X – M)/sX The ability to enumerate all outcomes of the test statistic makes
= (0.72 – 0.40)/0.11 P P
s
X = 2.84 0.04 0.22 calculating the P value straightforward (Figs. 1c and 2c), but
X Sign test
W = count(Xi > M )
there is an important consequence: there will be a minimum P
0 0.2 0.4 0.6 0.8 1.0 =5 t –4 –2 0 2 4 W 0 1 2 3 4 5 6 value, Pmin. Depending on the size of samples, Pmin can be rela-
tively large. For comparisons of samples of size nX = 6 and nY = 4
Figure 1 | A sample can be easily tested against a reference value using the (Fig. 2a), Pmin = 1/210 = 0.005 for a one-tailed test, or 0.01 for a
sign test without any assumptions about the population distribution. two-tailed test, corresponding to W = 0. Moreover, because there
(a) Sample X (n = 6) is tested against a reference M = 0.4. Sample mean is
are only 25 distinct values of W (Fig. 2c), only two other two-
shown with s.d. (sX) and s.e.m. error bars (s ). mx is sample median. (b) The
t-statistic compares to M in units of s.e.m. The sign test’s W is the number
tailed P values are <0.05: P = 0.02 (W = 1) and P = 0.038 (W = 2).
of sample values larger than M. (c) Under the null, t follows Student’s The next-largest P value (W = 3) is P = 0.07. Because there is no P
t-distribution with five degrees of freedom, whereas W is described by the with value 0.05, the test cannot be set to reject the null at a type I rate
binomial with 6 trials and P = 0.5. Two-tailed P values are shown. of 5%. Even if we test at a = 0.05, we will be rejecting the null at the

nature methods | VOL.11 NO.5 | MAY 2014 | 467


this month
Distribution Effect
Continuous n=5
Sample size and sampling method
Discrete Continuous n = 25 Discrete
The fact that the output of a rank test is driven by the probability
Normal
Test
t W
that a value drawn from distribution A will be smaller (or larger)
Exponential
FPR
than one drawn from B without regard to their absolute difference
has an interesting consequence: we cannot use this probability
FDR
Uniform (pairwise preferences, in general) to impose an order on distri-
Power
butions. Consider a case of three equally prevalent diseases for
−2 0 2 4 0.05 0 0.5 0.05 0 0.5 0.05 0 0.5 0.05 0 0.5
which treatment A has cure times of 2, 2 and 5 days for the three
Figure 3 | The Wilcoxon rank-sum test can outperform the t-test in the diseases, and treatment B has 1, 4 and 4. Without treatment, each
presence of discrete sampling or skew. Data were sampled from three disease requires 3 days to cure—let’s call this control C. Treatment
common analytical distributions with m = 1 (dotted lines) and s = 1 (gray A is better than C for the first two diseases but not the third, and
bars, m ± s). Discrete sampling was simulated by rounding values to the treatment B is better only for the first. Can we determine which
nearest integer. The FPR, FDR and power of Wilcoxon tests (black lines)
and t-tests (colored bars) for 100,000 sample pairs for each combination
of the three options (A, B, C) is better? If we try to answer this
of sample size (n = 5 and 25), effect chance (0 and 10%) and sampling using the probability of observing a shorter time to cure, we find
method. In the absence of an effect, both sample values were drawn from P(A < C) = 67% and P(C < B) = 67% but also that P(B < A) =
a given distribution type with m = 1. With effect, the distribution for the 56%—a rock-paper-scissors scenario.
second sample was shifted by d (d = 1.4 for n = 5; d = 0.57 for n = 25). The The question about which test to use does not have an unquali-
effect size was chosen to yield 50% power for the t-test in the normal noise fied answer—both have limitations. To illustrate how the t- and
scenario. Two-tailed P at a = 0.05.
Wilcoxon tests might perform in a practical setting, we compared
their false positive rate (FPR), false discovery rate (FDR) and power
next lower P—for an effective type I error of 3.8%. We will see how this at a = 0.05 for different sampling distributions and sample sizes
© 2014 Nature America, Inc. All rights reserved.

affects test performance for small samples further on. In fact, it may (n = 5 and 25) in the presence and absence of an effect (Fig. 3).
even be impossible to reach significance at a = 0.05 because there is a At n = 5, Wilcoxon FPR = 0.032 < a because this is the largest
limited number of ways in which small samples can vary in the context P value it can produce smaller than a, not because the test inher-
of ranks, and no outcome of the test happens less than 5% of the time. ently performs better. We can always reach this FPR with the t-test
For example, samples of size 4 and 3 offer only 35 arrangements of by setting a = 0.032, where we’ll find that it will still have slightly
ranks and a two-tailed Pmin = 2/35 = 0.057. Contrast this to the t-test, higher power than a Wilcoxon test that rejects at this rate. At n = 5,
which can produce any P value because the test statistic can take on Wilcoxon performs better for discrete sampling—the power (0.43)
an infinite number of values. is essentially the same as the t-test’s (0.46), but the FDR is lower.
This has serious implications in multiple-testing scenarios discussed When both tests are applied at a = 0.032, Wilcoxon power (0.43) is
in the previous column3. Recall that when N tests are performed, slightly higher than t-test power (0.39). The differences between the
multiple-testing corrections will scale the smallest P value to NP. In tests for n = 25 diminishes because the number of arrangements of
the same way as a test may never yield a significant result (Pmin > a), ranks is extremely large and the normal approximation to sample
applying multiple-testing correction may also preclude it (NPmin > a). means is more accurate. However, one case stands out: in the pres-
For example, making N = 6 comparisons on samples such as X and Y ence of skew (e.g., exponential distribution), Wilcoxon power is
shown in Figure 2a (nX = 6, nY = 4) will never yield an adjusted P value much higher than that of the t-test, particularly for continuous sam-
lower than a = 0.05 because Pmin = 0.01 > a/N. To achieve two-tailed pling. This is because the majority of values are tightly spaced and
significance at a = 0.05 across N = 10, 100 or 1,000 tests, we require ranks are more sensitive to small shifts. Skew affects t-test FPR and
npg

sample sizes that produce at least 400, 4,000 or 40,000 distinct rank power in a complex way, depending on whether one- or two-tailed
combinations. This is achieved for sample pairs of size of (5, 6), (7, 8) tests are performed and the direction of the skew relative to the
and (9, 9), respectively. direction of the population shift that is being studied4.
The P values from the Wilcoxon test (PXY = 0.07, PXZ = 0.04) in Nonparametric methods represent a more cautious approach and
Figure 2a appear to be in conflict with those obtained from the remove the burden of assumptions about the distribution. They apply
t-test (PXY = 0.04, PXZ = 0.06). The two methods tell us contradic- naturally to data that are already in the form of ranks or degree of
tory information—or do they? As mentioned, the Wilcoxon test preference, for which numerical differences cannot be interpreted.
concerns the median, whereas the t-test concerns the mean. For Their power is generally lower, especially in multiple-testing scenar-
asymmetric distributions, these values can be quite different, and it ios. However, when data are very skewed, rank methods reach higher
is conceivable that the medians are the same but the means are dif- power and are a better choice than the t-test.
ferent. The t-test does not identify the difference in means of X and Corrected after print 23 May 2014.
Z as significant because the standard deviation, sZ, is relatively large
owing to the influence of the sample’s largest value (0.81). Because COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
the t-test reacts to any change in any sample value, the presence of
outliers can easily influence its outcome when samples are small. Martin Krzywinski & Naomi Altman
For example, simply increasing the largest value in X (1.00) by 0.3
1. Krzywinski, M. & Altman, N. Nat. Methods 11, 215–216 (2014).
will increase sX from 0.28 to 0.35 and result in a PXY value that is 2. Krzywinski, M. & Altman, N. Nat. Methods 10, 1041–1042 (2013).
no longer significant at a = 0.05. This change does not alter the 3. Krzywinski, M. & Altman, N. Nat. Methods 11, 355–356 (2014).
Wilcoxon P value because the rank scheme remains unaltered. This 4. Reineke, D.M, Baggett, J. & Elfessi, A. J. Stat. Educ. 11 (2003).
insensitivity to changes in the data—outliers and typical effects Martin Krzywinski is a staff scientist at Canada’s Michael Smith Genome Sciences
alike—reduces the sensitivity of rank methods. Centre. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

468 | VOL.11 NO.5 | MAY 2014 | nature methods


corrigenda

Corrigendum: Nonparametric tests


Martin Krzywinski & Naomi Altman
Nat. Methods 11, 467–468 (2014); published online 29 April 2014; corrected after print 23 May 2014

In the version of this article initially published, the expression X (nX = 6) was incorrectly written as X (nY = 6). The error has been corrected
in the HTML and PDF versions of the article.
© 2014 Nature America, Inc. All rights reserved.
npg

nature methods

You might also like