Statistics Guide
Statistics Guide
Statistics
Guide
Statistical analyses
for laboratory and
clinical researchers
Harvey Motulsky
1999-2005 GraphPad Software, Inc. All rights reserved.
Third printing February 2005
Tip: Heed the first rule of computers: Garbage in, garbage out.
While this volume provides far more background information than most program
manuals, it cannot entirely replace the need for statistics books and consultants.
GraphPad provides free technical support when you encounter problems with the
program. Email us at [email protected]. However, we can only provide limited free
help with choosing statistics tests or interpreting the results (consulting and on-site
teaching can sometimes be arranged for a fee).
Confidence intervals
The best way to use data from a sample to make inferences about the population is to
compute a confidence interval (CI).
Let's consider the simplest example. You measure something (say weight or
concentration) in a small sample, and compute the mean. That mean is very unlikely to
equal the population mean. The size of the likely discrepancy depends on the size and
variability of the sample. If your sample is small and variable, the sample mean is likely to
be quite far from the population mean. If your sample is large with little scatter, the
sample mean will probably be very close to the population mean. Statistical calculations
combine sample size and variability (standard deviation) to generate a CI for the
population mean. As its name suggests, the confidence interval is a range of values.
The interpretation of a 95% CI is quite straightforward. If you accept certain assumptions
(discussed later in this book for each kind of analyses), there is a 95% chance that the 95%
CI of the mean you calculated contains the true population mean. In other words, if you
generate many 95% CIs from many samples, you'll expect the 95% CI to include the true
population mean in 95% of the cases and not to include the population mean value in the
other 5%. Since you don't know the population mean (unless you work with simulated
data), you won't know whether a particular confidence interval contains the true
population mean or not. All you know is that there is a 95% chance that the population
mean lies within the 95% CI.
The concept is general. You can calculate the 95% CI for almost any value you compute
when you analyze data, including the difference between the group means, a proportion,
the ratio of two proportions, the best-fit slope of linear regression, and a best-fit value of
an EC50 determined by nonlinear regression.
Limitations of statistics
The statistical model is simple: Extrapolate from the sample you collected to a more
general situation, assuming that each value in your sample was randomly and
independently selected from a large population. The problem is that the statistical
inferences can only apply to the population from which your samples were obtained, but
you often want to make conclusions that extrapolate even beyond that large population.
For example, you perform an experiment in the lab three times. All the experiments used
the same cell preparation, the same buffers, and the same equipment. Statistical
inferences let you make conclusions about what would happen if you repeated the
experiment many more times with that same cell preparation, those same buffers, and the
same equipment. You probably want to extrapolate further to what would happen if
someone else repeated the experiment with a different source of cells, freshly made buffer,
and different instruments. Unfortunately, statistical calculations can't help with this
further extrapolation. You must use scientific judgment and common sense to make
inferences that go beyond the limitations of statistics. Thus, statistical logic is only part of
data interpretation.
75
Number of experiments
50
25
0
9.5 9.6 9.7 9.8 9.9 10.0 10.1 10.2 10.3 10.4 10.5
Weight in milligrams
The average weight is 10 milligrams, the weight of 10 µL of water (at least on earth). The
distribution is flat, with no hint of a Gaussian distribution.
Number of experiments
100
50
0
19.0 19.5 20.0 20.5 21.0
Weight in milligrams
Each pipetting step has a flat random error. Add them up, and the distribution is not flat.
For example, you’ll get weights near 21 mg only if both pipetting steps err substantially in
the same direction, and that is rare.
Now let’s extend this to ten pipetting steps, and look at the distribution of the sums.
100
Number of experiments
50
0
97 98 99 100 101 102 103
Weight in milligrams
The distribution looks a lot like an ideal Gaussian distribution. Repeat the experiment
15,000 times rather than 1,000 and you get even closer to a Gaussian distribution.
Number of experiments
1000
500
0
97 98 99 100 101 102 103
Weight in milligrams
This simulation demonstrates a principle that can also be mathematically proven. Scatter
will approximate a Gaussian distribution if your experimental scatter has numerous
sources that are additive and of nearly equal weight, and the sample size is large.
The Gaussian distribution is a mathematical ideal. Few biological distributions, if any,
really follow the Gaussian distribution. The Gaussian distribution extends from negative
infinity to positive infinity. If the weights in the example above really were to follow a
Gaussian distribution, there would be some chance (albeit very small) that the weight is
negative. Since weights can’t be negative, the distribution cannot be exactly Gaussian. But
it is close enough to Gaussian to make it OK to use statistical methods (like t tests and
regression) that assume a Gaussian distribution.
What is a P value?
Suppose that you've collected data from two samples of animals treated with different
drugs. You've measured an enzyme in each animal's plasma, and the means are different.
You want to know whether that difference is due to an effect of the drug – whether the two
populations have different means. Observing different sample means is not enough to
persuade you to conclude that the populations have different means. It is possible that the
populations have the same mean (i.e., that the drugs have no effect on the enzyme you are
measuring) and that the difference you observed between sample means occurred only by
chance. There is no way you can ever be sure if the difference you observed reflects a true
difference or if it simply occurred in the course of random sampling. All you can do is
calculate probabilities.
Statistical calculations can answer this question: In an experiment of this size, if the
populations really have the same mean, what is the probability of observing at least as
large a difference between sample means as was, in fact, observed? The answer to this
question is called the P value.
The P value is a probability, with a value ranging from zero to one. If the P value is small
enough, you’ll conclude that the difference between sample means is unlikely to be due to
chance. Instead, you’ll conclude that the populations have different means.
What is a null hypothesis?
When statisticians discuss P values, they use the term null hypothesis. The null hypothesis
simply states that there is no difference between the groups. Using that term, you can
define the P value to be the probability of observing a difference as large as or larger than
you observed if the null hypothesis were true.
Common misinterpretation of a P value
Many people misunderstand P values. If the P value is 0.03, that means that there is a 3%
chance of observing a difference as large as you observed even if the two population
means are identical (the null hypothesis is true). It is tempting to conclude, therefore, that
there is a 97% chance that the difference you observed reflects a real difference between
populations and a 3% chance that the difference is due to chance. However, this would be
an incorrect conclusion. What you can say is that random sampling from identical
populations would lead to a difference smaller than you observed in 97% of experiments
and larger than you observed in 3% of experiments. This distinction may be clearer after
you read A Bayesian perspective on interpreting statistical significance on page 21.
The two-tail P value answers this question: Assuming the null hypothesis is true, what is
the chance that randomly selected samples would have means as far apart as (or further
than) you observed in this experiment with either group having the larger mean?
To interpret a one-tail P value, you must predict which group will have the larger mean
before collecting any data. The one-tail P value answers this question: Assuming the null
hypothesis is true, what is the chance that randomly selected samples would have means
as far apart as (or further than) observed in this experiment with the specified group
having the larger mean?
A one-tail P value is appropriate only when previous data, physical limitations or common
sense tell you that a difference, if any, can only go in one direction. The issue is not
whether you expect a difference to exist – that is what you are trying to find out with the
experiment. The issue is whether you should interpret increases and decreases in the
same manner.
You should only choose a one-tail P value when both of the following are true.
9 You predicted which group will have the larger mean (or proportion) before you
collected any data.
9 If the other group ends up with the larger mean – even if it is quite a bit larger –
you would have attributed that difference to chance.
It is usually best to use a two-tail P value for these reasons:
9 The relationship between P values and confidence intervals is easier to
understand with two-tail P values.
9 Some tests compare three or more groups, which makes the concept of tails
inappropriate (more precisely, the P values have many tails). A two-tail P value is
more consistent with the P values reported by these tests.
Choosing a one-tail P value can pose a dilemma. What would you do if you chose to use a
one-tail P value, observed a large difference between means, but the “wrong” group had
the larger mean? In other words, the observed difference was in the opposite direction to
your experimental hypothesis. To be rigorous, you must conclude that the difference is
due to chance, even if the difference is huge. While tempting, it is not fair to switch to a
two-tail P value or to reverse the direction of the experimental hypothesis. You avoid this
situation by always using two-tail P values.
Statistical power
Type II errors and power
If you compare two treatments and your study concludes there is “no statistically
significant difference”, you should not necessarily conclude that the treatment was
ineffective. It is possible that the study missed a real effect because you used a small
sample or your data were quite variable. In this case you made a Type II error — obtaining
a “not significant” result when, in fact, there is a difference.
When interpreting the results of an experiment that found no significant difference, you
need to ask yourself how much power the study had to find various hypothetical
differences (had they existed). The power depends on the sample size and amount of
variation within the groups, where variation is quantified by the standard deviation (SD).
Here is a precise definition of power: Start with the assumption that two population
means differ by a certain amount, but have the same SD. Now assume that you perform
many experiments with the sample size you used, and calculate a P value for each
experiment. Power is the fraction of these experiments that would lead to statistically
significant results, i.e., would have a P value less than alpha (the largest P value you deem
“significant”, usually set to 0.05).
Example of power calculations
Motulsky et al. asked whether people with hypertension (high blood pressure) had altered
numbers of α2-adrenergic receptors on their platelets (Clinical Science 64:265-272, 1983).
There are many reasons to think that autonomic receptor numbers may be altered in
hypertensives. We studied platelets because they are easily accessible from a blood
sample. The results are shown here:
Variable Hypertensives Controls
Number of subjects 18 17
Mean receptor number 257 263
(receptors per cell)
Standard Deviation 59.4 86.6
The two means were almost identical, and a t test gave a very high P value. We concluded
that the platelets of hypertensives do not have an altered number of α2 receptors.
What was the power of this study to find a difference (if there was one)? The answer
depends on how large the difference really is. Prism does not compute power, but the
companion program GraphPad StatMate does. Here are the results shown as a graph.
Percent Power
50
0
0 25 50 75 100 125
Difference Between Means
(# receptors per platelet)
If the true difference between means was 50.58, then this study had only 50% power to
find a statistically significant difference. In other words, if hypertensives really averaged
51 more receptors per cell, you’d find a statistically significant difference in about half of
studies of this size, but you would not find a statistically significant difference in the other
half of the studies. This is about a 20% change (51/257), large enough that it could
possibly have a physiological impact.
If the true difference between means was 84 receptors/cell, then this study had 90%
power to find a statistically significant difference. If hypertensives really had such a large
difference, you’d find a statistically significant difference in 90% of studies this size and
you would not find a significant difference in the other 10% of studies.
All studies have low power to find small differences and high power to find large
differences. However, it is up to you to define “low” and “high” in the context of the
experiment and to decide whether the power was high enough for you to believe the
negative results. If the power is too low, you shouldn’t reach a firm conclusion until the
study has been repeated with more subjects. Most investigators aim for 80% or 90%
power to detect a difference.
Since this study had only a 50% power to detect a difference of 20% in receptor number
(50 sites per platelet, a large enough difference to possibly explain some aspects of
hypertension physiology), the negative conclusion is not solid.
A. Prior probability=10%
Drug really Drug really doesn’t Total
works work
P<0.05, “significant” 80 45 125
P>0.05, “not significant” 20 855 875
Total 100 900 1000
B. Prior probability=80%
Drug really Drug really doesn’t Total
works work
P<0.05, “significant” 640 10 650
P>0.05, “not significant” 160 190 350
Total 800 200 1000
C. Prior probability=1%
Drug really Drug really doesn’t Total
works work
P<0.05, “significant” 8 50 58
P>0.05, “not significant” 2 940 942
Total 10 990 1000
The totals at the bottom of each column are determined by the prior probability – the
context of your experiment. The prior probability equals the fraction of the experiments
that are in the leftmost column. To compute the number of experiments in each row, use
the definition of power and alpha. Of the drugs that really work, you won’t obtain a P
value less than 0.05 in every case. You chose a sample size to obtain a power of 80%, so
80% of the truly effective drugs yield “significant” P values and 20% yield “not significant”
P values. Of the drugs that really don’t work (middle column), you won’t get “not
significant” results in every case. Since you defined statistical significance to be “P<0.05”
(alpha=0.05), you will see a significant result in 5% of experiments performed with drugs
that are really inactive and a “not significant” result in the other 95%.
If the P value is less than 0.05, so the results are “statistically significant”, what is the
chance that the drug is, in fact, active? The answer is different for each experiment.
For experiment A, the chance that the drug is really active is 80/125 or 64%. If you
observe a statistically significant result, there is a 64% chance that the difference is real
and a 36% chance that the difference simply arose in the course of random sampling. For
experiment B, there is a 98.5% chance that the difference is real. In contrast, if you
observe a significant result in experiment C, there is only a 14% chance that the result is
real and an 86% chance that it is due to random sampling. For experiment C, the vast
majority of “significant” results are due to chance.
Your interpretation of a “statistically significant” result depends on the context of the
experiment. You can’t interpret a P value in a vacuum. Your interpretation depends on the
context of the experiment. Interpreting results requires common sense, intuition, and
judgment.
What is an outlier?
When analyzing data, you'll sometimes find that one value is far from the others. Such a
value is called an outlier, a term that is usually not defined rigorously. When you
encounter an outlier, you may be tempted to delete it from the analyses. First, ask yourself
these questions:
9 Was the value entered into the computer correctly? If there was an error in data
entry, fix it.
9 Were there any experimental problems with that value? For example, if you noted
that one tube looked funny, you have justification to exclude the value resulting
from that tube without needing to perform any calculations.
9 Could the outlier be caused by biological diversity? If each value comes from a
different person or animal, the outlier may be a correct value. It is an outlier not
because of an experimental mistake, but rather because that individual may be
different from the others. This may be the most exciting finding in your data!
If you answered “no” to those three questions, you have to decide what to do with the
outlier. There are two possibilities.
One possibility is that the outlier was due to chance. In this case, you should keep the
value in your analyses. The value came from the same distribution as the other values, so
should be included.
The other possibility is that the outlier was due to a mistake: bad pipetting, voltage spike,
holes in filters, etc. Since including an erroneous value in your analyses will give invalid
results, you should remove it. In other words, the value comes from a different population
than the other and is misleading.
The problem, of course, is that you are rarely sure which of these possibilities is correct.
No mathematical calculation can tell you for sure whether the outlier came from the same
or different population than the others. Statistical calculations, however, can answer this
question: If the values really were all sampled from a Gaussian distribution, what is the
chance that you'd find one value as far from the others as you observed? If this probability
is small, then you will conclude that the outlier is likely to be an erroneous value, and you
have justification to exclude it from your analyses.
Statisticians have devised several methods for detecting outliers. All the methods first
quantify how far the outlier is from the other values. This can be the difference between
the outlier and the mean of all points, the difference between the outlier and the mean of
the remaining values, or the difference between the outlier and the next closest value.
Next, standardize this value by dividing by some measure of scatter, such as the SD of all
values, the SD of the remaining values, or the range of the data. Finally, compute a P value
answering this question: If all the values were really sampled from a Gaussian population,
what is the chance of randomly obtaining an outlier so far from the other values? If the P
value is small, you conclude that the deviation of the outlier from the other values is
statistically significant, and most likely from a different population.
5. Outliers 25
Prism does not perform any sort of outlier detection. If you want to perform an outlier test
by hand, you can calculate Grubb's test, described below.
Consult the references below for larger tables. You can also calculate an approximate P
value as follows.
1. Calculate a t ratio from N (number of values in the sample) and Z (calculated for
the suspected outlier as shown above).
N ( N − 2) Z 2
t=
( N − 1) 2 − NZ 2
2. Determine the P value corresponding with that value of t with N-2 degrees of
freedom. Use the Excel formula =TDIST(t,df,2), substituting values for t and df
(the third parameter is 2, because you want a two-tailed P value).
3. Multiply the P value you obtain in step 2 by N. The result is an approximate P
value for the outlier test. This P value is the chance of observing one point so far
from the others if the data were all sampled from a Gaussian distribution. If Z is
large, this P value will be very accurate. With smaller values of Z, the calculated
P value may be too large.
The most that Grubbs' test (or any outlier test) can do is tell you that a value is unlikely to
have come from the same Gaussian population as the other values in the group. You then
need to decide what to do with that value. I would recommend removing significant
outliers from your calculations in situations where experimental mistakes are common
and biological variability is not a possibility. When removing outliers, be sure to
document your decision. Others feel that you should never remove an outlier unless you
noticed an experimental problem. Beware of a natural inclination to remove outliers that
5. Outliers 27
get in the way of the result you hope for, but to keep outliers that enhance the result you
hope for.
If you use nonparametric tests, outliers will affect the results very little so do not need to
be removed.
If you decide to remove the outlier, you then may be tempted to run Grubbs' test again to
see if there is a second outlier in your data. If you do this, you cannot use the table shown
above. Rosner has extended the method to detecting several outliers in one sample. See
the first reference below for details.
Here are two references:
9 B Iglewicz and DC Hoaglin. How to Detect and Handle Outliers (Asqc Basic
References in Quality Control, Vol 16) Amer Society for Quality Control, 1993.
9 V Barnett, T Lewis, V Rothamsted. Outliers in Statistical Data (Wiley Series in
Probability and Mathematical Statistics. Applied Probability and Statistics) John
Wiley & Sons, 1994.
Statistical tests that are robust to the presence of outliers
Some statistical tests are designed so that the results are not altered much by the presence
of one or a few outliers. Such tests are said to be robust.
Nonparametric tests are robust. Most nonparametric tests compare the distribution of
ranks. This makes the test robust because the largest value has a rank of 1, but it doesn’t
matter how large that value is.
Other tests are robust to outliers because rather than assuming a Gaussian distribution,
they assume a much wider distribution where outliers are more common (so have less
impact).
6. Descriptive Statistics
Choose the descriptive statistics you want to determine. Note: CI means confidence interval.
If you formatted the Y columns for entry of replicate values (for example, triplicates),
Prism first averages the replicates in each row. It then calculates the column statistics on
the means, without considering the SD or SEM of each row or the number of replicates
you entered. If you enter 10 rows of triplicate data, the column statistics are calculated
from the 10 row means, not from the 30 individual values. If you format the Y columns for
entry of mean and SD (or SEM) values, Prism calculates column statistics for the means
and ignores the SD or SEM values you entered.
Note that this dialog also allows you to perform normality tests (details later in this
chapter) and compare the mean or median with a hypothetical value using a one-sample t
test or a Wilcoxon signed-rank test (details in the next chapter).
6. Descriptive Statistics 29
Prism calculates the SD using the equation below. (Each yi is a value, ymean is the average,
and N is sample size).
∑(y − y
2
)
SD = i mean
N −1
The standard deviation computed this way (with a denominator of N-1) is called the
sample SD, in contrast to the population SD which would have a denominator of N. Why
is the denominator N-1 rather than N? In the numerator, you compute the difference
between each value and the mean of those values. You don’t know the true mean of the
population; all you know is the mean of your sample. Except for the rare cases where the
sample mean happens to equal the population mean, the data will be closer to the sample
mean than it will be to the population mean. This means that the numerator will be too
small. So the denominator is reduced as well. It is reduced to N-1 because that is the
number of degrees of freedom in your data.
Defining degrees of freedom rigorously is beyond the scope of this book. When computing
the SD of a list of values, you can calculate the last value from N-1 of the values, so
statisticians say there are N-1 degrees of freedom.
Standard error of the mean (SEM)
The standard error of the mean (SEM) quantifies the precision of the mean. It is a
measure of how far your sample mean is likely to be from the true population mean. The
SEM is calculated by this equation:
SD
SEM =
N
With large samples, the SEM is always small. By itself, the SEM is difficult to interpret. It
is easier to interpret the 95% confidence interval, which is calculated from the SEM.
The difference between the SD and SEM
It is easy to be confused about the difference between the standard deviation (SD) and the
standard error of the mean (SEM).
The SD quantifies scatter — how much the values vary from one another.
The SEM quantifies how accurately you know the true mean of the population. The SEM gets
smaller as your samples get larger. This makes sense, because the mean of a large sample is
likely to be closer to the true population mean than is the mean of a small sample.
The SD does not change predictably as you acquire more data. The SD quantifies the
scatter of the data, and increasing the size of the sample does not change the scatter. The
SD might go up, or it might go down; you can't predict. On average, the SD will stay the
same as sample size gets larger.
If the scatter is caused by biological variability, you probably will want to show the variation. In
this case, report the SD rather than the SEM. If you are using an in vitro system with no
biological variability, the scatter can only result from experimental imprecision. In this case,
you may not want to show the scatter, but instead show how well you have assessed the mean.
Report the mean and SEM, or the mean with 95% confidence interval.
You should choose to show the SD or SEM based on the source of the variability and the
point of the experiment. In fact, many scientists choose the SEM simply because it is
smaller so creates shorter error bars.
6. Descriptive Statistics 31
Tip: Because the various methods to compute the 25th and 75th
percentiles give different results with small data sets, we suggest that you
only report the 25th and 75th percentiles for large data sets (N>100 is a
reasonable cut off). For smaller data sets, we suggest showing a column
scatter graph that shows every value.
Geometric mean
The geometric mean is the antilog of the mean of the logarithms of the values. This is
the same as taking the Nth root (where N is the number of points) of the product of all N
values. It is less affected by outliers than the mean. Prism also reports the 95% confidence
interval of the geometric mean.
Skewness and kurtosis
Skewness quantifies the asymmetry of a distribution. A symmetrical distribution has a
skewness of zero. An asymmetrical distribution with a long tail to the right (higher values)
has a positive skew. An asymmetrical distribution with a long tail to the left (lower values)
has a negative skew.
Kurtosis quantifies how closely the shape of a distribution follows the usual Gaussian
shape. A Gaussian distribution, by definition, has a kurtosis of 0. A distribution with more
values in the center, and less in the tails, has a negative kurtosis. A distribution with fewer
values in the center and more in the tail has a positive kurtosis.
Skewness and kurtosis are computed by these equations.
∑ (Y − µ )
3
skewness = i
Nσ 3
∑ (Y − µ )
4
kurtosis = i
−3
Nσ 4
6. Descriptive Statistics 33
First, decide on the scope of the calculations. If you have entered more than one data set
in the table, you have two choices. Most often, you’ll calculate a row total/mean for each
data set. The results table will have the same number of data sets as the input table. The
other choice is to calculate one row total/mean for the entire table. The results table will
then have a single data set containing the grand totals or means.
Then decide what to calculate: row totals, row means with SD, or row means with SEM.
To review the difference between SD and SEM see Interpreting descriptive statistics on
page 29.
If you have already averaged your data, format the data table for mean, SD (or SEM), and
N. With this format, you can’t pick nonparametric or paired tests, which require raw data.
Enter data on only one row.
Indexed data
Many statistics programs expect you to enter data in an indexed format, as shown below.
One column contains all the data, and the other column designates the group. Prism
cannot analyze data entered in index format. If you have indexed data from another
program, choose to “unstack” your data when you import it (an option on the Filter tab of
the Import data dialog). This rearranges your data to a format Prism can analyze. Read
the chapter on importing data in the Prism User's Guide. You can also insert indexed data
into Prism using the Paste Special command.
Checklist: Is the paired t test the right test for these data?
Before accepting the results of any statistical test, first think carefully about whether you
chose an appropriate test. Before accepting results from a paired t test, ask yourself these
questions. Prism can help you answer the first two questions listed below. You’ll have to
answer the others based on experimental design.
Are the differences distributed according to a Gaussian distribution?
The paired t test assumes that you have sampled your pairs of values from a
population of pairs where the difference between pairs follows a Gaussian
distribution. While this assumption is not too important with large samples, it is
important with small sample sizes. Prism tests for violations of this assumption, but
normality tests have limited utility (see page 32). If your data do not come from
Gaussian distributions, you have three options. Your best option is to transform the
values (perhaps to logs or reciprocals) to make the distributions more Gaussian.
Another choice is to use the Wilcoxon matched pairs nonparametric test instead of
the t test.
treated
log = log(treated ) − log(control )
control
To perform a ratio t test with Prism, follow these steps (see a detailed example below).
1. Transform both columns to logarithms.
2. Perform a paired t test on the transform results.
3. Interpret the P value: If there really were no differences between control and
treated values, what is the chance of obtaining a ratio as far from 1.0 as was
observed?
4. Prism also reports the confidence interval of the difference between means.
Since the data being analyzed are logs of the actual values, the difference
between means is the same as the mean of the log(ratio). Take the antilog of
each end of the interval (with a calculator) to compute the 95% confidence
interval of the ratio.
Note: Ratio t tests (like paired and unpaired t tests) are used to compare
two groups when the outcome is a continuous variable like blood
pressure or enzyme level. Don’t confuse with the analysis of a
contingency table, which is appropriate when there are only two possible
outcomes (the outcome is a binary variable).
If you perform a conventional paired t test, the P value is 0.07. The difference between
control and treated is not substantial or consistent enough to be statistically significant.
This makes sense because the paired t test looks at differences, and the differences are not
very consistent. The 95% confidence interval for the difference between control and
treated Km value is -0.72 to 9.72, which includes zero.
The ratios are much more consistent. It is not appropriate to analyze the ratios directly.
Because ratios are inherently asymmetrical, you’ll get a different answer depending on
whether you analyze the ratio of treated/control or control/treated. You’ll get different P
values testing the null hypothesis that the ratio really equals 1.0.
P value
One-way ANOVA compares three or more unmatched groups, based on the assumption
that the populations are Gaussian. The P value answers this question: If all the
populations really have the same mean (the treatments are ineffective), what is the chance
The key idea is that ANOVA partitions the variability among the values into one
component that is due to variability among group means (due to the treatment) and
another component that is due to variability within the groups (also called residual
variation). Variability within groups (within the columns) is quantified as the sum of
squares of the differences between each value and its group mean. This is the residual
sum-of-squares. Variation among groups (due to treatment) is quantified as the sum of
the squares of the differences between the group means and the grand mean (the mean of
all values in all groups). Adjusted for the size of each group, this becomes the treatment
sum-of-squares. Each sum-of-squares is associated with a certain number of degrees of
freedom (df, computed from number of subjects and number of groups), and the mean
square (MS) is computed by dividing the sum-of-squares by the appropriate number of
degrees of freedom.
The F ratio is the ratio of two mean square values. If the null hypothesis is true, you expect
F to have a value close to 1.0 most of the time. A large F ratio means that the variation
among group means is more than you’d expect to see by chance. You’ll see a large F ratio
both when the null hypothesis is wrong (the data are not sampled from populations with
the same mean) and when random sampling happened to end up with large values in
some groups and small values in others.
The P value answers this question: If the populations all have the same mean, what is the
chance that randomly selected groups would lead to an F ratio as big (or bigger) as the one
obtained in your experiment?
ANOVA table
The P value is calculated from the ANOVA table. With repeated-measures ANOVA, there
are three sources of variability: between columns (treatments), between rows
(individuals), and random (residual). The ANOVA table partitions the total sum-of-
squares into those three components. It then adjusts for the number of groups and
number of subjects (expressed as degrees of freedom) to compute two F ratios. The main
F ratio tests the null hypothesis that the column means are identical. The other F ratio
tests the null hypothesis that the row means are identical (this is the test for effective
matching). In each case, the F ratio is expected to be near 1.0 if the null hypothesis is true.
If F is large, the P value will be small.
The ANOVA calculations ignore any X values. You may wish to format the X column as
text in order to label your rows. Or you may omit the X column altogether, or enter
numbers.
You may leave some replicates blank and still perform ordinary two-way ANOVA (so long
as you enter at least one value in each row for each data set). You cannot perform
repeated-measures ANOVA if there are any missing values. Prism cannot perform
repeated-measures two-way ANOVA, if any values are missing for any subject. However,
Prism can perform repeated measures two-way ANOVA with different numbers of
subjects in each group, so long as you have complete data (at each time point or dose) for
each subject.
200
Before
150
During
After
Response
100
50
0
Men Women
9 Prism can only perform post tests within a row, comparing columns. When the
data are entered as shown above, Prism can compute post tests comparing Before
vs. During, Before vs. After, and During vs. After within each row. But Prism
cannot compare Men vs. Women since that is comparing two rows.
You could choose to enter the same data in an alternative manner like this:
200
Men
Women
150
Response
100
50
0
Before During After
When entered this way, the post tests would compare men vs. women in each of the three
time points. The rest of the two-way ANOVA results will be identical no matter how you
enter your data.
Variable names
Label the two factors to make the output more clear. If you don’t enter names, Prism will
use the generic names “Column factor” and “Row factor”.
The table above shows example data testing the effects of three doses of drugs in control
and treated animals. The decision to use repeated measures depends on the experimental
design.
Here is an experimental design that would require analysis using repeated measures by
row: The experiment was done with six animals, two for each dose. The control values
were measured first in all six animals. Then you applied a treatment to all the animals and
made the measurement again. In the table above, the value at row 1, column A, Y1 (23)
came from the same animal as the value at row 1, column B, Y1 (28). The matching is by
row.
Here is an experimental design that would require analysis using repeated measures by
column: The experiment was done with four animals. First each animal was exposed to a
treatment (or placebo). After measuring the baseline data (dose=zero), you inject the first
dose and make the measurement again. Then inject the second dose and measure again.
The values in the first Y1 column (23, 34, and 43) were repeated measurements from the
same animal. The other three columns came from three other animals. The matching was
by column.
The term repeated measures is appropriate for those examples, because you made
repeated measurements from each animal. Some experiments involve matching but no
repeated measurements. The term randomized-block describes these kinds of
experiments. For example, imagine that the three rows were three different cell lines. All
the Y1 data came from one experiment, and all the Y2 data came from another experiment
performed a month later. The value at row 1, column A, Y1 (23) and the value at row 1,
column B, Y1 (28) came from the same experiment (same cell passage, same reagents).
The matching is by row. Randomized block data are analyzed identically to repeated-
measures data. Prism only uses the term repeated measures for any analysis where
subjects were matched, regardless of whether measurements were actually repeated in
those subjects.
It is also possible to design experiments with repeated measures in both directions. Here
is an example: The experiment was done with two animals. First you measured the
baseline (control, zero dose). Then you injected dose 1 and made the next measurement,
then dose 2 and measured again. Then you gave the animal the experimental treatment,
waited an appropriate period of time, and made the three measurements again. Finally,
you repeated the experiment with another animal (Y2). So a single animal provided data
from both Y1 columns (23, 34, 43 and 28, 41, 56). Prism cannot perform two-way ANOVA
with repeated measures in both directions, and so cannot analyze this experiment.
Don’t confuse replicates with repeated measures. Here is an example: The experiment was
done with six animals. Each animal was given one of two treatments at one of three doses.
The measurement was then made in duplicate. The value at row 1, column A, Y1 (23) came
from the same animal as the value at row 1, column A, Y2 (24). Since the matching is
The numerator is the difference between the mean response in the two data sets (usually
control and treated) at a particular row (usually dose or time point). The denominator
combines the number of replicates in the two groups at that dose with the mean square of
the residuals (sometimes called the mean square of the error), which is a pooled measure
of variability at all doses.
Statistical significance is determined by comparing the t ratio with the t distribution for
the number of df shown in the ANOVA table for MSresidual, applying the Bonferroni
correction for multiple comparisons. The Bonferroni correction lowers the P value that
you consider to be significant to 0.05 divided by the number of comparisons. This means
that if you have five rows of data with two columns, the P value has to be less than 0.05/5,
or 0.01, for any particular row in order to be considered significant with P<0.05. This
correction ensures that the 5% probability applies to the entire family of comparisons, and
not separately to each individual comparison.
Confidence intervals at each row are computed using this equation:
1 1
Span = t * ⋅ MSresidual +
N1 N 2
95% CI: [(mean 2 − mean1 ) - Span ] to [(mean 2 − mean1 ) + Span ]
The critical value of t is abbreviated t* in that equation (not a standard abbreviation). Its
value does not depend on your data, only on your experimental design. It depends on the
number of degrees of freedom and the number of rows (number of comparisons).
Post tests following repeated-measures two-way ANOVA use exactly the same equation if
the repeated measures are by row. If the repeated measures are by column, use (SSsubject +
SSresidual)/(DFsubject + DFresidual) instead of MSresidual in the equation above, and set the
number of degrees of freedom to the sum of DFsubject and DFresidual.
How to think about results from two-way ANOVA
Two-way ANOVA partitions the overall variance of the outcome variable into three
components, plus a residual (or error) term.
Interaction
The null hypothesis is that there is no interaction between columns (data sets) and rows.
More precisely, the null hypothesis states that any systematic differences between col-
82 Part B: Continuous Data
umns are the same for each row and that any systematic differences between rows are the
same for each column. If columns represent drugs and rows represent gender, then the
null hypothesis is that the differences between the drugs are consistent for men and
women.
The P value answers this question: If the null hypothesis is true, what is the chance of
randomly sampling subjects and ending up with as much (or more) interaction than you
have observed? Often the test of interaction is the most important of the three tests.
If you graph the data, as shown below, there is no interaction when the curves are
“parallel”. In the graph below, the left panel shows interaction between treatment and
gender, while the right panel shows no interaction.
Interaction No interaction
Men Men
Outcome
Outcome
Women Women
Interaction No Interaction
Response
Response
log(Dose) log(Dose)
If you entered only a single value for each row/column pair, it is impossible to test for
interaction between rows and columns. Instead, Prism assumes that there is no
interaction, and continues with the other calculations. Depending on your experimental
design, this assumption may or may not make sense. The assumption cannot be tested
without replicate values.
Column factor
The null hypothesis is that the mean of each column (totally ignoring the rows) is the
same in the overall population, and that all differences we see between column means are
due to chance. If columns represent different drugs, the null hypothesis is that all the
drugs produced the same effect. The P value answers this question: If the null hypothesis
is true, what is the chance of randomly obtaining column means as different (or more so)
than you have observed?
Row factor
The null hypothesis is that the mean of each row (totally ignoring the columns) is the
same in the overall population, and that all differences we see between row means are due
to chance. If the rows represent gender, the null hypothesis is that the mean response is
the same for men and women. The P value answers this question: If the null hypothesis is
true, what is the chance of randomly obtaining row means as different (or more so) than
you have observed?
Subject (matching)
For repeated-measures ANOVA, Prism tests the null hypothesis that the matching was not
effective. You expect a low P value if the repeated-measures design was effective in
controlling for variability between subjects. If the P value was high, reconsider your
decision to use repeated-measures ANOVA.
How to think about post tests following two-way ANOVA
If you have two data sets (columns), Prism can perform post tests to compare the two
means from each row.
For each row, Prism reports the 95% confidence interval for the difference between the
two means. These confidence intervals adjust for multiple comparisons, so you can be
95% certain that all the intervals contain the true difference between means.
For each row, Prism also reports the P value testing the null hypothesis that the two
means are really identical. Again, the P value computations take into account multiple
comparisons. If there really are no differences, there is a 5% chance that any one (or
more) of the P values will be less than 0.05. The 5% probability applies to the entire family
of comparisons, not to each individual P value.
Trivial decrease Important You can’t reach a strong conclusion. The data
increase are consistent with the treatment causing a
trivial decrease, no change, or a large increase.
To reach a clear conclusion, you need to repeat
the experiment with more subjects.
Important decrease Trivial increase You can’t reach a strong conclusion. The data
are consistent with a trivial increase, no change,
or a decrease that may be large enough to be
important. You can’t make a clear conclusion
without repeating the experiment with more
subjects.
Important decrease Large increase You can't reach any conclusion. Repeat the
experiment with a much larger sample size.
Prism will compare the two columns at each row. For this example, Prism's built-in post
tests compare the two columns at each row, thus asking:
9 Do the control responses differ between men and women?
9 Do the agonist-stimulated responses differ between men and women?
9 Do the response in the presence of both agonist and antagonist differ between
men and women?
If these questions match your experimental aims, Prism's built-in post tests will suffice.
Many biological experiments compare two responses at several time points or doses, and
Prism built-in post tests are just what you need for these experiments. But if you have
more than two columns, Prism won't perform any post tests. And even with two columns,
you may wish to perform different post tests. In this example, based on the experimental
design above, you might want to ask these questions:
9 For men, is the agonist-stimulated response different than control? (Did the
agonist work?)
9 For women, is the agonist-stimulated response different than control?
1 1
(mean1 − mean2 ) − t * MS residual ( + )
N1 N 2
to
1 1
(mean1 − mean2 ) + t * MS residual ( + )
N1 N 2
Variable Explanation
mean1, mean2 The mean of the two groups you want to compare.
N1, N2 The sample size of the two groups you want to compare.
t* Critical value from the student t distribution. This variable is
defined using the Bonferroni correction for multiple corrections.
When making a single confidence interval, t* is the value of the t
ratio that corresponds to a two-tailed P value of 0.05 (or whatever
significance level you chose). If you are making six comparisons, t*
is the t ratio that corresponds to a P value of 0.05/6, or 0.00833.
Find the value using this Excel formula =TINV(0.00833,6), which
equals 3.863. The first parameter is the significance level corrected
for multiple comparisons; the second is the number of degrees of
freedom for the ANOVA (residuals for regular two-way ANOVA,
‘subject’ for repeated measures). The value of t* will be the same
for each comparison. Its value depends on the degree of confidence
you desire, the number of degrees of freedom in the ANOVA, and
the number of comparisons you made.
The variables are the same as those used in the confidence interval calculations, but notice
the key difference. Here, you calculate a t ratio for each comparison, and then use it to
determine the significance level (as explained in the next paragraph). When computing a
confidence interval, you choose a confidence level (95% is standard) and use that to
determine a fixed value from the t distribution, which we call t*. Note that the numerator
is the absolute value of the difference between means, so the t ratio will always be positive.
To determine the significance level, compare the values of the t ratio computed for each
comparison against the standard values, which we abbreviate t*. For example, to
determine whether the comparison is significant at the 5% level (P<0.05), compare the t
ratios computed for each comparison to the t* value calculated for a confidence interval of
95% (equivalent to a significance level of 5%, or a P value of 0.05) corrected for the
number of comparisons and taking into account the number of degrees of freedom. As
shown above, this value is 3.863. If a t ratio is greater than t*, then that comparison is
significant at the 5% significance level. To determine whether a comparison is significant
at the stricter 1% level, calculate the t ratio corresponding to a confidence interval of 99%
(P value of 0.01) with six comparisons and six degrees of freedom. First divide 0.01 by 6
(number of comparisons), which is 0.001667. Then use the Excel formula
=TINV(0.001667,6) to find the critical t ratio of 5.398. Each comparison that has a t ratio
greater than 5.398 is significant at the 1% level.
Tip: All these calculations can be performed using a free QuickCalcs web
calculator at www.graphpad.com.
For this example, here are the values you need to do the calculations (or enter into the
web calculator).
Comparison Mean1 Mean2 N1 N2
1: Men. Agonist vs. control 176.0 98.5 2 2
2: Women. Agonist vs. control 206.5 100.0 2 2
3: Men. Agonist vs. Ag+Ant 176.0 116.0 2 2
4: Women. Agonist vs. Ag+Ant 206.5 121.0 2 2
5: Men Control vs. Ag+Ant 98.5 116.0 2 2
6: Women. Control vs. Ag+Ant 100.0 121.0 2 2
The calculations account for multiple comparisons. This means that the 95% confidence
level applies to all the confidence intervals. You can be 95% sure that all the intervals
include the true value. The 95% probability applies to the entire family of confidence
intervals, not to each individual interval. Similarly, if the null hypothesis were true (that
all groups really have the same mean, and all observed differences are due to chance)
there will be a 95% chance that all comparisons will be not significant, and a 5% chance
that any one or more of the comparisons will be deemed statistically significant with P<
0.05.
For the sample data, we conclude that the agonist increases the response in both men and
women. The combination of antagonist plus agonist decreases the response down to a
level that is indistinguishable from the control response.
Introduction to correlation
When two variables vary together, statisticians say that there is a lot of covariation or
correlation. The correlation coefficient, r, quantifies the direction and magnitude of
correlation.
Correlation is not the same as linear regression, but the two are related. Linear regression
finds the line that best predicts Y from X. Correlation quantifies how well X and Y vary
together. In some situations, you might want to perform both calculations.
Correlation only makes sense when both X and Y variables are outcomes you measure. If
you control X (often, you will have controlled variables such as time, dose, or
concentration), don’t use correlation, use linear regression.
Tip: Linear and nonlinear regression are explained in the companion book,
Fitting Biological Data to Models Using Linear and Nonlinear Regression.
Correlation calculations do not discriminate between X and Y, but rather quantify the
relationship between the two variables. Linear regression does discriminate between X
and Y. Linear regression finds the straight line that best predicts Y from X by minimizing
the sum of the square of the vertical distances of the points from the regression line. The
X and Y variables are not symmetrical in the regression calculations. Therefore only
choose regression, rather than correlation, if you can clearly define which variable is X
and which is Y.
Results of correlation
How correlation works
Correlation coefficient
The correlation coefficient, r, ranges from -1 to +1. The nonparametric Spearman
correlation coefficient, abbreviated rs, has the same range.
Value of r (or rs) Interpretation
r=0 The two variables do not vary together at all.
between 0 and 1 The two variables tend to increase or decrease together.
r = 1.0 Perfect correlation.
between 0 and -1 One variable increases as the other decreases.
r = -1.0 Perfect negative or inverse correlation.
If r or rs is far from zero, there are four possible explanations:
9 Changes in the X variable change the value of the Y variable.
9 Changes in the Y variable change the value of the X variable.
9 Changes in another variable influence both X and Y.
9 X and Y don’t really correlate at all, and you just happened to observe such a
strong correlation by chance. The P value determines how often this could occur.
r2 (from correlation)
Perhaps the best way to interpret the value of r is to square it to calculate r2. Statisticians
call this quantity the coefficient of determination, but scientists call it r squared. It is
11. Correlation 93
a value that ranges from zero to one, and is the fraction of the variance in the two
variables that is “shared”. For example, if r2=0.59, then 59% of the variance in X can be
explained by variation in Y. Likewise, 59% of the variance in Y can be explained by
variation in X. More simply, 59% of the variance is shared between X and Y.
Prism only calculates an r2 value from the Pearson correlation coefficient. It is not
appropriate to compute r2 from the nonparametric Spearman correlation coefficient.
11. Correlation 95
Part C:
Categorical and Survival Data
p (1 − p ) p(1 − p )
p − 1.96 to p + 1.96
N N
where
# of " successes " S
p= =
# of experiments N
N is the number of experiments (or subjects), and S is the number of those experiments or
subjects with a particular outcome (termed “success”). This means the remaining N-S
subjects or experiments have the alternative outcome. Expressed as a fraction, success
The sensitivity, specificity and likelihood ratios are properties of the test. The positive and
negative predictive values are properties of both the test and the population you test. If
you use a test in two populations with different disease prevalence, the predictive values
will be different. A test that is very useful in a clinical setting (high predictive values) may
be almost worthless as a screening test. In a screening test, the prevalence of the disease is
much lower so the predictive value of a positive test will also be lower.
How to think about P values from a 2x2 contingency table
The P value answers this question: If there really is no association between the variable
defining the rows and the variable defining the columns in the overall population, what is
the chance that random sampling would result in an association as strong (or stronger) as
observed in this experiment? Equivalently, if there really is no association between rows
and columns overall, what is the chance that random sampling would lead to a relative
risk or odds ratio as far (or further) from 1.0 (or P1-P2 as far from 0.0) as observed in this
experiment?
Censored data
Creating a survival curve is not quite as easy as it sounds. The difficulty is that you rarely
know the survival time for each subject. Some subjects may still be alive at the end of the
study. You know how long they have survived so far, but don’t know how long they will
survive in the future. Others drop out of the study -- perhaps they moved to a different
city or wanted to take a medication disallowed on the protocol. You know they survived a
certain length of time on the protocol, but don’t know how long they survived after that
(or do know, but can’t use the information because they weren’t following the
experimental protocol). In both cases, information about these patients is said to be cen-
sored.
You definitely don’t want to eliminate these censored observations from your analyses.
You just need to account for them properly. Prism uses the method of Kaplan and Meier
to create survival curves while accounting for censored data.
Note. The term “censored” seems to imply that the subject did something
inappropriate. But that isn’t the case. The term “censored” simply means
that you don’t know, or can’t use, survival beyond a certain point.
Prism creates a data table formatted to enter X values as numbers, and Y values with no
subcolumns. Enter each subject on a separate row in the table, following these guidelines:
9 Enter time until censoring or death (or whatever event you are tracking) in the X
column. Use any convenient unit, such as days or months. Time zero does not
have to be some specified calendar date; rather it defined to be the date that each
subject entered the study so may be a different calendar date for different
subjects. In some clinical studies, time zero spans several calendar years as pa-
tients are enrolled. You have to enter duration as a number, and cannot enter
dates directly.
9 Enter “1” into the Y column for rows where the subject died (or the event
occurred) at the time shown in the X column. Enter “0” into the rows where the
subject was censored at that time. Every subject in a survival study either dies or
is censored
9 Enter subjects for each treatment group into a different Y column. Place the X
values for the subjects for the first group at the top of the table with the Y codes in
the first Y column. Place the X values for the second group of subjects beneath
those for the first group (X values do not have to be sorted, and the X column
may well contain the same value more than once). Place the corresponding Y
codes in the second Y column, leaving the first column blank. In the example
below, data for group A were entered in the first 14 rows, and data for group B
started in row 15.
Note that the five control animals are each entered on a separate row, with the time
entered as 28 (the number of days you observed the animals) and with Y entered as 0 to
denote a censored observation. The observations on these animals is said to be censored
because we only know that they lived for at least 28 days. We don’t know how much
longer they will live because the study ended.
The five treated animals also are entered one per row, with Y=1 when they died and Y=0
for the two animals still alive at the end of the study.
Example of survival data from a clinical study
Here is a portion of the data collected in a clinical trial:
Enrolled Final date What happened Group
07-Feb-98 02-Mar-02 Died Treated
19-May-98 30-Nov-98 Died Treated
14-Nov-98 03-Apr-02 Died Treated
01-Dec-98 04-Mar-01 Died Control
04-Mar-99 04-May-01 Died Control
Prism does not allow you to enter beginning and ending dates. You must enter elapsed
time. You can calculate the elapsed time in Excel (by simply subtracting one date from the
other; Excel automatically presents the results as number of days).
Unlike many programs, you don’t enter a code for the treatment (control vs. treated, in
this example) into a column in Prism. Instead you use separate columns for each
treatment, and enter codes for survival or censored into that column.
There are three different reasons for the censored observations in this study.
9 Three of the censored observations are subjects still alive at the end of the study.
We don’t know how long they will live.
9 Subject 7 moved away from the area and thus left the study protocol. Even if we
knew how much longer that subject lived, we couldn’t use the information since
he was no longer following the study protocol. We know that subject 7 lived 733
days on the protocol and either don’t know, or know but can’t use the
information, after that.
9 Subject 10 died in a car crash. Different investigators handle this differently.
Some define a death to be a death, no matter what the cause. Others would define
a death from a clearly unrelated cause (such as a car crash) to be a censored
observation. We know the subject lived 703 days on the treatment. We don’t
know how much longer he would have lived on the treatment, since his life was
cut short by a car accident.
Note: With automatic survival graphs, Prism assumes that deaths are
coded with “1” and censored observations with “0”. If you use a different
coding scheme, create survival curves manually.
You can usually leave the choices on this dialog set to their default value. It is rare to use a
code other than Y=0 for a censored subject and Y=1 for a death. The other choices on this
dialog determine the initial look of the survival curve, and you can change these later from
the graph (see the next section).
Note: Prism offers three places (Analysis parameters, Format Graph, and
Format Symbols & lines) to choose whether you want to tabulate and
graph fraction death, fraction survival, percent death, or percent
survival. If you make a change in any of these dialogs, it will also be
made in the other. You cannot choose one format for tabulating and
another for graphing (unless you repeat the analysis).
Percent Survival
50
0
0 1 2 3 4 5 6 7 8 9 10
Months
If survival exceeds 50% at the longest time points, then median survival cannot be
computed.
If the survival curve is horizontal at 50% survival, the median survival is ambiguous, and
different programs report median survival differently. Prism reports the average of the
first and last times at which survival is 50%.
100
Percent survival
Median survival=3.5
50
0
0 1 2 3 4 5 6 7
Time
When comparing two survival curves, Prism also reports the ratio of the median survival
times along with its 95% confidence interval. You can be 95% sure that the true ratio of
median survival times lies within that range.
Prism computes an approximate 95% confidence interval for the ratio of median survivals.
This calculation is based on an assumption that is not part of the rest of the survival
comparison. The calculation of the 95% CI of ratio of median survivals assumes that the
survival curve follows an exponential decay. This means that the chance of dying in a
small time interval is the same early in the study and late in the study. In other words, it
assumes that the survival of patients or animals in your study follows the same model as
radioactive decay. If your survival data follow a very different pattern, then the values that
Prism reports for the 95% CI of the ratio of median survivals will not be correct.
Tip: Don’t focus on the 95% CI of the ratio of median survivals unless the
survival curves follow the same shape as an exponential decay.
Bland-Altman results
The first page shows the difference and average values used to create the plot.
The second results page shows the bias, or the average of the differences. The bias is
computed as the value determined by one method minus the value determined by the
other method. If one method is sometimes higher, and sometimes the other method is
higher, the average of the differences will be close to zero. If it is not close to zero, this
indicates that the two assay methods are producing different results.
The Bland-Altman plot graphs the average of the two values on each row on the X axis,
and the difference between the measurements (A-B) on the Y axis.
10
Difference
0
-10
-20
Average
Prism automatically graphs these Bland-Altman results. We modified this graph a bit
using the Format Axes dialog box:
9 Set the origin to be the lower left.
9 Create a custom tick shown as a dotted line at the bias (Y=0.238, in this
example).
9 Offset the X and Y axes so that they do not touch
The bias is reported by Prism as:
As expected (among controls) the two methods had very similar results on average, and
the bias (difference between the means) is only 0.24. In 95% of subjects the difference lies
between -13.9 and +13.4.
The authors of this study used these results simply as a control, and then went on to
investigate patients with mitral disease (not shown here).
Remember: Sensitivity is the fraction of people with the disease that the
test correctly identifies as positive. Specificity is the fraction of people
without the disease that the test correctly identifies as negative.
Prism calculates the sensitivity and specificity using each value in the data table as the
cutoff value. This means that it calculates many pairs of sensitivity and specificity.
Prism tabulates sensitivity and 1-specifity, with 95% confidence intervals, for all possible
cutoff values.
ROC graph
An ROC curve may help in analyzing this trade-off between sensitivity and specificity.
Prism creates the ROC graph automatically. You’ll only need to spend a few moments
polishing it.
Note: This graph is created from a data table, of which only the top
portion was shown earlier.
100
Percent survival
Median survival=3.5
50
0
0 1 2 3 4 5 6 7
Time
When choosing which cutoff value you will use, don’t think only about the tradeoff of
sensitivity vs. specificity. Also consider the clinical consequences of false positive and false
negative results. The ROC curve can’t help with that.
Area under a ROC curve
The area under a ROC curve quantifies the overall ability of the test to discriminate
between those individuals with the disease and those without the disease. A truly useless
test (one no better at identifying true positives than flipping a coin) has an area of 0.5. A
perfect test (one that has zero false positives and zero false negatives) has an area of 1.00.
Your test will have an area between those two values.
While it is clear that the area under the curve is related to the overall ability of a test to
correctly identify normal versus abnormal, it is not so obvious how one interprets the area
itself. There is, however, a very intuitive interpretation. If patients have higher test values
than controls, then:
The area represents the probability that a randomly selected patient will have a higher
test result than a randomly selected control.
If patients tend to have lower test results than controls:
The area represents the probability that a randomly selected patient will have a lower
test result than a randomly selected control.
For example: If the area equals 0.80, on average, a patient will have a more abnormal test
result than 80% of the controls. If the test were perfect, every patient would have a more
abnormal test result than every control and the area would equal 1.00.
If the test were worthless, no better at identifying normal versus abnormal than chance,
then one would expect that half of the controls would have a higher test value than a
patient known to have the disease and half would have a lower test value. Therefore, the
area under the curve would be 0.5.
Note: The area under a ROC curve can never be less than 0.50. If the
area is first calculated as less than 0.50, Prism will reverse the definition
of abnormal from a higher test value to a lower test value. This
adjustment will result in an area under the curve that is greater than
0.50.
Prism also reports the standard error of the area under the ROC curve, as well as the 95%
confidence interval. These results are computed by a nonparametric method that does not
make any assumptions about the distributions of test results in the patient and control
groups. This method is described by Hanley, J.A., and McNeil, B. J. (1982). Radiology
143:29-36.
Interpreting the confidence interval is straightforward. If the patient and control groups
represent a random sampling of a larger population, you can be 95% sure that the
confidence interval contains the true area.
In the example above, the area is 0.946 with a 95% confidence interval extending from
0.8996 to 0.9938. This means that a randomly selected patient has a 94.6% chance of
having a higher test result than a randomly selected control.
Prism completes your ROC curve evaluation by reporting a P value and testing the null
hypothesis that the area under the curve really equals 0.50. In other words, the null
hypothesis is that the test diagnoses disease no better than flipping a coin. If your P value
is small, as it usually will be, you may conclude that your test actually does discriminate
between abnormal patients and normal controls. If the P value is large, it means your
diagnostic test is no better than flipping a coin to diagnose patients.
Comparing ROC curves
Prism does not compare ROC curves. It is, however, quite easy to compare two ROC
curves created with data from two different (unpaired) sets of patients and controls.
Area1 − Area2
z=
1 + SE Area 2
2 2
SE Area
4. If you investigated many pairs of methods with indistinguishable ROC curves,
you would expect the distribution of z to be centered at zero with a standard
deviation of 1.0. To calculate a two-tailed P value, therefore, use the following
(Microsoft) Excel function:
=2*(1-NORMSDIST(z))
The method described above is appropriate when you compare two ROC curves with data
collected from different subjects. A different method is needed to compare ROC curves
when both laboratory tests were evaluated in the same group of patients and controls.
Prism does not compare paired-ROC curves. To account for the correlation between areas
under your two curves, use the method described by Hanley, J.A., and McNeil, B. J.
(1983). Radiology 148:839-843. Accounting for the correlation leads to a larger z value
and, thus, a smaller P value.
Prism can only smooth data sets (and compute derivatives and integrals) where the X
values are equally spaced. The X values in the table may be formatted either as individual
numbers or as a sequence (you define the first value and the interval, and Prism fills in the
rest).
Smoothing a curve
If you import a curve from an instrument, smooth the data to improve the appearance of a
graph. The purpose of smoothing is solely to improve the appearance of a graph. Since
you lose data when you smooth a curve, you should not smooth a curve prior to nonlinear
regression or other analyses.
There is no point to smoothing curves created by nonlinear regression, since they are
already smooth. It only makes sense to smooth curves collected from an instrument. It
can make sense to compute the derivative or integral of a perfect (smooth) when X is time.
Smoothing uses the method of Savitsky and Golay (Analytical Chemistry, 36:1627-1639,
1964) using a cubic equation. Each point in the curve is replaced by the weighted average
of its nearest 5, 9, or 13 neighbors. The results table has a few less rows than the original
data.
Note: If all values are above the baseline, then the preceding dialog
choices (except for definition of the baseline) are irrelevant. Prism will
find one peak, and report the area under the entire curve. Single-peak
assessment is very useful in analysis of pharmacokinetic data.
Y1 Y2 Y1 + Y2
2
∆X ∆X ∆X
Prism defines a curve as a series of connected X,Y points, with equally spaced X values.
The left portion of the figure shows two of these points and the baseline (dotted line). The
area under that portion of the curve, a trapezoid, is shaded.
The middle portion of the figure shows how Prism computes the area under the curve.
Since the two triangles in the middle panel have the same area, the area of the trapezoid
on the left (which we want to find out) is the same as the area of the rectangle on the right
(which is easier to calculate).
∆X ( Y 1 + Y 2 )
2
Prism repeatedly uses this formula for each adjacent pair of points defining the curve.
Note: When Prism creates a curve for you, it generally defines the curve
as 150 line segments. You can increase or decrease this in the parameter
dialog for the analysis that created the curve.
15
10
Frequency
0
4
0
-6
-6
-6
-7
-7
-7
-7
-7
-8
62
64
66
68
70
72
74
76
78
Height
Prism offers other kinds of data manipulations in addition to transforms. See Subtracting
(or dividing by) baseline values on page 142, and Normalizing data on page 143.
To transform data with Prism:
1. Start from a data table, or a results table.
2. Click Analyze and choose Transform from the list of data manipulations.
Interchanging X and Y
When you choose a standard function, you can choose to interchange X and Y values and
also choose transforms of X or Y or both.
Some notes on interchanging X and Y values:
9 Prism can interchange data on tables with more than one data set (more than one
Y column), even though the results sheet has only a single X column. It does this
by creating additional rows. The results will be staggered down the page with only
one data set in any particular row.
9 If you entered replicate Y values (or mean with SD or SEM) Prism interchanges X
and Y by putting the mean Y value into the X column. Information about the
scatter of Y is ignored.
9 If you selected X or Y transforms (in addition to interchanging), Prism applies the
transform to the data after interchanging X and Y. This means that the X
transform is applied to data that were originally in the Y column, and the Y
transform is applied to data originally in the X column.
Standard functions
Choose from one of these functions for transforming Y values (analogous functions are
available for X):
Rather than enter the value of K on this table, you can choose a value from a linked info
table. If you have created a linked info table and it contains numbers, Prism will popup a
list of values you can choose from.
These transforms are very useful as a way to display data. They are much less useful as a
method to analyze data. You’ll get better results by using nonlinear regression on the
actual data. See the companion book on nonlinear curve fitting for details.
Here is the mathematical definition of each transform:
Function X becomes Y becomes
Eadie-Hofstee Y/X No change
Tip. Prism can also create Bland-Altman plots, which require a simple
transform of the data. However, this is not done via a transform, but
rather via a separate analysis. See Comparing Methods with a Bland-
Altman Plot on page 118.
User-defined transforms
At the top of the Parameters dialog for transforms, switch between built-in transforms
and user-defined transforms of X or Y. Select a user-defined transform you have used
before or enter a new one.
You can write equations so different data sets get different transforms. Put <B> in front of
a line in your transform that only applies to data set B. Put <~A> in front of a line that
applies to all data sets except data set A.
If you are transforming X values, you may use Y in the function. If the data table contains
several data sets (so has several Y values for a single X value), Prism will stagger the
results down the page, repeating X values as needed. The results for column A will appear
on top of the results table. Below that Prism will place the results for column B. For these
rows, column A will be empty.
You can precede a conditional expression with NOT, and can connect two conditional
expressions with AND or OR. Examples of conditional expressions:
Note: “<>” means not equal to, “<=” means less than or equal to, and “>=” means greater
than or equal to.
Here is an example.
Y= IF (Y<Y0, Y, Y*Y)
Tip: Although the dialog is called Subtract or Divide a baseline, you can
also add or multiply two columns.
Click Analyze and choose built-in analyses. Then choose Remove Baseline from
the list of data manipulations to bring up this dialog. The choices are self-explanatory:
Normalizing data
Normalize the data to convert Y values from different data sets to a common scale. This is
useful when the want to compare the shape or position (EC50) of two or more curves, and
don't want to be distracted by different maximum and minimum values.
Investigators who analyze dose-response curves commonly normalize the data so all
curves begin at 0% and plateau at 100%. If you then fit a sigmoid dose-response curve to
the normalized data, be sure to set the top and bottom plateaus to constant values. If
you’ve defined the top and bottom of the curves by normalizing, you shouldn’t ask Prism
to fit those parameters.
To normalize, click Analyze and choose Built-in analyses. Then select Normalize
from the list of data manipulations to bring up this dialog.
To normalize between 0 and 100%, you must define these baselines. Define zero as the
smallest value in each data set, the value in the first row in each data set, or to a value you
Tip: The Remove Baseline analysis lets you subtract (or divide) all values
by the mean of the first few rows. With some experimental designs, this
is the best way to normalize.
Pruning rows
This analysis reduces the size of large data sets to speed curve fitting and graphing. Use it
to preprocess large data sets imported from an instrument. The pruning analysis starts
with a large data table and generates a shorter results table. Another way to deal with
large data sets is to decimate data while importing, so Prism only reads every tenth (or
some other number) row. See the chapter in importing data in the Prism User's Guide.
Note: After pruning, the project contains both the original data and the
pruned data. Therefore pruning increases the size of the project file. If you
don’t want the original data any more, you should go to that data table and
use the Delete Sheet command (on the Edit menu) to remove it.
To prune, click Analyze and choose Built-in analyses. Then choose Prune from the
list of data manipulations to bring up this dialog.
One choice is to remove all rows where X is too low or too high, and keep only rows where
X is between limits you enter.
The other choice is to average every K rows to produce one output row (you enter K). First
Prism sorts the table by X (if not already sorted), then it averages every K rows to produce
one output row. For example, if K=3, the first X value in the results table is the average of
the X values of the first three rows. The first Y value of each data set is the average of the Y
values in the first three rows. The second row in the results table is the average of rows 4
Each row of Y values becomes one column (data set) in the results table. The first row
becomes the first data set, the second row becomes the second data set, etc. You may not
transpose a data table with more than 104 rows, because Prism cannot create a table with
more than 104 columns.
The column and row titles in the results table are determined by your choices in the
dialog.
Note: You can also transpose via copy and place, without using the Transpose analysis. To
do this, select a portion, or all, of a table, and copy to the clipboard. Position the insertion
point to a different part of the same table or to a new table. Drop the Edit menu and
choose Paste Special. Finally, choose Transpose on the Placement tab.
B
F
Bartlett's test for equal variances ........................ 62
Baselines, subracting or dividing........................142 F ratio, one-way ANOVA...................................... 62
Bayesian approach ................................................21 F ratio, two-way ANOVA......................................80
Bias, from Bland-Altman .................................... 118 F test to compare variance, from unpaired t test 44
Bland-Altman plot.............................................. 118 Fisher's test vs. Chi-square ................................. 101
Bonferroni posttest ...............................................61 Fisher's test, checklist .........................................105
Frequency distributions......................................132
Friedman test, checklist ........................................75
C Friedman test, how it works ................................ 73
Friedman test, posttests....................................... 74
Case-control study ............................................... 99
Friedman test, results .......................................... 74
Censored survival data........................................107
Functions. available for entering user-defined
Central limit theorem ........................................... 15
equations .............................................................138
Chi-square test for trend.....................................102
Chi-square test, checklist....................................105
Chi-square test, how it works .............................102 G
Chi-square vs. Fisher's test ................................. 101
Circularity ...................................................... 69, 88 Gaussian distribution, origin of ............................15
Coefficient of determination, defined.................. 93 Gaussian distribution, testing for ........................ 32
Coefficient of variation ......................................... 31 Geometric mean ................................................... 32
Column math ......................................................142 Grubbs' method to detect outliers ....................... 26
Compound symmetry .......................................... 69
147
Ratio t tests ........................................................... 51
Receiver-operator characteristic curve ..............122
T
Regression vs. correlation.................................... 92 t ratio, from one-sample t test ............................. 35
Relative risk ........................................................103 t ratio, from unpaired t test.................................. 44
Relative risk, equation for................................... 101 t test, one sample, checklist ................................. 38
Remove baseline .................................................142 t test, one sample, results of ................................ 35
Repeated measures ANOVA, checklist ................ 70 t test, paired, checklist ..........................................51
Repeated measures ANOVA, results ............. 69, 70 t test, paired, how it works................................... 48
Repeated measures test, choosing....................... 58 t test, paired, results............................................. 49
Repeated measures two-way ANOVA, choosing ..79 t test, ratio .............................................................51
Resampling approach to statistics........................ 11 t test, unpaired, how it works .............................. 44
Retrospective case-control study......................... 99 t test, unpaired, results of .................................... 44
Robust tests.......................................................... 28 t test, Welch's ....................................................... 43
ROC curve ...........................................................122 t tests, choosing.....................................................41
t tests, entering data for ........................... 40, 57, 92
S t tests, paired or unpaired? ...................................41
Test for linear trend, choosing..............................61
Sampling from a population .................................10 Test for trend, chi-square .................................. 102
Scatchard transform ........................................... 137 Totals, by row ....................................................... 34
SD of dataset (column) ........................................ 29 Tukey vs. Newman-Keuls post test.......................61
SD of replicates, calculating................................. 34 Two-tail P value, defined ...................................... 17
SD, definition of ................................................... 29 Two-way ANOVA ..................................See ANOVA
SEM of dataset (column) ..................................... 30 Type I error, defined .............................................18
SEM of replicates, calculating ............................. 34 Type II error, defined........................................... 20
SEM, definition of ................................................ 30
Sensitivity, defined..............................................103
Significance, defined.............................................18
U
Skewness .............................................................. 32 Unpaired t test, how it works............................... 44
Smooth a curve ................................................... 127 Unpaired t test, results of..................................... 44
Spearman correlation .......................................... 93 Unpaired tests, choosing.......................................41
Specificity, defined..............................................103 User-defined functions .......................................138
Sphericity ............................................................. 69
Standard deviation, definition of......................... 29
Standard error of the mean, definition of ........... 30 W
Statistical hypothesis testing ................................ 17
Welch's t test ........................................................ 43
Statistical power................................................... 20
Wilcoxon matched pairs test, checklist ............... 56
Statistical significance, defined ............................18
Wilcoxon matched pairs test, results of............... 55
Statistics, limitations of ........................................12
Wilcoxon signed rank test, checklist ................... 39
Subtract baseline.................................................142
Wilcoxon signed rank test, how it works ............. 38
Subtracting (or dividing by) baseline values ......142
Wilcoxon signed rank test, results of................... 39
Survival analyses, choosing ................................ 112
Survival curves, entering data ........................... 108
Survival curves, interpreting .............................. 113 Y
Yates' continuity correction, choosing................ 101
148