0% found this document useful (0 votes)
31 views5 pages

Lindley 1984

Uploaded by

László Sági
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views5 pages

Lindley 1984

Uploaded by

László Sági
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

The Analysis of Experimental Data:

The Appreciation of Tea and Wine


KEYWORDS: Dennis V Lindley
Significance level; Likelihood; Alternatives; University of Warwick, England.
Bayes’ formula. Teaching,
Summary
A classical experiment on the tasting of tea is used to
show that many standard methods of analysis of the
resulting data are unsatisfactory. A similar experiment
with wine is used to show how a more sensible
method may be developed.

◆The experiment with tea◆


Fisher then argued that either
(a) the null hypothesis is true and an event of small
One afternoon in the 1920’s at Rothamsted
probability has occurred, or
Experimental Station, the statistician, R.A.Fisher,
(b) the null hypothesis is false and the lady has
made Muriel Bristol a cup of tea. She protested when
discriminatory powers.
he put the tea infusion into the cup before adding the
In this case the small probability is 1/64. Since events
milk, claiming that she could discriminate whether
of small probability only rarely occur, we might favour
the milk had been added first or second, preferring
(b) as the more reasonable explanation in getting all
the former. Fisher then devised a classic experiment
correct except one. The result is said to be significant
that is beautifully discussed in chapter 2 of his book,
with probability 1/64 and the probability is the
Fisher (1935). The principles developed there are
significance level. The key idea is that if something
today widely used in the design and analysis of many
which is unusual on the null hypothesis happens, then
types of experiment. Because the original experiment
the null hypothesis is discredited. Nowadays it is
leads to technical difficulties in its analysis, we shall
common to use a value of 1/20, or 5%, as a benchmark
here consider a modified form that avoids them, yet
and to say the result is significant at 5% if the small
retains all the essential qualities of the original.
probability is less than this value, as with our result.
In the modified form, the lady is presented with a
Fisher immediately realized that this argument fails
pair of cups of tea and told truthfully that one has
because every possible result with the 6 pairs has
had the milk put in first, whilst the other has had it
probability (1/2)6 = 1/64, so every result is significant
added to the tea infusion. She is required to identify
at 5%. Fisher avoided this absurdity by saying that any
which is which. The only possible results are right,
outcome with just I W and 5 R’s, no matter where that
denoted by R and wrong W. The experiment is to be
W occurred, is equally suggestive of discriminatory
repeated with 6 pairs of cups in all. Suppose that the
powers and so should be included. There are 6 such
result is RRRRRW with only the last pair wrong.
possibilities, including the actual outcome, so the
Fisher’s analysis goes as follows.
relevant probability for (a) above is 6(1/2)6= 6/64 = .094,
so now the result is not significant at 5%.
First suppose the lady is completely unable to do
what she claims so that she is effectively guessing
Fisher’s amended argument for a general situation
which cup of the pair is which. The hypothesis of
replaces the probability of the outcome on the null
her inability to perform the task is called the null
hypothesis by the probability of that and similar
hypothesis. In Fisher’s view the purpose of this, and
outcomes; here the probability of 1 error in 6. Fisher
many other, experiments is to provide an opportunity
realized that even this would not work. For what is the
of discrediting the null hypothesis that she is
most probable result with pure guessing? Clearly one
guessing. Here the null hypothesis means that each
half of the pairs right and one half wrong. For 128 pairs
pair is R with probability 1/2 or W with probability
of cups with 64 R and 64 W, the probability is 128C64(1/
1/2,independently of the others. The observed result
2)128 which is about .05. This is for the most probable
has probability (1/2)6 = 1/64.
outcome, every other outcome has smaller probability.
So for 128 pairs we are back to the difficulty that probability (1/2)8; and so on. The probability of the
every result is significant at 5%. To overcome this, observed result and more extreme ones is therefore (1/
Fisher ingeniously argued that if 1 error in 6 is 2)6+ (l/2)7 + (1/2)8+ = (l/2)6/(1- 1/2) = (l/2)5 = .031.
significant, so surely is no error, or 6 R’s. In other Before we had .109, yet now we have significance at
words, cases that more strongly suggest 5%. This is surprising.
discriminatory powers than in the case observed
should also be included when calculating the Let us see where we stand. If the experiment consisted
probability to be judged against 5%. Outcomes that of 6 pairs of cups being tested and the result was
suggest powers as, or more, strongly than the outcome RRRRRW, the relevant probability is .109. If the
observed are said to be as, or more, extreme. experiment consisted of pairs being tested until the first
error, with the same result, the relevant probability is
The upshot of this is that Fisher’s simple, either .031, less than a third of the previous value. And lack of
(a) or (b) above, has to be amended to read: either significance in the first case changes to significance in
(a) the null hypothesis is true and the probability the second. Is not this absurd? Here are 6 pairs of cups
of events as, or more, extreme than that observed honestly being tested, resulting in RRRRRW what does
is small, or it matter what might have happened (for example
(b) the null hypothesis is false and the lady has RRRWRR in one case, RRRRRRRW in the other) but
discriminatory powers. did not? What would be the probability if Dr. Bristol
had stopped because of the meeting?
This form is accepted by most statisticians and the
scientific literature is full of 5% significances, where Let us pinpoint the difficulty with Fisher’s either/or
the 5% refers to the probability of all results as, or argument. It lies in deciding just what results are as, or
more, extreme than that observed. It is the italicised more, extreme than that observed. (We have seen that
words that distinguish the accepted form from that the extreme results must be included since there are
first given by Fisher. With the outcome RRRRRW. experiments in which every result is unusual.) In the
of probability (1/2)6, there are 5 others as extreme case of a fixed number, 6, of cups, the extreme values
and 1, with no errors, more extreme, giving 7 cases are different from those in the case where one continues
in all and a total probability of 7(1/2)6 =.109, not until a mistake is made. Let us call these two experiments
significant at 5%. the fixed and the sequential respectively. It might be
argued that the judgements should depend on whether
◆A Criticism◆ the fixed or sequential experiment was used. But, if you
feel that, consider the following experiment. A fair coin
For many years the argument went largely is tossed, if it comes down heads, the fixed experiment
unchallenged and was supported by alternative, more with 6 cups is used; if tails, the sequential one is adopted.
mathematical, approaches due to Neyman, Pearson The result RRRRRW has probability associated with it
and Wald. But recently doubts, originally advanced equal to the average of the two experiments, namely
by Jeffreys, have crept in and the argument is (.109 + .031)/2 = .070, and the result is not significant
increasingly being attacked. Let us see how the at 5%. But if the coin came down tails and the sequential
criticism works for the outcome RRRRRW. Fisher form used (with a natural probability of .031) should
has to consider what results are as, or more, extreme we really quote .070 merely because the coin might have
and to do this he takes other possibilities with 6 pairs shown heads? The suggestion seems strange. Attempts
of cups. But why fix 6? The value 6 may have arisen have been made to define exactly what is meant by more
by chance. Perhaps Dr. Muriel Bristol had a meeting extreme but without success.
to go to after tea and had to leave after 6 pairs. Had
the cups not been prepared so efficiently, she might So we have to abandon the use of more extreme
have done fewer. Another possible form of outcomes. This leaves us only with the probability of
experiment, suggested and used by J. B. S. Haldane what happened and we have seen that is unsatisfactory
in the context of cats rather than tea-tasting, is to go because in some experiments all probabilities are small.
on until the first mistake is made. Dr. Bristol’s result So what are we to do?
is compatible with this type of experimentation. So
let us use Fisher’s argument for Haldane’s ◆An Alternative Analysis◆
experiment. The probability of the sequence
RRRRRW is still (1/2)6. More extreme sequences are Fisher’s approach only considers probabilities on the
those in which the first mistake occurs after the sixth null hypothesis. It does not consider probabilities were
pair. Thus at the seventh, probability( l/2)7; eighth, Muriel Bristol to have discriminatory powers. Of course,
if she had perfect power then R would have that P> 1/2 is possible. So what I want to do is to put
probability 1 and the sequential experiment never something into the analysis that incorporates my belief
end. But even the most enthusiastic supporter of the that tea is different from wine. Notice that the likelihood
thesis that the milk must go in first would admit to is the same for both though the meaning of P is different.
occasional lapses. We saw that on the null hypothesis
each R had probability P, independent of the others, The way this is done is to introduce probability
with P = 1/2. A reasonable indication of distributions for P appropriate for tea and for wine. Let
discriminatory power would admit a value of P in me give you my distributions to illustrate the ideas. For
excess of 1/2. The higher the value of P. the greater wine, I chose the expression
is the lady’s ability. The values of P above 1/2 are 48(1 -P)(P- 1/2), for l/2 < P < l, (1)
called alternative hypotheses. The result RRRRRW having the form illustrated in Figure 1 and labelled prior.
has probability P5(l-P), P = 1/2 giving (l/2)6 as before. This expresses the fact that I think that she can
This is called the likelihood function of P, the discriminate but can make mistakes. The value 48 makes
probability of correct classification, for the observed the total probability 1. For tea I took 0.8 for the
result. probability that P = 1/2 and l.6(l-P) for P > l/2,
having the form illustrated in Figure 2 and labelled prior.
In general, it describes the probability of the observed This expresses my personal probability of 0.8 that she
result as a function of P. Modem work says that it is cannot discriminate. (Fisher may have had such a value
this function that is required, not any consideration since he expressed surprise at Dr. Bristol’s claim,
of more extreme cases. The probability of what reportedly saying “Nonsense, surely it makes no
actually happened is considered under various difference”, Box (1978).) This allows a probability of
hypotheses, rather than the probability of several 0.2 that she can, thinking that having good discriminatory
outcomes solely under the null hypothesis. power (P near 1) is less likely than modest ones (P near
1/2). These formulae reflect my own views. You may
What has to be done is to compare the probability freely insert your own. More details will be found in
on the null hypothesis with probabilities for other Lindley (1984).
values of P, the alternative hypotheses. But which
value of P? To answer this consider another lady. ◆Bayes’ Formula◆

◆The experiment with wine◆ It is next necessary to combine these personal opinions
with the evidence of the data expressed through the
This lady is a wine expert, testified by her being a likelihood function. The calculus of probability tells us
Master (sic) of Wine, MW. Instead of tasting tea, she how this is to be done, namely by multiplying the original
tasted wine. She was given 6 pairs of glasses (not probability by the likelihood function. For the lady
cups). One member of each pair contained some tasting wine we have
French claret. The other had a Californian Cabernet 48(1 -P)(P-l/2)P5(l - P) for 1/2 < P < 1.
Sauvignon, Merlot blend. In other words, both wines Apart from the fact that the total probability is not l,
were made from the same blend of grapes, one in this is a probability distribution. Simple, but tedious,
France, the other in California. She was asked to say calculations enable us to find a constant K such that
which glass had which. That is, she did the same K(l - P)2 P5 (P - 1/2), for l/2 < P < l (2)
experiment as Dr. Bristol but with the two wines is a probability distribution, having integral from 1/2 to
instead of the two preparations of tea. Suppose she 1 of 1. The first probability distribution (1) is called the
got the same result RRRRRW and consequently the prior distribution (prior, that is, to the data). The one
same likelihood function P5(l -P), P now referring to just obtained, (2), is called the posterior distribution. The
the probability of classifying the pairs of wines formula says
correctly.
posterior = K x prior x likelihood,
At this point I can only speak for myself though I where K is a number chosen to make the integral of the
hope that many will agree with me. You may freely right-hand side 1. It is called Bayes’ formula and the
disagree and still be sensible. I believe that Masters method is termed Bayesian. The only complication in
of Wine can distinguish the Californian imitation its calculation is the determination of K.
from the French original. Mathematically I think that
P > 1/2. Yet I think it doubtful that ladies can Figure 1 shows for the wine-tasting
distinguish the two methods of teamaking. P = 1/2 (i) the prior distribution (1),
seems quite reasonable to me there though I admit (ii) the posterior distribution (2)
is these values that can be contrasted with the
significance levels, the probabilities of results as, or
more, extreme than the actual results on the null
hypothesis. The latter are .109 and .016 respectively.
Notice that in both cases the significance probability is
substantially lower than the posterior probability. A
partial reason for this is the high value of the prior
probability at .8. But the statement is still true even if
one thinks that the lady is just as likely to have the power
as not, expressed through a prior probability of .5. For
example, with 1 error in 6 pairs, the posterior probability
is .26 compared with a significance level of .109. It is
typically true that the posterior probability of the null
hypothesis exceeds the significance level, though there
is no logical connection between the two values. The
behaviour of the curves for the distributions for P> 1/2
is similar to those for wine.

The analysis just presented depends heavily on my


opinion of the two ladies’ abilities. Your opinion may
be different. This seems sensible to me. On the slender
evidence of l2 cups or glasses it is not surprising that
our views might differ, just as scientists currently differ
over the greenhouse effect because the evidence is
inadequate. But had we evidence on 1200 cups, perhaps
with 100 ladies, the different initial opinions would be
swamped by the evidence of the data and we would
essentially agree. Technically, the likelihood dominates
the prior with a large sample. This happens in science.
20 years ago many of us were suspicious of the claims
made that lead affected intelligence. The evidence now
overwhelms the original opinions. All evidence does is
to change opinions: it does not create them.
for the case of 6 pairs yielding 1 error, and
(iii) the same with no errors. Initially I thought P = ◆Conclusions◆
3/4 was the most probable value but there was
substantial uncertainty expressed by the large spread There are four lessons that can be learned from this
about that value. With 1 error, there has hardly been analysis.
any shift in the most probable value but I am slightly (a) Since the significance level is typically less than
more confident that P is near 3/4 as expressed by a the posterior probability of the null hypothesis and a
smaller spread. To understand the spread, consider small value of the former, like 5%, is going to cast doubt
the area under these curves between say .6 and .9, on the null hypothesis, it follows that null hypotheses
.15 either side of P = 3/4. The area, and hence the will be more easily discounted using Fisher’s method
probability, is a little larger for the posterior rather than the Bayesian approach. When it is
distribution than for the prior. With no errors, the remembered that a typical null hypothesis is that a drug
situation changes and the most probable value has is of no use, or that a treatment is ineffective, it will be
risen to around .87 and the spread is substantially seen that the plethora of significance tests that are used
lower. For example, the probability that P is less than today will encourage specious beliefs in the efficacy of
.75 is about .2 whereas originally it was .5. drugs or treatments. Whenever you read of some effect
The situation with tea is subtler because I had having been detected, remember that it probably refers
initially a probability that she could not discriminate, to significance, which too easily suggests an effect when
P = 1/2, which was not entertained with wine. The none exists.
similar graphs are shown in Figure 2. The prior value (b) The Bayesian analysis provides the scientist with
of this probability was .8, which drops to .59 when 1 what he requires. He is interested either whether or not
error is made in 6 pairs, and to .23 with no errors. It the null hypothesis is true (as with tea) or in the
magnitude of the effect being investigated (as with
wine) or both. He requires a measure of belief in either ◆Summary◆
of these and probability provides such a measure. For
the null hypothesis directly; for the magnitude, in our The main points are now summarized. Fisher argued
example expressed through P, by a probability in the form of a dichotomy; either (a) an event of small
distribution illustrated by the curves in the figures probability on the null hypothesis has occurred, or (b)
This is in marked contrast to the significance level the null hypothesis is false. This did not work and the
which provides a probability for something that did probability had to include events that did not occur but
not happen on a hypothesis that may not be true. were as, or more, extreme. This did not work because
(c) The Bayesian analysis distinguishes between of the ambiguity over what is ‘more extreme’. The way
tea and wine. Fisher’s analysis used only probabilities out of this difficulty is to compare the probabilities of
assuming guessing, and guessing is the same for both, what actually occurred on the null hypothesis and on
as the word ‘guessing’ implies. The Bayesian view alternatives to it. Since there are ordinarily several
recognizes that one’s opinion of tasting the two alternative hypotheses, they have to be weighted. This
liquids may be different or that the ladies may have is done by expressing personal beliefs about the situation
different skills. before experimentation. These prior beliefs, also in the
(d) This is easily the most important point of the form of probabilities, are then modified by the
four. The Bayesian method is comparative. It experimental data, using Bayes’ formula, to give
compares the probabilities of the observed event on posterior beliefs. One compares the various possible
the null hypothesis and on the alternatives to it. In explanations for what has happened, and compares one’s
this respect it is quite different from Fisher’s approach posterior beliefs with those held initially. All the analysis
which is absolute in the sense that it involves only a is comparative.
single consideration, the null hypothesis. All our
uncertainty judgements should be comparative:there Note
are no absolutes here. A striking illustration of this This paper is based on a Collingwood lecture presented by
arises in legal trials. When a piece of evidence E is Professor Lindley at the University of Durham.
produced in a court investigating the guilt G or
innocence I of the defendant, it is not enough merely References
to consider the probability of E assuming G; one must Box, J.F. (1978). R A Fisher, the Life of a Scientist. New
also contemplate the probability of E supposing I. In York; Wiley.
fact, the relevant quantity is the ratio of the two Fisher, RA. (1935). The Design of Experiments.
probabilities. Generally if evidence is produced to Edinburgh; Oliver and Boyd.
support some thesis, one must also consider the Lindley, DV. (1984). A Bayesian lady tasting tea. In
reasonableness of the evidence were the thesis false. Statistics: an Appraisal. Ed. H A. David and H.T. David.
Whenever courses of action are contemplated, it is Ames; Iowa State University Press, 455-485 (with
not the merits or demerits of any course that matter, discussion).
but only the comparison of these qualities with those
of other courses.

You might also like