Design of Experiments
Design of Experiments
Design of experiments
2
MAST90044 Thinking and Reasoning with Data Chapter 8
8.9 Randomisation
Randomisation is the use of chance to allocate treatments to
experimental units. It forms the basis of any valid statistical ex-
periment.
We would not compare the effectiveness of two medications by giving one medication to healthy
individuals, and the other to sick individuals. Nor would we give one kind of food to large guinea
pigs and another to small guinea pigs in a nutrition experiment. The results of such experiments
would be limited in value, because we would not know how much of the observed difference is
due to the treatment, and how much is due to the differences in health or size of the subjects.
In statistical terminology, the two effects would be confounded, i.e. they would not be able to
be distinguished from one another. Observational studies have the inherent problem that the
observed associations may be due to confounding variables which have not been measured.
A truly random allocation of treatments to experimental units is defined as one is which all
units have an equal probability of receiving any of the treatments. The details of exactly how
the randomisation is performed will depend on the particular experimental design.
Random does not mean selecting units in a subjective or haphazard way it must be based
on probability to ensure validity. Manual methods, such as rolling dice or tossing a coin,
are acceptable, but it is usually more convenient to use computer software to perform the
randomisation, or in simple cases, random digits.
Drug trials are quite often conducted in this way: if there is a pool of eligible subjects, the
subjects may be allocated at random to one of the arms, or treatment groups, of the trial.
There are a variety of ways that this is done in practice, usually using computer programs, but
they all have the feature that each subject has the same probability of being allocated to each
group.
Example Suppose there are 18 experimental units, to which 3 treatments are to be randomised,
so that we have 6 units per treatment. The units should be labelled (e.g. from 1 to 18) before
the randomisation.
Using R
> treatment <- c(rep("A", 6), rep("B", 6), rep("C", 6))
> unit <- sample(18, 18)
> data.frame(treatment, unit)
treatment unit
1 A 10
2 A 13
3 A 7
4 A 17
5 A 4
6 A 8
7 B 1
8 B 5
9 B 18
10 B 12
3
MAST90044 Thinking and Reasoning with Data Chapter 8
11 B 9
12 B 16
13 C 2
14 C 15
15 C 3
16 C 6
17 C 14
18 C 11
The sample(x,n) function selects n numbers randomly, without replacement, from the numbers
1 to x. Therefore sample(n,n) randomly arranges n numbers. The sample(x,n) function can
also draw a sample with replacement if required.
Using Excel
1. Put the numbers 1 to 18 in column A.
2. Enter the formula RAND() in cell B1 and fill down to B18.
3. Copy column B onto itself using Paste Special Values. If you dont do this step, a new
set of random numbers is generated every time a cell in the worksheet is changed, and it
can get confusing.
4. Select a cell in either column A or B, and sort the worksheet by
Data Sort Column B.
We now have (in column A) the numbers 1 to 18 in random order, so we can give the first six
to treatment A, and so on.
8.10 Replication
Replication is having a treatment applied to more than one experi-
mental unit.
A single observation or a unique event seldom allows us to draw a conclusion. On repeating
any observation, we nearly always find that the results of the second round are not identical
with those obtained in the first. This variation has to be taken into account in the analysis of
experiments.
There are actually many sources of variation. Successive measurements on the same object will
usually yield different readings owing to either human variation in the act of taking observations,
instrument variation, or variation in the object itself. When measurements are taken not on the
same object, but on a group of seemingly uniform objects, an additional source of variation is
introduced: no two objects are really identical, and the variation between objects is generally
larger than the variation within objects.
The nature of the random variation between measurements can be observed and described only
by repeated observations under given conditions in which all the systematic variation is con-
trolled. The main purpose of experimentation is to separate the systematic and the random
variation by repeated observations, i.e. replication. Adequate replication ensures that the ran-
dom variation averages out sufficiently so that the systematic effects of the treatments can be
seen.
The larger the number of replicates, the greater the precision will be in comparing treatment
effects, i.e. standard errors will be smaller, confidence intervals narrower, and P -values smaller
when testing the null hypothesis of no treatment differences.
Replication is also needed to estimate the error variance, so as to obtain a measure of precision;
we cant estimate the variability of anything from one reading.
4
MAST90044 Thinking and Reasoning with Data Chapter 8
Measuring trees within a plot of a forest experiment and analysing the data as if each tree
was an experimental unit.
Applying milking treatments to only a few cows and then measuring the milk yield of each
cow several times to increase the number of replications.
Testing one new Holden and one new Ford for acceleration on 5 stretches of road, and
making conclusions about the acceleration of all new Holdens and Fords.
Taking ten samples from a mature block of cheese and ten from an immature block, and
performing an analysis of variance to assess the maturing process.
Note that there is nothing wrong per se with measuring experimental units more than once,
or taking sub-samples. It is often a good thing to do, and will enhance the quality of your
data, by giving a less variable value of the response variable for a particular experimental unit.
The problem occurs when these measurements are included in the analysis as if they were
separate experimental units. False replication is sometimes described in nicer terms such as
pseudo-replication or quasi-replication. These sound more scientific, but dont be fooled
they usually mean false replication.
8.11 Blocking
A block is a group of experimental units that are similar in some way that is expected to affect
the response to the treatments. Blocks should consist of units which are likely to be more ho-
mogeneous than the entire collection of available units. Treatments are then randomly assigned
to experimental units within blocks, and so comparisons between treatments are essentially per-
formed within blocks. The purpose of blocking is to reduce the variance of the error, and hence
increase precision, by accounting for some of the variation between units.
The term blocking was first used by R. A. Fisher in agronomic experiments, where blocks were
literally blocks of land in a field, each comprising a number of experimental plots. This traditional
terminology has continued in experimental design, although other terms such as matching,
pairing or stratification have become common in some disciplines. Whatever they are called,
the important aspect is that such groups contain similar or homogeneous experimental units;
this is shown in the following diagrams.
5
MAST90044 Thinking and Reasoning with Data Chapter 8
Non-homogeneous
experimental units
Blocking into
homogeneous groups
3 3 2
Randomisation 1 4 2 4 1 2
within blocks
4 2 5 1 3
1 5
5 5 3 4
river
W E
Wrong blocking
Correct blocking
Cross section of a piece of land. The section from W to E is to be used for experiments.
It is to be assumed that the fertility and mosture increases from W (highland) to E (lowland).
6
MAST90044 Thinking and Reasoning with Data Chapter 8
Apart from blocks of contiguous plots in a field, other examples of blocks in experiments are:
Precision
There are essentially two ways of increasing the precision of an experiment. The first is by
increased replication; no matter how large the error variance may be, in principle the desired
precision can be achieved by increasing the number of replicates. However, most experiments
have limited resources, and so increasing the replication massively is not feasible. The second way
is to use blocking; if within-block variability is substantially less than between-block variability,
the gain in precision can be substantial. Improvement of precision by blocking is usually cheaper
than improvement by increasing the number of experimental units.
7
MAST90044 Thinking and Reasoning with Data Chapter 8
slope -
Assuming that the tomato plants can be arranged in any way, how should the blocks be divided
into plots? (Remember the main principle of blocking: the experimental units should be as
similar as possible within the block).
Note that in the R code above, the repetition of sample(8,8) is rather clumsy if there were
more than 5 blocks, we would probably use a loop.
8
MAST90044 Thinking and Reasoning with Data Chapter 8
Now suppose that in addition to the slope, there is a gradient in soil texture which is perpen-
dicular to the slope, starting from clay at the top of the diagram to loam at the bottom. If soil
texture influences the effect of the treatments, we then have an additional blocking factor which
needs to be accounted for. If we ignore it, some treatments may be advantaged or disadvantaged;
for example, in the above design, treatment 4 occurs only in the bottom half of the diagram,
and so would not be used at all with clay soil. A type of experimental design which can be used
in situations like this is the Latin square.
A Latin square design incorporates two blocking factors, which are usually represented as rows
and columns. There must be as many levels of each blocking factor as there are treatments,
and each treatment must appear exactly once in each row and in each column.
The following Latin square enables both blocking factors (slope and soil texture) to be accounted
for in the potato wireworm experiment:
slope
clay 3 1 2 4
1 4 3 2
4 2 1 3
loam 2 3 4 1
The rows and columns of a Latin square do not have to be physical rows and columns. They
can be periods of time, groups of people or animals, and so onin fact, any blocking factor is
a potential row or column of a Latin square, as the following example illustrates.
9
MAST90044 Thinking and Reasoning with Data Chapter 8
Period
Cow I II III
1 A 608 B 885 C 940
2 B 715 C 1087 A 766
3 C 844 A 711 B 832
Randomly choose a standard generating Latin square of the correct size from a list;
Some statistical packages have capabilities for design of experiments, including Latin squares.
Multiple Latin squares
Latin squares can also be used if the number of levels of one of the blocking factors is an exact
multiple of the number of levels of the other blocking factor and the treatment. For example,
suppose that in the food supplement experiment, six cows were available instead of three. Two
Latin squares could be used, as shown below. Note that the two squares should have separate
randomisations of treatments to plots.
Period
Cow I II III
1 A B C
2 B C A
3 C A B
4 B A C
5 A C B
6 C B A
10
MAST90044 Thinking and Reasoning with Data Chapter 8
8.12 Balance
In all our discussion so far, it has been assumed that each treatment in an experiment is ran-
domised to the same number of experimental units. In other words, the replication is the same
for each treatment. This is known as balance.
2. It gives the most precise comparisons. For example, allocating 20 farms to each of feed
types A and B results in a more precise comparison than allocating 25 farms to A and 15
farms to B (or vice-versa). However, allocating 25 farms to A and 20 farms to B gives
greater precision than allocating 20 to each.
Balance is therefore desirable within the limitations already placed on an experiment, e.g. cost,
or available space. But if more experimental units become available within those limitations,
they should in general be used, even if it results in lack of balance. In other words, balance is
goodbut increased replication is even better.
8.13 Controls
Gastric freezing was a treatment for ulcers in the upper intestine. The patient swallowed a
deflated balloon with tubes attached, then a refrigerated liquid was pumped through the balloon
for an hour. The idea was that cooling the stomach would reduce the production of acid, and
so relieve ulcers. An experiment reported in the Journal of the American Medical Association
indicated that gastric freezing did reduce acid production and relieve ulcer pain. The treatment
was safe and easy and was widely used for several years. However, the experiment was poorly
designed, with no controls.
A later experiment divided ulcer patients into two groups. One group was treated by gastric
freezing as before. The other group received a placebo treatment in which the liquid in the
balloon was at room temperature rather than freezing. The results: 34% of the 82 patients in
the treatment group improved, but so did 38% of the 78 patients in the placebo group. This and
other properly designed experiments showed that gastric freezing was no better than a placebo,
and its use was abandoned.
In the earlier experiment, the patients responses may have been due to the placebo effect. A
placebo is a dummy treatment, and the response to a dummy treatment is the placebo effect.
Many patients respond favourably to any treatment, even a placebo, presumably because of trust
in the doctor or the procedure, and expectation of a cure. The placebo effect is well documented,
even for treatments which are quite invasive.
A placebo is a form of a control or control treatment. The idea behind a control is that when
treatments are to be compared, all other variables should be held as constant as possible, i.e.
controlled. The argument of causation is sustainable because the experiment compares what
happened with what would have happened without the intervention.
11
MAST90044 Thinking and Reasoning with Data Chapter 8
Experiments without controls have been shown to be biased in favour of the treatment being
tested. This is also true for historical controls, in which the effect of a control treatment is
estimated from past records etc. rather than being included in the experiment. Confounding
with variables that change over time is a major weakness of trials which use historical controls.
Blind studies
In a medical experiment, if people know they are receiving a placebo, then the placebo effect is
less likely. For this reason the subjects should not be told which treatment they are given, if at
all possible. Such studies are referred to as blind studies. Obviously the placebo should match
the treatment in as many ways as possible; for example, if it is a tablet, it should be the same
size, shape and colour.
Even better are experiments in which neither the subject, nor the experimenters working with
them, know which treatment they are on, until the study is completed. Such studies are referred
to as double-blind studies. If, for example, a doctor believes in the value of a particular
treatment, they may treat or evaluate a patient more favourably. It is therefore desirable for
them to be blind if possible.
For experiments involving animals, plants, or inanimate material, blindness of the subject is
not an issue. However, it is still desirable for the experimenter to be blind where possible. For
example, when injecting an animal, or scoring disease on a plant, or assessing the fatigue of a
metal panel, it is better if the treatment is unknown to the person performing the task.
12
MAST90044 Thinking and Reasoning with Data Chapter 8
6. A study from the Education faculty of the University of Melbourne found that children
who had been in child-care from an early age experienced more educational difficulties
than those who had not. It was also found that children cared for by a private nanny had
fewer difficulties. The report concluded that child-care caused educational difficulties. Do
you think this conclusion is justified? Can you identify any possible confounding variables?
7. How random are the following methods of allocating treatments to experimental units?
13
MAST90044 Thinking and Reasoning with Data Chapter 8
8. For the two studies below, assess whether the claimed replication is false, and how many
(true) experimental units there are.
(a) The sprays are randomly allocated to rows, and 8 strawberry plants randomly selected
from each row for assessment.
(b) Each row is divided into 3 plots of 8 plants each. The sprays are randomly allocated
to plots within each row.
(c) The sprays are randomly allocated to individual plants across the entire trial area.
(d) The trial area is divided into 4 quarters, each consisting of 3 rows 6 plants. Within
each quarter, the sprays are randomly allocated to rows.
(e) At the conclusion of the trial it is found that one end of the trial area is substantially
wetter than the other, and moisture generally increases the risk of fungal rots. Which
design had the most appropriate blocking to account for this?
(f) For part (c), perform the randomisation
(i) using R.
(ii) using Excel.
10. A cheese manufacturer wants to test two additives for their effect on texture of cheese.
There are three treatmentsadditive A, additive B, and no additive. A different batch of
milk is delivered to the factory each morning. Because of the complexity of the machinery,
only 3 cheese-making runs are possible on each day. The experiment can run for one week,
from Monday to Friday.
14
MAST90044 Thinking and Reasoning with Data Chapter 8
11. In order to assess the effects of exercise on reducing cholesterol, a researcher sampled
50 people from a local gym who exercised regularly and 50 people from the surrounding
community who did not exercise regularly. They each reported to a clinic to have their
cholesterol measured. The subjects were unaware of the purpose of the study, and the
technician measuring the cholesterol was not aware of whether subjects exercised regularly
or not. This is
A. an observational study;
B. an experiment, but not a double-blind experiment;
C. a double-blind experiment.
12. A new headache remedy was given to a group of 25 patients who suffered severe headaches.
Of these, 20 reported that the remedy was very helpful in treating their headaches. From
this information you can conclude
A. the remedy is effective for the treatment of headaches;
B. very little, because the sample size was too small;
C. very little, because there was no control group for comparison;
D. very little, because there were no objective measurements taken.
13. To determine whether a particular hormone injection produces a change in iron level in
the blood of mice, 20 mice are measured for their iron level, before injection. Three days
after injection, the iron level is measured again, and for each mouse the difference (before
vs after) is calculated. The results are statistically analysed, and used to conclude that
the hormone injection has caused a change in iron level.
Is the conclusion justified?
14. An apple orchard has 32 trees set aside for an experiment which aims to examine the effect
of mulching on tree growth. There are 4 mulching treatments: 1. Control (no mulch);
2. Wood chips; 3. Garden compost; 4. Clippings from a local council collection. The trees
are in a 4 8 rectangle, as shown in the diagram below. The ground slopes down from
the left to the right of the diagram. The experimenter has resources to maintain 16 plots,
each consisting of 2 trees.
15