Design Notes-1
Design Notes-1
What is an experiment? It is a planned inquiry to obtain new facts or to confirm or deny the
results of previous experiments. Such inquiry may aid or assist in recommending a seed
variety, an animal breed, a ration, a medicine, a pesticide or a management procedure.
Experiments have to be planned because we are investigating effects. Effects under
investigation vary from experiment to experiment i.e. keep on changing thus the drawing of
inference becomes very difficult. Therefore, the design of research (experiment) is aimed at
identifying and reducing this variation. We can reproduce our experiment with reasonable
degree of repeatability. If an experiment is improperly designed or blindly carried out, then
it's 100% likely that result will be irrelevant or unwanted.
Experiments can be classified into three categories
(a) Preliminary experiment. In this the investigator tries out a large number of treatments in
order to obtain leads for future work.
(b) Critical experiment. In this the investigator compares response to different treatments
using sufficient observations of the responses to give reasonable assurance of detecting
differences.
(c) Demonstrational experiments. These are performed when investigator compares a new
treatment with a standard.
1.1 Terminologies
1.1.3 Treatment
Various conditions, which distinguish the population of interest e.g., variety of crops, drugs in
animal experiments, fertilizer in agronomy studies etc.
1.1.4 Control
This is a standard treatment. It might also be a placebo (that is a tablet, cream or solution
looking exactly like the real medicine but lacking the active ingredient)
1
1.1.5 Factors
When several aspects are studied in a single experiment e.g. rainfall, fertilizer, each of these is
called a factor. They can also be defined as aspects under study. Factors can occur at different
levels.
1.1.6 Variation
These are differences exhibited by individuals in a population. The variation could arise from
different causes which could be known or unknown. However, there are two main causes of
variation: -
(1). Inherent (genetic) variability that exists in the experimental material e.g. between
animals, plants in a plot, leaves on a plant, each test tube or each machine.
(2) Lack of uniformity in the physical conduct of the experiment e.g. mixing feed, reading at
different times, where different people are involved etc.
If variation is measured in a population, we talk of total variation, which can further be sub-
divided into different sources causing the variation using analysis of variance (ANOVA).
1.1.7 ANOVA
This is the process of subdividing the total variability of experimental observation into
portions attributed to recognised causes. For example, gain in weight in humans is normally
attributed to amount of feed, genes, type of feed etc.
2
The standard error of the mean is calculated as
2
Standard deviation x
Number of experimental units per treatment
1.1.12 Precision
Reflects the repeatability of measurement of any variable.
3
(5) Selection of experimental material
The type and amount of experimental material will depend upon:
(a) Objective of the experiment
(b) The factors under study
(c) The inference space
(d) The budget available
4
1.5 Hypothesis testing
Null hypothesis (H0): - this is set up for the purpose of being described. Consequently, the
opposite of the conclusion that the researcher is seeking to reach becomes the states of the
null hypothesis. If the H0 is accepted, we say the data does not provide sufficient evidence to
reject the H0. If H0 leads to rejection we conclude that the data is not compatible with H 0, but
are supportive of some other hypothesis.
Alternative hypothesis (H1).
The error committed when a true hypothesis is rejected is called TYPE I ERROR. The
probability of committing type I error is denoted α (Significance level). TYPE II ERROR is
the error committed when false hypothesis is accepted. The probability of committing a type
II error is denoted as β.
Step 7: - Computing Test Statistic.
From the sample data we compute the test statistic and compare it with acceptance and
rejection regions as determined by level of significance α.
5
A statistical decision consists of rejecting or accepting H0 depending on whether the computed
test statistic falls in the acceptance or rejection region.
Sampling from a normal population with mean unknown and variance unknown.
This is assuming a random sample from a normally distributed sample with mean µ unknown
and variance σ2 unknown.
To test either of the following hypothesis
H0 µ = µ0 H0 µ ≤ µ0 H0 µ ≥ µ0
H1 µ ≠ µ0 H1 µ > µ0 H1 µ < µ0
note that in this equation s (sample statistic) is used instead of (population
statistic).
The problem we don’t know the population distribution. If n is small, then we have a problem
but if n is large we use the central limit theorem in determining the test statistic which in this
case is: -
~ N (0, 1) where s is the sample variance.
6
2 POPULATION MEAN AND VARIANCE
It is imperative to first define statistics. Statistics is the science of collecting, analysing and
interpreting quantitative data in such a way that the reliability of the conclusions can be
evaluated in an objective way. There are two types of statistics: -
(1) Descriptive statistics: - Deals with organizing and summarizing numerical data. This also
includes the presentation of data in graphical or pictorial form.
(2) Inference statistics: - Deals with formulating theories and making conclusions from
empirical facts or data. The conclusions made can be useful knowledge applicable in life and
production.
2.1 Population
A population is a group of things whose numbers may not be known. However, a sample is a
part of a population whose size is known and is subject to investigation or research in order to
generate data. The purpose of statistical inference is to establish facts about populations.
Ordinarily, a statistical analysis involves only certain aspects of a population - only certain
attributes, activities or characteristics e.g. weight, height, motion etc.
The attributes, activities or characteristics in a population or sample are assumed to be
normally or randomly distributed according to the laws of probability.
We can measure a certain characteristic of an element of the population. Such measurement is
called an observation. Interest in statistics often lies in the set of measurements taken from the
elements of the population. Such a set often also will be called the universe or population of
observations. The number of elements in the population is N and is called population size. In
practise populations are finite. In statistical theory populations are assumed to be infinite. This
is a good model, if N is large.
cY h
N
2 2
i
N
i 1
7
A formula to calculate 2, if is not an interger, is
L F I
M Y G Y J
N
2
N 2
OP
H K
2
MN
N
PQ
i i
N
i 1 i 1
The variance is often replaced by another measure for how the distribution of observation is
spread, this is the square root of the variance, and this measure is called standard deviation.
The standard deviation is denoted by and is
LM F I 2
OP
cY h
MN Y GH Y JK
N N N
2
N N = 2
2
N
i 1
i
i 1
i
i 1
i
PQ
The advantage of the standard deviation is that it has the same dimension as the observations.
Be aware that the two measures population mean and population variance generally do not
describe the population completely. To show the influence of how the distribution is spread,
measured by the variance, we describe a population of 10 observations.
Example 1.1
The population consists of (3, 4, 4, 5, 5, 5, 5, 6, 6, 7) i.e. N = 10. We can summarize these
observations in a so-called frequency table. For this we first look at the values in the
population, which are different. In our example we have five different values namely 3, 4, 5, 6
and 7. The number of different values of the population is denoted by M.
Yi fi fi/N
3 1 0.1
4 2 0.2
5 4 0.4
6 2 0.2
7 1 0.1
fi is called frequency and counts the number of times a certain observation Y i occurs
fi/N is called the relative frequency.
2 = (262-502/10)/10 = (262-250)/10=12/10=1.2
The calculation of and 2 is easier when we already have the frequency table by:
8
M
f i Yi N
i 1
L F I
M f i Y G f i Y J
M
2
M 2
OP
H K
2
MN
N
PQ
i i
N
i 1 i 1
M
where N f i and M = number of different observations. (They are also denoted by Y but
i 1
now i is running from 1 to M and all the MYi -values must be different).
Exercise 1
Calculate the population mean and the population variance for each of the following five distributions.
1.
Yi 3 4 5 6 7
fi 1 3 2 3 1
2.
Yi 3 4 5 6 7
fi 2 2 2 2 2
3.
Yi 3 4 5 6 7
fi 1 4 2 0 3
4.
Yi 3 4 5 6 7
fi 3 2 0 2 3
5.
Yi 3 4 5 6 7
fi 2 4 0 0 4
9
3 TYPES OF LINEAR MODELS
In any experiment our main aim is to study any variable e.g. growth rate. The behaviour of
growth rate can be expressed in terms of an equation and this is referred to as a linear model.
This equation explains the response of the variable in terms of its components parts. For all
the models of analysis, a linear model equation of the form
Yi ei
is defined. In this equation Yi denotes not only the trait but also an observation induced by the
model. This observation is the sum of mean and an error term e i containing observational
errors and the variability between experimental units. The models of analysis of variance
differ by the number and nature of parameters under study. The observations in an analysis of
variance are allocated in at least two classes, which are determined by the levels of the
factors. Each factor occurring in the models has at least two levels.
For example
An experiment to measure the effect of feeding dairy meal to cows of the same breed, age and
in the same stages of lactation on milk yield will have the following linear model (assuming
presence of a control group)
Yij = + i + eij
where
Yij = observation on jth cow of the ith treatment
= overall population mean
i = effect due to the ith treatment (deviation of each treatment mean from overall mean)
eij = random error associated with Yij.
10
(could have used more). The inference made will have something to do with how humidity
affects larva development.
Exercise 2
Mr. Nyasi designed an experiment to investigate the effect of Rhodes grass hay, maize silage,
sorghum silage and Napier silage on milk yield. His aim was to adopt and use the forage that
gives highest milk yield.
He used Friesian cows in 3 stages of lactation (First, mid and last third). Each group of cows
was kept in a zerograzing pen. Write down his linear model.
11
4 DATA MANAGEMENT
Examples of data:
Univariate data e.g. the height of some individual drawn randomly from a population of
graduate students.
Multivariate data e.g. post-graduate students weight, height, ages.
Time series e.g. height of plants at different ages, height, weight at different ages.
4.2 Metadata
This is data about data or descriptive information about data, which allows a potential user to
determine a dataset's fitness for use. Metadata has many applications. It can be used to:
Concisely describe datasets and other resources using elements such as the name of the
dataset, the quality, who is the custodian, how to access the data, what is its intended
purpose, and whom to contact for more information about the data.
Enable effective management of data resources.
Enable accurate search and data resource discovery.
Accompany a dataset when it is transferred to another computer so that the dataset can be
fully understood, and put to proper use and to duly acknowledge the custodian of the
dataset.
12
What was the name of the project that generated them? Perhaps include an introduction or
abstract to the project.
What is the format and structure of the data? Include here any naming conventions used for
the files.
Where were the data collected? Give details about the site and perhaps the sampling frame
used.
When were they collected? State the time period covered by the data.
How were the data collected? Describe the measuring instruments used.
Why were they collected? What was the purpose of the research?
Who collected the data, or who were the principle researchers involved? Who holds the data?
Who has property rights to the data? Include names and contact details.
These are just ideas but if you can answer these questions then you are well on the way to
having a well-documented dataset. Of course for this information to be useful for others it
must be stored somewhere that others can access - in other words not just your memory!
Below is an outline of what might be included in the metadata.
Title
The name of the dataset or project
Authors
Name of principal investigator and other major players in the research. Include mailing
address, phone number, fax numbers, email, web address, etc.
Data set overview
Introduction or abstract. Time period covered by the data. Physical location of the data.
Any references to the Internet.
Instrument description
Brief text describing the instrument, with references. Figures or links if applicable. Table
of specifications.
Data collection and processing
Description of the data collection. Description of any derived parameters. Description of
quality control procedures used.
Data format
Data file structure and naming conventions.
Data format and layout.
Data version number and date.
Description of codes in the data.
Data documentation should be sufficiently complete, so that persons unfamiliar with a
given project could read the documentation and be able to use and interpret the data.
Part of your data management strategy might be to develop a database of metadata for all your
research projects. Of course if your datasets are currently in disarray and undocumented, this
will be a mammoth task, but you could start with current and future projects. We don't expect
you to solve all your data management problems overnight.
13
Database (DBMS) packages, e.g. Access, dBase, EpiInfo, Paradox, DataEase;
Statistical packages, e.g. Genstat, MSTAT, SAS, SPSS, Statgraphics, Systat;
Spreadsheet packages, e.g. Excel, Lotus-123;
Word processors, e.g. Word, WordPerfect; or text editors, e.g. Edit.
Geographic information systems are also available for storing spatial data.
Some of the advantages and disadvantages of the three types of packages for data
management are given in the following table.
Spreadsheet User friendly, well known and Unsuitable for multi-level data
widely used structures
Good exploratory facilities Lacks security of a relational
database management system
Statistical analysis capabilities
limited to simple descriptive
methods
Each type of package has its own special use. Nevertheless, statistical, spreadsheet and
database management packages have overlapping facilities for data management, and all can
now 'talk' to each other. In other words, a data file stored in one package can be exported
(transferred) to another. The researcher needs to anticipate at the start of a project how data
entry, management and analysis will proceed, and plan accordingly.
Excel is popular among researchers and is suitable for storing data from simple studies. But
care is needed in handling spreadsheets as it is easy to make mistakes. Excel does not offer
the same data security as a data base management system and so it is easy to lose and spoil
data.
14
There is no reason nowadays why data entry, management and analysis cannot all be done in
the one statistical package, especially for the simpler studies in which data can be stored in
standard spreadsheets. The problems of data security remain but the advantage, as indicated in
the above table, is that all the stages from data entry to analysis and interpretation can be done
within the same package.
Once all the data handling facilities of a statistical package have been exhausted attention can
then be turned to Excel or another spreadsheet package to see what additional facilities are
available there. Finally, the use of more specialized data base management software for
handling the multi-level datasets on which some of the case studies are based can be taught.
The advantage of this approach is that the student can find out what data handling facilities
are available within the statistical package itself. This means that the package will be taught
not just as a statistical tool but as a data handling tool too. The student will not have to move
from one package to another with the risk of making mistakes. At the end of the course the
student will be able to grasp better the appropriate roles for each type of package in the data
handling and management process.
Generally, the following views can be made on software for data management. Transfer of
data between packages is now simple enough that the same package need not be used for the
different stages of the work. The data entry task should be conceptually separated from the
task of analysis. This will help when thinking about what software is needed for data keying,
for checking purposes, for managing the “data archive” and for analysis.
Database management software (DBMS) should be used far more than at present. Many
research projects involve data management tasks that are sufficiently complex to warrant the
use of a relational database package such as Access. Spreadsheet packages are ostensibly the
simplest type of package to use. They are often automatically chosen for data entry because
they are familiar, widespread and flexible – but their very flexibility means that they can
result in poor data entry and management. They should thus be used with great care. Users
should apply the same rigor and discipline that is obligatory with more structured data entry
software.
More consideration should be given to alternative software for data entry. Until recently the
alternatives have been harder to learn than spreadsheets. Some statistical packages, for
example SPSS, have special modules for data entry and are therefore candidates for use at the
entry and checking stages. If a package with no special facilities for data checking is used for
the data entry, a clear specification should be made of how the data checking will be done. A
statistics package – not a spreadsheet – should normally be used for the analysis.
4.4 Data storage and backup strategies for your research data.
Data disasters
There exists much literature on perceptions of risk. Most people will overestimate the chance
of relatively rare disasters such as earthquakes and plane crashes. They underestimate the
chance of common problems. In particular, they underestimate the chance of their computer
crashing, a probability that approaches 1. Power fluctuations and dirt can damage the hard
disk, and the read/write mechanism can just wear out. In addition, the disk can get too
fragmented, slow down and this and several other reasons leads to corrupted files, lost files,
inaccessible hard disk or worse. On top of that there are different categories of human errors:
loss of a laptop, deleting files, deleting parts of files. It can also happen on purpose by
unscrupulous persons, jealous colleagues or disgruntled employees: theft of laptops, deleting
15
files or damaging computers on purpose, … And then of course there is always a chance of
fire, water damage, … Last but not least there is an ever growing group of “malware”:
computer viruses, Trojan horses, backdoors, spyware.
The goal of this section is not to protect you from a complete system failure. This is the quite
complex job of the better skilled system administrator. The goal of this note is to protect you
from loss of data from your current research activities. In a way, the goal of this note is to
protect you from yourself since it involves some discipline. The only way to avoid
unrecoverable data loss is to back up regularly in an organized way.
The following steps can be used to store and backup your research data.
Step 1: Organize your data.
In this step you want to organize your data in a structured logical way. Physically, you put
them in a structured set of subfolders within a folder. Give those folders a logical name that is
easy to understand. It will make life easier when trying to find a file. This data organization
can take some time, but you only have to do it once. The moment you have your data
structured, change the default location where some common software saves its files.
With your data logically ordered in subfolders you can easily locate the folder where a
specific data file is stored. But that folder might contain so many files that you still don’t
easily find what you’re looking for.
It is therefore important to add documentation to the data at the level of individual files. This
involves naming your files but also creating File Properties this is useful when performing
search. Give your data files descriptive and meaningful names for example if you wanted to
name an Excel workbook containing yield data from an experiment with tomatoes, you could
have named it something like “ tomyld.xls”. With such a file name, any person who is not
specialized in tomato yields needed a lot of imagination to figure out the contents of the file.
Instead use long, descriptive and meaningful file names. But don’t exaggerate the length of
the file names. It is good practice to name the file containing the tomato yield data as
something like “Tomato yield harvest year 1999.xls”. This file name is only 34 characters but
is long enough to understand what the file is about.
Your data are now ordered in a structured and logical way. And you have added sufficient
documentation to each file. The only thing you have to do now to make a backup is to copy
the folder containing your data (D:\Data or C:\Data) and paste it onto a storage medium.
The following is a review of the advantages and disadvantages of some common storage
media. As you will learn it has been concluded that the CD-R (and CD-RW) is the best
backup medium for research data of individual researchers and small organizations.
16
5 EXPERIMENTAL DESIGNS
5.1.1 Replication
If I were to tell you that a farmer gave his sick cow a home-made medicine and that a week
later she had got better, you would be rightly sceptical about this 'evidence'. That was not an
experiment because there was only one treatment. Any experiment must have at least two
treatments (one of which may, if we choose, be a control involving no treatment). Suppose
that the farmer had two sick cows and he gave his medicine to one of them: she got better in a
week and the other, untreated, cow took a month to recover. This is now an experiment but it
is un-replicated and will not impress us greatly. However, if the farmer had six cows showing
similar symptoms and he gave his medicine to three of them, who recovered in 5-10 days,
while the other three cows, left untreated, took 30-50 days to get better, you may well think
that this farmer's medicine deserves to be taken seriously. This is now a replicated
experiment. The whole basis of science is that observations are repeatable, although when we
are using biological material we have to allow for the effects of natural variation which clouds
our observations. We know that cows, like people, sometimes recover from illness whether
they are treated or not and so we demand controls (animals not treated) and replication before
we are satisfied that the evidence indicates that the medicine is the cause of the early recovery
of treated cows.
The essence of a scientific experiment conducted with variable biological material is that we
judge differences between units treated differently in the light of observed variation in units
treated alike.
Definition
Replication is the appearance of a treatment several times in an experiment. A treatment may
be replicated 2, 3, 4 times or more. Replicating a treatment is essential due to the following
reasons: -
1. Replicates provide a means of estimating experimental error. If no replicate, no true
estimate of experimental error because there will be no degree of freedom [t(r -1)].
2. Replication improves the precision of an experiment by reducing the standard
deviation and hence the variance of a treatment mean i.e. The replications there are,
the lower the variation or variance. The mean becomes more precise and closer to the
population mean.
3. Replicates increase the scope of inference of the experiment by selection and use of
more variable experimental units.
4. Replicates permit some control of experimental error variance through grouping of
experimental units.
17
5.1.2 Randomization
If I were now to tell you that the farmer reported above looked at his six ailing cows and
chose the three that appeared least affected to receive his medicine, you would immediately
revise your opinion of his 'experiment'. You would rightly say that the question of which
cows were to be treated should have been determined at random. This, like replication, is such
a common sense precaution against biased results that it does not need much elaboration as a
principle. The question of how randomness is achieved in practice will be dealt with later in
this course.
In medical research, where patients are randomly allocated to different medicines or to a
medicine and a placebo (that is a tablet, cream or solution looking exactly like the real
medicine but lacking the active ingredient), it is common practice to conceal the information
about which patient is receiving which treatment from both the patients themselves and from
the doctors recording the results of the trial. This is called a 'double-blind' trial. This
procedure is not commonly applied in animal experiments, on the grounds that the
observations made are 'objective' (e.g. weights of animals or their products). But when, for
example, behavioural observations are made, there is a serious risk that subjective bias may
creep in and you should then aim to guard against this by specifying that, wherever possible,
the person making the observations does not know which animals are receiving which
treatment. There are, however, cases where the treatments applied are apparent even to a
casual observer (trials assessing grazing behaviour on different sown pastures, for example)
and then it becomes important to make the observations as objective as possible by, for
example, formally recording with a stop-watch the times that animals spend in defined
activities.
Definition
Randomization is a procedure by which the experimental units are randomly (by chance)
assigned to the treatments (or treatments combination). The importance of randomisation
includes: -
1. It is aimed at avoiding bias. It is somewhat analogous to insurance, in that it is a
precaution against disturbance that may occur.
2. It assures a valid estimate of experimental error and avoids bias in estimating the
mean.
3. Randomization tends to destroy the correlation among errors and make valid the usual
tests of significance.
18
farms in a study area (you cannot visit and study them all) but before drawing names and
addresses from a hat containing names of all the farmers in the designated area, you may
decide to classify farms according to the size of their dairy herds. You will then draw similar
proportions (at random) from each size class and thereby ensure that your survey fairly
samples the situation in small, medium and large herds, and thus gives a truer picture of
mastitis in the region. In such cases the number of farms in each stratum of the survey will
probably be different, and the number yielding reliable information will probably differ from
the number intended at the outset. Blocking is rather like stratification but, because it is
applied in a planned experiment, it is usually more regular in its features.
19
5.3 Basic statistics on experimental data
Experiments may entail the calculation of basic statistics (variance, standard deviation,
standard error of the men, coefficient of variation etc.) for the purpose of summarizing the
data. The following exercise shows how this can be done.
Example 4.1
Calculate the statistics for the following data of number of eggs laid in 30 days by a flock of
chickens treated with a vitamin (H = hens number 1 to 12).
n
1. The grand total = Y = 14 + 18 + …+ 18 = 221
i 1
i
FG Y IJ
n 2
Yi2
H K
i 1
i
20
3. Any difference obtained in the experimental units will be due to experimental error.
4. It is a one-way classification of ANOVA. Classification is by treatment only.
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
3. Assign the treatments to the experimental units. Assign treatments by way of randomisation
21
Groups Ranks in the group
1 17 (A) 2 (A) 15 (A) 7 (A) 19 (A)
2 13 (B) 4 (B) 18 (B) 1 (B) 8 (B)
3 16 (C) 14 (C) 6 (C) 9 (C) 12 (C)
4 10 (D) 20 (D) 3 (D) 11 (D) 5 (D)
(e) Assign the t treatments to the n experimental plots (units) by using the group number as
the treatment number and the corresponding ranks in each group as the plot (unit) number in
which the corresponding treatment is to be assigned.
II Drawing lots
Rather than using random numbers e can use drawing lots by using papers which should
conform to your experimental units. For them to be homogenous, they have to be identical.
Steps involved include: -
1. Prepare n identical pieces of paper and divide them into t groups, each group with r pieces
of paper.
2. Label each paper of the same group with the same letter (or number) corresponding to a
treatment. Uniformly fold each of the n labelled pieces of paper; mix thoroughly and place
them in container.
3. For our example, there should be 20 pieces of papers, five each with treatments A, B, C, D
appearing on them.
4. Draw one piece of paper at a time without replacement and with a constant shaking of the
container after each draw to mix its contents.
5. For our example, the label and the corresponding sequence in which each paper is drawn
may be as follows: -
Treatment D B A B C A D C B D D A A B B C D C C A
label
Sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
6. Assign the treatment to plots (units) based on the corresponding treatments label and
sequence drawn.
22
eij = random error associated with Yij (deviation of each observation from its own mean - Yij -
)
Assuming that the model is fixed and that the treatment is unstructured, the ANOVA is
tabulated as follows
2
2. CF = correction factor = Y.. n
t 2
3. SST = sum of squares for the treatments = Y
i 1
t
r CF
Let us assume that our aim is to look at the effect of feeding dairy meal to cattle, which are
homogenous i.e. are of the same breed, age and in the same stage of lactation on milk yield
and that there is a group that is un-supplemented (control). The experimental design is a CRD.
In such an experiment the hypotheses are
H0 (null hypothesis): Feeding of dairy meal does not lead to changes in milk yield i.e. 1
(mean of control) = 2 (mean of supplemented).
Ha (alternative hypothesis): Feeding of dairy meal leads to changes in milk yield i.e. 1 (mean
of control) 2 (mean of supplemented).
23
To test which hypothesis is the correct one, we use F values, the tables of which are found in
most books of statistics. From ANOVA table F = MST/MSE. The tabulated value is obtained
by using the degrees of freedom of treatment (t - 1) as the numerator and that of error (rt -1)
as the denominator. If the tabulated F value is greater than calculated F value, then treatment
effect is not significant and therefore H0 is accepted (Ha is rejected) i.e. feeding of dairy meal
does not lead to changes in milk yield. If tabulated F value is less that calculated F value, then
treatment effect is significant and therefore Ha is accepted (H0 is rejected) i.e. feed does lead
to changes in milk yield.
If ANOVA gives a significant effect, all it tells you is that at least one of the mean is
significantly different from the others. If this is the case, then more analysis is needed to see
which treatment are significantly different. However, when there is no significant effect of
treatment, then that is the end of the analysis.
Example 4.2
An agronomy student has 2 varieties of soybeans. He wants to compare them with a control
variety to see if any of them is better than the control. The following is the data that was
obtained from the study
Variety Yij
Variety 1 Y1 64.5
Variety 2 Y2 54.8
Variety 3 Y3 67.0
TOTAL 186.3 (Y..)
2
Then calculate the correction factor CF = Y.. n = (186.3)2/30 = 1156.9
2. Calculate total sum of squares (SSY) by squaring each observation and summing it and
subtracting CF from it.
Y
ij1
2
ij CF = [(6.6)2 + (6.4)2 + ………. (6.8)2] - CF
= 1167.5 - 1156.9
= 10.6
Note that the total uncorrected sum of squares is 1167.5
3. Calculate sum of squares of treatments (SST)
t
Yt2 LM (64.5) 2
OP
(54.8) 2 (67.0) 2
CF
r CF =
i 1 N 10 Q
24
= 1165.2 - 1156.9
= 8.3
4. Calculate sum of squares of error (SSE) = SSY - SST
= 10.6 - 8.3 = 2.3
5. We are testing all treatments which are of interest and therefore our conclusion will refer to
only those treatments (fixed model). Therefore, the ANOVA tables looks as follows
Note that the MS = (corrected sum of square)/degree of freedom. The F test requires the
comparison of the mean squares. For our case F= 46.1. To check for significance at certain
level e.g at 5% or 1%, we use the F tables and take the degree of freedom of variety (i.e. 2) to
be the numerator while that of experimental error (i.e. 27) to be the denominator. Under these
degrees of freedom, the table gives a value of 3.35. Remember that if calculated F value is
less than 3.35, then varieties are not statistically different at 5% level. If calculate F is 3.35 or
greater than 3.35 then the varieties are statistically different at 5% level.
Since our calculate F is 46.1 which is greater than the tabulated F (3.35), the varieties are
statistically different. ANOVA has simply told us that at least 2 varieties are significantly
different from each other.
It is customary to make a table and list the difference between pair of means and their
standard errors.
Remember that the standard error of a mean (SEM) is calculated as 2 / n where is the
standard deviation and n is the sample size. If an ANOVA has been performed then 2 is the
MSE and in calculating SEM for a treatment mean, n is the number of replicates contributing
to that mean. For our case therefore let’s use letter r instead of n. The standard error of the
difference between two means (SED) is calculated as
LF I F I O
MG r J G r J P
2 2
NH K H K Q
1 2
1 2
The above formula must be used if the two treatments being compared have unequal variance.
However, if you have performed an ANOVA and obtained a pooled MSE which is presumed
to be applicable to all treatments and if the treatments all have the same number of replicates,
then 12 12 and r1 r2 and therefore
25
SED 2 r
2 2MSE
r
When the number of observation in each treatment is not equal. Then SED MSE
FG 1 1 IJ
Hr r K
1 2
The appropriate SED for difference between two differences can also be calculated. Given
four treatment means, 1, 2, 3 and 4 each with their own SEM
If we then wish to ask whether the difference (1 - 2) differs significantly from the difference
(3 - 4), the appropriate SED is
For our situation, the number of observation in each treatment is equal, therefore
2MSE
SED
r
= (2 x0.09) / 10 = 0.13
1 2 Y1 Y2 SED
1 2 6.45 - 5.48 = 0.97 0.13
1 3 6.45 - 6.70 = - 0.25 0.13
2 3 5.49 - 6.70 = - 1.22 0.13
and
26
(i-j)-(k-l)
dY Y i - cY Y h
i j k l
SEDb i - jgb k l g
(1 - 2) - (1 - 3) 0.97 + 0.25 = 1.22 0.19
(1 - 2) - (2 - 3) 0.97 + 1.22 = 2.19 0.19
(1 - 3) - (2 - 3) - 0.25 +1.22 = 0.97 0.19
Exercise 3
The following are weekly first lactation yields of Friesian cows on 4 different diets. Two
weekly yields were incomplete due to death of the cows. The researcher is interested in
determining whether average weekly yield differ for the four diets. Each diet being a
treatment.
As already alluded to, if ANOVA gives a significant effect, then more analysis is needed to
see which treatments are significantly different. There is therefore the need to separate the
means using a suitable procedure. This is called Post-ANOVA analysis and uses one of the
mean separation procedures described in the next section.
27
5. Dunnet's procedure.
- used where one mean is control and all other means are compared with it.
Each of these procedures has an advantage or disadvantage in guarding against type 1 and
type 2 errors. Rejecting or accepting a hypothesis means committing errors.
Type one error
Occurs when you judge 2 pairs of means significantly different when they are equal i.e.
falsely accepting Ha (1 2) or falsely rejecting H0 (1 = 2).
Type two error
Occurs when a pair of means is actually different but this difference was not detected i.e.
falsely accepting H0 (1 = 2) or falsely rejecting Ha (1 2). In other words, this error
occurs when you judge 2 pairs of means equal when they are significantly different.
(In other words, If we reject a hypothesis when it should be accepted, we say that a Type I
error has been made. If, on the other hand, we accept a hypothesis when it should be rejected,
we say that a Type II error has been made. In either case a wrong decision or error in
judgment has occurred).
Post ANOVA analysis using the above procedures is done to strikes a balance between these
two types of error.
If the comparison of means is between each treatment mean and the control, then the number
of comparison to be made is k - 1 where k is number of treatments. If k = 4, then number of
comparisons is 3 i.e. compare 1 with 2, 1 with 3 and 1 with 4. If the experiment has no
In cases where the treatments are structured i.e. related, separation of means using these
procedures is meaningless. Therefore, we do not compare the means but form contrast. A
contrast is a linear equation whose coefficients add up to zero. Note that contrast can also be
used in situations where the treatments are unstructured. How contrast can be used in
separating treatments means will be discussed later after under the sub section on orthogonal
polynomials.
To illustrate the different types of the separation procedures, let us use the following arbitrary
example.
Example 4.3
Suppose we are testing the effect of 4 diets (treatments) on milk yield per day and that number
of observation for each treatment is six and the design is CRD. The treatment means are 30.5,
30.3, 35.7 and 33.0. After carrying out an ANOVA, MSE = 4.76 while the df for MSE = 20
(i.e. n - t = 24 - 4 = 20). The standard error of the difference between two means (SED) is
1.26 ie
28
2MSE 2 x 4.76
SED = 1.26
r 6
Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7*
4 33.0 2.7* 2.5
1 30.5 0.2 0.2
2 30.3 0
(b) Studentized range values tables
Formula used is
LSD
b
SED k v g
2
where
(k v) = studentized range value of order k, = 5%, v = df of the MSE and SED = 1.26.
Please note that for LSD k is always 2.
LSD =
b g 2.63
1.26 2.95
2
Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7*
4 33.0 2.7* 2.5
1 30.5 0.2 0.2
2 30.3 0
29
5.5.2 Tukey's range procedure (TK or HSD)
This test is similar to LSD except that it guards against Type I error more and therefore it is
also referred to as Honestly significant difference (HSD).
TK or HSD
b
SED k v g
2
where
k = number of means of comparisons (for our example k = 4), therefore
TK or HSD
b
SED 4 v g
2
Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7
4 33.0 2.7 2.5
1 30.5 0.2 0.2
2 30.3 0
k = 2, SNKb 2 g
1.26 2.95 b g = 2.63 (remember this was the value obtained for LSD)
2
k = 3, SNKb 3g
1.26 3.58 b g = 3.19
2
k = 4, SNKb 4 g
1.26 3.96 b g = 3.52
2
Then compare the first diagonal (lowest) with first value ie. 2.63, the second with 3.19 and the
third with 3.52 i.e.,
30
3.52, 2.19, 2.63
Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7*
4 33.0 2.7 2.5
1 30.5 0.2 0.2
2 30.3 0
Treatment 3 is significantly different to all the others but the others are not significantly
different to one another.
b
Db p g SED p v g
Several values depending on the number of treatments are calculated.
b g
p = 2, Db 2 g 1.26 2.09 = 2.63
b g
p = 3, Db 3g 1.26 2.19 = 2.76
Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7*
4 33.0 2.7 2.5
1 30.5 0.2 0.2
2 30.3 0
b g
D SED d k v where k is always = 2, thus D SED d 2 v b g
31
Let us go back to our soybean example where we had three treatments. Treatment 1 (ie variety
1) is a control. ANOVA showed that the treatment effect was significant and that MSE =
2 x 0.09
0.09, df of MSE = 27, therefore SED = = 0.134.
10
D = 0.134 (2.33) = 0.31
Remember in that example, the treatment means for the three treatments were 6.45, 5.48 and
6.70 respectively. If treatment one is the control, then the number of comparison to be made
are given by k - 1 i.e. 3 - 1 = 2.
i h Y1 Y2 SED
3 - 1 6.70 - 6.45 0.25
2 - 1 5.48 - 6.45 0.97*
Variety 3 is not significantly different from the control but variety 2 is.
Exercise 4
In a completely randomized design, a researcher recorded the following information
(1) Each treatment had 11 observations
(2) Y1 14 Y2 20 Y3 12 Y4 5
(3) SS for treatments was 200 df = 3
SS for experimental error was 440 df = 40
SS for total was 640 df = 43
(a) Give the linear model for this experiment. Explain terms used in the model and specify
ranges on the subscript.
(b) Test equality means.
(c) What difference of pairs of treatment means would be judged significantly different at the
5% level by:
(i) Fisher's LSD (use both tables)
(ii) Tukey's procedure
(iii) Student Newman Keul method
(iv) Duncan's multiple range test
(d) Assuming that Y1 14 is a control, what treatment means would be judged significantly
different from it at 5% level.
32
liveweight, age, parity, litter size (single vs twins in growing lambs), previous yield of milk or
eggs, breed, growth rate prior to the start of an experiment etc.
If the experimental units are not homogenous and one decides to use a CRD, then the
experimental error will be large to the extent that the sensitivity of the experiment is going to
be poor. There is however a price to pay for blocking. The cost of blocking is a loss of
degrees of freedom from the error term.
The purpose of blocking is to increase precision by reducing the error variance. An
experiment arranged in blocks is called a randomised complete block (RCB) design:
'randomized' because treatments have been allocated randomly to position within each block:
'complete' because each block contains every treatment. There is no limit of the number of
blocks one can have but the number of treatments appearing in each block should be the same.
33
i Yi1 Yi2 Yij Yir Yi
t Yt1 Yt2 Ytl Ytr Yt
Block totals Y.1 Y.2 Y.j Y.r Y..
Assuming that the model is fixed and that the treatment is unstructured, the ANOVA is
tabulated as follows
2
2. CF = correction factor = Y.. n (note n = rt)
t
Yt2
3. SST = sum of squares for the treatments = i 1
r CF (note r = number blocks)
r
Yi2
4. SSB = sum of squares for the blocks = i 1
t CF
Example 4.4
The following data was taken from an experiment in which four dietary treatments were
compared with eight sheep allocated to each treatment in a randomised complete block
design. The block was based on live weight of the sheep at the start of the trial.
2
1. Calculate the correction factor CF = Y.. n = (595.1)2/32 = 11067.0003
34
b g b g c h b g
n
= 58.9297
3. Calculate treatment sum of squares (SST)
t
Yt2 Lb141.3g b153.2g b149.0g b151.6g OP CF
CF = M
2 2 2 2
i 1
r
MN 8 PQ
= 11077.43625 - 11067.0003
= 10.4359
4. Calculate block sum of squares (SSB)
r
Yi2 LMb72.6g + b69.7g ....+b80.3g OP CF
2 2 2
t CF =
i 1 MN 4 PQ
= 11096.1325 - 11067.0003
= 29.1322
5. Calculate error sum of squares (SSE) = SSY - SST - SSB
= 58.9297 - 10.4359 - 29.1322 = 19.3616
6. Enter the sum of squares in an ANOVA table and calculate the means squares (MS =
SS/df).
7. Calculate F values as the ration of each MS to MSE.
8. Check F-ratios against the values tabulated to obtain the corresponding probability
estimates.
Exercise 5
In the above example, which treatments and blocks are significantly different from one
another? Use LSD, TK and SNK to separate the means.
Normally experiments are not predictable. An animal might die in the course of the
experiment. For RCB such a missing value, m, can be obtained using the following formula.
35
rBh tTg Y ..
m=
br 1gbt 1g
where
m = missing value
Tg = total of all non-missing observation for treatment g.
Bh = total of all non-missing observation for block h.
Y.. = grand total of all non-missing observations.
r = number of blocks.
t = number of treatments
Example 4.5
Assume the following data on 4 diets def to 5 different litters. During the experiment, one
animal died.
36
Thus the RCB design was 81% more efficient that the completely randomized design in this
case.
5.6.5 Confounding
Confounding is a general term used to mean that two effects are so mixed up that they cannot
be separated. An extreme example would be a trial using two-year old heifers and three-year-
old steers. Suppose that we find there is a difference in fatness between these two groups, we
have no way of telling whether this is due to age or due to sex because the two effects are
completely confounded.
Exercise 6
The following data represent the weight gain in grams of rabbit breeds under different
temperatures.
Treatment Breed
1 2 3 4
1 20C 54 21 48 68
2 25C 63 17 50 56
3 30C 50 15 46 49
4 32C 55 13 45 70
5 35C 60 19 47 37
37
C2
d c2 x n i
where c represents the coefficients used (+3, -1, -1 and -1) and n = the number of plots in each
total used (8 in this case).
b12 x 8g
2 2 2 2
Since 3 + (-1) + (-1) + (-1) = 12, the SS we are looking for =
38
The corresponding SS are shown in the table below.
Treatment totals Contrast
(n = 8) 1 2 3
Treatment 1 141.3 +3
Treatment 2 153.2 -1 +1 +1
Treatment 3 149.0 -1 -2
Treatment 4 151.6 -1 +1 -1
c= 0 0 0
c =
2 12 6 2
The following are the rules to be followed when making up orthogonal set of contrast using
coefficients (or polynomials).
Rule 1. Any contrast must be between two quantities and thus represent 1 d.f. only
Rule 2. The maximum number of contrast available in one orthogonal set is equal to the d.f
available.
Rule 3. To be a valid contrast, the coefficients must sum to zero ( c = 0)
Rule 4. For one contrast to be orthogonal to another, the sum of the products of their
d b g i
coefficients must be zero c1c2 0 .
If we test this last ruleon the table of coefficients, we obtain for columns 1 and 2:
In any set, if column A id orthogonal with B and B is orthogonal with C, then A and C will
also be mutually orthogonal. It is thus only necessary to check that one column is orthogonal
with each of the others to know that you have an orthogonal set.
If a set of contrast obey the rules for orthogonality given above, then the corresponding
component SS will add to the treatment SS. You will have noticed that:
39
Set 1 Set 2 Set 3
Contrasts: Contrasts: Contrasts:
1 2 3 1 2 3 1 2 3
A 3 0 0 1 1 1 1 1 0
B -1 2 0 1 -1 -1 1 -1 0
C -1 -1 1 -1 1 -1 -1 0 1
D -1 -1 -1 -1 -1 1 -1 0 -1
Exercise 7
Using the information given in Example 4.3, find the corresponding SS for each contrast and
set in the table above.
Breed
A B C D
Lactation no. B C D A
C D A B
D A B C
5.7.1 Characteristics of LS
1. Gives more accurate treatment comparisons
40
2. Has a greater sensitivity because we have identified 2 sources of error thus our
experimental error is low
3. It is a fairly easy to analyze.
4. Number of treatments will determine the number of rows and columns. Their use is
restricted to cases where four or five treatments are to be compared. If used to
compare a lot of treatments, there will be a problem in that the MSE will be inflated.
This means that we will be dividing by a bigger number and thus chances of getting
significant effect will be minimal i.e. F value will be reduced.
1. Select type of square you are dealing with i.e. select a LS plan with 5 treatments from a
statistic book (see the above existing plan).
2. Randomize the row arrangement on the plan selected in step 1 following the randomization
scheme of random numbers. e.g. select 3 digits’ random numbers. for example: 628, 846, 475,
902 and 452. then rank them from lowest to highest.
Use the rank to represent the existing row number of the selected plan and the sequence to
represent the row number of the new plan.
New plan
C D A E B
D E B A C
B A E C D
E C D B A
A B C D E
3. Randomize the column arrangement using the same procedure used for row in step 2 e.g.
select 5 random 3 digits’ numbers - 792, 032, 947, 293 and 196.
41
The rank will now be used to represent the column number of the plan obtained in step 2 and
the sequence will be used to represent the column number of the final plan.
Final plan
E C B A D
A D C B E
C B D E A
B E A D C
D A E C B
2
2. CF = correction factor = Y.. n (note n = t2)
t
Yi..2
3. SST = sum of squares for the treatments =
i 1
t CF
42
t
Y.2j.
4. SSC = sum of squares for the columns =
i 1
t CF
t 2
5. SSR = sum of squares for the rows = Y
i 1
..k
t CF
Example 4.6
An animal scientist is testing 4 diets A, B, C and D and he is using 4 breeds of cows, which
are in lactation number 1, 2, 3 and 4. The animals were put on new ration 10 days after
lactation for 3 months. The following are total 3-month milk yield.
Breeds
Ayrshire Friesian Jersey Guernsey Lactation total
1 810: B 1080: C 700: A 910: D 3500
Lactation no. 2 1100: C 880: D 780: B 600: A 3360
3 840: D 540: A 1055: C 830: B 3265
4 650: A 740: B 1025: D 900: C 3315
Breed total 3400 3240 3560 3240 13440
2
2. Calculate correction factor Y.. n = (13440)2/ 16 = 11289.600
b g b g c h b g
n
= 11723250 - 11289.600
= 433650
4. Calculate treatment sum of squares (SST)
t
Yi..2 Lb2490g b3160g b4135g b3655g OP CF
CF = M
2 2 2 2
i 1
t
MN 4 PQ
= 11660737.5 - 11289.600
= 371137.50
5. Calculate column sum of squares (SSC)
43
t
Y.2j. Lb3400g b3240g b3560g b3240g OP CF
CF = M
2 2 2 2
i 1
t
MN 4 PQ
= 11307200 - 11289.6000
= 17600
6. Calculate row sum of squares (SSR)
t
Y..k2 Lb3500g b3360g b3265g b3315g OP CF
CF = M
2 2 2 2
i 1
t
MN 4 PQ
= 11297262.5 - 11289600
= 7662.5
7. Calculate error sum of squares (SSE) = SSY - SST - SSC - SSR
= 433750 - 371137.50 - 17600 - 7662.5 = 37250
8. Enter the sum of squares in an ANOVA table and calculate the means squares (MS =
SS/df).
9. Check F-ratios against the values tabulated to obtain the corresponding probability
estimates.
44
The ANOVA tables looks as follows
Under LS ordinarily one does not perform test for row and columns because they were not
randomized in LS and therefore validity of such tests is questionable. Since treatments were
assigned at random, then significant test can be performed.
b g
LSDb 0.05g t 0.025 df 5:20
2MSE
r
= 2.447 x
b g
2 6208
4
= 136.3
D C B
i Mean 913.75* 1033.75* 622.50*
A 622.5 291.25 411.25 167.70
B 790.2 123.75 243.75*
C 1033.75 120
D 913.75
When we have missing data in a row or column we have a formula which can be used:
m
b
t T R C 2Gg
b gb g
t 1 t 2
where
45
T = treatment totals corresponding to the missing value.
R = row totals corresponding to the missing value.
C = columns totals corresponding to the missing value.
t = number of treatments.
G = sum of all observations.
Example 4.7
Calculate the missing value in the data below
Breeds
Ayrshire Friesian Jersey Guernsey Lactation total
1 810: B 1080: C 700: A 910: D 3500
Lactation no. 2 1100: C 880: D 780: B MISSING: A 2760
3 840: D 540: A 1055: C 830: B 3265
4 650: A 740: B 1025: D 900: C 3315
Breed total 3400 3240 3560 2640 12840
The treatment total for treatment A is 1890.
m
b g b
4 1890 2760 2540 2 12840 g
3 2b gb g
m = 513.
Exercise 7
A LS design was used to test the effect on egg weight of including molasses in the diet of
laying hens at four concentrations (0, 70, 140 and 210g/kg). Four groups, each of 48 birds,
received diet in turn for a period of 4 weeks. Data for the first 2 weeks after changing diets
were discarded.
The table above shows mean egg weight for each group on each diet (based on weighing eggs
in bulk on weekdays in weeks 3 and 4 in each period). The Roman numerals in brackets
alongside the data give the period in which the results were obtained.
1. Carry out an analysis of variance of these data
2. Do the figures justify a conclusion that the inclusion of molasses in the diet depresses
egg weight?
46
5.8 Factorial experiments
It is not usual to plan experiments that investigate two or more factors simultaneously. When
the treatment consists of 2 or more factors then the experiment is referred to as a factorial.
The term factorial refers to the treatment design (i.e. the relationship among treatments). For
example, (a) A 2 x 2 factorial experiment refers to a situation where we are dealing with 2
factors occurring at 2 levels. For example, one factor could be energy level (factor A) in the
diet which occurs at two levels - high energy or low energy levels. The other could be protein
level (factor B) also occurring at two levels - high and low.
Factor B
b1 b2
Factor A a1 a1b1 a1b2
a2 a2b1 a2b2
Where
a1b1 = high energy high protein.
a1b2 = high energy low protein.
a2b1 = low energy high protein.
a2b2 = low energy low protein.
Each animal is receiving 2 treatments ate one time. Ina 2 x 2 factorial experiment, we need 4
EU. If each treatment is replicated r times, then the number of EU needed is 4 x r.
(b) 3 x 3 factorial experiment refers to a situation where we are dealing with 2 factors
occurring at 3 levels.
Factor B
b1 b2 b3
Factor A a1 a1b1 a1b2 a1b3
a2 a2b1 a2b2 a2b3
a3 a3b1 a3b2 a3b3
(c) 3 x 4 factorial experiment refers to experiment with 2 factors with one occurring at 3
levels and the other at 4 levels.
Factor B
b1 b2 b3 b4
a1 a1b1 a1b2 a1b3 a1b4
Factor A a2 a2b1 a2b2 a2b3 a2b4
a3 a3b1 a3b2 a3b3 a3b4
a4 a4b1 a4b2 a4b3 a4b4
Please note that "levels" does not always imply a numerical description: it can also imply
qualitative differences or multidimensional differences e.g. breed, which could differ in a
47
number of quantifiable ways. We can also have three, four factors etc factorial experiments.
For example, a 2 x 2 x 2 factorial experiment simply means one with 8 EU, 3 factors each
having 2 levels.
1. Factorial experiments allow us to look at more than one factor at a time. Instead of
conducting 2 or more experiments at a time, one is conducted which incorporates all
the factors and levels.
2. The bigger the factorial the bigger the number of EU required.
3. In factorial, one is able to study the main effect as well as the interaction effect
between the factors under study.
4. Factorial experiments can be used with any experimental design.
48
Ai = effect of level of factor A
Bj = effect of level of factor B
k = effect due to kth block.
(AB)ij = effect of interaction of factors A and B.
eijkl = random error associated with Yijkl.
5.8.3
Format for data recording and ANOVA in factorial experiment
under CRD
Data from a factorial experiment under CRD is normally organised as shown in the following
table
Factor B
1 2 …………… b A total
Assuming that the model is fixed and that the treatment is unstructured, the ANOVA is
tabulated as follows
49
(d.f.) (SS)
Factor A a-1 SSA SSA/a - 1 = MSA MSA/MSE
Factor B b-1 SSB SSB/b - 1 = MSB MSB/MSE
Interaction (AB) (a - 1) (b - 1) SSAB SSAB/(a - 1) (b - 1) = MSAB MSAB/MSE
Experimental error ab (a - 1) SSE SSE/ ab (a - 1)=MSE
Total abr - 1 SSY
Where a= levels in factor A, b = levels in factor B and r = number of replications.
Please note that abr = n
n
1. SSY = total sum of squares = Y
ij1
2
ijk CF
2
2. CF = correction factor = Y... n
ab
Yij.2
3. SAB = sum of squares for the AB subclass = i 1
r CF
a
Yi..2
4. SSA = sum of squares for factor A =
i 1
br CF
b
Y.2j.
5. SSB = sum of squares for factor B =
i 1
ar CF
6. SSAB = sum of squares for the interaction between A and B = SAB - SSA- SSB
7. SSE = sum of squares for error = SSY - SAB
Example 4.8
The data below is from experiment in which chicks were fed protein from two different
sources with the objective of finding which one resulted in faster daily growth rates in grams.
Source one was fed at three levels while the other at 4 levels. Each combination was fed to a
12 groups of 3 chicks each from 7 to 21 days of age.
Factor B
b1 b2 b3 b4
11 8 12 9
a1 12 10 10 11
9 10 13 10
Factor A 13 14 8 9
a2 11 10 12 9
14 10 10 8
9 10 11 7
a3 9 8 11 11
9 11 9 6
2
2. Calculate the correction factor = Y... n = (364)2/36 = 3680.44
50
3. Calculate the total sum of squares (SSY)
b g b g c h bg
n
= 3798 - 3680.44
= 117.56
4. Calculate the AB subclass sum of squares (SAB)
ab
Yij.2 Lb32g b28g .....b24g OP CF
CF = M
2 2 2
i 1
r
MN 3 PQ
= 3738.67 - 3680.44
= 58.23
5. Calculate the factor A sum of squares (SSA)
a
Yi..2 LMb125g b128g b111g OP CF
2 2 2
CF =
i 1
br
MN 3x4 PQ
= 3694.17 - 3680.44
= 13.73
6. Calculate the factor B sum of squares (SSB)
b
Y.2j. LMb97g b91g b96g b80g OP CF
2 2 2 2
CF =
i 1
ar
MN 3x3 PQ
= 3700.67 - 3680.44
= 20.23
7. Calculate the sum of squares for the interaction between A and B (SSAB) = SAB - SSA-
SSB
= 58.23 - 13.73 - 20.23
= 24.27
8. Calculate error sum of squares for error (SSE) = SSY - SAB.
= 117.56 - 58.23
= 59.33
9. Enter the sum of squares in an ANOVA table and calculate the means squares (MS =
SS/df).
10. Calculate F values as the ratio of each MS to MSE.
11. Check F-ratios against the values tabulated to obtain the corresponding probability
estimates.
51
The ANOVA tables looks as follows
Source Degrees of freedom Sum of Squares Mean Squares (MS) F
(d.f.) (SS)
Factor A 2 13.73 6.86
Factor B 3 20.23 6.74
Interaction (AB) 6 24.27 4.04
Experimental error 24 59.33 2.47
Total 35 117.56
Exercise 8
An experiment was conducted to determine the effects of 4 different anthelminths on the
weight gain (kg) in 3 breeds of sheep. Four animals from each breed were randomly selected
from a flock of sheep and drenched. The four anthelminths were then randomly given to the
animals. Weight gain was obtained after one month.
The data appears in the table below. Set up an analysis of variance table, computing all the
sum of squares and mean squares. Separate the means (use = 0.05 for all F tests). What
conclusions can you draw?
Anthelmintic
1 2 3 4
1 29 50 43 53
Breed 2 41 58 42 73
3 66 85 69 85
52
5.9 Split-plot design
This design can also be referred to as a factorial design with unequal replication. In most
animal experiments where two or more factors are investigated, all the factorial treatments
combinations will receive equal replication. There are a few circumstances, however, where
this is either not possible or not convenient, and you are then likely to end up with a split-plot
design. In animal trials, split-plot designs usually arise because circumstances dictate.
53
5.9.3 Randomization layout
1. Consists of two separate randomization processes for
(a) Main plots
(b) Subplot
2. In each replication main plot treatments are first randomly assigned to the main plots
followed by a random assignment of the subplot treatment within each main plot. Each is
done by any of the randomization schemes discussed earlier.
3. Using a as the number of main plots treatment, b as the number of subplot treatments and r
as the number of replications. For illustration, a two factor experiment involving 6 levels of
nitrogen (main plot treatments) and 4 rice varieties (subplot treatments) in 3 replications are
used.
Step 1
Divide the experimental area into r = 3 blocks, each of which is further divided into a = 6
main plots.
Replication 1
1 2 3 4 5 6
Replication 2
1 2 3 4 5 6
Replication 3
1 2 3 4 5 6
Step 2
Following the RCD randomization procedure with a = 6 treatments r = 3 replications and
randomly assign the 6 nitrogen treatments to 6 main plots in each of the 3 blocks. The result
may be as shown below.
Replication 1
N4 N3 N1 N0 N5 N2
Replication 2
N1 N0 N5 N2 N4 N3
Replication 3
N0 N1 N4 N3 N5 N2
Step 3
Divide each of the (r)(a) = 3 x 6 = 18 main plots in b = 4 subplots and following the BCD
randomization procedure for b = 4 treatments and (r)(a) = 18 replications, randomly assign the
4 varieties to the 4 subplots in each of the 18 main plots. The result may be as shown below.
Replication 1
N4 N3 N1 N0 N5 N2
V2 V1 V1 V2 V4 V3
54
V1 V4 V2 V3 V3 V2
V3 V2 V4 V1 V2 V1
V4 V3 V3 V4 V1 V4
Replication 2
N1 N0 N5 N2 N4 N3
V1 V4 V3 V1 V1 V3
V3 V1 V4 V2 V4 V2
V2 V2 V1 V4 V2 V4
V4 V3 V2 V3 V3 V1
Replication 3
N0 N1 N4 N3 N5 N2
V4 V3 V3 V1 V2 V1
V2 V4 V2 V3 V3 V4
V1 V1 V4 V2 V4 V2
V3 V2 V1 V4 V1 V3
55
Example 2.9
The following data (grain yield) were obtained in a two factor experiment involving 6 levels
of nitrogen (main plot treatments) and 4 rice varieties (subplot treatments) in 3 replications.
N1 (60 kg N/ha)
V1 5418 5166 6432
V2 6502 5858 5586
V3 4768 6004 5556
V4 5192 4604 4652
N2 (90 kg N/ha)
V1 6076 6420 6704
V2 6008 6127 6642
V3 6244 5724 6014
V4 4546 5744 4146
N3 (120 kg N/ha)
V1 6462 7056 6680
V2 7134 6982 6564
V3 5792 5880 6370
V4 2776 5036 3638
N4 (150 kg N/ha)
V1 7290 7848 7552
V2 7682 6594 6576
V3 7080 6662 6320
V4 1414 1960 2766
N5 (180 kg N/ha)
V1 8452 8832 8818
V2 6228 7387 6006
V3 5594 7122 5480
V4 2248 1380 2014
56
Nitrogen x variety of yield totals
2. Compute correction factor (CF) and sums of squares for the main plot analysis
2
CF = correction factor = Y... n = (3944819)2/(3)(6)(4) = 2161323047
= 204747916
r 2
SSBL = replication (block) sum of squares = R
i 1
i.
ab CF
=
LMb128873g b135604g b130004g OP CF
2 2 2
MN 6x4 PQ
= 1082577
a 2
SSA = sum of squares for the main plot (nitrogen) = A
i 1
i
rb CF
=
LMb48670g ....b69561g OP CF
2 2
MN 3x4 PQ
= 30429200
ar
RA 2i
SSEa = error between mainplots = i 1
b CF SSBL SSA
MN 4 PQ
= 1419678
3. Compute sum of squares for subplot analysis.
57
b
B2i
SSB = sum of squares for the subplot (variety) =
i 1
ra CF
Lb117964g ....b65558g OP CF
= M
2 2
MN 3x6 PQ
= 89888101
ab
AB2i
SSAB = sum of squares for the interaction between A and B =
i 1
r CF SSA SSB
MN 3 PQ
= 69343487
SSEb = error within mainplots = SSY - SSBL - SSA - SSEa -SSB - SSAB
= 12584873
4. For each source of variation compute MS
MSBL = SSBL/(r - 1) = 541228
MSA = SSA/(a - 1) = 6085840
MSEa = SSEa/(r - 1)(a - 1) = 141968
MSB = SSB/ (b - 1) = 29962700
MSAB = SSAM/(a -1)(b - 1) = 4622899
MSEb = SSEb/a(r - 1)(b - 1) = 349580
5. Compute F
F (A) = MSA/MSEa = 42.87
F (B) = MSB/MSEb = 85.71
F (AB)= MSAB/MSEb = 13.22
bg
CV a
MSEa
Y...
x 100 note that Y... abr is the grand mean.
abr
141968
x 100
394481
b gb gb g
3 6 4
= 6.9%
This indicates the degree of precision attached to the main plot factor.
58
bg
CV b
MSEb
Y...
x 100
abr
349580
x 100
394481
b gb gb g
3 6 4
= 10.8%
This indicates the precision of the subplot factor and its interaction with the main plot factor.
The ANOVA tables looks as follows
Exercise 9
The figure below shows the plan of an experiment testing three temperatures (allocated to six
rooms at random) in combination with four diets (A, B, C and D) allocated at random within
rooms in a split-plot design. This experiment was planned to investigate the interaction
between environmental temperature and dietary nutrient concentration. The question was
whether putting more energy (and protein and minerals in due proportion) into the diet helps
to overcome the adverse effects of heat stress on laying performance. The numbers in
parenthesis indicate the total number of eggs laid after a three-month period. Analysis these
data and draw conclusions at = 0.05.
Room 1 2 3 4 5 6
Temp 27 33 27 30 30 33
(C)
C D A C B A C D B C C A
(56) (22) (61) (62) (74) (88) (71) (14) (66) (67) (57) (64)
A B B D C D B A D A B D
(85) (62) (60) (46) (71) (14) (77) (73) (20) (78) (61) (57)
59
6 REGRESSION AND CORRELATIONS
If we are interested in the question whether two variables are related, we speak about
correlation and focus our attention on the correlation coefficient, r; if we are interested in the
dependence of one variable on another, we call this regression and describe the relationship
with an equation such as Y = a + bX, where b is the regression coefficient. Regression and
correlation can be classified based on the following:
1. Number of variables
2. Form of relationship
If we have only 2 variables (independent and dependent), then we call that a simple regression
or correlation. If the number of variables is more than 2, we have multiple regression or
correlation. There are cases where by there can be k independent variables but only 1
dependent variable. When classified based on the form of relationship, two forms are
distinguishable: linear and non-linear. If the data appears to be approximated well by a
straight line, we say that a linear relationship exists between the variables. If a relationship
exists but it is not linear, then we call it a non-linear relationship. Non-linear relationships can
sometimes be reduced to linear relationships by appropriate transformation of variables.
The two types of classification basis result in different types of regression and correlation
namely: - simple linear regression or correlation, simple non-linear regression or correlation,
multiple linear regression or correlation and multiple non-linear regression or correlation.
r
XY cX XhcY Yh
d X Y i c X X h cY Y h
2 2 2 2
There is another way of writing the above formula which is generally easier and quicker to
use when performing actual calculations, this is
60
X Y
XY n
r
R|L U
MSM X dnXi OPPLMM Y dnYi OPP|V
2 2
2 2
|TN QN Q|W
61
Y values have to be obtained from several populations, each population providing the Y
values plus a corresponding X value all measured at the same time. Randomness of Y is
essential for probability theory to apply. X is fixed but may also be random.
S XY
b
S XX
a Y bX
c
where SXX X X h 2
X2
d Xi 2
and
n
c hc
SXY X X Y Y XY h d Xid Yi
n
Note that in practical reality when a population is considered, lines computed in regression
problems are lines about which the pairs of values (X, Y) cluster; they are strictly not lines
upon which the points fall. A point on a regression line is therefore an estimate of a mean of a
population of Y's having the corresponding X value.
62
Example 5.1
In a random sample n = 9 steers, the live weight and dressed weights were recorded. Let Y =
dresses weight (in hundreds of kg) and X = live weight ( in hundreds of kg). Use the data
below to obtain a and b.
X Y
4.2 2.8
3.8 2.5
4.8 3.1
3.4 2.1
4.5 2.9
4.6 2.8
4.3 2.6
3.7 2.4
3.9 2.5
37.2 23.7
c
SXX X X h 2
X 2
d Xi 2
155.48
b37.2g 2
n 9
= 155.48 - 153.96 = 1.72
c hc
SXY X X Y Y XY h d Xid Yi 99.02 b37.2gb23.7g
n 9
= 99.02 - 97.96 = 1.06
b = 1.06/1.72 = 0.616
X
X 37.2 4.133
n 9
Y
Y 23.7 2.633
n 9
b
a Y bX = 2.633 - 0.616 4.133 0.087 g
Y = 0.087 + 0.616X
This is a deterministic model because we have not included the error term. We have assumed
that the error term is zero. Using this equation, we can predict the value of X or Y given the
value of either e.g., when value of X is 4, Y is 0.087 + 0.616 (4) = 2.551.
63
6.2.4 Testing the significance of a and b
It is assumed that the Ys are normally distributed and hence the estimators are also normally
distributed. Hence we may base the confidence intervals and tests of hypotheses on the t-
distribution.
cX XhcY Yh
2
cY Y h
2
SYY
bS gbS g
c X Xh
XY XY
2
S XX
S2
n2 n2
Remember that b
S XY S b SXY
therefore S2 YY
b g
S XX n2
Where
SYY Y Yc h 2
Y2
d Yi 2
c hc
and SXY X X Y Y XY h d Xid Yi
n n
2. Compute tb as
b
tb
S2
S XX
3. Compare the computed tb value to the tabular t values with n- 2 degree of freedom. The
slope b is judged to be significantly different from 0 if the absolute value of the t b is greater
than the tabulated t value at the prescribed level of significance.
L b37.2gb23.7g O 2
MN 9 PQ LM155.48 b37.2g OP 2
106
.
2
S2 NM 9 QP
. 62.41
6313
172
.
0.72 0.65
92 7 7
= 0.01
64
0.616
tb = 8.11
0.01
1.72
The tabulated t value at 5% and 1% levels of significance with 7 (n - 2) degrees of freedom
are 2.365 and 3.499 respectively. Because the computed t b value is greater than the tabular t
values, it is concluded that there is linear response of dresses weight to changes in live weight.
Nn S QXX
Then compare the tabular t value 0.05, n - 2 df with the calculated t. If calculated is greater
than tabular, then reject the null hypothesis and accept that there is no other better intercept.
Note that the sum of squares for regression has 1 degree of freedom, the sum of deviations has
n - 2 degrees of freedom and the total sum of squares has, as usual n - 1 degree of freedom.
SEa = E
X 2
and SEb =
E
nS XX S XX
65
Coefficient of determination (regression index)
This is the fraction of the total variation in Y that is accounted for by the association between
Y and X and is given by:
SSR bSXY
R2
SSY SYY
R2 must always be between 0 and 1. if all the points are close to the line, the value of R 2 will
be close to one; but as the scatter of the points becomes greater, R 2 will become smaller,
indicating a poor fit. For this reason, R2 is a useful measure of the strength of the relationship
between Y and X.
1 - R2 is referred to as the coefficient of alienation; and is the fraction of the variation in Y
that is unaccounted for by X. It is the fraction associated with the errors of prediction.
The independent variables are assumed to be independent of the other. They are also assumed
to be linear. Normally the above equation depends on the number of independent variables. If
2 then
Y = a + b1X1 + b2X2
If 3 then
Y = a + b1X1 + b2X2 + b3X3
Assuming that we have 2 independent variables X1 and X2 then
Where
X
X
n
Y=
Y
n
66
x c X Xh
2 2
x x cX X hcX X h
1 2 1 1 2 2
xy cX XhcY Yh
y cY Y h
2 2
Example 5.2
Assume the following data on different varieties of maize.
Variety number Grain yield kg/ha (Y) Plant height cm (X1) Tiller number (X2)
1 5755 110.5 14.5
2 5939 105.4 16.0
3 6010 118.1 14.6
4 6545 104.5 18.2
5 6730 93.6 15.4
6 6750 84.1 17.6
7 6899 77.8 17.9
8 7862 75.6 19.4
x 1753.72
2
1
x x 156.65
1 2
x 23.22
2
2
x y 65194
1
x y 7210
2
y 3211504
2
b1 x12 b2 x1 x2 .. b k x1 xk x1 y
b1 x1 x2 b2 x2 xk .. b k xk2 xk y
In our case, we have 2 independent variables and thus the normal equations are
67
b1 x12 b2 x1 x2 x1 y
b1 x1 x2 b2 x22 x2 y
b
d x id x yi d x x id x yi
2
2 1 1 2 2
1
d x id x i d x x i
2
1
2
2 1 2
2
b
b23.22gb65194g b156.65gb7210g 23.75
1
b1753.72gb23.22g b156.65g 2
and
b2
d x id x yi d x x id x yi
2
1 2 1 2 1
d x id x i d x x i
2
1
2
2 1 2
2
b1753.72gb7210g b156.65gb65194g 150.27
b2
b1753.72gb23.22g b156.65g 2
b gd x yi
k
SSR b1 i
i 1
b gb g b
= 23.75 65194 150.27 7210 2631804 gb g
5. Residual sums of squares (SSE)
68
SSE y 2 SSR
= 3211504 - 2631804
= 579700
2631804
F 2 1135
b
579700 8 - 2 -1
.
g
Compare the computed F value to the tabular F value with df1 = k and df2 = (n -k -1). The R2
is significant if the computed F value is greater than the tabular F value at the prescribed level
of significance.
For our example, the tabular F values with df1 = 2 and df5 = 5 are 5.790 at 5% level of
significance and 13.27 at 1% level. Therefore, the estimated Y = 6336 - 23.75X1 + 150.27X2
is significant.
Exercise 10
An experiment was conducted to determine the association between minutes of mixing (X)
and an index of the textural quality (Y) of an animal feed. The data is shown below.
Number X Y Number X Y
1 5 67.45 16 20 55.65
2 5 67.90 17 20 57.74
3 5 69.41 18 20 53.54
4 5 67.64 19 20 57.05
5 5 64.17 20 20 56.98
6 10 61.94 21 25 54.78
7 10 60.32 22 25 51.91
8 10 62.47 23 25 49.45
9 10 64.78 24 25 52.25
10 10 67.96 25 25 53.94
11 15 60.83 26 30 49.11
12 15 55.78 27 30 50.29
13 15 55.90 28 30 43.63
14 15 63.89 29 30 50.82
15 15 59.00 30 30 48.76
69
(a) Calculate the values of a and b in the equation Y = a + bX
(b) Determine statistical significance of a and b by t test.
(c) Determine and interpreting R2
(d) Calculate the standard deviation of Y when X = 10
(e) Show the ANOVA
Exercise 11
The following data were recorded for a regression study
Y 15 12 14 18 19 16 17 26 20 22 24
X1 6 7 7 8 8 9 9 10 10 11 12
X2 10 12 13 14 15 15 16 17 18 19 19
(a) Write the regression model
(b) Use the sample data to fit the model.
70
7 DISCRETE DATA
Normally for an analysis of variance to be carried out, the data should be drawn from a
continuous variable which is normally distributed. However, it is not uncommon to encounter
data that are not continuous, either because the results are qualitative, not quantitative (e.g.,
male or female, alive or dead, pregnant or not pregnant), or because the numbers are small
(e.g. litter size in sheep or goats). You cannot be 'a little bit pregnant' and a single ewe cannot
have a litter of 1.7 lambs.
If you happen to have several large groups of animals, then the numbers become, for all
practical purposes, continuous and approximately normal, even though, at its root, the
character is discontinuous. A herd of cows can have a conception rate (to single insemination)
of 53.7% and a flock of sheep can have a mean litter size of 1.73. Comparisons amongst
numerous herds or numerous flocks can thus be made by treating the data as continuous. But
in formal experiments employing large animals, it is not usually possible to allocate replicated
groups to each treatment; the basis of replication is almost always the individual animal.
Some of the data collected may then be categorical, meaning that the result for any one animal
falls into one or another of a small number of categories. Although litter size in cattle, sheep
or goats can he represented by a number, you can also think of it as a set of categories (single,
twins or triplets). 'Alive or dead' are clearly two mutually exclusive categories which cannot
be realistically represented by numbers, even though you might assign dummy values of 0
and 1 to these conditions for certain analytical purposes.
For pigs and rabbits, where the litters are larger, it is usually safe to treat litter size as though
it were a continuous variable, and the same goes for egg numbers at a single ovulation in
polytocous mammals or for egg laying by poultry over an extended period. Individual litter
sizes in pigs might range from 6 to 14 and the number of eggs laid by one chicken in a month
might range from 20 to 3 1, and such data will generate variances that can be treated as part of
a normal distribution. However, if you were considering eggs laid by individual hens on 1
single day, that would be a discrete variable with values limited to 0, 1 or 2.
The usual method of analysing categorical data is to employ a chi-squared test (2). This is
easy to apply but be aware that the test is approximate if the number in any one category is
less than 5.
Event A1 A2 A3 …. Ak
Observed frequency O1 O2 O3 …. Ok
Expected frequency E1 E2 E3 ..... Ek
Suppose that in a particular sample a set of possible events A1, A2, A3,…, Ak (see table above)
are observed to occur with frequencies O1, O2, O3,…Ok called observed frequencies, and that
according to probability rules they are expected to occur with frequencies E1, E2, E3, …, Ek
called expected or theoretical frequencies, then Chi-square is given by
=
bO E g bO
2 1 1
2
2 E2 g 2
....
bO k Ek g 2
k
bO E gi i
2
E1 E2 Ek i 1 Ei
If 2 = 0, observed and theoretical frequencies agree exactly; while if 2 > 0, they do not agree
exactly. The larger the value of 2, the greater the discrepancy between observed and
expected frequencies.
71
In practise, expected frequencies are computed on the basis of a hypothesis H o. If under this
hypothesis the computed value of 2 is greater than some critical value (such as .295 or .299 ,
which are the critical values at the 0.05 and 0.01 significance levels respectively), we
conclude that observed frequencies differ significantly from expected frequencies and would
reject Ho at the corresponding level of significance. Otherwise we would accept it or at least
not reject it. This procedure is called the chi-square test of hypothesis or significance.
It should be noted that we must look with suspicion upon circumstances where 2 is too close
to zero, since it is rare that observed frequencies agree to well with expected frequencies. To
examine such situations, we can determine whether the computed value of 2 is less than
.205 or .201 , in which cases we would decide that the agreement is too good at the .05 or .01
levels of significance respectively.
The following table shows a chi-squared analysis of deaths and survival in two groups of
animals.
In this table, the expected number dying in each group are readily derived from the null
hypothesis that there is no difference (other than chance) between the groups and therefore the
expected proportion of deaths is what we observe in the entire sample i.e., 44/438.
72