Variance Decomposition in Unbalanced Data
Variance Decomposition in Unbalanced Data
6, 2021-01
Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
[email protected]
doi: 10.13140/RG.2.2.16789.35043
Abstract
In this report, the basic principles of Variance Decomposition (ANOVA) are explained. These
principles are then used to understand the implications of analyzing unbalanced data, and to
propose some corrections in the calculations. For instance, by determining the equivalent
degrees of freedom of each factor, it is possible to mitigate the effect of non-orthogonality of
unbalanced data on the results. In addition, unwanted correlation effects between factors in
multi-factor analyses can also be compensated with correction factors in the calculation of the
sums of squares, while at the same time the law of total variance can be preserved. Different
examples of unbalanced data reported in the literature were presented in order to illustrate
the use of the proposed analysis, and to highlight the differences in the results obtained with
respect to conventional calculation methods. In addition, unbalanced simulated data obtained
from a pre-defined model were used to evidence the effect of confounding between main
effects and interactions, and to show the benefit of the proposed approach to reduce the risk
of reaching erroneous conclusions in these situations. The proposed ANOVA calculations were
implemented in MS-Excel (ForsChem Actinium XL).
Keywords
Adam’s Law, ANOVA, Eve’s Law, Factorial, Hypothesis Testing, Interactions, Model, Statistical
Significance, Sum of Squares, Unbalanced Designs, Variance
1. Introduction
( ) ( ( )) ( ( ))
(1.1)
where and are random variables and represents the conditional event when is
known. In other words, if we consider that describes different groups of data, then the
term ( ( )) represents the within-group variation (average variation of within each
group ) and ( ( )) represents the between-group variation (variance of group
averages). Notice that the within-group variation is related to the residual error (variation not
explained by ) or experimental noise.
The Law of Total Variance results from the definition of the variance operator and the
application of Adam’s Law [4]:
( ) ( ( ))
(1.2)
simply stating that the overall average is the weighted average of group averages.
The one-way (univariate) ANOVA [2] basically represents a statistical test of the following
hypotheses:
( ( )) ( ( ))
( ( )) ( ( ))
(1.3)
Only when the between-group variation is significantly larger than the within-group variation,
we can safely conclude that the effect of factor on the response variable is statistically
significant.
When two or more factors ( ) are involved, the complexity of Adam and Eve’s
Laws increases. In this case, Adam’s Law becomes:
( ) ( ( ))
(1.4)
and Eve’s Law:
( ) ( ( )) ( ( ))
(1.5)
which can be equivalently expressed as [5]:
( ) ( ( )) ( ( )) ( ( ( ) ))
( ( ( ) ))
( ( ( ) ))
(1.6)
The result presented in Eq. (1.6) is obtained considering that (from Eq. 1.1 and 1.4):
( ( )) ( ( ( ) )) ( ( ( ) ))
( ( ( ) )) ( ( ( ) ))
( ( ( ) )) ( ( ))
(1.7)
along with the more general rule:
( ( ( ) ))
( ( ( ) ))
( ( ( ) ))
(1.8)
The first term in the right-hand side of Eq. (1.6) again represents the residual error, the variation
in not explained by any of the factors considered. The second term represents the variation
explained by , the third term represents the remaining variation explained by , and so on.
The last term of the right-hand side of Eq. (1.4) represents the remaining variation in
explained by .
When the factors are independent from each other, the remaining variation explained by a
given factor corresponds to the total variation explained by the factor, and the Law of Total
Variation greatly simplifies into:
( ) ( ( )) ( ( )) ( ( )) ( ( ))
( ( ))
(1.9)
Eq. (1.9) is the basis of the famous ANOVA table, and the corresponding hypotheses to be
tested become:
( ( )) ( ( ))
( ( )) ( ( ))
(1.10)
Such independence between factors can be easily achieved by means of balanced and
orthogonal designs of experiments, which are basic principles of classical experimental design
[1]. However, when these criteria are not met, not only the complexity of the calculations
increases but one must also be careful with the interpretation of the results [6].
planned experimental designs, unbalanced data may result due to missing or failed
experiments, or by unplanned repetitions of experiments. Unbalanced data can be balanced
simply by randomly taking out results in excess, or by “fabricating” missing data from all other
results. The latter approach is not recommended, because it lies in the verge of data
falsification (even though statistical techniques are used). On the other hand, while randomly
removing data seems a better approach, discarding valid data is always a loss of valuable
information for an analysis (as well as loss of degrees of freedom for improving the estimation
of the residual error). Thus, it would be desirable to directly analyze unbalanced data without
any sort of manipulation.
2. One-Way ANOVA
In order to better understand the problem considered in this report, let us begin explaining the
simplest case of variance decomposition: The One-Way ANOVA [2].
∑ ∑
( )
∑
(2.1)
∑ ∑ ( ( ))
( )
∑
(2.2)
∑
( ( ))
(2.3)
∑ ( ( ( ) ))
( ( ))
(2.4)
Notice that (using Eq. 2.3 and 2.1):
∑ ( ( )) ∑ ∑
( ( )) ( )
∑ ∑
(2.5)
corresponds to Adam’s Law for a single factor.
On the other hand, the within-group variation is (using Eq. 2.4 and 2.3):
∑ ( ( ))
∑ ∑ ( ( ( ) ))
( ( ))
∑ ∑
∑
∑ ∑ ( )
∑
(2.6)
and the between-group variation is (using Eq. 2.3 and 2.5):
∑
∑ ( ( ))
∑ ( ( ( )) ( ( )))
( ( ))
∑ ∑
(2.7)
where the variance is evaluated taking into account the relative frequency of each group.
∑ ∑ ( ( )) ∑ ∑ ( ( ( ) )) ∑ ( ( ( )) ( ))
∑ ∑ ∑
(2.8)
Notice that there is a common denominator ∑ for all terms in Eq. (2.8). An equivalent
equality can be obtained:
∑∑( ( )) ∑∑( ( ( ) )) ∑ ( ( ( )) ( ))
(2.9)
representing the decomposition of the sum of squares, where the left-hand term is denoted as
the total sum of squares ( ), the first term at the right-hand side is denoted as the sum of
squares due to error ( ), and the last term is the sum of squares due to factor ( ).
(2.10)
Let us now consider each term independently. The total sum of squares consists of a sum of
∑ squared terms. Assuming the response data ( ) to behave normally with a mean value
( ) and standard deviation , we can say that the total sum of squares behaves proportional
to a random variable with ∑ degrees of freedom [7]:
( )
∑∑( ) ∑∑ ∑
(2.11)
∑∑ ∑
(2.12)
Similarly, assuming a normal behavior of the residual error:
( ( ))
∑∑( ) ∑∑ ∑
(2.13)
In this case, degrees of freedom are subtracted from ∑ , one for each of the following
conditions:
(2.14)
Finally, assuming a normal behavior of the group averages:
( ( )) ( )
∑ ( ) ∑
(2.15)
If the data is balanced, that is, if for , then:
(2.16)
proportional to a random variable with degrees of freedom, where:
(2.17)
Thus, considering balanced, normal data, the following hypotheses for an test (ratio of two
distributions) are formulated:
(2.18)
where
( )
( )
(2.19)
and the estimation of is given by:
̂
( )
(2.20)
with
(2.21)
( )
(2.22)
These calculations are summarized in the ANOVA Table presented in Table 1.
The estimated -value ( ̂ ) can then be compared with the critical -value of a right-tail
distribution considering a significance level ( ( ( ))), rejecting the null
hypothesis ( ) in Eq. (2.18) when:
̂ ( ( ))
(2.23)
Alternatively, a -value of the right-tail distribution can be calculated for ̂ , rejecting the null
hypothesis when:
( ( )
̂)
(2.24)
If, on the other hand, the data is unbalanced, the one-way ANOVA table presented before is no
longer formally valid, since:
(2.25)
̅∑
(2.26)
∑
where ̅ is the average number of elements per group, and ∑
.is a relative
weight of a weighted sum of squares of , and ∑ .
Let us now compare the properties of the balanced (Eq. 2.16) and unbalanced (Eq .2.26):
( ) ( )
(2.27)
( ) ( ) ∑
(2.28)
Now, since the variance of a distribution with degrees of freedom is:
( )
(2.29)
then the variance of the unbalanced can be considered equivalent to the variance of a
distribution with equivalent degrees of freedom:
( )∑
(2.30)
Thus, by correcting only the degrees of freedom of the between-group variation in the
determination of the critical -value (or in the calculation of the -value), a more general test of
statistical significance is obtained, which is valid for both balanced and unbalanced data. The
ANOVA table presented in Table 1 remains unchanged, but now the criteria for rejecting the
null hypothesis become:
̂ ( ( ))
(2.31)
( ( )
̂)
(2.32)
Considering that and are calculated for integer degrees of freedom, the equivalent
and values can be determined by interpolation as follows:
( ( ))
( ⌊ ⌋ ( )) ( ⌊ ⌋)
( ( ⌈ ⌉ ( )) ( ⌊ ⌋ ( )))
(2.33)
( ( )
̂)
( ̂) ( ⌊ ⌋)
⌊ ⌋ ( )
( ( ̂) ( ̂ ))
⌈ ⌉ ( ) ⌊ ⌋ ( )
(2.34)
where ⌊ ⌋ and ⌈ ⌉ represent the floor and ceiling rounding operators.
According to Eq. (2.30), the minimum equivalent degrees of freedom will be:
(2.35)
obtained when all groups have the same size, and the maximum equivalent degrees of
freedom will be:
( )
(2.36)
obtained when one of the groups is extremely large compared to all others.
Finally, let us notice that can be interpreted as the sum of square residuals between the
experimental observations and the corresponding prediction ̂ given by the linear model:
̂ ( ) ()
(2.37)
where ( ) is the effect of factor when it is at level , and is given by:
∑ ∑ ∑
() ( ( )) ( )
∑
(2.38)
resulting in:
∑ ∑( ̂ )
(2.39)
Thus, the sum of squares due to factor can be simply expressed as the variation not
considered in the residuals of model (2.37):
(2.40)
3. Two-Way ANOVA
In the case of multi-factor analysis, data unbalance causes an additional effect. Let us consider
the simplest multi-factor case: independent factors. Assuming a balanced and orthogonal
design (e.g. full factorial design), Eve’s law for two independent factors become (from Eq. 1.9):
( ) ( ( )) ( ( )) ( ( ))
(3.1)
Following the ideas presented in Section 2, Eq. (3.1) can be equivalently expressed as the
following sum of squares:
(3.2)
In this case,
∑ ∑ ∑( ( ))
(3.3)
where represents one of the levels in factor , represents one of the levels in factor
, and is one of the replicates of the design.
The residuals represented by consider the variation not explained by the effects of and
, as described by the following linear model§:
( )
̂ ( ) ( ) ( )
(3.4)
where the superscripts indicate the terms considered in the model, and the effects are given
by:
( ) ( ( )) ( )
(3.5)
Combining Eq. (3.4) and (3.5) results in:
( )
̂ ( ( )) ( ( )) ( )
(3.6)
Then, the for model (3.4) is:
( ) ( )
∑ ∑ ∑( ̂ )
(3.7)
Now, the sum of squares due to each factor can be determined as follows:
( )
(3.8)
( )
where represents the sum of squares of the residuals of the following model:
( )
̂ ( ) ( )
(3.9)
and determined as
( ) ( )
∑ ∑ ∑( ̂ )
(3.10)
Eq. (3.8) can be obtained simply by performing a one-way ANOVA on factor while neglecting
( )
the effect of all other factors (which are contained in the term ).
The degrees of freedom associated to the sum of squares due to each factor are:
§
The model presented does not consider the interaction between both factors. When the interaction is
considered, a third set of terms emerges in the model and this will be covered at the end of this section.
(3.11)
(3.12)
The mean squares terms are calculated in general for the source of variation as:
(3.13)
and the -value for each factor is:
(3.14)
Finally, the statistical significance of each factor (rejecting the corresponding null hypothesis) is
determined by any of the following criteria:
̂ ( )
(3.15)
( ̂ )
(3.16)
Since the data is balanced, no corrections are needed. Table 2 shows the corresponding
general two-way ANOVA table.
Table 2. Two-way ANOVA table for balanced normal data (without interaction)
Source of
Sum of Squares Degrees of Freedom Mean Squares F
Variation
Factor (Eq. 3.8) (Eq. 3.11) (Eq. 3.13) ̂ (Eq. 3.14)
Factor (Eq. 3.8) (Eq. 3.11) (Eq. 3.13) ̂ (Eq. 3.14)
Residual error (Eq. 3.7) (Eq. 3.12) (Eq. 3.13)
Total (Eq. 3.3)
When the data is unbalanced, Eq. (3.1) is no longer valid. The correct expression for the law of
total variance is in this case (from Eq. 1.6):
( ) ( ( )) ( ( )) ( ( ( ) ))
(3.17)
In terms of sum of squares, Eq. (3.17) can be expressed as:
( )
(3.18)
where represents the sum of squares due to factor remaining after considering the
effect of factor . Notice that, unless both factors are independent:
(3.19)
The different types of sums of squares ( and ) were originally noticed by Yates
[8,9], and were later denoted as SS Type I (sequential) and SS Type II (partial) [10,11],
respectively.
The covariance of the level values between two factors is given by:
( ) ( ) ( ) ( )
(3.20)
Let us consider for example the following balanced design ( factorial design with
replicates):
Table 3. Example of a randomized factorial design with replicates
Experiment # 1 2 3 4 5 6 7 8
1 -1 1 -1 1 -1 -1 1
1 1 -1 -1 -1 -1 1 1
9 7 5 5 8 4 7 8
For this balanced design, the covariance between the factors is ( ) , indicating that
they are independent.
Now assume that the last experiment failed and it is not available for the analysis. Then, the
new covariance between the factors excluding experiment #8 is:
( ) ( ) ( ) ( ) ( )( )
(3.21)
Thus, it can be seen that unbalance in multi-factor data results in factor dependence
(correlation) and therefore: .
Since the factors in unbalanced data are correlated, each individual model
( )
̂ ( ) ( )
(3.22)
will neglect the additional effect of which is correlated with other factors, and therefore in
unbalanced data:
( )
(3.23)
On the other hand, Eq. (3.18) can also be alternative expressed as:
( )
(3.24)
resulting in a completely different ANOVA table, and possibly leading to different conclusions.
Table 4 shows the ANOVA table obtained after decomposing the variance according to Eq.
(3.18), while Table 5 shows the ANOVA obtained by decomposing the variance according to Eq.
(3.24). The ANOVA calculations were performed using the anova function on a linear model (lm)
in R (https://cran.r-project.org/), which uses Type I SS [10,11].
Table 4. ANOVA table obtained in R for the data presented in Table 3 when the last experiment
is missing. R model: lm(Y~X1+X2)
Df Sum Sq Mean Sq F value Pr(>F)
X1 1 4.2976 4.2976 3.4381 0.13732
X2 1 10.4167 10.4167 8.3333 0.04471 *
Residuals 4 5 1.25
Table 5. ANOVA table obtained in R for the data presented in Table 3 when the last experiment
is missing. R model: lm(Y~X2+X1)
Df Sum Sq Mean Sq F value Pr(>F)
X2 1 8.0476 8.0476 6.4381 0.06416 .
X1 1 6.6667 6.6667 5.3333 0.08209 .
Residuals 4 5 1.25
Notice that even though the calculations for the residuals are exactly the same, the
decomposition of variance changes depending on the order of the analysis. Most importantly,
considering a significance level, the first ANOVA identifies only factor as statistically
significant, whereas in the second ANOVA no factor is found statistically significant. Clearly, the
conclusion should not depend on the particular factor order considered for the analysis.
While other approaches for estimating the sum of squares are possible (including Type II and
Type III SS [11,12]), only the Type I SS preserves the validity of Eve’s Law for variance
decomposition. Thus, one possible alternative approach for compensating the correlation effect
in the sum of squares due to each factor while preserving the validity of Eve’s Law is defining the
following corrected sum of squares ( ):
( )
( )
(3.25)
This way, the sum of squares of each factor considering correlated effects is assumed to be
proportional to the relative magnitude of their partial (Type II) effects.
(3.26)
Table 6 shows the ANOVA table obtained by using the corrected sum of squares defined in Eq.
(3.25) for the example previously considered. Notice that the corrected sum of squares
obtained have intermediate values between the sums of squares of each factor reported for
each factor in the previous ANOVA tables (Table 4 and Table 5), as expected. No factor is found
statistically significant, although factor is at the very limit of significance. Additional data
should be collected if a more definitive conclusion regarding is required.
Table 6. ANOVA table for the data presented in Table 3 (when the last experiment is missing)
obtained after correcting the sum of squares due to each factor using Eq. (3.25).
Source of Sum of Degrees of
Mean Squares -value**
Variation Squares Freedom
5.1223 1 5.1223 4.0979 0.1129
9.5920 1 9.5920 7.6736 0.0503
Residual error 5 4 1.25
Total 19.7143 6
When interactions are considered, an additional term is added to the model (3.4):
**
The -values were obtained considering the corrected degrees of freedom for each factor (Eq. 2.30),
corresponding both to in this case.
( )
̂ ( ) ( ) ( ) ( )
(3.27)
While the interaction could be considered as an additional factor, it must be treated slightly
differently because the type I sum of squares of the interaction contains most of the sums of
squares of the interacting factors. Thus, the type II sum of squares of the interaction must be
necessarily considered, as follows:
( ) ( ) ( )
(3.28)
( )
( )
(3.29)
On the other hand, the degrees of freedom for the interaction term will be:
(3.30)
(3.31)
The ANOVA table considering the interaction then becomes (in general for balanced or
unbalanced data):
Table 7. Two-way ANOVA table for balanced or unbalanced normal data (with interaction)
Source of Variation Sum of Squares Degrees of Freedom Mean Squares F
Factor (Eq. 3.29) (Eq. 3.11) ̂
Factor (Eq. 3.29) (Eq. 3.11) ̂
Interaction (Eq. 3.29) (Eq. 3.30) ̂
Residual error (Eq. 3.7) (Eq. 3.31) (Eq. 3.13)
Total (Eq. 3.3)
4. Multi-Factor Analysis
Let us now generalize the variance decomposition analysis of balanced or unbalanced data
considering the following multi-factor estimation model for the response variable:
̂ ∑∑
(4.1)
where and are model coefficients obtained from the experimental data by least
squares regression, and are binary variables defined as:
( )
{
( )
(4.2)
that is, they have a value of when the corresponding observation was obtained when factor
was at level , and a value of for all other levels of .
For this particular model (Eq. 4.1), the sum of squares due to model residuals ( ) will be:
∑ ∑ ∑ ∑ ( ̂ )
∑ ∑ ∑ ∑ ( ∑∑ )
(4.3)
values may be different for each possible combination of the factors considered in the
model. Thus, while the analysis of variance decomposition can be done for any model
proposed, the most reliable conclusions will be those obtained from the model with the lowest
value. Usually, terms with low ̂ values are less likely to contribute to lowering the
value. Thus, those terms can be stepwise removed in order to improve the model.
The ̂ values for each factor in the model considering either balanced or unbalanced data can
be determined as the ratio of mean squares:
(4.4)
where
(4.5)
(4.6)
( )
( ) ( )
∑ ( ( ))
(4.7)
( )
∑ ∑ ∑ ∑ ( ( ( ) ))
(4.8)
∑ ∑ ∑ ∑ ( ( ))
(4.9)
(4.10)
∑ ∑ ∑
(4.11)
(4.12)
Finally, the statistical significance of each factor is confirmed when any of the following criteria
is met:
̂ ( )
(4.12)
( ̂ )
(4.13)
where
( )∑
(4.14)
∑ ∑ ∑ ∑ ∑
∑ ∑ ∑
(4.15)
( )
( ⌊ ⌋ ) ( ⌊ ⌋) ( ( ⌈ ⌉ ) ( ⌊ ⌋ ))
(4.16)
( ̂ )
( ̂ ) ( ⌊ ⌋)
⌊ ⌋
( ( ̂ ) ( ̂ ))
⌈ ⌉ ⌊ ⌋
(4.17)
All these calculations can be summarized in the multi-factor ANOVA table presented in Table 8.
Table 8. Multi-factor ANOVA table for balanced or unbalanced normal data using the proposed
correction on the sum of squares presented in Eq. (4.7)
Source of Sum of Degrees of
Mean Squares -values -values
Variation Squares Freedom
Factor (Eq. 4.7) (Eq. 4.12) (Eq. 4.5) ̂ (Eq. 4.4) (Eq. 4.17)
Factor (Eq. 4.7) (Eq. 4.12) (Eq. 4.5) ̂ (Eq. 4.4) (Eq. 4.17)
Factor (Eq. 4.7) (Eq. 4.12) (Eq. 4.5) ̂ (Eq. 4.4) (Eq. 4.17)
Residual error (Eq. 4.3) (Eq. 4.10) (Eq. 4.6)
Total (Eq. 4.9) (Eq. 4.11)
If any of the factors included in the model represents the interaction between two factors
and , then the corresponding individual sum of squares must be determined as follows:
( )
(4.18)
( ) ( )
(4.19)
where is a binary variable taking a value of when the main effect of factor is
considered in the model, and a value of when the main effect is not included. This expression
indicates that the interaction term gains the degrees of freedom of the interacting factors
when they are not included in the model.
5. Examples
Let us consider the first example presented by Ståhle and Wold [2] for the yield of a chemical
reaction using three different catalysts (A, B, and C), summarized in Table 9. The corresponding
uncorrected ANOVA table for this data is presented in Table 10.
Table 9. Unbalanced single-factor design: Reaction yield using different catalysts [2]
Yield (%)
A B C
91 85 96
95 88 98
92 87 97
90 86 97
89
Table 10. Uncorrected ANOVA table for an unbalanced single-factor design: Reaction yield using
different catalysts
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 223.0769 2 111.5385 42.8994 4.1028 1.2394E-05
Error 26 10 2.6
Total 249.0769 12
Considering the effect of data unbalance, the equivalent degrees of freedom (Eq. 2.30) for the
catalyst term are:
(5.1)
The ANOVA corrected (in the calculation of and ) with the equivalent degrees of freedom
is presented in Table 11. All the calculations using the method proposed in this report were
done using the MS-Excel-based application ForsChem Actinium XL (also available to download
at www.forschem.org).
Table 11. Corrected ANOVA table for an unbalanced single-factor design: Reaction yield using
different catalysts
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 223.0769 2 111.5385 42.8994 4.0935 1.2221E-05
Error 26 10 2.6
Total 249.0769 12
No big differences are observed ( decrease in -value after the correction) as the
equivalent degrees of freedom are close to the original value of (1.18% increase after the
correction).
( )
(5.2)
where is a binary variable representing the type of catalyst employed, represents a Type III
standard residual error ( ( ) , ( ) ) [14], and √ . Since the
residuals of the model are normally distributed (cf. Figure 1) then can be replaced by a
standard normal random variable .
Since the catalyst type is a fixed categorical variable, the corresponding binary variables
considered in Eq. (5.2) are related by the following expression:
(5.3)
Thus, Eq. (5.2) can be alternatively expressed as:
( )
(5.4)
In addition, let us compare the results obtained when the data is forced to be balanced. Table
12 to Table 16 show the ANOVA tables obtained when each of the observations for catalyst B
are independently removed. Table 17 shows the result obtained when the missing observations
of catalyst A and C are filled with the corresponding averages of the available data.
Table 12. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the first observation (85) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 180.6667 2 90.3333 38.7143 4.2565 3.7943E-05
Error 21 9 2.3333
Total 201.6667 11
Table 13. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the second observation (88) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 210.1667 2 105.0833 38.2121 4.2565 3.9992E-05
Error 24.75 9 2.75
Total 234.9167 11
Table 14. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the third observation (87) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 200 2 100 34.6154 4.2565 5.9415E-05
Error 26 9 2.8889
Total 226 11
Table 15. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the fourth observation (86) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 190.1667 2 95.0833 34.5758 4.2565 5.9686E-05
Error 24.75 9 2.75
Total 214.9167 11
Table 16. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the last observation (89) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 220.6667 2 110.3333 47.2857 4.2565 1.6808E-05
Error 21 9 2.3333
Total 241.6667 11
Table 17. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by adding average observations (92 for catalyst A and 97 for catalyst C).
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 250 2 125 57.6923 3.8853 6.9885E-07
Error 26 12 2.1667
Total 276 14
Even though the conclusion regarding the statistical significance of the catalyst effect
(assuming a 5% significance level) did not change, important numerical differences are
observed for the different possible results. Particularly the differences in -values (relative to
the uncorrected analysis) are summarized in Table 18. These results show that data
manipulation can lead to a wide range of possible -values, potentially compromising the
conclusion.
Table 18. Comparison of -values for the different possible ANOVA tables obtained for the
unbalanced reaction yield data in Table 9.
-value Relative
ANOVA -value
difference difference (%)
Uncorrected 1.2394E-05
Corrected 1.2221E-05 -1.73E-07 -1.4%
Removing data 1 3.7943E-05 2.55E-05 206.1%
Removing data 2 3.9992E-05 2.76E-05 222.7%
Removing data 3 5.9415E-05 4.70E-05 379.4%
Removing data 4 5.9686E-05 4.73E-05 381.6%
Removing data 5 1.6808E-05 4.41E-06 35.6%
Adding data 6.9885E-07 -1.17E-05 -94.4%
Let us now consider four different examples of unbalanced data when two factors are
involved.
Shaw and Mitchell-Olds [13] present a hypothetical example for the growth of plants
considering the initial size of the plant (small or large), and the treatments (removal or not of
neighboring plants). The hypothetical data considered is presented in Table 19.
Using the SS values reported by the authors (considering different SS types), the following
ANOVA tables (Table 20 to Table 22) are obtained:
Table 20. ANOVA table obtained using Type I SS for the plant growth unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 35.3 1 35.3 0.330 5.5914 0.5834
Initial Size 4846.0 1 4846.0 45.4 5.5914 0.0003
Interaction 11.4 1 11.4 0.107 5.5914 0.7535
Error 747.7 7 106.8
Total Sum 5640.4 10
Table 21. ANOVA table obtained using Type II SS for the plant growth unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 590.2 1 590.2 5.525 5.5914 0.0510
Initial Size 4846.0 1 4846.0 45.4 5.5914 0.0003
Interaction 11.4 1 11.4 0.107 5.5914 0.7535
Error 747.7 7 106.8
Total Sum 6195.3 10
Table 22. ANOVA table obtained using Type III SS for the plant growth unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 597.2 1 597.2 5.591 5.5914 0.0500
Initial Size 4807.9 1 4807.9 45.0 5.5914 0.0003
Interaction 11.4 1 11.4 0.107 5.5914 0.7535
Error 747.7 7 106.8
Total Sum 6164.2 10
In the previous tables, the total sums of squares were obtained by applying the law of total
variance to the individual sums of squares (and not from ) resulting in different values for
each case despite using the same data set. In addition, large differences in the significance of
the effect due to the treatment are observed between the different methods. Using the
correction of unbalanced data proposed in this report, the following ANOVA table is obtained:
Table 23. ANOVA table obtained using the correction proposed in this report for the plant
growth unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 18.76 1 18.76 0.1756 5.5844 0.6890
Initial Size 2277.39 1 2277.39 21.3196 5.5844 0.0024
Interaction 2596.65 1 2596.65 24.3083 5.5138 0.0016
Error 747.75 7 106.82
Total 5640.55 10
Several differences can be observed with respect to the previously reported calculations. While
the SS due to residual error is identical to the SSs reported in previous tables, the SSs due to
the factors and the interaction are different. Since the factors and the interaction are all
correlated between them, the uncorrected ANOVA tables assign most of the effect of the
interaction to the individual factors. In order to clarify this, let us consider the model obtained:
(5.5)
where represents an standard random variable (since in this case it is normal, it can also be
denoted as ).
By looking at the model coefficients, the effect of the treatment ( ) is apparently much
larger than the effect of the interaction ( ). However, we must take into account
that the binary variable is confounded with binary variables of the interaction
according to the following expression:
(5.6)
Similarly, the effect of initial plant size is also confounded according to:
(5.7)
Therefore, the values of the coefficients cannot be used to assess with absolute certainty the
relative effect of the factors and interactions simultaneously.
Considering Eq. (5.6) and (5.7), Eq. (5.5) can be expressed in general as follows:
( )
( ) ( )
(5.8)
where and are the “true” values of the coefficients representing the non-
interacting effects of the factors, which are unknown due to confounding. Such confounding
makes the ANOVA analysis uncertain, and thus, assumptions are needed in order to reach any
conclusion. By assuming and , a minimal effect of the
interaction is obtained. By assuming , the effect of the interaction is
maximal. A more conservative and equilibrated approach might be assuming middle values
(distributing the effects equally between the factors and the interactions), where
and , resulting in:
(5.9)
Eq. (5.9) is not necessarily correct but it minimizes the risk of error in the values of the model
coefficients. From this equation, it becomes clear that the effect of the interaction can be
larger than both individual effects of the factors, as described by the corrected ANOVA (Table
23). In this case, the interaction becomes significant, whereas all other approaches considered
this effect negligible. Notice that the uncertainty introduced by confounding also is present in
both unbalanced and balanced data as well.
However, in order for the previous analysis to be valid, the best possible model (lowest
standard error) must be used. Since the estimation of the error depends on the available
degrees of freedom, it is always important to test whether or not including individual or
interaction terms result in a better model (i.e. applying the principle of Parsimony [15]). By
removing the interaction, the following results are obtained:
Table 24. ANOVA table obtained using the correction proposed in this report for the plant
growth unbalanced design (no interaction considered)
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 39.88 1 39.88 0.4202 5.3106 0.5361
Initial Size 4841.51 1 4841.51 51.0198 5.3106 9.7202E-5
Error 759.16 8 94.89
Total 5640.55 10
Since the standard error decreases by removing the interaction ( goes from to
), the non-interacting model can be considered a better model††:
(5.10)
Furthermore, by removing the interaction the confounding is also removed. Now, we can
conclude that the effect of the initial size is significant but the effect of the treatment is
negligible.
While the analysis of the model containing the interaction terms was not adequate for this
particular example because it did not provide an optimal model, it clearly illustrates the risk of
confounding in ANOVA when interactions are considered.
††
By further removing the effect of the treatment from the model, the increases to , and thus,
the model considering both main effects can be considered optimal.
Smith & Cribbie [12] describe a factorial example using simulated data. In this example,
the gambling behavior scores of different participants are tabulated according to gender and
athletic status. The results obtained are presented in Table 25.
The authors report different ANOVA tables using different types of SS. These tables are
presented in Table 26 to Table 28. On the other hand, the ANOVA table obtained for this set of
unbalanced data using the proposed approach is presented in Table 29.
Table 28. ANOVA table obtained using Type III SS from [12].
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Gender 3.897 1 3.897 186.888 4.4138 6.03E-11
Status 35.989 2 17.995 862.971 3.5546 1.33E-18
Interaction 0.121 2 0.061 2.906 3.5546 0.081
Error 0.375 18 0.021
Total Sum 40.382 23
Table 29. ANOVA table obtained using the correction proposed in this report for the gambling
behavior unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Gender 5.1562 1 5.1562 247.2779 4.4079 5.7959E-12
Status 20.9676 2 10.4838 502.7757 3.5299 1.5178E-16
Interaction 22.9604 2 11.4802 550.5611 3.4888 6.1180E-17
Error 0.3753 18 0.02085
Total 49.4596 23
The model obtained for the ANOVA presented in Table 29 is the following:
(5.11)
where represents a standard normal random variable.
Once again, the coefficient values in the model obtained do not necessarily reflect the factor
effects since confounding between individual terms and the interaction are present. However,
we can conclude that conventional analyses consider non-significant the effect of the
interaction for all three types of SS, whereas the method proposed in this report consider all
terms significant. Furthermore, the proposed method considers the effect of the interaction to
be even greater than the effect of the athletic status (considering the correction for
unbalanced data). Let us recall that the proposed analysis represents a middle-point
distribution of the confounded effects, which minimizes the risk of error since the available
data does not allow concluding if the observed effect is completely caused by the athletic
status or completely caused by the interactions, or about how the effects are particularly
distributed.
In this example, removing the interaction results in a model with a larger standard error. Thus,
model (5.11) can be considered optimal, and in this case confounding is inevitable.
The next case was obtained from data presented by Lewsey et al. [16]. They considered a
general factorial experimental design with possible unbalanced sets of data, by
assuming different number of replicates ( or ) for each treatment. All data was obtained
from a non-interacting model with normal error. The set of unbalanced data for the current
example, presented in Table 30, was randomly chosen between the 726 possibilities.
(5.12)
The incorporation of interaction terms increases the standard error of the model.
The corresponding ANOVA table using the correction for unbalanced data is presented in Table
31. The results obtained indicate that both factors are significant, and the interaction is not
even considered in the model, which is consistent with the nature of the original model.
Table 30. Unbalanced factorial design: Random example from data in [16]
Factor A Factor B Response
1 1 7.626
1 1 10.424
1 2 5.878
1 2 6.878
1 2 3.024
1 3 3.786
1 3 6.614
2 1 6.04
2 2 2.878
2 2 3.878
2 2 0.024
2 3 0.786
2 3 3.614
Table 31. ANOVA table obtained using the correction proposed in this report for an unbalanced
design from Lewsey et al. [16]
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Factor A 33.6862 1 33.6862 10.8621 5.1123 0.00926
Factor B 37.3127 2 18.6567 6.0158 4.1912 0.02089
Error 27.9112 9 3.1012
Total 98.9101 12
As a final example, let us consider the set of data presented in Table 32, again for an
unbalanced factorial design. The data in this case was randomly obtained using the
following model:
(5.13)
where is a standard normal random variable.
Model (5.13) does not contain main effects, only interaction terms. However, the best model
obtained after analyzing the data in Table 32 is:
(5.14)
Table 32. Unbalanced factorial design: Random data (rounded to one decimal place)
obtained from model (5.13)
Factor A Factor B Response Y
2 2 4.4
1 2 7.7
2 1 5.7
2 2 2.7
2 1 3.6
2 3 10.4
2 1 4.7
2 3 7.8
1 3 8.7
1 2 5.6
1 1 7.2
1 2 5.6
2 2 2.3
Eq. (5.14) can alternatively be expressed (considering only interacting effects) as:
(5.15)
Of course, the original model (5.13) is not accurately obtained due to random errors during
sampling. However, all the coefficients have errors of less than one unit.
The SS Type I ANOVA table for this data is presented in Table 33, whereas Table 34 shows the
ANOVA table with the correction proposed in this report.
Table 33. ANOVA table obtained using Type I SS for the unbalanced data presented in Table 32.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Factor A 9.531 1 9.531 6.0579 5.5914 0.04338
Factor B 42.487 2 21.244 13.5023 4.7374 0.00396
Interaction 5.991 2 2.996 1.9040 4.7374 0.21864
Error 11.013 7 1.573
Total Sum 69.023 12
Table 34. ANOVA table obtained using the correction proposed in this report for the
unbalanced data presented in Table 32.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Factor A 5.236 1 5.236 3.3282 5.5460 0.11009
Factor B 20.903 2 10.452 6.6430 4.6727 0.02322
Interaction 31.870 2 15.935 10.1282 4.6034 0.00774
Error 11.013 7 1.573
Total Sum 69.023 12
According to the Type I SS ANOVA (Table 33), the interaction is not significant but it cannot be
removed from the model (since the standard error would increase). On the other hand, the
corrected ANOVA (Table 34) identifies the interaction as significant and as the most important
effect in the model, and while Factor A is found non-significant, Factor B is identified as
significant.
The presence of confounding between main effects and interactions also seems to support the
opinion that testing main effects in the presence of interactions is an “uninteresting
hypothesis” [17], or an “exercise in fatuity” [18]. For that reason, since the proposed analysis
provides a middle point between main effects and interactions, it can be used to reduce the
risk of reaching an erroneous conclusion, if such an analysis needs to be done.
6. Conclusion
Data unbalance introduces unwanted correlations between the different effects in a model.
This lack of orthogonality compromises the validity of conventional ANOVA calculations. Also,
in the case of multi-factor analyses, if the law of total variance is preserved during ANOVA, the
results become dependent on the particular sequence selected for the terms in the model
(type I SS). Thus, a novel general proposal for calculating ANOVA considering either balanced
or unbalanced data is presented in this report. In the case of balanced data, the calculations
coincide with the conventional ANOVA method. For unbalanced data, two important
differences emerge:
The degrees of freedom of each factor are “corrected” using the idea of equivalent
degrees of freedom (Eq. 2.30), which takes into account the effect of unbalance from
the relative number of experiments at each level of a factor. These equivalent degrees
of freedom are then used to test the ANOVA hypotheses (Eq. 2.18).
The sums of squares of each factor (or interaction) are “corrected” considering their
relative effect on the sum of squares due to the model (Eq. 4.7). This correction makes
the sequence of terms in the model irrelevant, while preserving the law of total
variance.
When interactions are considered, confounding between main effects and interaction terms is
inevitable. While analyzing the significance of factors in the presence of interactions is quite
uncertain due to confounding, the proposed approach provides a balanced (intermediate)
estimation of all effects, reducing the risk of reaching a wrong conclusion. In any case, the
conclusions obtained under those conditions must always be considered provisional and
uncertain.
Even though ANOVA analyses might be replaced by linear regression analyses [19], variance
decomposition remains a widely used tool in science and engineering. Furthermore,
understanding the basic principles of ANOVA can help us improve the validity and capability of
ANOVA, including the use of unbalanced data.
Acknowledgments
This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.
References
[1] Hinkelmann, K., & Kempthorne, O. (2008). Design and Analysis of Experiments. Volume 1:
Introduction to Experimental Design. 2nd Edition. John Wiley & Sons, Inc., New Jersey.
[2] Ståhle, L., & Wold, S. (1989). Analysis of Variance (ANOVA). Chemometrics and Intelligent
Laboratory Systems, 6, 259-272.
[4] Blitzstein, J. K., & Hwang, J. (2019). Introduction to probability. 2nd Edition. CRC Press/Taylor
& Francis Group, Boca Raton. Chapter 9.
[5] Bowsher, C. G., & Swain, P. S. (2012). Identifying sources of variation and the flow of
information in biochemical networks. Proceedings of the National Academy of Sciences,
109(20), E1320-E1328.
[6] Montgomery, D. C. (2017). Design and Analysis of Experiments. 9th Edition. John Wiley &
sons, Inc., Hoboken. Section 15.2.
[9] Herr, D. G. (1986). On the history of ANOVA in unbalanced, factorial designs: The first 30
years. The American Statistician, 40(4), 265-270.
[10] Littell, R. C., Stroup, W. W., & Freund, R. J. (2002). SAS ® for Linear Models. 4th Edition. SAS
Institute Inc., Cary. pp. 14-17.
[11] Langsrud, Ø. (2003). ANOVA for unbalanced data: Use Type II instead of Type III sums of
squares. Statistics and Computing, 13(2), 163-167.
[12] Smith, C. E., & Cribbie, R. (2014). Factorial ANOVA with unbalanced data: a fresh look at the
types of sums of squares. Journal of Data Science, 12, 385-404.
[13] Shaw, R. G., & Mitchell-Olds, T. (1993). ANOVA for unbalanced data: an overview. Ecology,
74(6), 1638-1645.
[15] Sober, E. (1981). The principle of Parsimony. The British Journal for the Philosophy of
Science, 32(2), 145-156.
[16] Lewsey, J. D., Gardiner, W. P., & Gettinby, G. (2001). A study of type II and type III power for
testing hypotheses from unbalanced factorial designs. Communications in Statistics-Simulation
and Computation, 30(3), 597-609.
[17] Nelder, J. A. (1994). The statistics of linear models: back to basics. Statistics and
Computing, 4(4), 221-234.
[18] Kempthorne, O. (1975). Fixed and mixed models in the analysis of variance. Biometrics,
473-486.
[19] Gelman, A. (2005). Analysis of variance—why it is more important than ever. The Annals of
Statistics, 33(1), 1-53.