0% found this document useful (0 votes)
104 views

Variance Decomposition in Unbalanced Data

In this report, the basic principles of Variance Decomposition (ANOVA) are explained. These principles are then used to understand the implications of analyzing unbalanced data, and to propose some corrections in the calculations. For instance, by determining the equivalent degrees of freedom of each factor, it is possible to mitigate the effect of non-orthogonality of unbalanced data on the results.

Uploaded by

Hugo Hernandez
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Variance Decomposition in Unbalanced Data

In this report, the basic principles of Variance Decomposition (ANOVA) are explained. These principles are then used to understand the implications of analyzing unbalanced data, and to propose some corrections in the calculations. For instance, by determining the equivalent degrees of freedom of each factor, it is possible to mitigate the effect of non-orthogonality of unbalanced data on the results.

Uploaded by

Hugo Hernandez
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Vol.

6, 2021-01

Variance Decomposition in Unbalanced Data

Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
[email protected]

doi: 10.13140/RG.2.2.16789.35043

Abstract

In this report, the basic principles of Variance Decomposition (ANOVA) are explained. These
principles are then used to understand the implications of analyzing unbalanced data, and to
propose some corrections in the calculations. For instance, by determining the equivalent
degrees of freedom of each factor, it is possible to mitigate the effect of non-orthogonality of
unbalanced data on the results. In addition, unwanted correlation effects between factors in
multi-factor analyses can also be compensated with correction factors in the calculation of the
sums of squares, while at the same time the law of total variance can be preserved. Different
examples of unbalanced data reported in the literature were presented in order to illustrate
the use of the proposed analysis, and to highlight the differences in the results obtained with
respect to conventional calculation methods. In addition, unbalanced simulated data obtained
from a pre-defined model were used to evidence the effect of confounding between main
effects and interactions, and to show the benefit of the proposed approach to reduce the risk
of reaching erroneous conclusions in these situations. The proposed ANOVA calculations were
implemented in MS-Excel (ForsChem Actinium XL).

Keywords

Adam’s Law, ANOVA, Eve’s Law, Factorial, Hypothesis Testing, Interactions, Model, Statistical
Significance, Sum of Squares, Unbalanced Designs, Variance

1. Introduction

Variance decomposition (usually referred to as Analysis of Variance - ANOVA) is a statistical


technique widely used for analyzing the relative effect of different factors from experimental
data [1,2]. The theoretical foundation of ANOVA is the Law of Total Variance [3], also known as
the Variance Decomposition Formula or Eve’s Law [4], which for a single factor can be expressed
as:

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (1 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

( ) ( ( )) ( ( ))
(1.1)

where and are random variables and represents the conditional event when is
known. In other words, if we consider that describes different groups of data, then the
term ( ( )) represents the within-group variation (average variation of within each
group ) and ( ( )) represents the between-group variation (variance of group
averages). Notice that the within-group variation is related to the residual error (variation not
explained by ) or experimental noise.

The Law of Total Variance results from the definition of the variance operator and the
application of Adam’s Law [4]:
( ) ( ( ))
(1.2)
simply stating that the overall average is the weighted average of group averages.

The one-way (univariate) ANOVA [2] basically represents a statistical test of the following
hypotheses:
( ( )) ( ( ))
( ( )) ( ( ))
(1.3)

Only when the between-group variation is significantly larger than the within-group variation,
we can safely conclude that the effect of factor on the response variable is statistically
significant.

When two or more factors ( ) are involved, the complexity of Adam and Eve’s
Laws increases. In this case, Adam’s Law becomes:

( ) ( ( ))
(1.4)
and Eve’s Law:
( ) ( ( )) ( ( ))
(1.5)
which can be equivalently expressed as [5]:

( ) ( ( )) ( ( )) ( ( ( ) ))
( ( ( ) ))
( ( ( ) ))
(1.6)

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (2 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

The result presented in Eq. (1.6) is obtained considering that (from Eq. 1.1 and 1.4):

( ( )) ( ( ( ) )) ( ( ( ) ))
( ( ( ) )) ( ( ( ) ))
( ( ( ) )) ( ( ))
(1.7)
along with the more general rule:

( ( ( ) ))
( ( ( ) ))
( ( ( ) ))
(1.8)

The first term in the right-hand side of Eq. (1.6) again represents the residual error, the variation
in not explained by any of the factors considered. The second term represents the variation
explained by , the third term represents the remaining variation explained by , and so on.
The last term of the right-hand side of Eq. (1.4) represents the remaining variation in
explained by .

When the factors are independent from each other, the remaining variation explained by a
given factor corresponds to the total variation explained by the factor, and the Law of Total
Variation greatly simplifies into:

( ) ( ( )) ( ( )) ( ( )) ( ( ))
( ( ))
(1.9)

Eq. (1.9) is the basis of the famous ANOVA table, and the corresponding hypotheses to be
tested become:

( ( )) ( ( ))
( ( )) ( ( ))
(1.10)

Such independence between factors can be easily achieved by means of balanced and
orthogonal designs of experiments, which are basic principles of classical experimental design
[1]. However, when these criteria are not met, not only the complexity of the calculations
increases but one must also be careful with the interpretation of the results [6].

Unbalanced data is commonly found in observational studies where the numbers of


observations for each group are not necessarily defined by the analyst. Even when carrying out

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (3 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

planned experimental designs, unbalanced data may result due to missing or failed
experiments, or by unplanned repetitions of experiments. Unbalanced data can be balanced
simply by randomly taking out results in excess, or by “fabricating” missing data from all other
results. The latter approach is not recommended, because it lies in the verge of data
falsification (even though statistical techniques are used). On the other hand, while randomly
removing data seems a better approach, discarding valid data is always a loss of valuable
information for an analysis (as well as loss of degrees of freedom for improving the estimation
of the residual error). Thus, it would be desirable to directly analyze unbalanced data without
any sort of manipulation.

2. One-Way ANOVA

In order to better understand the problem considered in this report, let us begin explaining the
simplest case of variance decomposition: The One-Way ANOVA [2].

Let us consider a set of single-factor experimental data containing different observations of


for each -th group of groups available in , and let us denote each observation as
where and . Then we have:

∑ ∑
( )

(2.1)
∑ ∑ ( ( ))
( )

(2.2)

( ( ))

(2.3)
∑ ( ( ( ) ))
( ( ))

(2.4)
Notice that (using Eq. 2.3 and 2.1):

∑ ( ( )) ∑ ∑
( ( )) ( )
∑ ∑
(2.5)
corresponds to Adam’s Law for a single factor.

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (4 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

On the other hand, the within-group variation is (using Eq. 2.4 and 2.3):

∑ ( ( ))
∑ ∑ ( ( ( ) ))
( ( ))
∑ ∑

∑ ∑ ( )


(2.6)
and the between-group variation is (using Eq. 2.3 and 2.5):


∑ ( ( ))
∑ ( ( ( )) ( ( )))
( ( ))
∑ ∑
(2.7)

where the variance is evaluated taking into account the relative frequency of each group.

Thus, Eve’s Law (Eq. 1.1) can be expressed as follows:

∑ ∑ ( ( )) ∑ ∑ ( ( ( ) )) ∑ ( ( ( )) ( ))
∑ ∑ ∑
(2.8)

Notice that there is a common denominator ∑ for all terms in Eq. (2.8). An equivalent
equality can be obtained:

∑∑( ( )) ∑∑( ( ( ) )) ∑ ( ( ( )) ( ))

(2.9)

representing the decomposition of the sum of squares, where the left-hand term is denoted as
the total sum of squares ( ), the first term at the right-hand side is denoted as the sum of
squares due to error ( ), and the last term is the sum of squares due to factor ( ).

Similarly, the hypotheses to be tested can be expressed as:

(2.10)

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (5 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Let us now consider each term independently. The total sum of squares consists of a sum of
∑ squared terms. Assuming the response data ( ) to behave normally with a mean value
( ) and standard deviation , we can say that the total sum of squares behaves proportional
to a random variable with ∑ degrees of freedom [7]:

( )
∑∑( ) ∑∑ ∑

(2.11)

where represent realizations of an standard normal random variable ( ). One degree of


freedom is deducted for because the last value of must necessarily guarantee that
( ) . In other words:

∑∑ ∑

(2.12)
Similarly, assuming a normal behavior of the residual error:

( ( ))
∑∑( ) ∑∑ ∑

(2.13)

In this case, degrees of freedom are subtracted from ∑ , one for each of the following
conditions:

(2.14)
Finally, assuming a normal behavior of the group averages:

( ( )) ( )
∑ ( ) ∑

(2.15)
If the data is balanced, that is, if for , then:

(2.16)
proportional to a random variable with degrees of freedom, where:

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (6 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

(2.17)

Thus, considering balanced, normal data, the following hypotheses for an test (ratio of two
distributions) are formulated:

(2.18)
where

( )
( )
(2.19)
and the estimation of is given by:

̂
( )
(2.20)
with

(2.21)

( )
(2.22)
These calculations are summarized in the ANOVA Table presented in Table 1.

Table 1. One-way ANOVA table for balanced normal data


Source of Degrees of
Sum of Squares Mean Squares F
Variation Freedom
Between-group (Eq. 2.15) (Eq. 2.21) ̂ (Eq. 2.20)
Within-group (Eq. 2.13) ( ) (Eq. 2.22)
Total (Eq. 2.11)

The estimated -value ( ̂ ) can then be compared with the critical -value of a right-tail
distribution considering a significance level ( ( ( ))), rejecting the null
hypothesis ( ) in Eq. (2.18) when:

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (7 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

̂ ( ( ))
(2.23)

Alternatively, a -value of the right-tail distribution can be calculated for ̂ , rejecting the null
hypothesis when:
( ( )
̂)
(2.24)

If, on the other hand, the data is unbalanced, the one-way ANOVA table presented before is no
longer formally valid, since:

(2.25)

However, Eq. (2.25) can be alternatively expressed as follows:

̅∑

(2.26)

where ̅ is the average number of elements per group, and ∑
.is a relative
weight of a weighted sum of squares of , and ∑ .

Let us now compare the properties of the balanced (Eq. 2.16) and unbalanced (Eq .2.26):

( ) ( )
(2.27)

( ) ( ) ∑

(2.28)
Now, since the variance of a distribution with degrees of freedom is:

( )
(2.29)

then the variance of the unbalanced can be considered equivalent to the variance of a
distribution with equivalent degrees of freedom:

( )∑

(2.30)

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (8 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Thus, by correcting only the degrees of freedom of the between-group variation in the
determination of the critical -value (or in the calculation of the -value), a more general test of
statistical significance is obtained, which is valid for both balanced and unbalanced data. The
ANOVA table presented in Table 1 remains unchanged, but now the criteria for rejecting the
null hypothesis become:
̂ ( ( ))
(2.31)
( ( )
̂)
(2.32)

Considering that and are calculated for integer degrees of freedom, the equivalent
and values can be determined by interpolation as follows:

( ( ))

( ⌊ ⌋ ( )) ( ⌊ ⌋)

( ( ⌈ ⌉ ( )) ( ⌊ ⌋ ( )))
(2.33)
( ( )
̂)

( ̂) ( ⌊ ⌋)
⌊ ⌋ ( )

( ( ̂) ( ̂ ))
⌈ ⌉ ( ) ⌊ ⌋ ( )

(2.34)
where ⌊ ⌋ and ⌈ ⌉ represent the floor and ceiling rounding operators.

According to Eq. (2.30), the minimum equivalent degrees of freedom will be:

(2.35)

obtained when all groups have the same size, and the maximum equivalent degrees of
freedom will be:

( )
(2.36)
obtained when one of the groups is extremely large compared to all others.

Finally, let us notice that can be interpreted as the sum of square residuals between the
experimental observations and the corresponding prediction ̂ given by the linear model:

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (9 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

̂ ( ) ()
(2.37)
where ( ) is the effect of factor when it is at level , and is given by:

∑ ∑ ∑
() ( ( )) ( )

(2.38)
resulting in:

∑ ∑( ̂ )

(2.39)

Thus, the sum of squares due to factor can be simply expressed as the variation not
considered in the residuals of model (2.37):

(2.40)

3. Two-Way ANOVA

In the case of multi-factor analysis, data unbalance causes an additional effect. Let us consider
the simplest multi-factor case: independent factors. Assuming a balanced and orthogonal
design (e.g. full factorial design), Eve’s law for two independent factors become (from Eq. 1.9):

( ) ( ( )) ( ( )) ( ( ))
(3.1)
Following the ideas presented in Section 2, Eq. (3.1) can be equivalently expressed as the
following sum of squares:

(3.2)
In this case,

∑ ∑ ∑( ( ))

(3.3)
where represents one of the levels in factor , represents one of the levels in factor
, and is one of the replicates of the design.

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (10 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

The residuals represented by consider the variation not explained by the effects of and
, as described by the following linear model§:

( )
̂ ( ) ( ) ( )
(3.4)

where the superscripts indicate the terms considered in the model, and the effects are given
by:
( ) ( ( )) ( )
(3.5)
Combining Eq. (3.4) and (3.5) results in:

( )
̂ ( ( )) ( ( )) ( )
(3.6)
Then, the for model (3.4) is:

( ) ( )
∑ ∑ ∑( ̂ )

(3.7)

Now, the sum of squares due to each factor can be determined as follows:

( )

(3.8)
( )
where represents the sum of squares of the residuals of the following model:

( )
̂ ( ) ( )
(3.9)
and determined as

( ) ( )
∑ ∑ ∑( ̂ )

(3.10)

Eq. (3.8) can be obtained simply by performing a one-way ANOVA on factor while neglecting
( )
the effect of all other factors (which are contained in the term ).

The degrees of freedom associated to the sum of squares due to each factor are:

§
The model presented does not consider the interaction between both factors. When the interaction is
considered, a third set of terms emerges in the model and this will be covered at the end of this section.

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (11 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

(3.11)

and consequently, the degrees of freedom associated to the are:

(3.12)

The mean squares terms are calculated in general for the source of variation as:

(3.13)
and the -value for each factor is:

(3.14)

Finally, the statistical significance of each factor (rejecting the corresponding null hypothesis) is
determined by any of the following criteria:

̂ ( )
(3.15)
( ̂ )
(3.16)

Since the data is balanced, no corrections are needed. Table 2 shows the corresponding
general two-way ANOVA table.

Table 2. Two-way ANOVA table for balanced normal data (without interaction)
Source of
Sum of Squares Degrees of Freedom Mean Squares F
Variation
Factor (Eq. 3.8) (Eq. 3.11) (Eq. 3.13) ̂ (Eq. 3.14)
Factor (Eq. 3.8) (Eq. 3.11) (Eq. 3.13) ̂ (Eq. 3.14)
Residual error (Eq. 3.7) (Eq. 3.12) (Eq. 3.13)
Total (Eq. 3.3)

When the data is unbalanced, Eq. (3.1) is no longer valid. The correct expression for the law of
total variance is in this case (from Eq. 1.6):

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (12 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

( ) ( ( )) ( ( )) ( ( ( ) ))
(3.17)
In terms of sum of squares, Eq. (3.17) can be expressed as:

( )

(3.18)

where represents the sum of squares due to factor remaining after considering the
effect of factor . Notice that, unless both factors are independent:

(3.19)

Unfortunately, when unbalance is present in multi-factor data, covariance emerges between


the factors indicating that they are no longer independent.

The different types of sums of squares ( and ) were originally noticed by Yates
[8,9], and were later denoted as SS Type I (sequential) and SS Type II (partial) [10,11],
respectively.

The covariance of the level values between two factors is given by:

( ) ( ) ( ) ( )
(3.20)

Let us consider for example the following balanced design ( factorial design with
replicates):
Table 3. Example of a randomized factorial design with replicates
Experiment # 1 2 3 4 5 6 7 8
1 -1 1 -1 1 -1 -1 1
1 1 -1 -1 -1 -1 1 1
9 7 5 5 8 4 7 8

For this balanced design, the covariance between the factors is ( ) , indicating that
they are independent.

Now assume that the last experiment failed and it is not available for the analysis. Then, the
new covariance between the factors excluding experiment #8 is:

( ) ( ) ( ) ( ) ( )( )
(3.21)

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (13 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Thus, it can be seen that unbalance in multi-factor data results in factor dependence
(correlation) and therefore: .

Since the factors in unbalanced data are correlated, each individual model

( )
̂ ( ) ( )
(3.22)

will neglect the additional effect of which is correlated with other factors, and therefore in
unbalanced data:
( )

(3.23)
On the other hand, Eq. (3.18) can also be alternative expressed as:

( )

(3.24)

resulting in a completely different ANOVA table, and possibly leading to different conclusions.
Table 4 shows the ANOVA table obtained after decomposing the variance according to Eq.
(3.18), while Table 5 shows the ANOVA obtained by decomposing the variance according to Eq.
(3.24). The ANOVA calculations were performed using the anova function on a linear model (lm)
in R (https://cran.r-project.org/), which uses Type I SS [10,11].

Table 4. ANOVA table obtained in R for the data presented in Table 3 when the last experiment
is missing. R model: lm(Y~X1+X2)
Df Sum Sq Mean Sq F value Pr(>F)
X1 1 4.2976 4.2976 3.4381 0.13732
X2 1 10.4167 10.4167 8.3333 0.04471 *
Residuals 4 5 1.25

Table 5. ANOVA table obtained in R for the data presented in Table 3 when the last experiment
is missing. R model: lm(Y~X2+X1)
Df Sum Sq Mean Sq F value Pr(>F)
X2 1 8.0476 8.0476 6.4381 0.06416 .
X1 1 6.6667 6.6667 5.3333 0.08209 .
Residuals 4 5 1.25

Notice that even though the calculations for the residuals are exactly the same, the
decomposition of variance changes depending on the order of the analysis. Most importantly,
considering a significance level, the first ANOVA identifies only factor as statistically

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (14 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

significant, whereas in the second ANOVA no factor is found statistically significant. Clearly, the
conclusion should not depend on the particular factor order considered for the analysis.

While other approaches for estimating the sum of squares are possible (including Type II and
Type III SS [11,12]), only the Type I SS preserves the validity of Eve’s Law for variance
decomposition. Thus, one possible alternative approach for compensating the correlation effect
in the sum of squares due to each factor while preserving the validity of Eve’s Law is defining the
following corrected sum of squares ( ):

( )
( )

(3.25)

This way, the sum of squares of each factor considering correlated effects is assumed to be
proportional to the relative magnitude of their partial (Type II) effects.

For balanced data, we simply obtain from Eq. (3.25):

(3.26)

Table 6 shows the ANOVA table obtained by using the corrected sum of squares defined in Eq.
(3.25) for the example previously considered. Notice that the corrected sum of squares
obtained have intermediate values between the sums of squares of each factor reported for
each factor in the previous ANOVA tables (Table 4 and Table 5), as expected. No factor is found
statistically significant, although factor is at the very limit of significance. Additional data
should be collected if a more definitive conclusion regarding is required.

Table 6. ANOVA table for the data presented in Table 3 (when the last experiment is missing)
obtained after correcting the sum of squares due to each factor using Eq. (3.25).
Source of Sum of Degrees of
Mean Squares -value**
Variation Squares Freedom
5.1223 1 5.1223 4.0979 0.1129
9.5920 1 9.5920 7.6736 0.0503
Residual error 5 4 1.25
Total 19.7143 6

When interactions are considered, an additional term is added to the model (3.4):

**
The -values were obtained considering the corrected degrees of freedom for each factor (Eq. 2.30),
corresponding both to in this case.

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (15 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

( )
̂ ( ) ( ) ( ) ( )
(3.27)

While the interaction could be considered as an additional factor, it must be treated slightly
differently because the type I sum of squares of the interaction contains most of the sums of
squares of the interacting factors. Thus, the type II sum of squares of the interaction must be
necessarily considered, as follows:

( ) ( ) ( )

(3.28)

The corrected sum of squares when the interaction is considered becomes:

( )
( )

(3.29)
On the other hand, the degrees of freedom for the interaction term will be:

(3.30)

and the degrees of freedom of the residual error are now:

(3.31)

The ANOVA table considering the interaction then becomes (in general for balanced or
unbalanced data):

Table 7. Two-way ANOVA table for balanced or unbalanced normal data (with interaction)
Source of Variation Sum of Squares Degrees of Freedom Mean Squares F
Factor (Eq. 3.29) (Eq. 3.11) ̂
Factor (Eq. 3.29) (Eq. 3.11) ̂
Interaction (Eq. 3.29) (Eq. 3.30) ̂
Residual error (Eq. 3.7) (Eq. 3.31) (Eq. 3.13)
Total (Eq. 3.3)

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (16 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

4. Multi-Factor Analysis

Let us now generalize the variance decomposition analysis of balanced or unbalanced data
considering the following multi-factor estimation model for the response variable:

̂ ∑∑

(4.1)

where and are model coefficients obtained from the experimental data by least
squares regression, and are binary variables defined as:

( )
{
( )
(4.2)

that is, they have a value of when the corresponding observation was obtained when factor
was at level , and a value of for all other levels of .

For this particular model (Eq. 4.1), the sum of squares due to model residuals ( ) will be:

∑ ∑ ∑ ∑ ( ̂ )

∑ ∑ ∑ ∑ ( ∑∑ )

(4.3)

values may be different for each possible combination of the factors considered in the
model. Thus, while the analysis of variance decomposition can be done for any model
proposed, the most reliable conclusions will be those obtained from the model with the lowest
value. Usually, terms with low ̂ values are less likely to contribute to lowering the
value. Thus, those terms can be stepwise removed in order to improve the model.

The ̂ values for each factor in the model considering either balanced or unbalanced data can
be determined as the ratio of mean squares:

(4.4)
where

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (17 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

(4.5)

(4.6)
( )
( ) ( )
∑ ( ( ))

(4.7)

( )
∑ ∑ ∑ ∑ ( ( ( ) ))

(4.8)

∑ ∑ ∑ ∑ ( ( ))

(4.9)

(4.10)

∑ ∑ ∑

(4.11)

(4.12)

Finally, the statistical significance of each factor is confirmed when any of the following criteria
is met:

̂ ( )
(4.12)
( ̂ )
(4.13)
where

( )∑

(4.14)

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (18 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

∑ ∑ ∑ ∑ ∑
∑ ∑ ∑
(4.15)

( )
( ⌊ ⌋ ) ( ⌊ ⌋) ( ( ⌈ ⌉ ) ( ⌊ ⌋ ))
(4.16)
( ̂ )

( ̂ ) ( ⌊ ⌋)
⌊ ⌋

( ( ̂ ) ( ̂ ))
⌈ ⌉ ⌊ ⌋

(4.17)

All these calculations can be summarized in the multi-factor ANOVA table presented in Table 8.

Table 8. Multi-factor ANOVA table for balanced or unbalanced normal data using the proposed
correction on the sum of squares presented in Eq. (4.7)
Source of Sum of Degrees of
Mean Squares -values -values
Variation Squares Freedom
Factor (Eq. 4.7) (Eq. 4.12) (Eq. 4.5) ̂ (Eq. 4.4) (Eq. 4.17)

Factor (Eq. 4.7) (Eq. 4.12) (Eq. 4.5) ̂ (Eq. 4.4) (Eq. 4.17)

Factor (Eq. 4.7) (Eq. 4.12) (Eq. 4.5) ̂ (Eq. 4.4) (Eq. 4.17)
Residual error (Eq. 4.3) (Eq. 4.10) (Eq. 4.6)
Total (Eq. 4.9) (Eq. 4.11)

If any of the factors included in the model represents the interaction between two factors
and , then the corresponding individual sum of squares must be determined as follows:

( )

(4.18)

In addition, the degrees of freedom of such interaction must be determined as:

( ) ( )
(4.19)

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (19 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

where is a binary variable taking a value of when the main effect of factor is
considered in the model, and a value of when the main effect is not included. This expression
indicates that the interaction term gains the degrees of freedom of the interacting factors
when they are not included in the model.

5. Examples

5.1. Unbalanced Single-factor Design

Let us consider the first example presented by Ståhle and Wold [2] for the yield of a chemical
reaction using three different catalysts (A, B, and C), summarized in Table 9. The corresponding
uncorrected ANOVA table for this data is presented in Table 10.

Table 9. Unbalanced single-factor design: Reaction yield using different catalysts [2]
Yield (%)
A B C
91 85 96
95 88 98
92 87 97
90 86 97
89

Table 10. Uncorrected ANOVA table for an unbalanced single-factor design: Reaction yield using
different catalysts
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 223.0769 2 111.5385 42.8994 4.1028 1.2394E-05
Error 26 10 2.6
Total 249.0769 12

Considering the effect of data unbalance, the equivalent degrees of freedom (Eq. 2.30) for the
catalyst term are:

(5.1)

The ANOVA corrected (in the calculation of and ) with the equivalent degrees of freedom
is presented in Table 11. All the calculations using the method proposed in this report were
done using the MS-Excel-based application ForsChem Actinium XL (also available to download
at www.forschem.org).

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (20 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Table 11. Corrected ANOVA table for an unbalanced single-factor design: Reaction yield using
different catalysts
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 223.0769 2 111.5385 42.8994 4.0935 1.2221E-05
Error 26 10 2.6
Total 249.0769 12

No big differences are observed ( decrease in -value after the correction) as the
equivalent degrees of freedom are close to the original value of (1.18% increase after the
correction).

The model obtained for this data is:

( )
(5.2)

where is a binary variable representing the type of catalyst employed, represents a Type III
standard residual error ( ( ) , ( ) ) [14], and √ . Since the
residuals of the model are normally distributed (cf. Figure 1) then can be replaced by a
standard normal random variable .

Figure 1. Normality plot for the residuals of model (5.2)

Since the catalyst type is a fixed categorical variable, the corresponding binary variables
considered in Eq. (5.2) are related by the following expression:

(5.3)
Thus, Eq. (5.2) can be alternatively expressed as:

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (21 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

( )
(5.4)

In addition, let us compare the results obtained when the data is forced to be balanced. Table
12 to Table 16 show the ANOVA tables obtained when each of the observations for catalyst B
are independently removed. Table 17 shows the result obtained when the missing observations
of catalyst A and C are filled with the corresponding averages of the available data.

Table 12. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the first observation (85) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 180.6667 2 90.3333 38.7143 4.2565 3.7943E-05
Error 21 9 2.3333
Total 201.6667 11

Table 13. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the second observation (88) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 210.1667 2 105.0833 38.2121 4.2565 3.9992E-05
Error 24.75 9 2.75
Total 234.9167 11

Table 14. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the third observation (87) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 200 2 100 34.6154 4.2565 5.9415E-05
Error 26 9 2.8889
Total 226 11

Table 15. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the fourth observation (86) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 190.1667 2 95.0833 34.5758 4.2565 5.9686E-05
Error 24.75 9 2.75
Total 214.9167 11

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (22 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Table 16. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by removing the last observation (89) of catalyst B.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 220.6667 2 110.3333 47.2857 4.2565 1.6808E-05
Error 21 9 2.3333
Total 241.6667 11

Table 17. ANOVA table obtained for a forced balanced design of reaction yield using different
catalysts by adding average observations (92 for catalyst A and 97 for catalyst C).
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Catalyst 250 2 125 57.6923 3.8853 6.9885E-07
Error 26 12 2.1667
Total 276 14

Even though the conclusion regarding the statistical significance of the catalyst effect
(assuming a 5% significance level) did not change, important numerical differences are
observed for the different possible results. Particularly the differences in -values (relative to
the uncorrected analysis) are summarized in Table 18. These results show that data
manipulation can lead to a wide range of possible -values, potentially compromising the
conclusion.

Table 18. Comparison of -values for the different possible ANOVA tables obtained for the
unbalanced reaction yield data in Table 9.
-value Relative
ANOVA -value
difference difference (%)
Uncorrected 1.2394E-05
Corrected 1.2221E-05 -1.73E-07 -1.4%
Removing data 1 3.7943E-05 2.55E-05 206.1%
Removing data 2 3.9992E-05 2.76E-05 222.7%
Removing data 3 5.9415E-05 4.70E-05 379.4%
Removing data 4 5.9686E-05 4.73E-05 381.6%
Removing data 5 1.6808E-05 4.41E-06 35.6%
Adding data 6.9885E-07 -1.17E-05 -94.4%

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (23 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

5.2. Unbalanced Two-factor Designs – Example 1

Let us now consider four different examples of unbalanced data when two factors are
involved.

Shaw and Mitchell-Olds [13] present a hypothetical example for the growth of plants
considering the initial size of the plant (small or large), and the treatments (removal or not of
neighboring plants). The hypothetical data considered is presented in Table 19.

Table 19. Unbalanced two-factor design: Plant growth [13]


Treatment Initial Size Final plant height
Control Small 50
Control Small 57
Removal Small 57
Removal Small 71
Removal Small 85
Control Large 91
Control Large 94
Control Large 102
Control Large 110
Removal Large 105
Removal Large 120

Using the SS values reported by the authors (considering different SS types), the following
ANOVA tables (Table 20 to Table 22) are obtained:

Table 20. ANOVA table obtained using Type I SS for the plant growth unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 35.3 1 35.3 0.330 5.5914 0.5834
Initial Size 4846.0 1 4846.0 45.4 5.5914 0.0003
Interaction 11.4 1 11.4 0.107 5.5914 0.7535
Error 747.7 7 106.8
Total Sum 5640.4 10

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (24 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Table 21. ANOVA table obtained using Type II SS for the plant growth unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 590.2 1 590.2 5.525 5.5914 0.0510
Initial Size 4846.0 1 4846.0 45.4 5.5914 0.0003
Interaction 11.4 1 11.4 0.107 5.5914 0.7535
Error 747.7 7 106.8
Total Sum 6195.3 10

Table 22. ANOVA table obtained using Type III SS for the plant growth unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 597.2 1 597.2 5.591 5.5914 0.0500
Initial Size 4807.9 1 4807.9 45.0 5.5914 0.0003
Interaction 11.4 1 11.4 0.107 5.5914 0.7535
Error 747.7 7 106.8
Total Sum 6164.2 10

In the previous tables, the total sums of squares were obtained by applying the law of total
variance to the individual sums of squares (and not from ) resulting in different values for
each case despite using the same data set. In addition, large differences in the significance of
the effect due to the treatment are observed between the different methods. Using the
correction of unbalanced data proposed in this report, the following ANOVA table is obtained:

Table 23. ANOVA table obtained using the correction proposed in this report for the plant
growth unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 18.76 1 18.76 0.1756 5.5844 0.6890
Initial Size 2277.39 1 2277.39 21.3196 5.5844 0.0024
Interaction 2596.65 1 2596.65 24.3083 5.5138 0.0016
Error 747.75 7 106.82
Total 5640.55 10

Several differences can be observed with respect to the previously reported calculations. While
the SS due to residual error is identical to the SSs reported in previous tables, the SSs due to
the factors and the interaction are different. Since the factors and the interaction are all
correlated between them, the uncorrected ANOVA tables assign most of the effect of the
interaction to the individual factors. In order to clarify this, let us consider the model obtained:

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (25 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

(5.5)

where represents an standard random variable (since in this case it is normal, it can also be
denoted as ).

By looking at the model coefficients, the effect of the treatment ( ) is apparently much
larger than the effect of the interaction ( ). However, we must take into account
that the binary variable is confounded with binary variables of the interaction
according to the following expression:

(5.6)

Similarly, the effect of initial plant size is also confounded according to:

(5.7)

Therefore, the values of the coefficients cannot be used to assess with absolute certainty the
relative effect of the factors and interactions simultaneously.

Considering Eq. (5.6) and (5.7), Eq. (5.5) can be expressed in general as follows:

( )
( ) ( )
(5.8)

where and are the “true” values of the coefficients representing the non-
interacting effects of the factors, which are unknown due to confounding. Such confounding
makes the ANOVA analysis uncertain, and thus, assumptions are needed in order to reach any
conclusion. By assuming and , a minimal effect of the
interaction is obtained. By assuming , the effect of the interaction is
maximal. A more conservative and equilibrated approach might be assuming middle values
(distributing the effects equally between the factors and the interactions), where
and , resulting in:

(5.9)

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (26 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Eq. (5.9) is not necessarily correct but it minimizes the risk of error in the values of the model
coefficients. From this equation, it becomes clear that the effect of the interaction can be
larger than both individual effects of the factors, as described by the corrected ANOVA (Table
23). In this case, the interaction becomes significant, whereas all other approaches considered
this effect negligible. Notice that the uncertainty introduced by confounding also is present in
both unbalanced and balanced data as well.

However, in order for the previous analysis to be valid, the best possible model (lowest
standard error) must be used. Since the estimation of the error depends on the available
degrees of freedom, it is always important to test whether or not including individual or
interaction terms result in a better model (i.e. applying the principle of Parsimony [15]). By
removing the interaction, the following results are obtained:

Table 24. ANOVA table obtained using the correction proposed in this report for the plant
growth unbalanced design (no interaction considered)
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Treatment 39.88 1 39.88 0.4202 5.3106 0.5361
Initial Size 4841.51 1 4841.51 51.0198 5.3106 9.7202E-5
Error 759.16 8 94.89
Total 5640.55 10

Since the standard error decreases by removing the interaction ( goes from to
), the non-interacting model can be considered a better model††:

(5.10)

Furthermore, by removing the interaction the confounding is also removed. Now, we can
conclude that the effect of the initial size is significant but the effect of the treatment is
negligible.

While the analysis of the model containing the interaction terms was not adequate for this
particular example because it did not provide an optimal model, it clearly illustrates the risk of
confounding in ANOVA when interactions are considered.

††
By further removing the effect of the treatment from the model, the increases to , and thus,
the model considering both main effects can be considered optimal.

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (27 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

5.3. Unbalanced Two-factor Designs – Example 2

Smith & Cribbie [12] describe a factorial example using simulated data. In this example,
the gambling behavior scores of different participants are tabulated according to gender and
athletic status. The results obtained are presented in Table 25.

The authors report different ANOVA tables using different types of SS. These tables are
presented in Table 26 to Table 28. On the other hand, the ANOVA table obtained for this set of
unbalanced data using the proposed approach is presented in Table 29.

Table 25. Unbalanced factorial design: Gambling behavior [12]


Gender Athletic Status Gambling Score
Male Current 3.0
Male Current 3.0
Male Current 2.8
Male Former 5.1
Male Former 4.7
Male Former 4.9
Male Former 5.2
Male Former 4.9
Male Former 5.0
Male Non-athlete 2.1
Male Non-athlete 2.0
Male Non-athlete 1.9
Male Non-athlete 1.8
Female Current 2.3
Female Current 2.4
Female Current 2.1
Female Former 3.9
Female Former 4.1
Female Former 3.8
Female Non-athlete 1.2
Female Non-athlete 1.1
Female Non-athlete 1.3
Female Non-athlete 1.1
Female Non-athlete 1.0

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (28 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Table 26. ANOVA table obtained using Type I SS from [12].


Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Gender 11.023 1 11.023 528.625 4.4138 8.57E-15
Status 37.94 2 18.97 909.757 3.5546 8.31E-19
Interaction 0.121 2 0.061 2.906 3.5546 0.081
Error 0.375 18 0.021
Total Sum 49.459 23

Table 27. ANOVA table obtained using Type II SS from [12].


Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Gender 4.139 1 4.139 198.497 4.4138 3.66E-11
Status 37.94 2 18.97 909.757 3.5546 8.31E-19
Interaction 0.121 2 0.061 2.906 3.5546 0.081
Error 0.375 18 0.021
Total Sum 42.575 23

Table 28. ANOVA table obtained using Type III SS from [12].
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Gender 3.897 1 3.897 186.888 4.4138 6.03E-11
Status 35.989 2 17.995 862.971 3.5546 1.33E-18
Interaction 0.121 2 0.061 2.906 3.5546 0.081
Error 0.375 18 0.021
Total Sum 40.382 23

Table 29. ANOVA table obtained using the correction proposed in this report for the gambling
behavior unbalanced design
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Gender 5.1562 1 5.1562 247.2779 4.4079 5.7959E-12
Status 20.9676 2 10.4838 502.7757 3.5299 1.5178E-16
Interaction 22.9604 2 11.4802 550.5611 3.4888 6.1180E-17
Error 0.3753 18 0.02085
Total 49.4596 23

The model obtained for the ANOVA presented in Table 29 is the following:

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (29 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

(5.11)
where represents a standard normal random variable.

Once again, the coefficient values in the model obtained do not necessarily reflect the factor
effects since confounding between individual terms and the interaction are present. However,
we can conclude that conventional analyses consider non-significant the effect of the
interaction for all three types of SS, whereas the method proposed in this report consider all
terms significant. Furthermore, the proposed method considers the effect of the interaction to
be even greater than the effect of the athletic status (considering the correction for
unbalanced data). Let us recall that the proposed analysis represents a middle-point
distribution of the confounded effects, which minimizes the risk of error since the available
data does not allow concluding if the observed effect is completely caused by the athletic
status or completely caused by the interactions, or about how the effects are particularly
distributed.

In this example, removing the interaction results in a model with a larger standard error. Thus,
model (5.11) can be considered optimal, and in this case confounding is inevitable.

5.4. Unbalanced Two-factor Designs – Example 3

The next case was obtained from data presented by Lewsey et al. [16]. They considered a
general factorial experimental design with possible unbalanced sets of data, by
assuming different number of replicates ( or ) for each treatment. All data was obtained
from a non-interacting model with normal error. The set of unbalanced data for the current
example, presented in Table 30, was randomly chosen between the 726 possibilities.

The optimal model found in this case is:

(5.12)

The incorporation of interaction terms increases the standard error of the model.

The corresponding ANOVA table using the correction for unbalanced data is presented in Table
31. The results obtained indicate that both factors are significant, and the interaction is not
even considered in the model, which is consistent with the nature of the original model.

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (30 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Table 30. Unbalanced factorial design: Random example from data in [16]
Factor A Factor B Response
1 1 7.626
1 1 10.424
1 2 5.878
1 2 6.878
1 2 3.024
1 3 3.786
1 3 6.614
2 1 6.04
2 2 2.878
2 2 3.878
2 2 0.024
2 3 0.786
2 3 3.614

Table 31. ANOVA table obtained using the correction proposed in this report for an unbalanced
design from Lewsey et al. [16]
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Factor A 33.6862 1 33.6862 10.8621 5.1123 0.00926
Factor B 37.3127 2 18.6567 6.0158 4.1912 0.02089
Error 27.9112 9 3.1012
Total 98.9101 12

5.4. Unbalanced Two-factor Designs – Example 4

As a final example, let us consider the set of data presented in Table 32, again for an
unbalanced factorial design. The data in this case was randomly obtained using the
following model:

(5.13)
where is a standard normal random variable.

Model (5.13) does not contain main effects, only interaction terms. However, the best model
obtained after analyzing the data in Table 32 is:

(5.14)

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (31 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Table 32. Unbalanced factorial design: Random data (rounded to one decimal place)
obtained from model (5.13)
Factor A Factor B Response Y
2 2 4.4
1 2 7.7
2 1 5.7
2 2 2.7
2 1 3.6
2 3 10.4
2 1 4.7
2 3 7.8
1 3 8.7
1 2 5.6
1 1 7.2
1 2 5.6
2 2 2.3

Clearly, non-interacting terms emerge in this model as a result of confounding.

Eq. (5.14) can alternatively be expressed (considering only interacting effects) as:

(5.15)

Of course, the original model (5.13) is not accurately obtained due to random errors during
sampling. However, all the coefficients have errors of less than one unit.

The SS Type I ANOVA table for this data is presented in Table 33, whereas Table 34 shows the
ANOVA table with the correction proposed in this report.

Table 33. ANOVA table obtained using Type I SS for the unbalanced data presented in Table 32.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Factor A 9.531 1 9.531 6.0579 5.5914 0.04338
Factor B 42.487 2 21.244 13.5023 4.7374 0.00396
Interaction 5.991 2 2.996 1.9040 4.7374 0.21864
Error 11.013 7 1.573
Total Sum 69.023 12

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (32 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

Table 34. ANOVA table obtained using the correction proposed in this report for the
unbalanced data presented in Table 32.
Source of Sum of Degrees of Mean
F Fcr (5%) P-value
Variation Squares Freedom Squares
Factor A 5.236 1 5.236 3.3282 5.5460 0.11009
Factor B 20.903 2 10.452 6.6430 4.6727 0.02322
Interaction 31.870 2 15.935 10.1282 4.6034 0.00774
Error 11.013 7 1.573
Total Sum 69.023 12

According to the Type I SS ANOVA (Table 33), the interaction is not significant but it cannot be
removed from the model (since the standard error would increase). On the other hand, the
corrected ANOVA (Table 34) identifies the interaction as significant and as the most important
effect in the model, and while Factor A is found non-significant, Factor B is identified as
significant.

The presence of confounding between main effects and interactions also seems to support the
opinion that testing main effects in the presence of interactions is an “uninteresting
hypothesis” [17], or an “exercise in fatuity” [18]. For that reason, since the proposed analysis
provides a middle point between main effects and interactions, it can be used to reduce the
risk of reaching an erroneous conclusion, if such an analysis needs to be done.

6. Conclusion

Data unbalance introduces unwanted correlations between the different effects in a model.
This lack of orthogonality compromises the validity of conventional ANOVA calculations. Also,
in the case of multi-factor analyses, if the law of total variance is preserved during ANOVA, the
results become dependent on the particular sequence selected for the terms in the model
(type I SS). Thus, a novel general proposal for calculating ANOVA considering either balanced
or unbalanced data is presented in this report. In the case of balanced data, the calculations
coincide with the conventional ANOVA method. For unbalanced data, two important
differences emerge:

 The degrees of freedom of each factor are “corrected” using the idea of equivalent
degrees of freedom (Eq. 2.30), which takes into account the effect of unbalance from
the relative number of experiments at each level of a factor. These equivalent degrees
of freedom are then used to test the ANOVA hypotheses (Eq. 2.18).

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (33 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

 The sums of squares of each factor (or interaction) are “corrected” considering their
relative effect on the sum of squares due to the model (Eq. 4.7). This correction makes
the sequence of terms in the model irrelevant, while preserving the law of total
variance.

When interactions are considered, confounding between main effects and interaction terms is
inevitable. While analyzing the significance of factors in the presence of interactions is quite
uncertain due to confounding, the proposed approach provides a balanced (intermediate)
estimation of all effects, reducing the risk of reaching a wrong conclusion. In any case, the
conclusions obtained under those conditions must always be considered provisional and
uncertain.

Even though ANOVA analyses might be replaced by linear regression analyses [19], variance
decomposition remains a widely used tool in science and engineering. Furthermore,
understanding the basic principles of ANOVA can help us improve the validity and capability of
ANOVA, including the use of unbalanced data.

Acknowledgments

This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.

References

[1] Hinkelmann, K., & Kempthorne, O. (2008). Design and Analysis of Experiments. Volume 1:
Introduction to Experimental Design. 2nd Edition. John Wiley & Sons, Inc., New Jersey.

[2] Ståhle, L., & Wold, S. (1989). Analysis of Variance (ANOVA). Chemometrics and Intelligent
Laboratory Systems, 6, 259-272.

[3] Mahmoud, H. F. F. (2020). Bayesian and Frequentist Approaches Robustness in Variance


Components Estimation. Journal of Statistics Applications & Probability, 9 (3), 435-449.

[4] Blitzstein, J. K., & Hwang, J. (2019). Introduction to probability. 2nd Edition. CRC Press/Taylor
& Francis Group, Boca Raton. Chapter 9.

[5] Bowsher, C. G., & Swain, P. S. (2012). Identifying sources of variation and the flow of
information in biochemical networks. Proceedings of the National Academy of Sciences,
109(20), E1320-E1328.

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (34 / 35)


www.forschem.org
Variance Decomposition
in Unbalanced Data
Hugo Hernandez
ForsChem Research
[email protected]

[6] Montgomery, D. C. (2017). Design and Analysis of Experiments. 9th Edition. John Wiley &
sons, Inc., Hoboken. Section 15.2.

[7] Simon, M. K. (2002). Probability Distributions Involving Gaussian Random Variables: A


Handbook for Engineers and Scientists. Springer Science Business Media, New York.
[8] Yates, F. (1934). The analysis of multiple classifications with unequal numbers in the
different classes. Journal of the American Statistical Association, 29(185), 51-66.

[9] Herr, D. G. (1986). On the history of ANOVA in unbalanced, factorial designs: The first 30
years. The American Statistician, 40(4), 265-270.

[10] Littell, R. C., Stroup, W. W., & Freund, R. J. (2002). SAS ® for Linear Models. 4th Edition. SAS
Institute Inc., Cary. pp. 14-17.

[11] Langsrud, Ø. (2003). ANOVA for unbalanced data: Use Type II instead of Type III sums of
squares. Statistics and Computing, 13(2), 163-167.

[12] Smith, C. E., & Cribbie, R. (2014). Factorial ANOVA with unbalanced data: a fresh look at the
types of sums of squares. Journal of Data Science, 12, 385-404.

[13] Shaw, R. G., & Mitchell-Olds, T. (1993). ANOVA for unbalanced data: an overview. Ecology,
74(6), 1638-1645.

[14] Hernandez, H. (2018). Multidimensional Randomness, Standard Random Variables and


Variance Algebra. ForsChem Research Reports, 3, 2018-02. doi: 10.13140/RG.2.2.11902.48966.

[15] Sober, E. (1981). The principle of Parsimony. The British Journal for the Philosophy of
Science, 32(2), 145-156.

[16] Lewsey, J. D., Gardiner, W. P., & Gettinby, G. (2001). A study of type II and type III power for
testing hypotheses from unbalanced factorial designs. Communications in Statistics-Simulation
and Computation, 30(3), 597-609.

[17] Nelder, J. A. (1994). The statistics of linear models: back to basics. Statistics and
Computing, 4(4), 221-234.

[18] Kempthorne, O. (1975). Fixed and mixed models in the analysis of variance. Biometrics,
473-486.

[19] Gelman, A. (2005). Analysis of variance—why it is more important than ever. The Annals of
Statistics, 33(1), 1-53.

27/01/2021 ForsChem Research Reports Vol. 6, 2021-01 (35 / 35)


www.forschem.org

You might also like