Classical and Robust Regression Analysis With Compositional Data
Classical and Robust Regression Analysis With Compositional Data
https://doi.org/10.1007/s11004-020-09895-w
Received: 15 November 2019 / Accepted: 14 September 2020 / Published online: 6 October 2020
© The Author(s) 2020
123
824 Math Geosci (2021) 53:823–858
1 Introduction
123
Math Geosci (2021) 53:823–858 825
123
826 Math Geosci (2021) 53:823–858
generalising the logit, and he defined several of them, most notably the centered
logratio (clr) transformation (Aitchison 1986)
x
clr(x) = ln
D
, inverse: clr −1 (x∗ ) = C [exp(x∗ )] ,
xi
with logs and exponentials applied component-wise. The sample space of a random
composition, the simplex S D , has been since the end of the nineties recognized
to have a vector space structure, induced by the operations of perturbation x ⊕ y =
C [x1 y1 , x2 y2 , . . . , x D y D ] (Aitchison 1986) and powering λx = C [x1λ , x2λ , . . . , x D
λ]
(Aitchison 1997). The neutral element with respect to this structure is proportional to a
vector of D ones, n = C [1 D ]. This structure can be extended (Pawlowsky-Glahn and
Egozcue 2001b) to a Euclidean space structure (Aitchison et al. 2002) by the scalar
product
1 xi
D D
yi
x, y A = ln ln = clr(x), clr(y) , (1)
2D xj yj
i=1 j=1
D D
1 xi yi 2
d 2A (x, y) = ln − ln , (2)
2D xj yj
i=1 j=1
both fully compliant with the concept of relative importance conveyed in the modern
definition of compositions.
To take this relative character into account in an easy fashion when statistically
analyzing compositional data sets, the principle of working on coordinates is recom-
mended (Pawlowsky-Glahn 2003). This states that the statistical analysis should be
applied to the coordinates of the compositional observations, preferably in an orthonor-
mal basis of the Euclidean structure {S D , ⊕, , ·, · A }, and that results might be
expressed back as compositions by applying them to the basis used. A simple and
easy way to compute orthonormal coordinates is provided by the isometric log-ratio
(ilr) transformation (Egozcue et al. 2003)
Each of these columns provide the clr coefficients of each of the compositional vectors
forming the orthonormal basis used, so that the orthonormal basis on the simplex (with
respect to the Aitchison geometry) is wi = clr −1 (vi ). Conversely, given a vector of
123
Math Geosci (2021) 53:823–858 827
D−1
xi∗ wi =: ilr −1 (x∗ ) = C [exp(V · x∗ )], (5)
i=1
and the clr coefficients can be recovered as well from the ilr coordinates with
There are infinitely many ilr transformations, actually as many as matrices satisfying
the conditions of Eq. (4). Each represents a rotation of the coordinate system (see, e.g.,
Egozcue et al. 2003; Filzmoser et al. 2018). For the purposes of this contribution, it is
relevant to consider those matrices linked to particular subcompositions, i.e. subsets
of components. If one wants to split the parts in x into two groups, say the first s
against the last r = D − s, there is a vector that identifies the balance between the
two groups,
⎡ ⎤
1 ⎣r, r, . . . , r , −s, −s, . . . , −s ⎦ .
v= √ (7)
r s + sr 2
2
s r
Given that one can always split the resulting two subcompositions again into two
groups, this sort of structure can be reproduced D − 1 times, generating D − 1 vectors
of this kind with only three values every time (a positive value for one group, a
negative value for the other group and zero for those parts not involved in that particular
splitting). This is called a sequential binary partition, and the interested reader may
find the details of the procedure in Egozcue and Pawlowsky-Glahn (2005, 2011).
A particular case occurs when the splitting of interest places one individual variable
against all other (a sort of “one component subcomposition”). Considering the first
component as the one that is desired to be isolated, the balancing vector is then obtained
with Eq. (7) taking r = 1 and s = D − 1. Balances isolating any other part can be
obtained permuting the resulting components. This sort of balances, which can be
identified with so-called pivot coordinates (Filzmoser et al. 2018), are thus useful also
to check the possibility to eliminate one single part from the regression model.
Three kinds of linear models involving compositions have been defined (van den
Boogaart and Tolosana-Delgado 2013; Pawlowsky-Glahn et al. 2015; Filzmoser et al.
2018): models with compositional response (Aitchison 1986; Daunis-i-Estadella et al.
2002), models with compositional explanatory variable (Aitchison 1986; Tolosana-
Delgado and van den Boogaart 2011), and models with compositions as both
explanatory variable and response (Egozcue et al. 2013). The next sections systemati-
cally build each regression model solely by means of geometric operations, and show
123
828 Math Geosci (2021) 53:823–858
how these models are then represented in an arbitrary isometric logratio transforma-
tion.
P
Ŷ = X i bi , Y ∼ NS D (Ŷ, Σ ε ), (8)
i=0
where NS D (Ŷ, Σ ε ) stands for the normal distribution on the simplex of Y (Mateu-
Figueras and Pawlowsky-Glahn 2008), parametrized in terms of a compositional mean
vector and a covariance matrix of the random composition in some ilr representation.
This reflects the fact that the normal distribution on the simplex of a random com-
position corresponds to the (usual) normal distribution of its ilr representation. This
regression model is useful for explanatory variables of type quantitative (regression),
categorical (ANOVA) or a combination of both (ANCOVA). Note that one can establish
this regression model for compositional data in a least-square sense (Mood et al. 1974,
Chapter X), free of the normality assumption, by using the Aitchison distance [Eq.
(2)] as Daunis-i-Estadella et al. (2002) proposed. However, the normality assumption
is needed in the context of hypotheses testing which is one of the main contributions
of this paper. Specifically, it serves for deriving the distribution of the test statistics in
the classical (least squares) regression case and serves also as the reference model for
robust regression.
If a logratio transformation is applied to this model, this yields a conventional
multiple, multivariate linear regression model on coordinates
P
Ŷ∗ = X i · bi∗ , Y∗ ∼ N D−1
(Ŷ∗ , Σ ε ). (9)
i=0
The model parameters are thus the slopes b∗0 , b∗1 , . . . , b∗P , and the residual covariance
matrix Σ ε . Note that it is common to take X 0 ≡ 1 and then b∗0 rather represents
the intercept of the model in the logratio coordinate system chosen. The specification
in Eq. (9) has the advantage to be tractable with conventional software and solving
methods. Once estimates of the vector coefficients are available, they can be back-
transformed to compositional coefficients, e.g. b̂i = ilr −1 (b̂i∗ ) if calculations are done
in ilr coordinates. Alternatively, ilr coordinates can also be converted to clr coefficients
with b̂iclr = V · b̂i∗ .
It is important to emphasise that the predictions provided by this regression model
are unbiased both in terms of any logratio representation [Eq. (9)], and in terms of
the original composition [Eq. (8)] with respect to the Aitchison geometry discussed in
123
Math Geosci (2021) 53:823–858 829
Sect. 2.1. This follows directly from the isometry of the ilr or clr mappings (Egozcue
et al. 2012; Pawlowsky-Glahn et al. 2015; Fišerová et al. 2016). If interest lies on
understanding the unbiasedness properties of predictions (8) with respect to the con-
ventional Euclidean geometry of the real multivariate space R D , i.e. on the nature
of the expected value of Ŷ − Y, then one can resort to numerical integration of the
model explicated by Eq. (8), which provides the conditional distribution of Y given
Ŷ (Aitchison 1986).
A model with a compositional explanatory variable X and one explained real variable
Y is (both in composition and coordinates)
The model parameters are thus the intercept b0 , the gradient b∗ and the residual vari-
ance σε2 , which again can be estimated with any conventional statistical toolbox. The
gradient, once estimated in coordinates, can be back-transformed to a compositional
gradient as b̂ = ilr −1 (b̂∗ ), or to its clr representation by b̂iclr = V · b̂i∗ . Note that
solving a Type 2 model directly in clr would require the use of generalised inversion
of the covariance matrix of clr(X), which provides the same results but at a higher
computational cost.
123
830 Math Geosci (2021) 53:823–858
along which each explanatory ilr coordinate can modify the response,
P
Ŷ∗ = b0 ⊕ X i∗ b·i .
i=1
Third, one can interpret B∗ as the matrix representation (on the chosen bases of the
two simplexes) of a linear application B: S Dx → S D y , which is nothing else than the
combination of a rotation on S Dx and a rotation on S D y , together with a univariate
linear regression of each of the pairs of rotated axes:
R
∗
B = ui · di · vit , R ≤ min(Dx , D y ) . (11)
i=1
Here, the matrix U = [ui ] of left vectors is the rotation on the image simplex S D y , and
that of the right vectors V = [vi ] the rotation on the origin simplex S Dx . The coefficient
di is then the slope of the regression between the pair of rotated directions. Note that
this representation coincides with a singular value decomposition of the matrix B∗ ,
and is reminiscent of methods such as canonical correlation analysis or redundancy
analysis (Graffelman and van Eeuwijk 2005). To recover clr representations of the
matrix of coefficients, or of these singular vectors, one just needs the respective basis
matrices Vx and V y ,
These expressions apply to the model coefficients (B∗ ) and to their estimates (B̂∗ )
given later on in Sects. 3 and 4.
Note that the same issues about the unbiasedness of predictions raised in Sect. 2.2.1
apply to predictions obtained with Eq. (10).
One of the most common tasks of regression is the validation of a particular model
against data, in particular models of (linear) independence, partial or complete. In
a non-compositional framework, independence is identified with a slope or gradient
matrix/vector identically null (complete independence), or just with some null coef-
ficients (partial independence). Complete independence for compositional models is
also identified with a null slope, null gradient vectors, or null matrices of the model
established for coordinates (each slope bi∗ , the gradient b∗ resp. the matrix B∗ ). But
partially nullifying one single coefficient of these vectors or matrices just forces inde-
pendence of the covariable(s) with a certain logratio, not with the components this
logratio involves. The necessary concept in this context is thus rather one of subcom-
positional independence, i.e. that a whole subset of components has not influence in
123
Math Geosci (2021) 53:823–858 831
resp. is not influenced by a covariable. One must further distinguish two cases, namely
internal and external subcompositional independence.
Consider the first s components of the composition as independent of a given
covariable. One can then construct a basis of S D with three blocks: an arbitrary basis
of s − 1 vectors comparing the first s components (independent subcomposition), the
balancing vector between the two subcompositions (Eq. 7), and an arbitrary basis of
r − 1 vectors comparing the last r = D − 1 components (dependent subcomposition).
In a Type 1 regression model (compositional response), internal independence of a
certain subcomposition with respect to the ith explanatory covariable X i means that
this covariable is unable to modify the relations between the components of the inde-
∗ = b∗ = · · · = b∗
pendent subcomposition, i.e. b1i 2i (s−1)i = 0. External independence
further assumes that the balance coordinate is independent of the covariable, bsi ∗ = 0.
123
832 Math Geosci (2021) 53:823–858
intercept, if that is included in the model). The regression parameters are denoted by
the vector b∗ = [b0∗ , b1∗ , . . . , b∗D−1 ]t , and the scale of the residuals is σε . The residuals
are denoted as ri (b∗ ) = yi − [b∗ ]t xi∗ , for i = 1, 2, . . . , n.
Considering the vector of all responses y = [y1 , . . . , yn ]t and the matrix of all
explanatory variables X∗ = [x1∗ , . . . , xn∗ ] (each row is an individual, the first column
is the constant 1, and each subsequent column an ilr coordinate), the least squares
estimators of the model parameters are
b∗ = [(X∗ )t · X∗ ]−1 · (X∗ )t · y
and
1 2 ∗
n
σ̂ε2 = ri (b ).
N−D
i=1
B∗ = [(X∗ )t · X∗ ]−1 · (X∗ )t · Y∗
and
1 ∗ ∗ t ∗ ∗
n
ε =
Σ ri (B ) · ri (B ).
N−P
i=1
b = Σ
Σ ε ⊗ [(X∗ )t · X∗ ]−1 ,
123
Math Geosci (2021) 53:823–858 833
The classical theory of linear regression modeling provides a wide range of tests on
regression parameters, both in univariate regression (Type 2) and multivariate regres-
sion models (Types 1 and 3) (Johnson and Wichern 2007). Among them, we are
particularly interested in those that are able to cope with subcompositional indepen-
dence (in its external and internal forms, respectively), as introduced in Sect. 2.3.
For the model of Type 2 and the internal subcompositional independence, the cor-
responding hypothesis on the regression parameters can be expressed as Ab∗ = 0
with A = (0, I), where I is the (s − 1) × (s − 1) identity matrix and 0 stands for an
(s − 1) × (D − s + 1) matrix with all its elements zero. In the alternative hypothesis the
former equality does not hold. Note that for the case of external subcompositional inde-
pendence, the sizes of the matrices I and 0 would just change to s × s and s × (D − s),
respectively. In the following, only the internal subcompositional independence will
be considered, the case of the external independence could be derived analogously.
Under the model assumptions including normality on the simplex and the above null
hypothesis, the test statistic
(S R − S)/(s − 1)
T = , (13)
S/(n − D)
where
n
n
S= ri (
b∗ ), S R = ri (
b∗R ),
i=1 i=1
follows an F distribution with s−1 and n− D degrees of freedom. Here, b∗R denotes the
LS estimates under the null hypothesis (i.e. just the submodel is taken for the estimation
of regression parameters). The hypothesis on internal subcompositional independence
is rejected if t ≥ Fs−1,n−D (1 − α), the (1 − α)-quantile of that distribution. This test
statistic coincides with the likelihood ratio test on the same hypothesis, that can be
easily generalized for the case of multivariate regression. The statistic can be written
also in the form
S ∗R − S ∗
T =
s−1
for
n ∗
ri (b )
n
ri (
b∗R )
∗
S = , S ∗R = . (14)
σ̂ε σ̂ε
i=1 i=1
Finally, note that frequently the fact is used that the distribution of (s − 1)T converges
in law to a χ 2 distribution with s − 1 degrees of freedom for n → ∞.
Similarly, it might be of interest if some of the (non-compositional) explana-
tory variables do not influence the compositional response (Type 1); in case of a
123
834 Math Geosci (2021) 53:823–858
where Σ b R denotes the estimated covariance matrix of the estimated matrix of regres-
sion parameters in the submodel, formed under the null hypothesis. For n → ∞ the
statistic TM converges to a χ 2 distribution with (D − 1)(s − 1) degrees of freedom
(Johnson and Wichern 2007).
4 Robust MM Estimation
Many proposals for robust regression are available in the literature (see Maronna et al.
2006). The choice of an appropriate estimator depends on different criteria. First of
all, the estimator should have desired robustness properties, i.e. robustness against
a high level of contamination, and at the same time high statistical efficiency. MM
estimators for regression possess the maximum breakdown point of 50% (i.e. at least
50% of contaminated samples are necessary in order to make the estimator useless),
and they have a tunable efficiency. Although other regression estimators also achieve
a high breakdown point, like the LTS regression estimator, their efficiency can be
quite low (Maronna et al. 2006). Another criterion for the choice is the availability of
an appropriate implementation in software packages. MM estimators for regression
are available in the software environment R (R Development Core Team 2019). For
univariate response (Type 2 regression) we refer to the function lmrob of the R
package robustbase (Maechler et al. 2018), for multivariate response (Types 1
and 3) there is an implementation in the package FRB, which also provides inference
statistics using the fast robust bootstrap (Van Aelst and Willems 2013).
123
Math Geosci (2021) 53:823–858 835
n
ri (b)
b∗ = argmin ρ , (16)
b σ̂ε
i=1
where σ̂ε is a robust scale estimator of the residuals (Maronna et al. 2006). The function
ρ(·) should be bounded in order to achieve good robustness properties of the estimator
(for details, see Maronna et al. 2006). An example is the bisquare family, with
2 4
r 2
3 − 3 rk + rk for |r | ≤ k
ρ (r ) = k . (17)
1 else
The constant k is a tuning parameter which gives a tradeoff between robustness and
efficiency. When k gets bigger, the resulting estimate tends to LS, thus being more
efficient but less robust. A choice of k = 0.9 leads to a good compromise with a given
efficiency.
The crucial point is to robustly estimate the residual scale which is needed for
the minimization problem (Eq. 16). This can be done with an M-estimator of scale,
defined as the solution of the implicit equation
1
n
ri (b)
ρ1 = d. (18)
n σ̂ε
i=1
Regression S-estimators are highly robust but inefficient. However, one can compute
an S-estimator b∗(0) as a first approach to b∗ , and then compute σ̂ε as an M-estimator
of scale using the residuals from b∗(0) (see Maronna et al. 2006). Yohai (1987) has
shown that the resulting MM estimator b∗ inherits the breakdown point of
b∗(0) , but
its efficiency under normal distribution is determined by tuning constants. The default
implementation of the R function lmrob attains a breakdown point of 50% and an
asymptotic efficiency of 95% (Maechler et al. 2018).
123
836 Math Geosci (2021) 53:823–858
Robust hypothesis tests in linear regression are not straightforward, because they have
to involve robust residuals, and some tests also rely on a robust estimation of the
covariance matrix of the regression coefficients. In the following we will focus on
tests which can cope with subcompositional independence.
For the univariate case (Type 2) a robust equivalent to the test mentioned in Sect. 3.3
is available. It is a likelihood ratio-type test which, unlike a Wald-type test, does not
require the estimation of the covariance matrix of b∗ . The hypothesis to be tested is
∗
the same as stated in Sect. 3.3, namely Ab = 0, with A = (0, I) and I an identity
matrix of order s − 1. For the alternative hypothesis Ab∗ = 0. In analogy to the terms
in (14), the test is based on
∗
n
ri (b )
n
ri (
b∗R )
∗ ∗
S = ρ , SR = ρ , (21)
σ̂ε σ̂ε
i=1 i=1
where ρ(·) is a bounded function and σ̂ε is a robust scale estimator of the residuals,
see also Eq. (16). With the choice
n
∗
i=1 ψ ri (b )/σ̂ε
ξ= n 2 ,
i=1 ψ ri (
b∗R )/σ̂ε
T = ξ(S ∗R − S ∗ ) (22)
1986). The null hypothesis is rejected at the significance level α if the value of the test
2 (1 − α).
statistic t > χs−1
For regression Type 1 and 3 we can use the robust equivalent of the likelihood ratio
test mentioned in Sect. 3.3. According to Eq. (15), the covariance matrix of the esti-
mated matrix of regression parameters is needed. This can be obtained by bootstrap
as follows. In their R package FRB, Van Aelst and Willems (2013) provide function-
ality for inference statistics in multivariate MM regression by using the idea of fast
123
Math Geosci (2021) 53:823–858 837
and robust bootstrap (Salibian-Barrera and Zamar 2002). A usual bootstrap procedure
would not be appropriate for robust estimators, since it could happen that a bootstrap
data set contains more outliers than the original one due to an over-representation of
outlying observations, thus causing breakdown of the estimator. Moreover, recalculat-
ing the robust estimates for every sample would be very time consuming. The idea of
fast and robust bootstrap (FRB) is to estimate the parameters only for the original data.
Let θ contains all estimates ε in vectorized form, and denote by ΩΘ the set of
B and Σ
possible values of this vectorized model parameter. MM-estimators can be written in
form of a system of fixed-point equations, i.e. thanks to a function g: ΩΘ → ΩΘ such
that
θ = g(θ ). Indeed, if the function g is known, one can estimate θ as the fixed point
of the equation. The function g depends on the sample, hence for a bootstrap sample
we obtain a different function gb . The idea is thus to use the original estimate and
the fixed-point equation for the bootstrap sample, obtaining θ b := gb (
1
θ ). This results
in an approximation of the bootstrap estimates θ b which would be obtained directly
from the bootstrap sample, i.e. solving θ b = gb (
θ b ). Applying a Taylor expansion, an
improved estimate θ b can be derived, estimating the same limiting distribution as
I
θ b,
and being consistent for θ . For more details concerning fast and robust bootstrap for
the MM-estimator of regression see Salibian-Barrera et al. (2008).
The GEMAS (“Geochemical Mapping of Agricultural and grazing land Soil”) soil
survey geochemical campaign was conducted at European level, coordinated by
EuroGeoSurveys, the association of European Geological Surveys (Reimann et al.
2014a, b). It covered 33 countries, and it focuses on those land uses that are vital for
food production. The area was sampled at a density of around 1 site per 2,500 km2 .
Samples were taken from agricultural soils (0 to 20 cm) and grazing land soils (0 to
10 cm). At each site, 5 samples at the corners and in the center of a square with 10 by
10 m were collected, and the composite sample was analyzed. Around 60 chemical
elements were obtained in samples of both kinds of soil. Soil textural composition
was also analyzed, i.e. the weight % of sand, silt and clay. Some parameters describ-
ing the climate (climate type, mean temperature or average annual precipitation) and
the background geology (rock type) are also available. More specifically, the average
annual precipitation and the annual mean temperature at the sample locations are taken
from Reimann et al. (2014a) and originate from www.worldclim.org. The subdivision
of the GEMAS project area into climate zones goes back to Baritz et al. (2005).
From the several variables available, we focus on the effects between the soil com-
position (either its chemistry or its sand-silt-clay texture) and the covariables: annual
average precipitation, soil pH (both as continuous variables) and climate zones [as cat-
egorical variable, with the respective sample sizes; the categories are Mediterranean
(Medi, 438), Temperate (Temp, 1,102), Boreal–Temperate (BoTe, 352) and Suprabo-
real (Spbo, 203) ]. Figure 1 shows a set of descriptive diagrams of these variables
and compositions. A total of n = 2095 samples of the GEMAS data set were used,
123
838 Math Geosci (2021) 53:823–858
A D E
1000
clay
600
Frequency
Frequency
600
200
0 200
0
B F
600
Frequency
400
200
. ..
Na2O .. ..
.. .. .. .... ...
0
.
. . .. .... ................ .
500 1000 2000
.
. . . . . ....................... . ... ..
. .. ....... . ......... .... . .
. . .. ........................................................ ...... .. .
Ann. prec. (log−scale) . . ... ................................................. ........ .... . . . . . . .. .
. . . . . . . .... . .. .
. . .. ....... .......................................................................... .... .MgO ...... ... . . . .
.
. . . .
. .. ... . .................................................................. ....... . ....... . ... . . ..... .
. .. . .. .......................................... .... ... ............ ..... ..... . . . . .. . CaO
............................................. ....................... ........ .. .. ............. .... ...
. ...Al2O3
.
C . .K2O .. . . . . . . . . . . . . . . .. . . . . .
...................... ......... ... .. . . ...... . .. ... .. . .. ....... .
. . ..............................................Fe2O3
. .
........................................... ..... .. .. .. . .... . ... .. . ... . . . .
. . . .........TiO2
. . ...........MnO .. . .. .. ... . . . .. .. ... . . . ...
. .... ........................P2O5 .... ... .... . . .. . .. . . . . . .. . . . .
2000
..SiO2
. .... .... .......... ............. .... .... . . . LOI .. . . . .
. . .. . . . ...... . . ...
. . .
. . ... .... ........ .......... . ... . . . . . . . . . . .
. . . . . . . ... ..
. .. . . .
.. . . ....... ............. .. . . .. ... . . .
1000
. .. .. . . . .
. . . .
. . . .
.. . .
.
Medi Temp BoTe Spbo
.
.
climate
Fig. 1 Descriptive diagrams of the sets of variables used. Displayed are histograms of Annual Precipitation
in the original (a) and logarithmic (b) scales. Boxplots of log-scaled Annual Precipitation (c) and histogram
of sample sizes (d) according to climate groups follow. The climate groups are used to color sand-silt-clay
compositions in the ternary diagram (e). Finally, the multivariate data structure of chemical compositions
is captured using the compositional biplot (f)
covering almost all Europe, excepting Romania, Moldova, Belarus, Eastern Ukraine
and Russia (Fig. 2). From a comparison between panels A and B (Fig. 1), one can
conclude that the logarithm of Annual Precipitation is required for further treatment.
Though symmetry or normality are not attained, even with a logarithm [both p values
of the Anderson–Darling test for normality (Anderson and Darling 1952) were zero],
at least a view by the four climatic groups suggest that departures from symmetry are
moderate to mild (Fig. 1c), not going to affect negatively further regression results.
As indicated above, the data present a rather unbalanced design with respect to cli-
matic areas (Fig. 1d), particularly due to the dominance of temperate climate, which
accounts for more than 50% of the samples, see also Fig. 2.
The sand-silt-clay textural composition is represented in Fig. 1e as a ternary dia-
gram, with colors after the four climatic zones: these show a certain control on the
123
Math Geosci (2021) 53:823–858 839
70
65
60
55
latitude
50
45
40
35
−10 0 10 20 30
longitude
Fig. 2 Sample location. Colors after climatic zones: red for Mediterranean, blue for Temperate, green for
Boreal–Temperate and violet for Supraboreal
amount of clay, and this will be explored later. With regard to the major oxide com-
position including SO3 and LOI (loss on ignition), this is represented in Fig. 1f as
a centered logratio covariance biplot, as conventional in compositional data analysis
(Aitchison 1997; Aitchison and Greenacre 2002). This shows a quite homogeneous
data set without any strong grouping that could negatively affect the quality of the
next regression steps.
−1 −1 +2
2
yclay
√ √ √ y1∗ = √1 ln ysilt ·ysand
V =
t
−1
6
+1
6 6 , 6 , (23)
√
2
√
2
0 y2∗ = √1 ln ysilt
ysand
2
123
840 Math Geosci (2021) 53:823–858
4
2
sand
0
−2
4
4
grain size composition
2
2
0
0
silt
−2
−2
−4
−4
2
0
clay
−2
−4
log(Annual Precipitation)
Fig. 3 Grain size composition as a function of (log) annual precipitation; observed GEMAS data (dots) and
fitted models (red: classical; blue: robust). Symbol size (in the lower left panels) is inversely proportional
to the weights computed in robust regression
and a model of the form of Eq. (9) was fitted by the LS method. Table 1 shows the
logratio coefficients, as well as their values once back-transformed. Note that the back-
transformed values would be exactly the same, whatever other logratio transformation
would have been used for the calculations.
Table 1 reports the coefficients for the ilr coordinates defined in Eq. (23), the
corresponding p values, and the back-transformed coefficients, using LS and MM
estimators.The LS estimates show that the ratio silt-to-sand is not affected by annual
precipitation, while their relation to clay does depend on this covariable. In contrast,
for the MM estimators both coordinates are affected by annual precipitation. Figure 3
shows both the LS and MM models, re-expressed in each of the possible pairwise
logratios. Note that the slope and intercept given for the coordinate y2∗ in Table 1
correspond to panel (2,1) of this Fig. 3. The intercepts and slopes for each of the other
panels can be obtained by transforming the coefficients (ysand , ysilt , yclay ) accordingly.
123
Math Geosci (2021) 53:823–858 841
Table 1 Regression models of grain size composition against (log) annual precipitation, using LS and MM
regression
The columns refer to the estimated parameters for the ilr coordinates, the corresponding p values, and the
back-transformed regression coefficients
2
clay
1
0
y*2
−1
−2
−3
sand silt
−3 −2 −1 0 1
y*1
Fig. 4 Scatter diagrams (left: ilr plane; right: ternary diagram) of the data and of the predictions of both
the LS (red) and MM (blue) models. The lines show the models extrapolated beyond the range of observed
annual precipitation. Symbol size is inversely proportional to the weights computed by robust regression
Figure 4 shows the model predictions for the classical (red) and the robust (blue)
model. The left plot presents the predictions in the ilr coordinates, as they are used
in the regression models, and the right plot shows the predictions for the original
composition. The symbol sizes are inversely proportional to the weights from robust
MM regression, and here it gets obvious that due to very small values of clay (rounded
values), data artifacts are produced in the ilr coordinates, but these observations are
downweighted by MM regression. This is the main reason for the difference between
the LS and the MM model.
A regression of the grain size composition (response) against climate zones (explana-
tory variables) should take into account that the climate zones are ordered in a clear
sequence from Mediterranean (Medi), Temperate (Temp), Boreal–Temperate (BoTe)
to Supraboreal (Spbo), ordered from South to North. This is clearly seen in Fig. 5,
showing a relatively constant average sand/silt ratio across climatic zones, but a clear
123
842 Math Geosci (2021) 53:823–858
4
4
4
3
2
log(sand/clay)
log(sand/silt)
log(silt/clay)
2
0
1
0
−2
−1
−2
Medi Temp BoTe Spbo −2 Medi Temp BoTe Spbo Medi Temp BoTe Spbo
Fig. 5 Logratios of the grain size composition for the different climate zones
monotonous trend of average sand/clay and silt/clay ratios northwards. Such a trend
is followed also by the compositional centers (Pawlowsky-Glahn et al. 2015) for the
respective climate categories, see Table 2.
Thus the following hypothesis of uncorrelation appear as sensible:
1. the balance of sand to silt is uncorrelated with climate (i.e. the sand-silt subcom-
position is internally uncorrelated with climate)
2. the balance of clay to the other two depends on climate only in so-called linear
terms, as explained in the next paragraph.
Given these hypotheses, the same ilr coordinates as in the preceding section (Eq. 23)
will be used here.
In R—package stats; (R Development Core Team 2019)—, a regression model
with an ordered factor of 4 levels requires building an accessory (n×3)-element design
or contrast matrix X, where each row is taken as the corresponding row of Table 3.
The labels L—“Linear”, Q—“Quadratic” and C—“Cubic” stand for the kind of trend
between the four categories fitting the data, L implying that the differences between
two consecutive categories are constant (Simonoff 2003).
Table 4 summarizes the numerical output of this regression model, including esti-
mated coefficients (intercept, and effects L, Q and C) for each of the two balances,
the p values of the hypotheses of null coefficient, and the back-transformed coeffi-
cients. These results are given for both classical (LS) and robust (MM) regression.
Classical LS regression shows that C and Q effects can be discarded for y2∗ but not
L effects, i.e. the first hypothesis (inner uncorrelation of the sand-silt subcomposition
with climate) must be rejected. With regard to the second hypothesis, nullifying the
123
Math Geosci (2021) 53:823–858 843
Table 3 Row vectors to construct the design matrix associated with the categorical variable climate
L Q C
These numbers result from applying the R function contrasts on the ordered factor variable climate. L
stands for linear effect, Q for quadratic effect and C for cubic effect
Table 4 Fitted coefficients and p values of the regression models of grain size composition versus climate
coefficients for L and Q effects on y1∗ are significantly different from zero (p values
smaller that 0.05 critical level), which implies that the second hypothesis is false as
well. Nevertheless, C effects can be discarded. A global test in the fashion of what
was explained in Sect. 3.3 gives a zero p value for the hypothesis of absence of Q
or C effects, thus supporting these conclusions. Robust regression delivers a similar
picture, except that here all effects are significant for y2 .
Of course, other contrasts could be used for this analysis, depending on the nature
of the hypotheses of dependence that we are interested on testing. If, for instance, one
would want to check whether soils from different climatic zones have on average the
same soil texture, one could have used the constr.treatment function of R to
force this sort of comparison.
One way or another, in a categorical regression model like this, the intercept can
be interpreted as a sort of global average value compensating for the lack of balance
between the four categories. While the conventional compositional center is [sand; silt;
clay] = [52.39%; 35.27%; 12.34%] the least squares regression delivers an estimate
[54.83%; 35.48%; 9.69%] and the robust regression [51.54%; 37.60%; 10.86%], both
downweighting the importance of clay. Note that this intercept does not depend on
which contrast set is chosen for capturing the categorical variable.
123
844 Math Geosci (2021) 53:823–858
LS regression MM regression
0.2 0.2
clr slope coefficient
0.0
0.0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
SiO2
TiO2
Al2O3
MnO
MgO
CaO
Na2O
K2O
P2O5
LOI
SiO2
TiO2
Al2O3
MnO
MgO
CaO
Na2O
K2O
P2O5
LOI
Fe2O3
Fe2O3
Fig. 6 Least squares (left) and MM (right) regression coefficients of the clr transformed major oxide
composition versus log-annual precipitation
Much richer hypotheses can be tested if the composition used has many parts. To illus-
trate this, a regression of the major oxides against (log) annual precipitation follows. A
natural initial question is whether the soil geochemistry is influenced by precipitation.
For this purpose, Fig. 6 shows the clr coefficients estimated with classical least squares
and robust regression: clr coefficients should never be interpreted individually; rather,
differences between them can be understood as the influence on a particular pairwise
logratio. Thus, we are seeking for the smallest differences between coefficients as
they identify pairs of variables whose balance is not influenced by the explanatory
variable. As a subset of pairwise logratios identifies a (sub)composition, this gives
information about subcompositions that might be potentially internally independent
of the covariable, such as:
– TiO2 –Fe2 O3 –MnO
– Al2 O3 -LOI (with Na2 O in least squares regression, or MgO in robust regression)
– SiO2 –K2 O
A set of ilr coordinates is selected accordingly to contain balances between these
subcompositions. The matrix of signs to build these balances is given in Table 5.
Remember that in a sign table, + 1 indicates variables that appear in the numerator
of the balance, − 1 variables in the denominator, and 0 variables are not involved in
that particular balance. For instance, the balance between the subcompositions TiO2 –
Fe2 O3 –MnO and Al2 O3 –Na2 O–LOI is y4∗ , and the balances (y7∗ , y8∗ ) describe the
internal variability in the subcomposition TiO2 –Fe2 O3 –MnO.
Using this set of balances, a regression model with explanatory variable (log) annual
precipitation is fit, with LS and MM regression. Results are reported in Table 6.
Paying attention to the p values of the slopes of the two models, we conclude that
the subcomposition Al2 O3 –Na2 O–LOI (y7∗ , y8∗ ) is internally independent of annual
precipitation (both classical and robust methods agree in that). Loosely speaking, the
123
Math Geosci (2021) 53:823–858 845
Table 6 Intercept (int) and slope (slp) estimated coefficients and p values (.p), for least squares (LS) and
robust (MM) regression
same applies to the balances SiO2 /K2 O (y10∗ ) and MgO against all other components
∗ ∗
(y1 ). Finally the balance TiO2 /Fe2 O3 (y6 ) appears to be uncorrelated with annual
precipitation only from a least-squares perspective.
Now, global tests of internal and external independence of Al2 O3 –Na2 O–LOI with
respect to annual precipitation were performed after the methodology of Sect. 3.3, and
delivered p values of 0.884 and 0,respectively.These results are somehow at odds with
the common understanding of weathering as a process of enrichment in Al2 O3 (and
perhaps LOI) at the expenses of Na2 O (and CaO). Annual precipitation, one of the
factors of chemical weathering, is not showing any significant effect on the logratio
Al2 O3 /Na2 O. The robust global test, on the other hand, results in significant effects:
in both cases, the p values are zero.
123
846 Math Geosci (2021) 53:823–858
LS regression MM regression
0.6
0.4 0.4
clr slope coefficient
0.2 0.2
0.0 0.0
−0.2 −0.2
−0.4 −0.4
SiO2
TiO2
Al2O3
MnO
MgO
CaO
Na2O
K2O
P2O5
LOI
SiO2
TiO2
Al2O3
MnO
MgO
CaO
Na2O
K2O
P2O5
LOI
Fe2O3
Fe2O3
Fig. 7 Regression gradient of pH against the major oxide composition, expressed in clr coefficients: least
squares estimates (LS, left) and robust estimates (MM, right)
negative weights we also find balances which could concentrate most of the predicting
power for pH: these are CaO/Na2 O (y2∗ ) and K2 O/LOI (y5∗ ).
The regression results are presented in Table 8 for LS and MM regression. The
table shows the estimated regression coefficients and the corresponding p values.
Both methods reveal that the coefficients for balances y2∗ and y5∗ are significant, while
those for balances y6∗ and y10 ∗ are not. Using the methods from Sect. 3.3 we can
123
Math Geosci (2021) 53:823–858 847
Table 8 LS and MM regression coefficients and p values for a regression of pH on the major oxide balances
and low (blue) values of pH. One can see that the ratio of CaO and Na2 O increases
strongly with higher values of pH, leading to a non-linear relationship. The reason is
seen in the middle panel of Fig. 8 using the same coloring, where high pH values are
connected to high concentrations of CaO. These pH rich samples are indicated in the
project area in the right panel with red color. This supports the starting hypothesis of
a strong control on pH of the buffering ability of carbonate soils. This trend can be
explained as the contrast between silicic–clastic plus crystalline rocks with significant
contributions of Na-rich silicates versus carbonate karstic landscapes, dominated by
CaCO3 with its very strong buffering effect at slightly basic pH values. Such a complex
trend could be better captured either with a non-linear regression method, or stepwise
linear regression to be carried out only for the samples which behave similarly (blue
or red).
123
848 Math Geosci (2021) 53:823–858
70
4
65
7
60
3
latitude
55
6
Na2O
pH
50
2
5
45
1
40
4
35
−2 0 2 4 0 0 10 20 30 40 −10 0 10 20 30
Fig. 8 pH versus balance y2∗ (left), Na2 O versus CaO (middle), and map of the area (right), with color
indicating pH values lower or higher than 7
123
Math Geosci (2021) 53:823–858 849
3
Robust Standardized residuals
2
8
1
7
Response
0
6
−1
5
−2
724
4
1108
−3
1108
724 488
488
3
−4
0 5 10 15 3 4 5 6 7 8
Robust Distances Fitted Values
Fig. 9 Robust regression diagnostic plot (left), observed versus fitted pH values using the major oxides as
predictors (right). Observations with pH values higher than 7 are marked in red
1
MM residuals
0
−1
−2
Fig. 10 MM regression residuals. Observations with pH values higher than 7 are marked in red
Table 9 Descriptive statistics of the variability of the total sum of %sand + %silt + %clay, which should
be 100%
To start with, we follow the same approach as in the previous cases and plot the
coefficients resulting from LS or robust regression in terms of clr coefficients, Fig. 7.
We can then look at similar contributions of the several oxides on each grain size
fraction to formulate hypotheses. Note that this leads to a matrix of regression coeffi-
cients, linking the grain size distribution as responses to the major oxide composition
as explanatory variables. Table 10 reports the coefficients of an LS Type 3 regression
model, albeit with all coefficients expressed in clr representation.
For establishing subcompositional independence, though, it is more convenient
to work in isometric logratios. Hence, and given the results obtained until now it
appears sensible to study the single component independence of clay (vs. sand and
silt in balance y2∗ ) on the one hand, and on the other the internal subcompositional
123
850 Math Geosci (2021) 53:823–858
Table 10 Least-square
Sand Silt Clay y1∗ y2∗
coefficients of a Type 3
regression model, represented in Intercept − 3.5756 3.3797 0.1958 4.9181 0.2398
clr coefficients (first 3 columns)
and in ilr coordinates (last 2 SiO2 0.7096 − 0.4581 − 0.2515 − 0.8257 − 0.3080
columns) TiO2 − 0.8453 0.5952 0.2501 1.0186 0.3063
Al2 O3 0.4882 − 0.5858 0.0977 − 0.7594 0.1196
Fe2 O3 − 0.0012 0.0434 − 0.0422 0.0315 − 0.0516
MnO − 0.1958 0.0029 0.1928 0.1405 0.2362
MgO − 0.0402 − 0.1334 0.1737 − 0.0659 0.2127
CaO − 0.0794 − 0.0214 0.1008 0.0410 0.1235
Na2 O 0.1614 0.2218 − 0.3832 0.0428 − 0.4693
K2 O − 0.5873 0.1671 0.4203 0.5335 0.5147
P2 O5 0.3422 0.0191 − 0.3613 − 0.2284 − 0.4425
LOI 0.0480 0.1493 − 0.1973 0.0716 − 0.2417
123
Table 11 Least squares regression model coefficients and null p values (last three columns) for a tailored ilr representation of the major oxide composition (represented in
the first ten columns) to explain the logratio sand to silt (note: intercept not included)
Math Geosci (2021) 53:823–858
SiO2 TiO2 Al2 O3 Fe2 O3 MnO MgO CaO Na2 O K2 O P2 O5 LOI Estim. Stderr. p value
123
851
852
123
Table 12 Least squares regression model coefficients and null p values (last three columns) for a tailored ilr representation of the major oxide composition (represented in
the first ten columns) to explain the balance of clay against sand and silt (note: intercept not included)
SiO2 TiO2 Al2 O3 Fe2 O3 MnO MgO CaO Na2 O K2 O P2 O5 LOI Estimate Std.error p value
0.6
1.5
Na2O
0.4
P2O5
1.0
LOI
0.2
silt
0.5
TiO2
Fe2O3
sand
0.0
0.0
SiO2
CaO
−0.5
−0.2
MnO
clay
MgO
K2O
−1.0
−0.4
Al2O3
−1.5
−0.4 −0.2 0.0 0.2 0.4 0.6
Fig. 11 Simultaneous plot of the left and (scaled) right singular vectors of the regression coefficients
matrix, expressed in clr coefficients on both image and origin compositional spaces
picture for y2∗ is completely different: here we can only hope to simplify one coeffi-
cient, namely x8∗ , giving the balance of Fe with respect to the other mafic components.
This is nevertheless irrelevant for the sake of subcompositional independence testing,
because the rest of the balances between mafic components (x2∗ , x5∗ and x7∗ ) do show
significant coefficients.
Another way of looking at the model coefficients is to express them via the singu-
lar value decomposition of Eq. (11). A naive simultaneous plot of the left and right
singular vectors (these last ones scaled by the singular values) is given in Fig. 11. In
this diagram, links joining two variables represent the direction (on the origin or on
the image simplexes, resp. Sx or S y ) associated with fastest change of the logratio
of the two variables involved. A pair of parallel links, one involving components of
Sx and the other linking components of S y , suggests that the logratio between the
involved response variables is strongly controlled by the logratio of the explanatory
variables. For instance, the silt–clay link is reasonably parallel to the link Na2 O–
Al2 O3 ; the same can be said of the links silt–sand versus TiO2 –SiO2 , or of clay–sand
versus K2 O–SiO2 . An analogous reasoning applies for orthogonal links: they indicate
lack of dependence between the two sets of variables involved. In other words, by
finding orthogonal links we identify subcompositions to test for potential subcom-
positional independence. For instance, the link sand-silt is roughly orthogonal to the
sets SiO2 –Al2 O3 and CaO–Fe2 O3 –MgO–Na2 O–LOI(–MnO), that is to the subcom-
positions that were previously tested. Similarly, the diagram suggests as well tests
for subcompositional independence of sand-clay with respect to the subcompositions
SiO2 –P2 O5 –Na2 O or Al2 O3 –Fe2 O3 (–LOI), or even K2 O–TiO2 .
123
854 Math Geosci (2021) 53:823–858
0.8
0.5
0.8
0.7
0.4
0.6
0.6
0.3
0.5
Fitted
Fitted
Fitted
0.4
0.2
0.4
0.3
0.1
0.2
0.2
0.1
0.0
0.2 0.4 0.6 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.1 0.2 0.3 0.4 0.5
Fig. 12 Observed versus fitted values of the grain size composition, using the major oxides as predictors.
Points on the dashed lines would indicate a perfect match between the observed and fitted values. The
contour lines indicate the density of the point distribution
Finally, we illustrate the model in terms of the fitted values. For this purpose one
can use any ilr coordinates representing the grain size composition, and any ilr coor-
dinates representing the major oxides, and perform LS regression to obtain the fitted
values in ilr coordinates. The appropriate inverse ilr transformation leads to the back-
transformed fitted values of the grain size distribution, which can be compared to
the observed values in Fig. 12 (as kernel density estimates). The same results will
be obtained with any other logratio transformation. The linear model has at least
some predictive power, and one can see a clearer relationship with sand and clay, and
a weaker with silt. This suggests that the major oxides are affecting the grain size
composition mainly by its sand and clay proportions. Obviously, several factors do
contribute to this discrepancy, among other the information effect (the regression line
of true values as a function of predicted values cannot lie above the 1:1 bisector), the
presence of outliers, the bad quality of the input grain size data, the non-linearity of
the back-transformation of predictions from logratios to original components, or the
highly complex relations between chemistry, mineralogy and texture that form the
basis to attempt such a prediction.With respect to outliers, the predictive power can
be improved by using a robust estimator.The non-linearity of the back-transformation
is something that can easily be corrected by means of Hermitian integration of the
conditional distribution of the soil grain size composition provided by Eq. (10), as
proposed by Aitchison (1986). But much more important than those effects are the
uncertainty on the textural data and the complexity of the relation we are trying to
capture here. Indeed, if the goal of the study would be that prediction, linear regression
might not be the most appropriate technique. Tackling this complexity is a matter of
predictive models, beyond the scope of this contribution.
6 Conclusions
The purpose of this contribution was to outline the concept of regression analysis
for compositional data, and to show how the analysis can be carried out in practice
with real data. We distinguished three types of regression models: Type 1, where
123
Math Geosci (2021) 53:823–858 855
the response is a composition and the explanatory variable(s) is a (are) real non-
compositional variable(s); Type 2 with a composition as explanatory variables and a
real response, and Type 3 where both the responses and the explanatory variables are
compositions. Note that one could also consider the case where regression is done
within one composition, by splitting the compositional parts into a group forming the
responses, and a group representing the explanatory variables. This case has not been
treated here because it requires a so-called errors-in-variables model, see Hrůzová
et al. (2016) for details.
For all three types of models it is essential how the composition is treated for regres-
sion modeling. A geometrically sound approach is in terms of orthonormal coordinates,
so-called balances, which can be constructed in order to obtain an interpretation of
the regression coefficients and for testing different hypotheses. If the interest is not in
the statistical inference but only in the model fit and in the fitted values, any logra-
tio coordinates would be appropriate to represent the composition. Note that the clr
transformation would not be appropriate for Type 2 or Type 3 regression models, since
the resulting data matrix is singular, leading to problems for the parameter estimation
when the composition plays the role of the explanatory variables.
Classical least-squares (LS) regression as well as robust MM regression have been
considered to estimate the regression parameters and the corresponding p values for
the hypothesis tests. If the model requirements are fulfilled, the LS regression esti-
mator is the so-called best linear unbiased estimator (BLUE) with the corresponding
optimality properties (see, e.g., Johnson and Wichern 2007), but in that case also MM
regression leads to an estimator with high statistical efficiency. However, in case of
model violations, e.g., due to data outliers, these optimality properties are no longer
valid. Still, the MM estimator is reliable because it is highly robust against outliers,
both in the explanatory and in the response variables. In practical applications it might
not always be clear if outliers are present in the data at hand. In this case it could be
recommended to carry out both types of analysis and compare the results. In particular,
one could inspect diagnostics plots from robust regression (as it was done in Sect. 5.5)
in order to identify potential outliers that could have affected the LS estimator, see
Maronna et al. (2006).
The different regression types and estimators have been applied to an example data
set from the GEMAS project (Reimann et al. 2014a, b). All presented examples are
only for illustrative purposes, but they show how balances can be constructed and
how hypotheses can be tested. For the robust estimators, functions are available in the
R packages robustbase (Maechler et al. 2018) and FRB (Van Aelst and Willems
2013). It is important to note that not only the regression parameters are estimated
robustly with these packages, but robust estimation is also carried out for estimating
the standard errors and for hypothesis testing, for the residual variance, the multiple R 2
measure, etc. We demonstrate the possibilities of regression diagnostics in Sect. 5.5.
In most examples, a comparison of LS and MM regression has been provided.
An important issue in the regression context is the problem of variable selection, or
subcompositional independence. In particular for Type 2 and 3 where the explanatory
variables are originating from a composition, it is not straightforward how to end up
with the “best subset” of compositional parts that does not contain non-informative
parts and still yields a model with similar predictive power as the full model. There are
123
856 Math Geosci (2021) 53:823–858
approaches available in the literature to reduce the number of components, see, e.g.,
Pawlowsky-Glahn et al. (2011), Hron et al. (2013), Mert et al. (2015) and Greenacre
(2019). However, there are no methods of subcompositional independence which work
equivalently to non-compositional methods, such as forward or backward variable
selection; only a brief outlook for those in the compositional context was sketched in
Filzmoser et al. (2018). Those methods will be treated in our future research.
Acknowledgements Karel Hron and Peter Filzmoser gratefully acknowledge the support by
Czech Science Foundation GA19-01768S.
Funding Open access funding provided by ZHAW Zurich University of Applied Sciences
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use
is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission
directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/
by/4.0/.
References
Aitchison J (1986) The statistical analysis of compositional data. Monographs on Statistics and Applied
Probability, London (UK): Chapman & Hall, London. (Reprinted in 2003 with additional material by
The Blackburn Press), ISBN 0-412-28060-4
Aitchison J (1997) The one-hour course in compositional data analysis or compositional data analysis
is simple. In: Pawlowsky-Glahn V (ed) Proceedings of IAMG’97—The III annual conference of
the international association for mathematical geology, volume I, II and addendum, Barcelona (E):
International Center for Numerical Methods in Engineering (CIMNE), Barcelona (E), ISBN 84-87867-
97-9, pp 3–35
Aitchison J, Greenacre M (2002) Biplots for compositional data. J R Stat Soc Ser C (Appl Stat) 51(4):375–
392
Aitchison J, Barceló-Vidal C, Egozcue JJ, Pawlowsky-Glahn V (2002) A concise guide for the algebraic-
geometric structure of the simplex, the sample space for compositional data analysis. In: Bayer U,
Burger H, Skala W (eds) Proceedings of IAMG’02—The eigth annual conference of the International
Association for Mathematical Geology, volume I and II, Selbstverlag der Alfred-Wegener-Stiftung,
Berlin, pp 387–392, ISSN 0946-8978
Anderson TW, Darling DA (1952) Asymptotic theory of certain “goodness-of-fit” criteria based on stochastic
processes. Ann Math Stat 23:193–212
Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2001) Mathematical foundations of compo-
sitional data analysis. In Ross G (ed) Proceedings of IAMG’01—The VII annual conference of the
international association for mathematical geology, Cancun (Mex)
Baritz R, Fuchs M, Hartwich R, Krug D, Richter S (2005) Soil regions of the European Union and adjacent
countries 1:5,000,000 (Version 2.0)—Europaweite thematische Karten und Datensätze. European Soil
Bureau Network
Billheimer D, Guttorp P, Fagan W (2001) Statistical interpretation of species composition. J Am Stat Assoc
96(456):1205–1214
Coenders G, Martín-Fernández J, Ferrer-Rosell B (2017) When relative and absolute information matter:
compositional predictor with a total in generalized linear models. Stat Model 17(6):494–512
Daunis-i-Estadella J, Egozcue JJ, Pawlowsky-Glahn V (2002) Least squares regression in the Simplex.
In: Bayer U, Burger H, Skala W (eds) Proceedings of IAMG’02—the eigth annual conference of
123
Math Geosci (2021) 53:823–858 857
the International Association for Mathematical Geology, volume I and II, Selbstverlag der Alfred-
Wegener-Stiftung, Berlin, ISSN 0946-8978, pp 411–416
Egozcue JJ, Pawlowsky-Glahn V (2005) Groups of parts and their balances in compositional data analysis.
Math Geol 37:795–828
Egozcue J J, Pawlowsky-Glahn V (2011) Basic concepts and procedures. In: Pawlowsky-Glahn V, Buccianti
A (eds) Compositional data analysis: theory and applications, Wiley, ISBN 978-0-470-71135-4, pp
12–28
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C (2003) Isometric logratio transfor-
mations for compositional data analysis. Math Geol 35(3):279–300, ISSN 0882-8121
Egozcue JJ, Daunis-i-Estadella J, Pawlowsky-Glahn V, Hron K, Filzmoser P (2012) Simplicial regression.
The normal model. J Appl Probab Stat 6(1&2):87–108
Egozcue J, Lovell D, Pawlowsky-Glahn V (2013) Regression between compositional data sets. In: Hron
K, Filzmoser P, Templ M (eds) Proceedings of the 5th international workshop on compositional data
analysis, Vorau
Filzmoser P, Hron K, Templ M (2018) Applied compositional data analysis: with worked examples in R.
Springer, Cham
Fišerová E, Donevska S, Hron K, Bábek O, Vaňkátová K (2016) Practical aspects of log-ratio coordinate
representations in regression with compositional response. Meas Sci Rev 16(5):235–243
Graffelman J, van Eeuwijk F (2005) Calibration of multivariate scatter plots for exploratory analysis of
relations within and between sets of variables in genomic research. Biom J 47(6):863–879
Greenacre M (2019) Variable selection in compositional data using pairwise logratios. Math Geosc 51:649–
682
Hampel F, Ronchetti E, Rousseeuw P, Stahel W (1986) Robust statistics. The approach based on influence
functions. Wiley, New York
Hron K, Donevska S, Fišerová E, Filzmoser P (2013) Covariance-based variable selection for compositional
data. Math Geosci 45(4):487–498
Hrůzová K, Todorov V, Hron K, Filzmoser P (2016) Classical and robust orthogonal regression between
parts of compositional data. Stat A J Theor Appl Stat 50(6):1261–1275
Johnson R, Wichern D (2007) Applied multivariate statistical analysis, 6th edn. Prentice Hall, New York,
p 800
Maechler M, Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M,
Conceicao ELT, Anna di Palma M (2018) robustbase: basic robust statistics. R package version 0.93-3
Maronna R, Martin R, Yohai V (2006) Robust statistics: theory and methods. Wiley, New York
Mateu-Figueras G, Pawlowsky-Glahn V (2008) A critical approach to probability laws in geochemistry.
Math Geosci 40(5):489–502
Mert C, Filzmoser P, Hron K (2015) Sparse principal balances. Stat Model 15(2):159–174
Mood AM, Graybill FA, Boes DC (1974) Introduction to the theory of statistics, 3rd edn. McGraw-Hill,
New York
Pawlowsky-Glahn V (2003) Statistical modelling on coordinates. In: Thió-Henestrosa S, Martín-Fernández
JA (eds) Proceedings of CoDaWork’03, The 1st Compositional Data Analysis Workshop, Girona (E).
Universitat de Girona, ISBN 84-8458-111-X, http://ima.udg.es/Activitats/CoDaWork2003/
Pawlowsky-Glahn V, Egozcue JJ (2001a) Geometric approach to statistical analysis on the simplex. Stoch
Environ Res Risk Assess (SERRA) 15(5):384–398
Pawlowsky-Glahn V, Egozcue JJ (2001b) Geometric approach to statistical analysis on the simplex. Stoch
Environ Res Risk Assess (SERRA) 15(5):384–398
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2011) Principal balances. In: Egozcue JJ, Tolosana-
Delgado R, Ortego MI (eds) Proceedings of the 4th international workshop on compositional data
analysis (2011), CIMNE, Barcelona, Spain, ISBN 978-84-87867-76-7
Pawlowsky-Glahn V, Egozcue J, Tolosana-Delgado R (2015) Modeling and analysis of compositional data.
Wiley, Chichester
R Development Core Team (2019) R: a language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna
Reimann C, Birke M, Demetriades A, Filzmoser P, O’Connor P (eds) (2014a) Chemistry of Europe’s
agricultural soils—part A: methodology and interpretation of the GEMAS data set. Geologisches
Jahrbuch (Reihe B 102). Schweizerbarth, Hannover
123
858 Math Geosci (2021) 53:823–858
123