Mec15663 Sup 0003 Appendixs3
Mec15663 Sup 0003 Appendixs3
ABC framework
Let y denote the observed data and θ a vector of parameters associated to a statistical model whose
likelihood is f (. | θ). Under the Bayesian parametric paradigm the posterior distribution
π(θ | y) ∝ f (y | θ)π(θ)
is of prime interest. It characterizes the distribution of θ given the observation y and can be
interpreted as an update of the prior distribution π(θ) by the likelihood of y. The likelihood is
hence pivotal, but unfortunately intractable in the evolutionary scenarios (models) we consider in
the present study, as well as in many other evolutionary studies. As a matter of fact, the underlying
Kingman’s coalescent process (Kingman, 1982) does not allow a close expression for the likelihood
because all the possible genealogies and mutational process yielding y should be considered. To
solve this issue, some likelihood-free methods have been developed using the fact that, even though
the likelihood is not available, generating artificial (i.e. simulated) data for a given value of θ is
much easier if not feasible (e.g. Beaumont (2010). Approximate Bayesian computation (ABC) is
one of them (Beaumont et al., 2002).
In a nutshell, ABC consists in generating parameters θ 0 and associated pseudo-data z from
the scenario, and accepting θ 0 as a realization from an approximated posterior if z is similar to y.
In standard ABC treatments, the notion of similarity is defined through the use of a distance ρ
to compare η(z) and η(y), where η(.) is a projection of the data in a lower dimensional space of
summary statistics. Only pseudo-data providing distance lower than a threshold are retained.
The choice of ρ, η(.) and is a major issue in ABC (Beaumont, 2010).
ABC-RF i s a r ecently derived ABC approach based on t he s upervised m achine l earning t ool
named Random Forest (see main text), which has as m ajor advantage t o avoid t he t hree above-
mentioned difficulties. I nitially i ntroduced i n Pudlo et al. ( 2016) f or model choice and t hen extended
to parameter i nference i n Raynal et al. ( 2019) , ABC-RF r elies on t he use of r andom f orests on a s et
of s imulated pseudo-data according t o t he generative Bayesian m odels under consideration. Let
consider M Bayesian parametric models. For a given model i ndex m ∈ {1, . . . , M}, a prior
probability P(M = m) is defined, with θ m its associated parameters and fm (y | θ m ) its likelihood.
The generation process of a reference table made of H elements is described in Algorithm 1.
Algorithm 1: Generation of a reference table with H elements
1 for j ← 1 to H do
2 Generate m(j) from the prior P(M = m)
3 Generate θ m(j) from the prior πm(j) (.)
4 Generate z(j) from the model fm(j) (. | θ m(j) )
5 Compute η(z(j) ) = η1 (z(j) ), . . . , ηd (z(j) )
6 end
1
The output takes the form of a matrix containing simulated model indexes, parameters and
summary statistics, as described below
(1)
θm(1) η1 (z(1) ) η2 (z(1) ) . . . ηd (z(1) )
m
m(2) θm(2) η1 (z(2) ) η2 (z(2) ) . . . ηd (z(2) )
.
.. .. .. .. .. ..
. . . . . .
m(H) θm(H) η1 (z(H) ) η2 (z(H) ) . . . ηd (z(H) )
The selected scenario is the one with the highest number of votes in his favor. In addition to
this majority vote, the posterior probability of the selected scenario can be computed as described
in Algorithm 3.
Such posterior probability provides a confidence measure of the previous prediction at the point
of interest η(y). It relies on the building of a regression random forest designed to explain the
model prediction error. More specifically, and as a first step, posterior probability computation
makes use of out-of-bag predictions of the training dataset. Because each tree of the random forest
is built on a bootstrap sampling of the H elements of the reference table (i.e. the training dataset),
there is about one third of the reference table that remains unused per tree, and this ensemble of
left aside datasets corresponds to the “out-of-bag”. Thus, for each pseudo-data of the reference
2
table, one can obtain an out-of-bag prediction by aggregating all the classification trees in which
the pseudo-data was out-of-bag.
(h) In a second step, the out-of-bag predictions m̂(η(z(h) )) are used
(h)
to compute the indicators I m 6= m̂(η(z )) . These 0 - 1 values are used as response variables
for the regression random forest training, for which the explanatory variables are the summary
statistics of the reference table. Predicting the observed data thanks to this forest allows the
derivation of the posterior probability of the selected model (Algorithm 3). Note that using the
out-of-bag procedure prevents over-fitting issues and is computationally parsimonious as it avoids
the generation of a second reference table for the regression random forest training.
Model grouping A recent useful add-on to ABC-RF has been the model-grouping approach de-
veloped in Estoup et al. (2018), where pre-defined groups of scenarios are analysed using Algorithm
2 and 3. The model indexes used in the training reference table are modified in a preliminary step
to match the corresponding groups, which are then used during learning phase. When appropriate,
unused scenarios are discarded from the reference table. This improvement is particularly useful
when a high number of individual scenarios are considered and have been formalized through the
absence or presence of some key evolutionary events (e.g. admixture, bottleneck, ...). Such key
evolutionary events allow defining and further considering groups of scenarios including or not such
events. This grouping approach allows to evaluate the power of ABC-RF to make inferences about
evolutionary event(s) of interest over the entire prior space and assess (and quantify) whether or
not a particular evolutionary event is of prime importance to explain the observed dataset (see
Estoup et al. (2018) for details and illustrations).
3
Global prior errors
In both contexts, model choice or parameter estimation, a global quality of the predictor can be
computed, which does not take the observed dataset (about which one wants to make inferences)
into account. Random forests make it possible the computation of errors on the training reference
table, using the out-of-bag predictions previously described in the section “ABC-RF for model
choice”.
For model choice, this type of error is called the prior error rate, which is the mis-classification
error rate computed over the entire multidimensional prior space. It can be computed as
H
1 X n (h) o
I m 6= m̂(η(z(h) )) .
H
h=1
For parameter estimation, the equivalent is the prior mean squared error (MSE) or the nor-
malised mean absolute error (NMAE), the latter being less sensitive to extreme values. These
errors are computed as
H 2
1 X
MSE = θm(h) ,k − θ̂m(h) ,k ,
H
h=1
H
1 X θm(h) ,k − θ̂m(h) ,k
NMAE = .
H θm(h) ,k
h=1
They can be perceived as Monte Carlo approximation of expectations with respect to the prior
distribution.
1 − P̂ (m = m̂(η(y)) | η(y))
For parameter estimation, when trying to infer on θm,k , a point-wise analogous measure of a
local error can be computed as the posterior expectations
!
2 θm,k − θ̂m,k
E θm,k − θ̂m,k | η(y) and E | η(y) . (1)
θm,k
4
We again uses the out-of-bag information to compute θ̂m(h) ,k , hence avoiding the (time consuming)
production of a second reference table, and assume that the weights wy from the regression random
forest are good enough to approximate any posterior expectations of functions of θm,k :
E(g(θm,k ) | η(y)).
Another more expensive strategy to evaluate the posterior expectations (1) is to construct new
regression random forests using the out-of-bag vector of values
2 θm(h) ,k − θ̂m(h) ,k
θm(h) ,k − θ̂m(h) ,k or ,
θm(h) ,k
depending on the targeted error. The observation η(y) is then given to the forests, targeting the
expectations (1).
Note that the values θ̂m(h) ,k in the previous formulas can be replaced by either the approximated
posterior expectations Ê(θm(h) ,k | η(y)) or the posterior medians Q̂50% (θm(h) ,k | η(y)), again using
the out-of-bag information, to provide the local posterior errors. We found that both in the present
paper (see main text, Materials and Methods section) and for various tests that we carried out on
different inferential setups and datasets (results not shown), the posterior median provides a better
accuracy of parameter estimation than the posterior expectation (aka posterior mean). This trends
also holds for global prior errors that can be computed using either the mean or the median as
point estimates.
As final comment, it is worth noting that so far a common practice consisted in evaluating
the quality of prediction (for model choice or parameter estimation) in the neighborhood of the
observed dataset, that is around η(y) and not exactly for η(y). For model choice, Estoup et al.
(2018) use the so called posterior predictive error rate which is an error of this type. In this case,
some simulated datasets of the reference table close to the observation are selected thanks to an
Euclidean distance, then new pseudo-observed datasets are simulated using similar parameters, on
which is computed the error (see also Lippens et al., 2017, for a similar approach in a standard
ABC framework). However, the main problem of processing this way is the difficulty to specify the
size of the area around the observation, especially when the number of summary statistics is large.
We therefore do not recommend the use of such a “neighborhood” error anymore, but rather to
compute the local posterior errors detailed above as the latter measured prediction quality exactly
at the position of interest η(y).
References
Beaumont, M. A. (2010). Approximate Bayesian Computation in Evolution and Ecology. Annual
Review of Ecology, Evolution, and Systematics, 41:379–406.
Beaumont, M. A., Zhang, W., and Balding, D. (2002). Approximate Bayesian Computation in
Population Genetics. Genetics, 162(4):2025–2035.
Estoup, A., Raynal, L., Verdu, P., and Marin, J.-M. (2018). Model choice using Approximate
Bayesian Computation and Random Forests: analyses based on model grouping to make infer-
ences about the genetic history of Pygmy human populations. Journal de la Société Française
de Statistique, 159(3):159, 167-190.
5
Kingman, J. F. C. (1982). On the Genealogy of Large Populations. Journal of Applied Probability,
19(A):27–43.
Lippens, C., Estoup, A., Hima, M. K., Loiseau, A., Tatard, C., Dalecky, A., Bâ, K., Kane, M.,
Diallo, M., Sow, A., Niang, Y., Piry, S., Berthier, K., Leblois, R., Duplantier, J. M., and Brouat,
C. (2017). Genetic structure and invasion history of the house mouse (Mus musculus domesticus)
in Senegal, West Africa: a legacy of colonial and contemporary times. Heredity, 119(2):64–75.
Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M., and Robert, C. P. (2016). Reliable
ABC model choice via random forests. Bioinformatics, 32(6):859–866.
Raynal, L., Marin, J.-M., Pudlo, P., Ribatet, M., Robert, C. P., and Estoup, A. (2019). ABC
random forests for Bayesian parameter inference. Bioinformatics, 35(10):1720–1728