0% found this document useful (0 votes)
9 views

Mec15663 Sup 0003 Appendixs3

This document provides an overview of the Approximate Bayesian Computation Random Forest (ABC-RF) method used in the paper. It describes: 1) The ABC framework which generates parameters and pseudo-data to approximate the posterior distribution when the likelihood is intractable. 2) How ABC-RF uses random forests on simulated data to choose models or estimate parameters without specifying summary statistics or thresholds. 3) Extensions like model grouping to analyze predefined groups of scenarios and regression forests to estimate parameter distributions.

Uploaded by

martinez.lican
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Mec15663 Sup 0003 Appendixs3

This document provides an overview of the Approximate Bayesian Computation Random Forest (ABC-RF) method used in the paper. It describes: 1) The ABC framework which generates parameters and pseudo-data to approximate the posterior distribution when the likelihood is intractable. 2) How ABC-RF uses random forests on simulated data to choose models or estimate parameters without specifying summary statistics or thresholds. 3) Extensions like model grouping to analyze predefined groups of scenarios and regression forests to estimate parameter distributions.

Uploaded by

martinez.lican
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Supplementary Material S3: Overview of the used ABC Ran-

dom Forest (ABC-RF) methods


In this supplementary material, we provide readers with an overview of the Approximate Bayesian
Computation Random Forest (hereafter ABC-RF) methods used in the present paper. We invite
the reader to consult Pudlo et al. (2016), Estoup et al. (2018), and Raynal et al. (2019) for more
in-depth explanations.

ABC framework
Let y denote the observed data and θ a vector of parameters associated to a statistical model whose
likelihood is f (. | θ). Under the Bayesian parametric paradigm the posterior distribution
π(θ | y) ∝ f (y | θ)π(θ)
is of prime interest. It characterizes the distribution of θ given the observation y and can be
interpreted as an update of the prior distribution π(θ) by the likelihood of y. The likelihood is
hence pivotal, but unfortunately intractable in the evolutionary scenarios (models) we consider in
the present study, as well as in many other evolutionary studies. As a matter of fact, the underlying
Kingman’s coalescent process (Kingman, 1982) does not allow a close expression for the likelihood
because all the possible genealogies and mutational process yielding y should be considered. To
solve this issue, some likelihood-free methods have been developed using the fact that, even though
the likelihood is not available, generating artificial (i.e. simulated) data for a given value of θ is
much easier if not feasible (e.g. Beaumont (2010). Approximate Bayesian computation (ABC) is
one of them (Beaumont et al., 2002).
In a nutshell, ABC consists in generating parameters θ 0 and associated pseudo-data z from
the scenario, and accepting θ 0 as a realization from an approximated posterior if z is similar to y.
In standard ABC treatments, the notion of similarity is defined through the use of a distance ρ
to compare η(z) and η(y), where η(.) is a projection of the data in a lower dimensional space of
summary statistics. Only pseudo-data providing distance lower than a threshold  are retained.
The choice of ρ, η(.) and  is a major issue in ABC (Beaumont, 2010).
ABC-RF i s a r ecently derived ABC approach based on t he s upervised m achine l earning t ool
named Random Forest (see main text), which has as m ajor advantage t o avoid t he t hree above-
mentioned difficulties. I nitially i ntroduced i n Pudlo et al. ( 2016) f or model choice and t hen extended
to parameter i nference i n Raynal et al. ( 2019) , ABC-RF r elies on t he use of r andom f orests on a s et
of s imulated pseudo-data according t o t he generative Bayesian m odels under consideration. Let
consider M Bayesian parametric models. For a given model i ndex m ∈ {1, . . . , M}, a prior
probability P(M = m) is defined, with θ m its associated parameters and fm (y | θ m ) its likelihood.
The generation process of a reference table made of H elements is described in Algorithm 1.
Algorithm 1: Generation of a reference table with H elements
1 for j ← 1 to H do
2 Generate m(j) from the prior P(M = m)
3 Generate θ m(j) from the prior πm(j) (.)
4 Generate z(j) from the model fm(j) (. | θ m(j) )

5 Compute η(z(j) ) = η1 (z(j) ), . . . , ηd (z(j) )
6 end

1
The output takes the form of a matrix containing simulated model indexes, parameters and
summary statistics, as described below
 (1)
θm(1) η1 (z(1) ) η2 (z(1) ) . . . ηd (z(1) )

m
 m(2) θm(2) η1 (z(2) ) η2 (z(2) ) . . . ηd (z(2) ) 
.
 
 .. .. .. .. .. ..
 . . . . . . 
m(H) θm(H) η1 (z(H) ) η2 (z(H) ) . . . ηd (z(H) )

ABC-RF for model choice


The ABC-RF strategy for model choice is described in Algorithm 2. The output is the affectation
of y to a model (scenario), this decision being made based on the majority class of the RF tree
votes.

Algorithm 2: ABC-RF for model choice


Input : a reference table used as learning set, made of H elements, each one composed of a model
index m(H) and d summary statistics. A possibly large collection of summary statistics can
be used, including some obtained by machine-learning techniques, but also by scientific
theory and knowledge
Learning : construct a classification random forest m̂(·) to infer model indexes
Output : apply the random forest classifier to the observed data η(y) to obtain m̂(η(y))

The selected scenario is the one with the highest number of votes in his favor. In addition to
this majority vote, the posterior probability of the selected scenario can be computed as described
in Algorithm 3.

Algorithm 3: ABC-RF computation of the posterior probability of the selected scenario



Input : the values of I m(h) 6= m̂(η(z(h) )) for the trained random forest and corresponding
summary statistics of the reference table, using the out-of-bag classifiers
Learning : construct a regression random forest Ê(.) to infer E (I {m 6= m̂(η(y))} | η(y))
Output : an estimate of the posterior probability of the selected model m̂(η(y))

P̂ (m = m̂(η(y)) | η(y)) = 1 − Ê (I {m 6= m̂(η(y))} | η(y))

Such posterior probability provides a confidence measure of the previous prediction at the point
of interest η(y). It relies on the building of a regression random forest designed to explain the
model prediction error. More specifically, and as a first step, posterior probability computation
makes use of out-of-bag predictions of the training dataset. Because each tree of the random forest
is built on a bootstrap sampling of the H elements of the reference table (i.e. the training dataset),
there is about one third of the reference table that remains unused per tree, and this ensemble of
left aside datasets corresponds to the “out-of-bag”. Thus, for each pseudo-data of the reference

2
table, one can obtain an out-of-bag prediction by aggregating all the classification trees in which
the pseudo-data was out-of-bag.
 (h) In a second step, the out-of-bag predictions m̂(η(z(h) )) are used
(h)
to compute the indicators I m 6= m̂(η(z )) . These 0 - 1 values are used as response variables
for the regression random forest training, for which the explanatory variables are the summary
statistics of the reference table. Predicting the observed data thanks to this forest allows the
derivation of the posterior probability of the selected model (Algorithm 3). Note that using the
out-of-bag procedure prevents over-fitting issues and is computationally parsimonious as it avoids
the generation of a second reference table for the regression random forest training.

Model grouping A recent useful add-on to ABC-RF has been the model-grouping approach de-
veloped in Estoup et al. (2018), where pre-defined groups of scenarios are analysed using Algorithm
2 and 3. The model indexes used in the training reference table are modified in a preliminary step
to match the corresponding groups, which are then used during learning phase. When appropriate,
unused scenarios are discarded from the reference table. This improvement is particularly useful
when a high number of individual scenarios are considered and have been formalized through the
absence or presence of some key evolutionary events (e.g. admixture, bottleneck, ...). Such key
evolutionary events allow defining and further considering groups of scenarios including or not such
events. This grouping approach allows to evaluate the power of ABC-RF to make inferences about
evolutionary event(s) of interest over the entire prior space and assess (and quantify) whether or
not a particular evolutionary event is of prime importance to explain the observed dataset (see
Estoup et al. (2018) for details and illustrations).

ABC-RF for parameter estimation


Once the selected (i.e. best) scenario has been identified, the next step is the estimation of its
parameters of interest under this scenario. The ABC-RF parameter estimation strategy is described
in Algorithm 4 and takes a similar structure to Algorithm 2. The idea is to use a regression random
forest for each dimension of the parameter space (i.e. for each parameter). For a given parameter of
interest, the output of the algorithm is a vector of weights wy that can be used to compute posterior
quantities of interest such as expectation, variance and quantiles. wy provides an empirical posterior
distribution for θm,k; see Raynal et al. (2019) for more details.

Algorithm 4: ABC-RF for parameter estimation


Input : a vector of θm(h) ,k values (i.e. the k-th component of θm(h) ) and d summary statistics
Learning : construct a regression random forest to infer parameter values
Output : apply the random forest to the observed data η(y), to deduce a vector of weights
(1) (H)
wy = {wy , . . . , wy }, which provides an empirical posterior distribution for θj,k
wy is used to compute the estimators of the mean, the variance and the quantiles of the
parameter of interest

Ê(θm,k | η(y)), V̂(θm,k | η(y)), Q̂α (θm,k | η(y))

3
Global prior errors
In both contexts, model choice or parameter estimation, a global quality of the predictor can be
computed, which does not take the observed dataset (about which one wants to make inferences)
into account. Random forests make it possible the computation of errors on the training reference
table, using the out-of-bag predictions previously described in the section “ABC-RF for model
choice”.
For model choice, this type of error is called the prior error rate, which is the mis-classification
error rate computed over the entire multidimensional prior space. It can be computed as
H
1 X n (h) o
I m 6= m̂(η(z(h) )) .
H
h=1

For parameter estimation, the equivalent is the prior mean squared error (MSE) or the nor-
malised mean absolute error (NMAE), the latter being less sensitive to extreme values. These
errors are computed as
H 2
1 X
MSE = θm(h) ,k − θ̂m(h) ,k ,
H
h=1
H
1 X θm(h) ,k − θ̂m(h) ,k
NMAE = .
H θm(h) ,k
h=1

They can be perceived as Monte Carlo approximation of expectations with respect to the prior
distribution.

Local posterior errors


In the present paper, we propose some posterior versions of errors, which target the quality of
prediction with respect to the posterior distribution. As such errors take the observed dataset η(y)
into account, we mention them as local posterior errors.
For model choice, the posterior probability provided by Algorithm 3 is a confidence measure of
the selected scenario given the observation. Therefore

1 − P̂ (m = m̂(η(y)) | η(y))

directly yields the posterior error associated to η(y): P̂ (m 6= m̂(η(y)) | η(y)).

For parameter estimation, when trying to infer on θm,k , a point-wise analogous measure of a
local error can be computed as the posterior expectations
  !
2 θm,k − θ̂m,k
E θm,k − θ̂m,k | η(y) and E | η(y) . (1)
θm,k

We approximate these expectations by


H H
X  2 X θm(h) ,k − θ̂m(h) ,k
wy(h) θm(h) ,k − θ̂m(h) ,k and wy(h) .
i=1 i=1
θm(h) ,k

4
We again uses the out-of-bag information to compute θ̂m(h) ,k , hence avoiding the (time consuming)
production of a second reference table, and assume that the weights wy from the regression random
forest are good enough to approximate any posterior expectations of functions of θm,k :
E(g(θm,k ) | η(y)).
Another more expensive strategy to evaluate the posterior expectations (1) is to construct new
regression random forests using the out-of-bag vector of values
 2 θm(h) ,k − θ̂m(h) ,k
θm(h) ,k − θ̂m(h) ,k or ,
θm(h) ,k

depending on the targeted error. The observation η(y) is then given to the forests, targeting the
expectations (1).
Note that the values θ̂m(h) ,k in the previous formulas can be replaced by either the approximated
posterior expectations Ê(θm(h) ,k | η(y)) or the posterior medians Q̂50% (θm(h) ,k | η(y)), again using
the out-of-bag information, to provide the local posterior errors. We found that both in the present
paper (see main text, Materials and Methods section) and for various tests that we carried out on
different inferential setups and datasets (results not shown), the posterior median provides a better
accuracy of parameter estimation than the posterior expectation (aka posterior mean). This trends
also holds for global prior errors that can be computed using either the mean or the median as
point estimates.
As final comment, it is worth noting that so far a common practice consisted in evaluating
the quality of prediction (for model choice or parameter estimation) in the neighborhood of the
observed dataset, that is around η(y) and not exactly for η(y). For model choice, Estoup et al.
(2018) use the so called posterior predictive error rate which is an error of this type. In this case,
some simulated datasets of the reference table close to the observation are selected thanks to an
Euclidean distance, then new pseudo-observed datasets are simulated using similar parameters, on
which is computed the error (see also Lippens et al., 2017, for a similar approach in a standard
ABC framework). However, the main problem of processing this way is the difficulty to specify the
size of the area around the observation, especially when the number of summary statistics is large.
We therefore do not recommend the use of such a “neighborhood” error anymore, but rather to
compute the local posterior errors detailed above as the latter measured prediction quality exactly
at the position of interest η(y).

References
Beaumont, M. A. (2010). Approximate Bayesian Computation in Evolution and Ecology. Annual
Review of Ecology, Evolution, and Systematics, 41:379–406.

Beaumont, M. A., Zhang, W., and Balding, D. (2002). Approximate Bayesian Computation in
Population Genetics. Genetics, 162(4):2025–2035.
Estoup, A., Raynal, L., Verdu, P., and Marin, J.-M. (2018). Model choice using Approximate
Bayesian Computation and Random Forests: analyses based on model grouping to make infer-
ences about the genetic history of Pygmy human populations. Journal de la Société Française
de Statistique, 159(3):159, 167-190.

5
Kingman, J. F. C. (1982). On the Genealogy of Large Populations. Journal of Applied Probability,
19(A):27–43.
Lippens, C., Estoup, A., Hima, M. K., Loiseau, A., Tatard, C., Dalecky, A., Bâ, K., Kane, M.,
Diallo, M., Sow, A., Niang, Y., Piry, S., Berthier, K., Leblois, R., Duplantier, J. M., and Brouat,
C. (2017). Genetic structure and invasion history of the house mouse (Mus musculus domesticus)
in Senegal, West Africa: a legacy of colonial and contemporary times. Heredity, 119(2):64–75.
Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M., and Robert, C. P. (2016). Reliable
ABC model choice via random forests. Bioinformatics, 32(6):859–866.
Raynal, L., Marin, J.-M., Pudlo, P., Ribatet, M., Robert, C. P., and Estoup, A. (2019). ABC
random forests for Bayesian parameter inference. Bioinformatics, 35(10):1720–1728

You might also like