0% found this document useful (0 votes)

0 views21 pages

A Spatial Autoregressive Random Forest Algorithm For Small-Area Spatial Prediction

The document presents SPAR-Forest, a novel spatial prediction algorithm that combines random forests with spatial smoothing models to predict missing values in small-area spatial data. The algorithm addresses the limitations of traditional machine learning approaches by incorporating residual spatial autocorrelation while maintaining flexible feature-target relationships. Results indicate that SPAR-Forest outperforms existing models in predicting median property prices in Scotland, providing a robust framework for small-area spatial prediction.

Uploaded by

speedyfrancis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views21 pages

A Spatial Autoregressive Random Forest Algorithm For Small-Area Spatial Prediction

Uploaded by

speedyfrancis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Submitted to the Annals of Applied Statistics

A SPATIAL AUTOREGRESSIVE RANDOM FOREST ALGORITHM FOR

SMALL-AREA SPATIAL PREDICTION

B Y C ARA M AC B RIDE1,a , V INNY DAVIES1,b

AND D UNCAN L EE1,c
1 School of Mathematics and Statistics, University of Glasgow, Glasgow, G12 8SQ, Scotland., a [email protected];
b [email protected]; c [email protected]

In spatial areal unit data with missing or suppressed values, it is de-

sirable to create models that are able to predict observations that are not
available. Typically, statistical spatial smoothing models fitted in a Bayesian
hierarchical framework are used for this purpose, which capture any unex-
plained residual spatial autocorrelation in the data through conditional au-
toregressive (CAR) or spatial autoregressive (SAR) priors applied to a set
of random effects. In contrast, typical machine learning approaches such as
random forests or neural networks ignore this residual autocorrelation, and
instead base predictions on complex non-linear feature-target relationships.
In this paper we propose SPAR-Forest, a novel spatial prediction algo-
rithm that fuses random forests with spatial smoothing models. By iteratively
refitting a random forest combined with a Bayesian CAR or SAR model in
one algorithm, SPAR-Forest can incorporate flexible feature-target relation-
ships while still accounting for the residual spatial autocorrelation. Our re-
sults, based on a Scottish property price data set and multiple simulated data
sets, show that SPAR-Forest outperforms Bayesian CAR / SAR models, ran-
dom forests, and state-of-the-art hybrid approaches including geographical
random forests, providing a state-of-the-art framework for small-area spatial
prediction.

1. Introduction. Spatial areal unit data are prevalent in fields including ecology (Brewer
and Nolan, 2007), economics (Kawabata, Naoi and Yasuda, 2022), and epidemiology (Lee
and Anderson, 2023), and the aims of modelling these data include hotspot identification
(Knorr-Held and Raßer, 2000), boundary detection (Lee, Meeks and Pettersson, 2021), eco-
logical regression (Wang et al., 2022), and the quantification of spatial inequalities (Jack, Lee
and Dean, 2019). Unlike for point-level data, spatial prediction is not normally the inferen-
tial goal, because there is one data value for each areal unit and hence nothing to predict.
However, areal unit data sometimes contain missing values, making spatial prediction an im-
portant methodological challenge. These missing values could be caused by the observed
value not existing, not being measured or being suppressed, the latter occurring because it
may disclose the identity of individuals. Here, we model median property prices at a small-
area scale in Scotland, and these data are only publicly released if 5 or more properties sold
in a year, leading to around 9% of the small areas having missing values.
Statistical models for these areal unit data typically represent the mean function with a
linear combination of available features and a set of random effects, with the latter capturing
any residual spatial autocorrelation in the data after feature adjustment. Conditional autore-
gressive (Besag, York and Mollié, 1991) models and spatial autoregressive models (Whittle,
1954) are commonly used for this purpose, which capture this spatial autocorrelation by
smoothing the random effects in neighbouring areal units towards each other. In contrast,
machine learning algorithms are the state of the art approach to non-spatial prediction, with
examples including random forests (Breiman, 2001), gradient boosting machines (Friedman,

Keywords and phrases: Areal Unit Data, Random Forests, Property Prices, Spatial Autoregressive Models.
1
2

2001) and neural networks (LeCun et al., 1990). These algorithms model the relationship be-
tween each feature and the target variable as a complex non-linear function, typically leading
to improved predictive performance compared to models with linear feature-target relation-
ships. These competing paradigms thus utilise different aspects of spatial areal unit data to
make predictions, with machine learning algorithms utilising complex non-linear feature-
target relationships whilst ignoring residual spatial autocorrelation, while spatial smoothing
models capture this autocorrelation at the expense of simpler feature-target relationships.
The use of machine learning algorithms in spatial statistics is a growing research area, with
Berrocal et al. (2020) and Credit (2022) comparing the predictive performance of traditional
spatial statistical models and machine learning algorithms. A number of hybrid methodolo-
gies have also been proposed, which for point-level spatial data include the random forest
regression Kriging (RFRK, Hengl et al., 2015) and random forest generalised least squares
(RF-GLS, Saha, Basu and Datta, 2023) algorithms. For areal unit data, Xia, Stewart and Fan
(2021) and Soltani et al. (2022) incorporate spatially lagged features into tree-based machine
learning models, while Georganos et al. (2021) propose a geographical random forest (GRF)
algorithm that fits a separate local random forest for each areal unit using only nearby data
points. In the related field of image analysis convolutional neural networks (CNN, see Le-
Cun, Bengio and Hinton, 2015) have been developed, which extend neural network models
by spatially smoothing the features and subsequent nodes in the network using a spatial mov-
ing average filter applied to each pixel’s 8 neighbouring pixels. These pixel-based models
have been extended to irregularly shaped areal unit data by graphical convolutional neural
networks (GCNN, see for example Kipf and Welling, 2017 and Zhu et al., 2022), which re-
place the regular spatial moving average filter with an irregular one based on the geographical
contiguity of the areal units. However, unlike the point-level RFRK and RF-GLS algorithms,
the above set of machine learning algorithms for areal unit data do not explicitly allow for
residual spatial autocorrelation in the target variable after feature adjustment.
Therefore, this paper proposes an iterative SPatial AutoregRessive random forest algo-
rithm called SPAR-forest for predicting spatial areal unit data, which is a novel fusion of
spatial correlation (smoothing) models and random forests that overcomes the above limita-
tion. This algorithm incorporates flexible feature-target relationships via a random forest and
residual spatial autocorrelation via a spatial random effects model, and it iteratively re-fits
each component based on the current value of the other. The total number of iterations is
one of the tuning parameters of the algorithm, which collectively are optimised via a 10-fold
cross validation procedure. A random forest is the machine learning algorithm used to capture
non-linear feature-target effects due to its computational efficiency and inbuilt bootstrapping
procedure, because the latter allows approximately out-of-sample predictions to be obtained
for the training set via out-of-bag predictions. For details of why this is needed see Section 4.
This methodology is motivated by a new study aiming to predict median property prices in
2018 at the small-area scale in Scotland, and details of this study are presented in Section 2.
Section 3 provides a review of competitor prediction models, while our novel SPAR-Forest
algorithm is described in Section 4. The study design used for assessing predictive perfor-
mance is outlined in Section 5, along with the metrics used to measure predictive accuracy.
Section 6 presents the results of a simulation study that compares the prediction accuracy of
a range of models under different fixed conditions, while the results of the motivating study
are presented in Section 7. Finally, the paper ends in Section 8 with a summary of the main
findings and areas for future work.

2. Motivating study. The aim of the study is to predict median property prices at the
small-area scale in Scotland in 2018, which is the most recent year of data that are publicly
available. The data relate to spatial units called Data Zones (DZ), which are a small-area
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 3

geography containing between 500 and 1,000 people. Data Zones nest within 32 larger Local
Authorities (LA), which are the administrative units that run public services such as schools
and rubbish collections. Three of these LAs (Na h-Eileanan Siar, Orkney, and Shetland) are
island communities that contain only 95 DZs in total, which are removed to avoid having
small numbers of DZs in an LA when splitting the data into training and test sets. This leaves
N = 6, 881 DZs as the study region, which comprise mainland Scotland and some of the
islands. The data used in this study are described below, and unless otherwise stated were
obtained from https://statistics.gov.scot/home.

2.1. Target variable. The target variable is the median selling price of all properties sold
in 2018, with the median being used because it is robust to outlying observations. Median
prices that are based on less than 5 sales are suppressed (or do not exist in the case of zero
sales) to ensure individual properties are not identifiable, which results in around 9% of DZs
having missing values. Additionally, one DZ had a median price of just £600, and as this
is likely to be an error this value is treated as missing. The remaining data exhibit a skewed
distribution (see Section 1.1 of the supplemental material) that ranges between £19, 500 and
£878, 000, with a median value of £139, 282. Figure 1 displays the spatial patterns in median
property prices across the two largest cities of Edinburgh (A, top) and Glasgow (B, bottom),
while the whole of Scotland is not shown because most DZs would then be too small to see.
In the figure DZs with missing property prices are not shaded, which in some cases makes
them appear to be white / very light grey when plotted over the background map. The figure
shows that prices are more expensive in Edinburgh compared to Glasgow, with median prices
of £230, 000 and £122, 000 respectively. Glasgow also exhibits a much higher proportion
of DZs with missing property prices than Edinburgh, being 16.8% and 4.0% respectively.
These missing values appear to be spatially clustered in Glasgow, where as in Edinburgh
they appear to be more randomly scattered. In Glasgow, three of the most prominent clusters
of missing values are in the residential areas of Drumchapel in the far north-west (south-west
of Bearsden), Castlemilk in the south (north-east of Clarkston), and in the east-end of the city
(south of Stepps).

2.2. Features. A number of features that are likely to explain the spatial variation in me-
dian property prices were obtained, including characteristics of the DZ itself and the proper-
ties situated within them. Some of these features contain a small number of missing values,
which are imputed using the K nearest neighbours (KNN) algorithm with K = 5 as recom-
mended by Kuhn and Johnson (2019). Additionally, a very small number of clear outliers
were assumed to be data errors and imputed as above. The numeric features were then stan-
dardised to have a mean of zero and a standard deviation of one. The set of features is sum-
marised below, with additional exploratory analysis of their distributions given in Section 1.2
of the supplemental material.

2.2.1. Property characteristics. Average property size is measured by the mean number
of rooms excluding bathrooms and kitchens, while property type is summarised by the per-
centages of: (i) flats; and (ii) semi-detached / detached houses; in each DZ. Additionally, the
density of properties is summarised by the number of dwellings per hectare. Finally, council
tax is a levy paid by each householder for public services, and the council tax band of a prop-
erty provides a crude measure of a property’s worth. The latter has 8 levels labelled A to H,
with the cheapest properties in band A and the most expensive in band H. The percentages
of properties in each of these 8 bands is available, but as they are highly correlated principal
components analysis (PCA) is applied to obtain independent features. The first 5 PCs explain
over 95% of the variation in these variables, and hence are used in the prediction model.
4

(A) - Edinburgh

(B) - Glasgow

F IG 1. Maps of median property prices in each DZ in Edinburgh (A, top) and Glasgow (B, bottom). DZs with
missing median property prices have no colour shading.
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 5

2.2.2. Small-area characteristics. The level of socio-economic deprivation in each Data

Zone is measured by the Scottish Index of Multiple Deprivation (SIMD, https://simd.scot/),
and we use data for the closest year available which is 2016. The SIMD is a composite
index comprising 26 correlated indicators across the domains of access to services, crime,
education, employment, health, housing and income, and a summary is presented in Section
1.2 of the supplemental material. The single indicator in the crime domain is removed because
it has 435 (6%) missing values, while the remaining indicators have at most 14 missing values
and are hence imputed. The income and employment domains each contain a single indicator,
which are used as-is in the modelling. The access to services domain is represented by the
indicator summarising the drive time needed to reach a post office, because the remaining
indicators contain numerous outliers. The education, health and housing domains contain
multiple correlated indicators, so PCA is again used to produce independent features. In each
case enough PCs are kept to make the cumulative proportion of variation explained above
95%, which results in the following number of features: education - 1, health - 3, housing -
2.
The urbanicity of each DZ is represented by an 8 fold urban-rural classification, which for
simplicity is reduced to the following three levels: urban (Large Urban Areas, Other Ur-
ban Areas); small-towns (Accessible Small-Towns, Remote Small-Towns, Very Remote
Small-Towns); and rural (Accessible Rural Areas, Remote Rural Areas, Very Remote Ru-
ral Areas). Here, small-towns is treated as the baseline level, resulting in binary indicator
variables for the urban and rural categories. The final features available comprise the lo-
cal authority each DZ is contained within (a factor with 29 levels), and the easting (east-west)
and northing (north-south) coordinates of the centroid (central point) of each DZ.

2.3. Study aims. Within the overarching aim of spatial areal unit prediction, this study
addresses three key questions. Firstly, how does the predictive performance of the proposed
SPAR-Forest algorithm compare to machine learning algorithms and spatial CAR / SAR
smoothing models? Secondly, how does property price predictability vary regionally across
Scotland, and which areas can be predicted with the greatest and least amounts of accuracy
and precision? Thirdly, what are the likely median property prices for the 9% of Data Zones
that have missing values, and how do these predictions compare to the prices in the remaining
Data Zones? This paper will thus provide users with information on average property prices
in their local areas, as well as access to a state-of-the-art prediction algorithm for spatial areal
unit data.

3. Competitor prediction models. The SPAR-Forest algorithm proposed here is bench-

marked against a series of competitors, including models based on spatial smoothing terms
and random forests. The majority of these are outlined below, with the exception being a
geographical random forest (GRF) that is summarised in Section 2.3 of the supplementary
material because it is not a core components of the SPAR-Forest algorithm. In initial model
testing we also investigated the efficacy of using neural networks and graphical convolution
neural networks (GCNN) for estimating non-linear feature-target relationships, but as these
did not perform as well as the random forests these experiments are summarised in Section
4 of the supplementary material. In the description that follows the study region is parti-
tioned into N non-overlapping areal units S = {A1 , . . . , AN }, median property prices are
denoted by Y = [Y (A1 ), . . . , Y (AN )], while x(Ak ) = [x1 (Ak ), . . . , xp (Ak )] denotes a vec-
tor of p features relating to Ak . The N areal units are randomly partitioned into a training set
{A1 , . . . , AK } and a test set {AK+1 , . . . , AN }, with each model being fitted to the training
set and used for out-of-sample prediction on the test set.
6

3.1. Normal linear model. The simplest baseline model is the normal linear model,
which when applied to the training set is given by

(1) Y (Ak ) ∼ N{β0 + x(Ak )⊤ β, σ 2 } for k = 1, . . . , K,

where (β0 , β, σ 2 ) denote the intercept term, the p × 1 vector of regression parameters
and the variance parameter respectively. Parameter estimation and prediction is achieved in a
frequentist framework, and further details are given by Faraway (2014).

3.2. Spatial smoothing models. Residual spatial autocorrelation not accounted for by
the features is ubiquitous in areal unit data, and it can be modelled by adding autocorrelated
random effects ϕ = [ϕ(A1 ), . . . , ϕ(AK )] to (1) via

(2) Y (Ak ) ∼ N{β0 + x(Ak )⊤ β + ϕ(Ak ), σ 2 } for k = 1, . . . , K.

Conditional autoregressive (CAR) and spatial autoregressive (SAR) models are commonly
specified for the random effects ϕ, and it is these two classes of models that we use as both
competitor prediction models and components within our SPAR-Forest algorithm. Inference
for CAR and SAR models can be undertaken in either frequentist or Bayesian settings, and
here we adopt a Bayesian approach using integrated nested Laplace approximations (INLA,
Rue, Martino and Chopin, 2009).
CAR and SAR models induce spatial dependence into the random effects via a (typically)
binary neighbourhood matrix W, whose kj th element wkj = 1 if areal units {Ak , Aj } are
spatially close and wkj = 0 otherwise (wkk = 0 ∀k ). Most commonly, wkj = 1 if {Ak , Aj }
share a common border, with wkj = 0 otherwise. However, as we split the data into training
and test subsets, some areal units will not share a border with any other units in these subsets,
leading to an inappropriate specification of the model. Therefore, we define W by the D
nearest neighbours algorithm, where wkj = 1 if Aj is one of the D nearest neighbours to
Ak in terms of inter-centroidal distance, and wkj = 0 otherwise. This creates an asymmetric
neighbourhood matrix, which is made symmetric by setting wjk = 1 if initially (wkj = 1,
wjk = 0). Here D is a tuning parameter of the model, with longer range autocorrelations
being captured as D increases.
CAR priors have been proposed by Besag, York and Mollié (1991), Leroux, Lei and Bres-
low (2000) and Riebler et al. (2016), and here we utilise the one proposed by Leroux, Lei
and Breslow (2000) because it only contains a single set of random effects ϕ. Its joint prior
distribution is given by

(3) ϕ ∼ N{0, τ −1 [ρ(diag(W1) − W) + (1 − ρ)I]−1 },

where I denotes a K × K identity matrix and 1 denotes a K × 1 vector of ones. Setting
ρ = 0 corresponds to spatial independence because the precision matrix simplifies to the
identity matrix, while if ρ = 1 then (3) becomes the intrinsic CAR prior for strong spatial
autocorrelation proposed by Besag, York and Mollié (1991).
A range of SAR model variants have also been proposed, including the spatial error model,
the spatial lag model and the spatial Durban model, and a review is given by Haining and Li
(2020). Here, we utilise the spatial error model because it applies the spatial autocorrelation
structure to the random effects only as CAR models do, and the model is given by

(4) ϕ ∼ N{0, τ −1 [(I − ρW̃)(I − ρW̃)]−1 }.

THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 7

In this model a row standardised neighbourhood matrix W̃ is commonly used rather than
the original binary matrix W. Again ρ is a spatial dependence parameter, with ρ = 0 cor-
responding to independence (the precision matrix again simplifies to the identity matrix)
while stronger autocorrelation is captured as ρ increases. The full spatial model comprises
the data likelihood model (2), one of the random effects models (3) or (4), and prior dis-
tributions for the parameters (β0 , β = (β1 , . . . , βp ), σ 2 , τ, ρ). Weakly informative priors are
assumed here for these parameters to let the data speak for themselves, which are the ones
recommended by the INLA software used for inference (Rue, Martino and Chopin, 2009).
Specifically: (i) βj ∼ N(0, 100, 000) for j = 0, . . . , p; (ii) ln(σ −2 ) ∼ log-gamma(1, 0.01);
(iii) τ ∼ log-gamma(1, 0.01); and (iv) ln[ρ/(1 − ρ)] ∼ N(0, 100). Once fitted to the training
set the model is used to predict property prices in the test set by sampling from the posterior
predictive distribution, and details are provided in Section 2.1 of the supplemental material
accompanying this paper.

3.3. Random forest model. Random forests (RF) are one of the best performing ma-
chine learning prediction algorithms (Boehmke and Greenwell, 2020), and were originally
proposed by Breiman (2001). They are based on the additive decomposition

(5) Y (Ak ) = m[x(Ak )] + ϵ(Ak ) for k = 1, . . . , N,

where {m[x(Ak )]} are the true values and the errors {ϵ(Ak )}N k=1 across both training and
test sets are assumed to be independent and identically distributed with some distribution
g(.). Random forests fit an ensemble of Ntr regression trees (Breiman, 1984) to the train-
ing data {Y (A1 ), . . . , Y (AK )} to estimate {m[x(Ak )]}, and the predictions of the test set
observations {m̂[x(AK+1 )], . . . , m̂[x(AN )]} are the means of the predictions made by these
Ntr trees. Here we fix Ntr = 1, 000, which initial analyses showed was sufficient for the
prediction error to stabilise. Each tree is fitted to an independent bootstrapped sample (with
replacement) from the training data of the same size. A tree is built by a recursive binary
partitioning algorithm, which considers a random subset of mtry features when making each
split in the tree. This recursive splitting continues until a stopping criterion is met, such as
when making an additional split would result in a terminal node having less than minnode
observations. Full details of the algorithm are given in Breiman (1984), while a practical in-
troduction is given by Boehmke and Greenwell (2020). Random forests only provide a single
point prediction without a measure of predictive uncertainty, which we overcome using the
95% prediction intervals for random forests proposed by Zhang et al. (2020). Further details
are given in Section 2.2 of the supplemental material.

4. Methodology. This section proposes a novel iterative spatial prediction algorithm for
areal unit data called SPAR-Forest, which uses random forests to estimate non-linear
feature-target relationships and Bayesian spatial autoregressive models to allow for any resid-
ual spatial autocorrelation. A Bayesian approach to inference using INLA is taken for the
spatial smoothing model, because it provides estimates of the spatial random effects for the
training set which are used in our iterative algorithm. We note, that maximum likelihood ap-
proaches can be used to estimate spatial random effects models, such as via the R package
spmodel. However, these estimation algorithms typically integrate out the random effects
rather than estimating them, which precludes their use here. In principle, any spatial smooth-
ing model that is appropriate for areal unit data could be used, but here we illustrate our
approach with CAR and SAR models as they are the most popular in the areal unit modelling
literature. The rationale for our algorithm is outlined in Section 4.1, while algorithmic details
are provided in Section 4.2.
8

4.1. Overall approach and rationale. The observed data {Y (Ak )} represent error-prone
measurements of the true values {m[x(Ak )]}, leading to the decomposition

(6) Y (Ak ) = m[x(Ak )] + ϵ(Ak ) for k = 1, . . . , N,

for both the training and test sets. The errors {ϵ(Ak )} are independent and identically
distributed, and represent the noise in the observed data. The true values {m[x(Ak )]} are un-
known, and can be estimated using a random forest based on the available features {x(Ak )}.
Denoting this model-based estimate by {m̂[x(Ak )]}, the above decomposition becomes

(7) Y (Ak ) = m̂[x(Ak )] + {m[x(Ak )] − m̂[x(Ak )]} + ϵ(Ak ) for k = 1, . . . , N.

The differences between the true values and the model estimates {m[x(Ak )] − m̂[x(Ak )]}
arise from an incorrect specification of the regression model, which is likely to be caused by
a range of factors, including measurement error in the features and unmeasured confounding.
These unmeasured confounders are likely be spatially autocorrelated, and their exclusion
from the model induces spatial autocorrelation into the differences {m[x(Ak )] − m̂[x(Ak )]}.
As these differences are unknown the standard approach is to replace them with a set of
spatially autocorrelated random effects {ϕ(Ak )}, which yields the following general model:

(8) Y (Ak ) = m̂[x(Ak )] + ϕ(Ak ) + ϵ(Ak ) for k = 1, . . . , N.

This specification naturally suggests a two-stage modelling cycle, where in the first stage
the non-linear effects of the features on the target variable are estimated using a random
forest. Then in the second stage the residual spatial autocorrelation not explained by the
features is modelled by random effects with a spatial autoregressive structure, which are
included in a model for the target variable that treats the estimated feature effects {m̂[x(Ak )]}
as fixed offsets. However, the presence of spatial-autocorrelation will likely affect the feature-
target relationships estimated in stage 1. Therefore, we propose iterating these two stages r =
1, . . . , R times, where the random effects estimated in stage 2 are fed back into the random
forest model in stage 1 as fixed offsets. The number of iterations R is a tuning parameter of
the algorithm, and after R iterations of these two steps the final CAR or SAR model is used
to make predictions for the test set.
This approach leverages the best characteristics of both random forests and spatial smooth-
ing models, namely the flexibility of the former for capturing non-linear feature-target rela-
tionships, and the ability of the latter to capture residual spatial autocorrelation. It thus ex-
tends random forest models for use in areal unit data applications where residual spatial au-
tocorrelation is ubiquitous. We note that the GRF model proposed by Georganos et al. (2021)
does not allow for residual spatial autocorrelation directly, because it simply fits non-spatial
random forest models to different local subsets of the training set.

4.2. Implementation. The iterative SPAR-Forest prediction algorithm has the following
tuning parameters: (i) the number of iterations of the algorithm R; (ii) the random forest
specific tuning parameters (mtry , minnode ); and (iii) the CAR / SAR model tuning parameter
D . All of these are estimated using a 10-fold cross validation procedure applied to the training
set, details of which are given in the next section. Thus, the algorithm below is presented for
a fixed set of tuning parameters.
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 9

Algorithm - SPAR-Forest
Stage 0: Initialise the random effects by setting ϕ̃(Ak ) = 0 for all training set observa-
tions, and fix the tuning parameters (mtry , minnode , D, R).
Stage 1: Iterate the following steps r = 1, . . . , R times.
A. Compute the decorrelated target variable Z(Ak ) = Y (Ak ) − ϕ̃(Ak ) for observations
in the training set k = 1, . . . , K .
B. Fit a random forest model with tuning parameters (mtry , minnode ) to the training set
with features {x(Ak )}K K
k=1 and target variable {Z(Ak )}k=1 , to estimate the effects of
the features on the decorreltaed target variable. Use this model to produce out-of-sample
predictions {m̂(k) [x(Ak )]}N k=1 for both the training and test sets, with the former being
produced using the out-of-bag approach.
C. Fit the following spatial random effects model described in Section 3.2 to the training
data using INLA:

Y (Ak ) ∼ N{β0 + m̂(k) [x(Ak )] + ϕ(Ak ), σ 2 } for k = 1, . . . , K,

where the random effects are represented by either the CAR model (3) or the SAR
model (4). Then produce estimates of the random effects {ϕ̃(Ak )}K k=1 from the pos-
terior mean of {ϕ(Ak )}K k=1 from the above model. The neighbourhood matrix W for
the model is constructed using the D nearest neighbours rule. Here, the out-of-sample
predictions {m̂(k) [x(Ak )]}Kk=1 from Step B. are included in the model as a fixed offset.
Stage 2: Use the final Gaussian CAR / SAR model from step C. obtained after R itera-
tions to produce predictions and 95% prediction intervals for observations in the test set via
their posterior predictive distributions. Note, these predictions also use the out-of-sample
feature predictions {m̂(k) [x(Ak )]}Nk=K+1 from the final random forest model.

Further details about random forests (Stage 1 B.) and the Bayesian CAR / SAR model
(Stage 1 C. and 2) are provided in Section 3 of the main paper and Section 2 of the supple-
mental material. The above algorithm is implemented in R, and software allowing others to
apply the method to their own data is available at https://github.com/vinnydavies/SPARforest
and described in more detail in Section 3 of the supplemental material. Specifically, the soft-
ware fits two variants of the SPAR-Forest algorithm, the first using the CAR model (3) to
represent the spatial random effects and the second replacing this with the spatial autoregres-
sive model (4). The software uses the ranger (Wright and Ziegler, 2017) package to fit the
random forests, and the INLA package (Rue, Martino and Chopin, 2009) to fit the Gaussian
CAR / SAR models.
The SPAR-Forest algorithm uses the random forest to make out-of-sample predictions
{m̂(k) [x(Ak )]}Kk=1 for observations in the training set via an out-of-bag approach, which
are subsequently used in step C. An out-of-bag prediction of Y (Ak ) is made by averaging
the predictions from the sub-forest of trees that were fitted without using Y (Ak ), which is
possible as random forests use a different bootstrapped (sampled with replacement) copy of
the training data when fitting each tree in the forest. Out-of-bag predictions are needed so
that the training and test set predictions are generated in the same way, i.e., without using
the data point in question. If one instead replaced {m̂(k) [x(Ak )]}K k=1 with in-sample fitted
values, then they would likely be closer to the observed data compared to those in the test
set, leading to overfitting of the training set and an underestimation of predictive uncertainty
(see Section 4.3 of the supplemental material for an example).
10

This ability to produce out-of-bag predictions for the training set in a computationally effi-
cient manner is the main reason why random forests are used to capture feature-target effects
in our algorithm, rather than using other machine learning algorithms such as neural net-
works that only produce in-sample fitted values by default. In principle however, one could
apply a bootstrapping approach to a neural network for this purpose, as bootstrapping neural
networks has been implemented in a variety of contexts (see for example, Franke and Neu-
mann, 2000 and Palmer et al., 2022). However in practice, this is computationally infeasible.
For example, running a single random forest on the motivating study data for 1,000 boot-
strapped trees in the forest takes 3.6 seconds on an iMac computer with a 3.8 GHz processor
and 32GB of memory, where as running a neural network (with 3 hidden layers each having
64 nodes and run for 1,000 epochs) repeatedly for 1,000 boostrapped data samples takes 8.3
hours. Thus, incorporating a neural network with such a bootstrapping procedure within our
proposed algorithm for a combined R = 20 iterations would take approximately 167 hours
for one model run, and the algorithm would need to be run large numbers of times for tuning
via a cross validation approach (see below).
An alternative would be to run the neural network only once for each of the R iterations
of our algorithm, and use in-sample predictions {m̂[x(Ak )]}K k=1 for the training set in the
spatial smoothing step C. However, initial testing showed that this approach leads to poor
performance, and details are given in Section 4 of the supplemental material. Also included
in that section is a comparison of using a random forest and a neural network for predicting
the motivating property price data, because it shows that the random forest performs better
and is hence likely to be more appropriate for our proposed algorithm.

5. Study design for assessing predictive performance. In both the simulation (Section
6) and property price (Section 7) studies the predictive performance of two variants of the
SPAR-Forest algorithm are assessed, namely CAR-Forest that uses (3) for the random
effects and SAR-Forest that uses (4) for the random effects. These iterative prediction
algorithms are compared against the following competitors: (i) a normal linear model (LM
- Section 3.1); (ii) spatial CAR and SAR models (CAR / SAR - Section 3.2); (iii) a random
forest model (RF - Section 3.3); and (iv) a geographical random forest model (GRF - Section
2.3 supplemental material). Additionally, for the motivating property price study we also
present results from a simplified non-iterative form of the SPAR-Forest algorithm equivalent
to R = 1, but as it did not perform as well as the full algorithm its results are shown in Section
6.2 of the supplemental material. The normal linear model is included for its simplicity,
while the remaining competitors comprise state of the art models in spatial statistics, machine
learning and existing fusions of these paradigms.
The predictive performance of each model is assessed by randomly splitting the Data
Zones into an 80% training set and a 20% test set, which for the motivating property price
study include 5,011 and 1,253 Data Zones respectively. Additionally, to ensure the results
for the property price study are not affected by the particular choice of training-test split, we
repeat the prediction experiment on 5 independent training-test splits. All the models except
the normal linear model contain tuning parameter(s), which are initially optimised using the
training set. This optimisation is done using a 10-fold cross validation procedure, which splits
the training set into ten random subsets of approximately equal size. Each model is fitted to
nine of these subsets with different combinations of tuning parameters, and for each combi-
nation the observations in the tenth subset, known as the validation set, are predicted. This
process is repeated treating each of the ten subsets as the validation set once, and the optimal
values of the tuning parameters are the combination that minimise the root mean square error
(see below for the definition) of the predictions. This process is repeated independently for
each of the five training and test splits.
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 11

Once the optimal tuning parameters have been chosen, each model is refitted to the full
training set using these optimal values, and is then used to make out-of-sample predictions for
the test set. As median property price is a continuous measurement, the quality of these pre-
dictions is assessed using the following standard metrics. In what follows {Y (Ar ), Ỹ (Ar )}
respectively denote the observed median property price and the prediction for the r th areal
unit in the test set, where following the notation in Section 3, r = K + 1, . . . , N .
v
u N i2
u 1 X h
Root mean square error − RMSE = t Ỹ (Ar ) − Y (Ar ) .
N −K
r=K+1
n o
Median absolute error − MAE = Medianr=K+1,...,N Ỹ (Ar ) − Y (Ar ) .
Coverage probability − CP = The proportion of the N − K 95% prediction
intervals that contain the true value.
Average interval width − AIW = The average width of the N − K 95%
prediction intervals.
The accuracy of the point prediction is summarised by both the RMSE and MAE metrics,
with the best model minimising both quantities. We present both metrics because as the
RMSE utilises both arithmetic mean and squared operators it is much less robust to individual
DZs with big prediction errors than the MAE is. The appropriateness of the 95% prediction
intervals is quantified by the coverage probability and average interval width, and the former
should be close to 0.95 if predictive uncertainty is appropriately captured. Finally, the average
interval width should be as small as possible as long as the coverage probability is close to
0.95.

6. Simulation study. This section presents a simulation study, whose aim is to compare
the predictive performance of the SPAR-Forest algorithm against the competitor prediction
models outlined above in a number of controlled scenarios.

6.1. Data generation. The study is based on the N = 746 Data Zones contained within
the Glasgow City local authority, because using all of mainland Scotland would make the
simulation study computationally infeasible. This is because the complete study involves fit-
ting each of the models described in the main paper thousands of times, due to both the
number of simulated data sets generated under multiple scenarios and the optimisation of the
tuning parameters required for each model. Each simulated data set consists of a continuous
target variable {Y (Ak )}N N
k=1 , five features {x(Ak ) = [x1 (Ak ), . . . , x5 (Ak )]}k=1 and the east-
ing and northing spatial coordinates of the Data Zone centroids. The features are assumed to
be independent in space, which ensures they are not collinear with the additional residual
spatial autocorrelation induced into the target variable (see below). Each feature is generated
by sampling N realisations from an independent uniform random variable, which has a min-
imum value of 0 and a maximum value of 2π . These limits are chosen so that the non-linear
feature-target relationships outlined below exhibit sizeable non-linearity.
The target variable is generated as Y (Ak ) = f [x(Ak )] + ϕ(Ak ) + ϵ(Ak ), a linear combi-
nation of the true value f [x(Ak )] + ϕ(Ak ) and independent zero-mean Gaussian error ϵ(Ak )
with standard deviation σ = 1. The true value thus depends on both the features x(Ak ) and
residual spatial autocorrelation induced by the random effect ϕ(Ak ), and the exact specifi-
cation of f [x(Ak )] + ϕ(Ak ) is varied across the four scenarios described below. The set of
12

spatial random effects for all DZs are generated from a zero-mean multivariate Gaussian dis-
tribution, where the covariance matrix is equivalent to that from the CAR model proposed by
Leroux, Lei and Breslow (2000). Here, the spatial neighbourhood matrix W is constructed
using the 5 nearest neighbours rule, and we set ρ = 0.9 to ensure strong spatial dependence.
The variance of these spatially autocorrelated random effects controls the size of its influence
on the target variable, and this is varied across the scenarios described below.

6.2. Study design. Fifty simulated data sets are generated under each of four different
scenarios, and the set of models outlined in Section 5 are compared in this study, which
include both the CAR and SAR variants of the SPAR-Forest algorithm and a range of com-
petitors from both spatial statistics and machine learning. In all of the scenarios the first two
features {x1 (Ak ), x2 (Ak )}N
k=1 have relationships with the response while the remaining 3
features do not, with the latter included in all models to ensure the results are not adversely
affected by the presence of unimportant features. Additionally, the easting and northing co-
ordinates of each Data Zone’s centroid are included as features in the linear model, random
forest and geographical random forest because they have no other way to capture the spatial
structure in the data, while this is not necessary for the CAR / SAR and the SPAR-Forest
approaches. The 4 different scenarios and their rationales are described below.
Scenario 1: The features have non-linear relationships with the response via f [x(Ak )] =
3 sin[x1 (Ak )] + 2 ln[x2 (Ak )], and the random effects variance is chosen so that the
marginal standard deviations of the feature component f [x(Ak )] and the random effects
ϕ(Ak ) are similar. This scenarios thus gives the features and the residual spatial autocor-
relation roughly equal prominence in influencing the true values.
Scenario 2: The features have non-linear relationships with the response via f [x(Ak )] =
3 sin[x1 (Ak )] + 2 ln[x2 (Ak )], and the random effects variance is chosen so that the
marginal standard deviation of the feature component f [x(Ak )] is twice that of the random
effects ϕ(Ak ). This scenarios thus makes the features more important than the residual
spatial autocorrelation in influencing the true values.
Scenario 3: The features have non-linear relationships with the response via f [x(Ak )] =
1.5 sin[x1 (Ak )] + ln[x2 (Ak )], and the random effects variance is chosen so that the
marginal standard deviation of the feature component f [x(Ak )] is half that of the ran-
dom effects ϕ(Ak ). This scenarios thus makes the residual spatial autocorrelation more
important than the features in influencing the true values.
Scenario 4: The features have linear relationships with the response via f [x(Ak )] =
x1 (Ak ) + x2 (Ak ), and the random effects variance is chosen so that the marginal stan-
dard deviations of the feature and random effect components are similar. This scenarios
thus compares the models if the feature effects are exactly linear rather than non-linear.
The predictive performances of all the models are quantified using the approach outlined in
Section 5, with the exception that only a single training-test split is considered for each sim-
ulated data set. The sets of possible tuning parameters for the models used in this simulation
study are described in Section 5.1 of the supplemental material.

6.3. Results of the simulation study. The predictive performances of all models across all
scenarios are summarised in Figure 2 by their RMSE values, while the corresponding results
for the MAE, CP and AIW metrics are presented in Section 5.2 of the supplemental material.
Each boxplot represents the set of out-of-sample RMSE values for a single model across the
50 simulated data sets in a single scenario, and the four panels relate to the four scenarios.
Additionally, the numbers at the top of each boxplot present the mean RMSE over the 50
simulated data sets for ease of reference.
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 13

F IG 2. Boxplots showing the out-of-sample RMSE for the test set predictions for each simulated data set in each
scenario from each model. The mean values for the boxplots are presented at the top of each graph.

The figure shows a number of key findings, the first being that the simple linear model per-
forms the worst across the board, due to the fact that it cannot accommodate unknown shaped
non-linear feature effects or residual spatial autocorrelation. In contrast, the SPAR-Forest al-
gorithms perform the best across scenarios 1 to 3, where both non-linear feature-target rela-
tionships and residual spatial autocorrelation are present. The CAR-Forest and SAR-Forest
variants show almost identical results, which is not surprising given the similarities between
their specifications. For scenario 4, where all features have linear effects on the target vari-
able, the CAR / SAR models slightly outperform the SPAR-Forest algorithm, which is be-
cause they are the models that are closest to the data generating mechanism. However in this
case, the SPAR-Forest algorithm could easily be extended to accommodate linear feature-
target relationships, simply by putting those features thought to have linear effects into the
CAR / SAR component of the algorithm rather than into the random forest component. Fi-
nally as expected, the random forest models outperform the CAR / SAR models when the
features effects dominate the residual spatial autocorrelation (Scenario 2), while the opposite
is true when the residual spatial autocorrelation is dominant (Scenario 3). In contrast, when
the two components have a similar influence on the target variable (Scenario 1) then these
models perform similarly.

7. Results from the property price study. This section presents the results of the mo-
tivating study, focusing on the 3 questions outlined in Section 2.3. Section 7.1 describes the
implementation of the models, while Section 7.2 compares their predictive abilities. Section
7.3 provides a local authority comparison of predictive performance from the best performing
model, while Section 7.4 predicts median property prices for the Data Zones with missing
values. Additional results from this study are presented in Section 6 of the supplemental ma-
terial, including an assessment of the stability of the iterative SPAR-Forest algorithm across
the R iterations (Section 6.1), and predictive performance metrics for simpler non-iterative
SPAR-Forest algorithms where R = 1 (Section 6.2).
14

7.1. Model implementation. All models are applied with the complete set of features de-
scribed in Section 2, which from initial analysis performed similarly to using an additional
forwards or backwards stepwise feature selection approach. The easting and northing coordi-
nates of each Data Zone’s centroid are included as features in the linear model, random forest
and geographical random forest because these models have no other way of capturing spatial
location, while this is not necessary for the remaining models that explicitly capture spatial
structure.
Initially, the normal linear model was fitted to both median property price and its natural
logarithm, and as the residual normality assumption is only plausible on the log scale, all
models are applied on this scale in the interests of fairness. The resulting predictions are then
back-transformed to the original scale when computing the predictive performance metrics
outlined in Section 5. As a number of the models are based on a normality assumption,
the back transformation for the point predictions follows the log-normal result that if X ∼
N(µ, σ 2 ), then E[exp(X)] = exp(µ + σ 2 /2). Here, the linear, CAR, SAR and SPAR-Forest
methods (both CAR and SAR variants) provide estimates of σ 2 directly from the model,
while for the random forest and its geographical extension σ 2 is estimated by the sample
variance of the out-of-bag prediction errors following the ideas in Zhang et al. (2020).
The CAR and SAR models have a single tuning parameter D that determines the con-
struction of the neighbourhood matrix W, and the candidate values considered comprise
D ∈ {5, 7, 9}. All models that incorporate a random forest component are implemented with
Ntr = 1, 000 trees, which initial analyses showed was sufficient for the prediction error to
stabilise. Additionally, each random forest model was optimised with respect to all possible
combinations of the tuning parameters mtry ∈ {10, 20, 30, 40, 51} and minnode ∈ {1, 5, 10},
where the largest value of mtry is chosen to be equivalent to bagging. The GRF model con-
tains all the random forest tuning parameters as well as the additional parameters (bw , α). Fol-
lowing Georganos et al. (2021), these latter parameters are optimised over all possible com-
binations of bw ∈ {100, 500, 1, 000} and α ∈ {0.25, 0.5, 0.75, 1}. Finally, the SPAR-Forest
algorithm (both CAR and SAR variants) was optimised with respect to the tuning parameters
from the CAR / SAR model (D ), the random forest model (mtry , minnode ) and the total
number of iterations of the algorithm (R). The same sets of possible values described above
for the first two types were considered, while we considered all possible combinations of
R = 1, . . . , 20.

7.2. Comparing the predictive ability of the models. The predictive abilities of the mod-
els for each training-test split as well as the mean over all five splits are summarised in Table
1, which presents the four metrics outlined in Section 5, namely RMSE, MAE, CP and AIW.
The two variants of the proposed SPAR-Forest algorithm (CAR-Forest and SAR-Forest) pro-
duce the best point predictions in terms of both RMSE and MAE in almost all cases, with the
lowest average RMSE coming from SAR-Forest (£39, 723) while the lowest average MAE
comes from CAR-Forest (£16, 732). These point predictions show an average improvement
over all five training-test splits of £1,761 (RMSE) and £814 (MAE), when comparing the
best SPAR-Forest algorithm against the best competitor model. The GRF model performs
the best out of the competitor models in both RMSE and MAE, while the spatial CAR / SAR
models slightly outperform a non-spatial random forest.
The 95% prediction intervals from all models exhibit close to their nominal coverage prob-
abilities (CP), with values ranging between 0.939 (SAR model) and 0.955 (GRF) on average
across all 5 training-test splits. The SAR-Forest intervals are the narrowest in all cases with
an average width of £136,827, which is £4,392 narrower compared to those from the exist-
ing competitor model with the narrowest intervals (SAR). Overall then, both variants of the
proposed SPAR-Forest algorithm outperform their spatial smoothing and machine learning
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 15

competitors, and there are only small differences between the CAR and SAR variants of the
new algorithm. The remainder of this section presents results from the SAR-Forest variant,
because it has the smallest RMSE and substantially narrowest prediction intervals, albeit with
a slightly smaller coverage than those from the CAR-Forest variant.

7.3. Comparing the accuracy of property price predictions by local authority. To get a
regional view of the predictability of property prices in Scotland, we compute the RMSE and
MAE metrics from the SAR-Forest model separately for each of the 29 local authorities in
the study. These regional metrics are plotted against each other in Figure 3, which presents
the mean values over all 5 training and test splits. The shading of the labels in the figure
denote the median prices across DZs in each LA, with darker colours having higher prices.
The left panel (A) displays the absolute RMSE and MAE values, while the right panel (B)
presents both metrics as percentages of their median property prices to remove the scaling
effect caused by some regions having more expensive properties than others.
Both panels show moderate to strong linear relationships between the RMSE and the MAE
metrics, suggesting that the relative accuracy of property price predictions by local authority

Split
Model
1 2 3 4 5 Mean
RMSE
LM £46,992 £50,132 £49,569 £45,879 £42,331 £46,981
CAR £41,617 £44,528 £45,084 £40,845 £37,115 £41,838
SAR £41,812 £44,709 £45,038 £40,904 £37,264 £41,945
RF £41,222 £46,407 £44,988 £42,622 £36,884 £42,424
GRF £40,562 £45,713 £44,640 £41,218 £35,288 £41,484
CAR-Forest £39,240 £43,066 £43,441 £39,964 £35,660 £40,274
SAR-Forest £39,605 £42,935 £42,001 £39,269 £34,806 £39,723
MAE
LM £20,401 £19,816 £20,718 £19,753 £18,988 £19,935
CAR £18,565 £17,519 £17,614 £17,577 £17,388 £17,733
SAR £18,549 £17,976 £17,844 £17,422 £17,489 £17,856
RF £18,124 £17,736 £18,255 £18,588 £17,181 £17,977
GRF £17,680 £17,352 £17,600 £18,585 £16,513 £17,546
CAR-Forest £16,222 £16,838 £17,184 £17,313 £16,104 £16,732
SAR-Forest £16,869 £16,645 £17,419 £16,484 £16,357 £16,755
CP
LM 0.932 0.943 0.936 0.943 0.950 0.941
CAR 0.947 0.947 0.938 0.946 0.949 0.945
SAR 0.942 0.940 0.928 0.939 0.945 0.939
RF 0.935 0.951 0.952 0.944 0.960 0.949
GRF 0.947 0.954 0.954 0.951 0.966 0.955
CAR-Forest 0.940 0.944 0.939 0.945 0.958 0.945
SAR-Forest 0.935 0.939 0.943 0.936 0.952 0.941
AIW
LM £155,091 £148,898 £152,074 £151,954 £153,574 £152,318
CAR £148,860 £147,725 £150,017 £146,681 £147,125 £148,082
SAR £144,202 £139,263 £140,811 £139,416 £142,407 £141,219
RF £148,397 £150,477 £151,874 £148,213 £150,840 £149,960
GRF £150,160 £148,381 £152,490 £149,820 £151,432 £150,457
CAR-Forest £144,511 £138,633 £143,863 £140,063 £144,179 £142,250
SAR-Forest £139,748 £134,057 £137,834 £134,466 £138,032 £136,827
TABLE 1
Comparison of the out-of-sample predictive abilities of each model for each data split, and the overall mean over
all 5 splits. The acronyms in the table denote: RMSE - root mean square error; MAE - median absolute error;
CP - coverage probability; and AIW - average interval width.
16

F IG 3. Comparison of the point prediction accuracy of the SAR-Forest algorithm by local authority, as measured
by MAE and RMSE. Panel (A) presents the un-scaled MAE / RMSE values, while panel (B) presents metrics scaled
by the median property price in each LA.

are similar regardless of which metric is used. The left panel (A) shows that generally LAs
with more expensive properties on average have higher prediction errors, with the City of
Edinburgh, East Lothian and East Renfrewshire having the least accurate predictions in ab-
solute terms, while Clackmannanshire and Dundee City have the most accurate predictions.
However, once these prediction metrics have been scaled to account for differences in me-
dian property prices (panel (B)), then East Lothian and West Dumbartonshire have the least
predictable prices, while Clackmannanshire, Dundee City and Midlothian are the most pre-
dictable. These results also show there are no clear spatial or urban-rural trends in the relative
predictability of Scotland’s housing market, as nearby or similar areas do not necessarily have
similar prediction metrics.

7.4. Predicting missing median property prices. The SAR-Forest algorithm is now fitted
to the data for the 6,264 Data Zones that have non-missing property prices, and is subse-
quently used to predict the remaining 617 DZs with missing values. The model is fitted with
Ntr = 1, 000 trees as before, while the tuning parameters are chosen as the medians of the
optimal values identified in the 5 training-test splits, specifically: R = 19, D = 9, mtry = 30
and minnode = 5. The distributions of the predictions (617 DZs) and the observed prices
(6,264 DZs) are displayed in panel (A) of Figure 4 by density estimates, with the predic-
tions in dark grey while the observed data are in light grey. The figure shows that the two
distributions are skewed to the right with broadly similar shapes, but that the Data Zones
with missing values have lower predicted prices on average (median £96,961) compared to
those with available data (median £139,282). The likely reason for this price differential
is illustrated in panel (B) of Figure 4, which presents the predictions (dark grey, point esti-
mates and 95% prediction intervals) and observed prices (light grey, observed values) against
a measure of socio-economic deprivation, specifically the first principal component of the
education domain of the SIMD. The figure shows that property prices decrease as the level
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 17

F IG 4. Comparison of the Data Zones with predicted (dark grey - predictions and 95% prediction intervals) and
observed (light grey - observations) property prices. The left panel (A) shows density estimates while the right
panel (B) shows the relationship between price / prediction and socio-economic deprivation as measured by the
first principal component of the education domain of the SIMD.

of socio-economic deprivation increases as expected, and that a high proportion of the Data
Zones with missing property price data are socio-economically deprived.

8. Discussion. This paper has proposed a novel fusion of random forests and autore-
gressive spatial smoothing models for the prediction of spatial areal unit data with missing
values. It is thus one of the first coherent frameworks for applying random forests to spatially
autocorrelated areal unit data, because unlike the RFRK (Hengl et al., 2015) and GLS (Saha,
Basu and Datta, 2023) algorithms for point-level data, existing areal unit level methods such
as geographical random forests (Georganos et al., 2021) and graphical convolutional neural
networks (Zhu et al., 2022) do not explicitly allow for the residual spatial autocorrelation in
the data after the feature effects have been accounted for. The improved predictive perfor-
mance of our SPAR-Forest algorithm compared with a number of state-of-the-art alternatives
has been evidenced in both simulated and real data settings, and its superior performance is
likely to be because it captures flexible feature-target relationships via random forests and
residual spatial autocorrelation via random effects. In principle, any spatial smoothing model
that allows predictions to be made for both training and test set observations could be used in
our algorithm in place of the CAR / SAR component, such as two dimensional spline-based
smoothers, but CAR / SAR models were used here as they are the predominant approaches
to smoothing spatial areal unit data. In contrast, random forests are a critical part of our algo-
rithm, due to their ability to produce out-of-bag predictions for both the training and test sets
in a computationally efficient manner. Other machine learning approaches could in principle
be used, but as discussed in Section 4.2 they may be computationally infeasible (e.g., neural
networks).
In the motivating study the SAR-Forest variant narrowly outperformed the CAR-Forest
variant in most of the prediction metrics, although the differences between these two similar
18

models were generally not large. Furthermore, SAR-Forest provided improved predictive
performance compared to the best existing competitor by around 4.2% (£1,761) in RMSE,
4.5% (£791) in MAE, and 3.2% (£4,464) in the precision of its 95% prediction intervals.
These improvements in point prediction are benchmarked against the geographical random
forest model, which generally outperforms the linear, random forest and CAR / SAR models
in our study. Finally, a comparison of the random forest and CAR / SAR models shows
that the latter slightly outperforms the former in terms of point prediction, suggesting that
for these property price data the ability to capture residual spatial autocorrelation is more
important than capturing non-linear feature-target relationships.
The simulation study illustrates that the SPAR-Forest algorithm performs as one would
expect under controlled conditions, which provides further evidence of its efficacy and ro-
bustness. In the presence of both non-linear feature-target relationships and residual spatial
autocorrelation SPAR-Forest outperforms its competitors, because it is the only method that
can accommodate both of these components. However, SPAR-Forest does not perform as
well as the CAR / SAR models when all the feature-target relationships are exactly linear,
which is because in this case the additional unnecessary flexibility of the random forest re-
sults in poorer predictive performance compared with rigidly enforcing the relationships to
be linear. Therefore in practical applications one should identify any features in advance that
exhibit close to linear relationships with the target variable via exploratory scatterplots, and
include those features as linear terms in the CAR / SAR component of the algorithm rather
than in the random forest.
This paper has focused exclusively on predictive performance for purely spatial data, and
hence there is much scope to further develop our methodology to address more complex sce-
narios. One such challenge is how to model multivariate spatio-temporal data, where differ-
ent random forest models would be needed for each target variable and possibly time period,
while complex residual autocorrelations over space, time and between the target variables
would need to be accounted for. A second challenge concerns the inferential goal, which
here was restricted to prediction for areal units with missing data values. In this context the
effects of the features on the target variable were not of direct interest, which allowed us to
plug their estimated effects from the random forest into the second stage CAR / SAR model
as a fixed offset. However, in the field of spatial ecological regression the relationships be-
tween features and the target variable are the primary inferential goal, meaning that the effects
of the features and their uncertainty needs to be quantified. One possible solution is to split
the features into a set of those that are of primary interest and a second set of confounders,
with the confounders remaining in the random forest component due to its flexibility while
the features of primary interest are included in the Bayesian CAR / SAR model as suggested
above in the context of linear feature-target relationships. A second possible solution is to
use interpretable machine learning tools for feature effects, such as partial dependence plots
(Friedman, 2001) and variable importance plots (see Greenwell and Boehmke, 2020).
Finally, as noted in the introduction spatial areal unit data are prevalent in many fields,
which gives the SPAR-Forest algorithm a wide range of possible application areas beyond
the property price example that motivated this paper. One important such area is disease
mapping (Lawson, 2018), where the fusion of machine learning and spatial autocorrelation
models has the potential to help answer a number of public health questions, such as where
are the hotspots of disease risk, which features affect disease risk, and how big are the health
inequalities between different communities and how are they changing over time.

Acknowledgements. The authors gratefully acknowledge the helpful comments from

the editor, associate editor and two reviewers, which have improved both the content and
presentation of the paper. For the purpose of open access, the author has applied a Creative
Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising
from this submission.
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 19

REFERENCES
B ERROCAL , V., G UAN , Y., M UYSKENS , A., WANG , H., R EICH , B., M ULHOLLAND , J. and C HANG , H. (2020).
A comparison of statistical and machine learning methods for creating national daily maps of ambient PM2.5
concentration. Atmospheric Environment 222 117130.
B ESAG , J., YORK , J. and M OLLIÉ , A. (1991). Bayesian image restoration with two applications in spatial statis-
tics. Annals of the Institute of Statistics and Mathematics 43 1-59.
B OEHMKE , B. and G REENWELL , B. (2020). Hands-on machine learning with R. Chapman & Hall/CRC.
B REIMAN , L. (1984). Classification and regression trees. Routledge.
B REIMAN , L. (2001). Random forests. Machine Learning 45 5-32.
B REWER , M. and N OLAN , A. (2007). Variable smoothing in Bayesian intrinsic autoregressions. Environmetrics
18 841-857.
C REDIT, K. (2022). Spatial models or random forest? evaluating the use of spatially explicit machine learning
methods to predict employment density around new transit stations in Los Angeles. Geographical Analysis 54
58-83.
FARAWAY, J. (2014). Linear models with R. Chapman & Hall/CRC.
F RANKE , J. and N EUMANN , M. (2000). Bootstrapping Neural Networks. Neural Computation 12 1929-1949.
F RIEDMAN , J. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29
1189 – 1232.
G EORGANOS , S., G RIPPA , T., G ADIAGA , A., L INARD , C., L ENNERT, M., VANHUYSSE , S., M BOGA , N.,
W OLFF , E. and K ALOGIROU , S. (2021). Geographical random forests: a spatial extension of the random
forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto Iinter-
national 36 121-136.
G REENWELL , B. and B OEHMKE , B. (2020). Variable importance plots — An introduction to the vip package.
The R Journal 20 343-366.
H AINING , B. and L I , G. (2020). Modelling spatial and spatio-temporal data: A Bayesian approach. Chapman &
Hall/CRC.
H ENGL , T., H EUVELINK , G., K EMPEN , B., L EENAARS , J., WALSH , M., S HEPHERD , K., S ILA , A., M AC M IL -
LAN , R., M ENDES DE J ESUS , J., TAMENE , L. and T ONDOH , J. (2015). Mapping soil properties of Africa at
250m resolution: random forests significantly improve current predictions. PLOS ONE 10 1-26.
JACK , E., L EE , D. and D EAN , N. (2019). Estimating the changing nature of Scotland’s health inequalities by us-
ing a multivariate spatiotemporal model. Journal of the Royal Statistical Society Series A: Statistics in Society
182 1061-1080.
K AWABATA , M., NAOI , M. and YASUDA , S. (2022). Earthquake risk reduction and residential land prices in
Tokyo. Journal of Spatial Econometrics 3 5.
K IPF, T. and W ELLING , M. (2017). Semi-supervised classification with graph convolutional networks. In Inter-
national Conference on Learning Representations.
K NORR -H ELD , L. and R ASSER , G. (2000). Bayesian detection of clusters and discontinuities in disease maps.
Biometrics 56 13-21.
K UHN , M. and J OHNSON , K. (2019). Feature engineering and selection: A practical approach for predictive
models. Chapman & Hall/CRC.
L AWSON , A. (2018). Bayesian disease mapping hierarchical modeling in spatial epidemiology. Chapman &
Hall/CRC.
L E C UN , Y., B ENGIO , Y. and H INTON , G. (2015). Deep learning. Nature 521 436–444.
L E C UN , Y., B OSER , B., D ENKER , J., H ENDERSON , D., H OWARD , R., H UBBARD , W. and JACKEL , L. (1990).
Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing
Systems 396-404.
L EE , D. and A NDERSON , C. (2023). Delivering spatially comparable inference on the risks of multiple severities
of respiratory disease from spatially misaligned disease count data. Biometrics 79 2691-2704.
L EE , D., M EEKS , K. and P ETTERSSON , W. (2021). Improved inference for areal unit count data using graph-
based optimisation. Statistics and Computing 31 51.
L EROUX , B., L EI , X. and B RESLOW, N. (2000). Estimation of disease rates in small areas: A new mixed model
for spatial dependence Statistical Models in Epidemiology, the Environment and Clinical Trials, Halloran, M
and Berry, D (eds), 135-178. Springer-Verlag, New York.
PALMER , G., D U , S., P OLITOWICZ , A., E MORY, J., YANG , X., G AUTAM , A., G UPTA , G., L I , Z., JACOBS , R.
and M ORGAN , D. (2022). Calibration after bootstrap for accurate uncertainty quantification in regression
models. npj Computational Materials 8 115.
R IEBLER , A., S ØRBYE , S., S IMPSON , D. and RUE , H. (2016). An intuitive Bayesian spatial model for disease
mapping that accounts for scaling. Statistical Methods in Medical Research 25 1145-1165.
20

RUE , H., M ARTINO , S. and C HOPIN , N. (2009). Approximate Bayesian inference for latent Gaussian models
uUsing integrated nested Laplace approximations (with discussion). Journal of the Royal Statistical Society
Series B 71 319-392.
S AHA , A., BASU , S. and DATTA , A. (2023). Random forests for spatially dependent data. Journal of the Ameri-
can Statistical Association 118 665-683.
S OLTANI , A., H EYDARI , M., AGHAEI , F. and P ETTIT, C. (2022). Housing price prediction incorporating spatio-
temporal dependency into machine learning algorithms. Cities 131 103941.
WANG , X., WANG , X., R EN , X. and W EN , F. (2022). Can digital financial inclusion affect CO2 emissions of
China at the prefecture level? Evidence from a spatial econometric approach. Energy Economics 109 105966.
W HITTLE , P. (1954). On stationary processes in the plane. Biometrika 41 434–449.
W RIGHT, M. and Z IEGLER , A. (2017). ranger: A fast implementation of random forests for high dimensional
data in C++ and R. Journal of Statistical Software 77 1–17.
X IA , Z., S TEWART, K. and FAN , J. (2021). Incorporating space and time into random forest models for analyzing
geospatial patterns of drug-related crime incidents in a major U.S. metropolitan area. Computers, Environment
and Urban Systems 87 101599.
Z HANG , H., Z IMMERMAN , J., N ETTLETON , D. and N ORDMAN , D. (2020). Random forest prediction intervals.
The American Statistician 74 392-406.
Z HU , D., L IU , Y., YAO , X. and F ISCHER , M. (2022). Spatial regression graph convolutional neural networks: A
deep learning paradigm for spatial multivariate distributions. Geoinformatica 26 645-676.
MacBride, Cara, Davies, Vinny and Lee, Duncan (2025) A spatial
autoregressive random forest algorithm for small-area spatial
prediction. Annals of Applied Statistics, 9(1), pp. 485-504.
(doi: 10.1214/24-AOAS1969)

There may be differences between this version and the published version.
You are advised to consult the publisher’s version if you wish to cite from
it.

http://eprints.gla.ac.uk/337422/

Deposited on 2 October 2024

Enlighten – Research publications by members of the University of Glasgow

http://eprints.gla.ac.uk

Spatio-Temporal Statistics IV Michaelmas 2024-25
No ratings yet
Spatio-Temporal Statistics IV Michaelmas 2024-25
130 pages
Spatial Analysis
100% (3)
Spatial Analysis
133 pages
Spatial Econometrics Cross-Sectional Data To Spatial Panels (Book) (Elhorst 2014) PDF
100% (1)
Spatial Econometrics Cross-Sectional Data To Spatial Panels (Book) (Elhorst 2014) PDF
125 pages
Bayesian Model Averaging of Space Time Car Models With Applicatio-1
No ratings yet
Bayesian Model Averaging of Space Time Car Models With Applicatio-1
60 pages
参考文献宋老师推荐的作图风格
No ratings yet
参考文献宋老师推荐的作图风格
49 pages
Hengl 2009 GEOSTAT-Excelente PDF
No ratings yet
Hengl 2009 GEOSTAT-Excelente PDF
293 pages
Spatial Data Mining: Presented By-: Rajkumar Jain M.tech (C.s.e) 1 Year (2 Sem)
0% (1)
Spatial Data Mining: Presented By-: Rajkumar Jain M.tech (C.s.e) 1 Year (2 Sem)
27 pages
Analysis PDF
No ratings yet
Analysis PDF
135 pages
GeoRF RandomForestSpatial
No ratings yet
GeoRF RandomForestSpatial
35 pages
5-Modeling SP Variability
No ratings yet
5-Modeling SP Variability
64 pages
SDM PDF
No ratings yet
SDM PDF
96 pages
Geoconformal Prediction: A Model-Agnostic Framework of Measuring The Uncertainty of Spatial Prediction
No ratings yet
Geoconformal Prediction: A Model-Agnostic Framework of Measuring The Uncertainty of Spatial Prediction
32 pages
Uso Random Forest Predicao Espacial 26693
No ratings yet
Uso Random Forest Predicao Espacial 26693
49 pages
A Hybrid Approach For Mass Valuation of Residential Properties Through Geographic Information Systems and Machine Learning Integration
No ratings yet
A Hybrid Approach For Mass Valuation of Residential Properties Through Geographic Information Systems and Machine Learning Integration
25 pages
JRFM 15 00193 With Cover
No ratings yet
JRFM 15 00193 With Cover
25 pages
Spatial Machine Learning: New Opportunities For Regional Science
No ratings yet
Spatial Machine Learning: New Opportunities For Regional Science
43 pages
Spatial AutoRegression SAR Model Parameter Estimation Techniques
0% (1)
Spatial AutoRegression SAR Model Parameter Estimation Techniques
81 pages
Spatial Big Data Science: Zhe Jiang Shashi Shekhar
100% (2)
Spatial Big Data Science: Zhe Jiang Shashi Shekhar
138 pages
Critical Review of Data Models and Performance Metrics For Wind and Solar Power Forecast
No ratings yet
Critical Review of Data Models and Performance Metrics For Wind and Solar Power Forecast
22 pages
Spatial Statistics
No ratings yet
Spatial Statistics
19 pages
05 Nonparametric
No ratings yet
05 Nonparametric
22 pages
Foundation For Unbiased Cross-Validation of Spatio-Temporal Models For Species Distribution Modeling
No ratings yet
Foundation For Unbiased Cross-Validation of Spatio-Temporal Models For Species Distribution Modeling
20 pages
Axioms: Spatial Statistical Models: An Overview Under The Bayesian Approach
No ratings yet
Axioms: Spatial Statistical Models: An Overview Under The Bayesian Approach
38 pages
Oliveira Et Al 2021 Evaluation Procedures For Forecasting With Spatiotemporal Data
No ratings yet
Oliveira Et Al 2021 Evaluation Procedures For Forecasting With Spatiotemporal Data
27 pages
A Parsimonious, Computationally Efficient Machine Learning Method For Spatial Regression
No ratings yet
A Parsimonious, Computationally Efficient Machine Learning Method For Spatial Regression
23 pages
Folien Woche 1-3 4x4
No ratings yet
Folien Woche 1-3 4x4
71 pages
Jurnal 3 Skripsit
No ratings yet
Jurnal 3 Skripsit
27 pages
Notes On Spatial Econometric Models - The Ohio State University
No ratings yet
Notes On Spatial Econometric Models - The Ohio State University
23 pages
(Ebook PDF) Modern Business Statistics, With Microsoft Office Excel 4th Edition PDF Download
100% (2)
(Ebook PDF) Modern Business Statistics, With Microsoft Office Excel 4th Edition PDF Download
48 pages
Gaussian Random Field Models For Spatial Data
No ratings yet
Gaussian Random Field Models For Spatial Data
47 pages
Train Test Split Spatial
No ratings yet
Train Test Split Spatial
13 pages
Spatial Econometrics - Common Models: J - M F Insee R L S Ensai
No ratings yet
Spatial Econometrics - Common Models: J - M F Insee R L S Ensai
29 pages
Khare 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012053
No ratings yet
Khare 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012053
15 pages
Spatial Prediction Based On Third Law of Geography
No ratings yet
Spatial Prediction Based On Third Law of Geography
17 pages
A Comparative Study of Machine Learning and Spatial Interpolation Methods For Predicting House Prices
No ratings yet
A Comparative Study of Machine Learning and Spatial Interpolation Methods For Predicting House Prices
14 pages
A Survey of Spatial Data Mining Methods Databases
No ratings yet
A Survey of Spatial Data Mining Methods Databases
10 pages
12 PAGES - Random Forest Algorithm, Support Vector Machine For Regression Analysis
No ratings yet
12 PAGES - Random Forest Algorithm, Support Vector Machine For Regression Analysis
12 pages
CARBayes ST
No ratings yet
CARBayes ST
37 pages
Inference On Higher-Order Spatial Autoregressive Models With Increasingly Many Parameters
No ratings yet
Inference On Higher-Order Spatial Autoregressive Models With Increasingly Many Parameters
13 pages
Linear Regression Slides
No ratings yet
Linear Regression Slides
129 pages
Application of Machine Learning Methods To Spatial Interpolation of Environmental Variables
No ratings yet
Application of Machine Learning Methods To Spatial Interpolation of Environmental Variables
13 pages
Understanding Housing Prices Using Geographic Big Data: A Case Study in Shenzhen
No ratings yet
Understanding Housing Prices Using Geographic Big Data: A Case Study in Shenzhen
20 pages
Shor Learn
No ratings yet
Shor Learn
13 pages
Spatial
100% (1)
Spatial
23 pages
Regression Models For Spatial Data: An Example From Precision Agriculture
No ratings yet
Regression Models For Spatial Data: An Example From Precision Agriculture
14 pages
Geostatistical Learning
No ratings yet
Geostatistical Learning
8 pages
Small Area Estimation Using A Spatio-Temporal Linear Mixed Model
No ratings yet
Small Area Estimation Using A Spatio-Temporal Linear Mixed Model
24 pages
Spatial Data Mining Techniques: M.Tech Seminar Report Submitted by
No ratings yet
Spatial Data Mining Techniques: M.Tech Seminar Report Submitted by
28 pages
Spatial Models
No ratings yet
Spatial Models
9 pages
Evaluating Region Inference Methods by Using Fuzzy Spatial Inference Models
No ratings yet
Evaluating Region Inference Methods by Using Fuzzy Spatial Inference Models
8 pages
CSIC 6132 排版870 878
No ratings yet
CSIC 6132 排版870 878
9 pages
An Extended ID3 Decision Tree Algorithm
No ratings yet
An Extended ID3 Decision Tree Algorithm
6 pages
Ecological Modelling: Sciencedirect
No ratings yet
Ecological Modelling: Sciencedirect
11 pages
Regional Science and Urban Economics: Ghislain Geniaux, Davide Martinetti
No ratings yet
Regional Science and Urban Economics: Ghislain Geniaux, Davide Martinetti
12 pages
Minitab Training PDF Free PDF
No ratings yet
Minitab Training PDF Free PDF
121 pages
Quiz 2
No ratings yet
Quiz 2
22 pages
Naive Bayes Algorithm
No ratings yet
Naive Bayes Algorithm
46 pages
Spatial Data Mining: Three Case Studies: Shashi Shekhar, University of Minnesota
No ratings yet
Spatial Data Mining: Three Case Studies: Shashi Shekhar, University of Minnesota
18 pages
AI Explanation Basic
No ratings yet
AI Explanation Basic
86 pages
Advanced Statistics and Probability
No ratings yet
Advanced Statistics and Probability
37 pages
Gangappa 2017 Ijca 914643
No ratings yet
Gangappa 2017 Ijca 914643
5 pages
Spatial Data Mining
No ratings yet
Spatial Data Mining
3 pages
5.3) Ordinal Logistic Regression 2
No ratings yet
5.3) Ordinal Logistic Regression 2
40 pages
Improving Performance of Spatio-Temporal Machine Learning Models Using Forward Feature Selection and Target-Oriented Validation
No ratings yet
Improving Performance of Spatio-Temporal Machine Learning Models Using Forward Feature Selection and Target-Oriented Validation
9 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
BI Manual (E-Next - In)
No ratings yet
BI Manual (E-Next - In)
66 pages
DataAnalyticsfortheInsuranceIndustry AGoldMine
No ratings yet
DataAnalyticsfortheInsuranceIndustry AGoldMine
31 pages
Spatial Clustering Algorithms - An Overview: Bindiya M Varghese
No ratings yet
Spatial Clustering Algorithms - An Overview: Bindiya M Varghese
8 pages
Crime Hot-Spots Prediction Using Support Vector Machine: 952 1-4244-0212-3/06/$20.00/©2006 IEEE
No ratings yet
Crime Hot-Spots Prediction Using Support Vector Machine: 952 1-4244-0212-3/06/$20.00/©2006 IEEE
8 pages
Simple Linear Regression Analysis: Mcgraw-Hill/Irwin
No ratings yet
Simple Linear Regression Analysis: Mcgraw-Hill/Irwin
16 pages
Econometrics Chapter 1& 2
No ratings yet
Econometrics Chapter 1& 2
35 pages
Analisis Pengaruh Kinerja Keuangan Terhadap Nilai Perusahaan Yang Terdaftar Di Bursa Efek Indonesia
No ratings yet
Analisis Pengaruh Kinerja Keuangan Terhadap Nilai Perusahaan Yang Terdaftar Di Bursa Efek Indonesia
11 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
IJETR042462
No ratings yet
IJETR042462
5 pages
Parametric and Nonparametric Logistic Regressions For Prediction of Presence/Absence of An Amphibian
No ratings yet
Parametric and Nonparametric Logistic Regressions For Prediction of Presence/Absence of An Amphibian
48 pages
DB140
No ratings yet
DB140
35 pages
Supervised Learning With R
No ratings yet
Supervised Learning With R
30 pages
What Is Time Series Analysis
No ratings yet
What Is Time Series Analysis
28 pages
Sampler PDF
0% (1)
Sampler PDF
25 pages
African Journal of Advanced Pure and Applied Sciences (AJAPAS)
No ratings yet
African Journal of Advanced Pure and Applied Sciences (AJAPAS)
13 pages
Influenceof Social Media Marketingonthe Purchase Intentionof
No ratings yet
Influenceof Social Media Marketingonthe Purchase Intentionof
12 pages
1 s2.0 S2214785321077014 Main
No ratings yet
1 s2.0 S2214785321077014 Main
11 pages
Afghari Et Al. - 2019 - Effects of Globally Obtained Informative Priors On Bayesian Safety Performance Functions Developed For Australia
No ratings yet
Afghari Et Al. - 2019 - Effects of Globally Obtained Informative Priors On Bayesian Safety Performance Functions Developed For Australia
11 pages
A Blockchain-Enhanced Transaction Model For Microgrid Energy Trading
No ratings yet
A Blockchain-Enhanced Transaction Model For Microgrid Energy Trading
10 pages
Chapter 4 Demand Estimation
No ratings yet
Chapter 4 Demand Estimation
9 pages
Econ321 Tutorial IV AK
No ratings yet
Econ321 Tutorial IV AK
5 pages
Reviewer Research II Q3
No ratings yet
Reviewer Research II Q3
5 pages
Econometrics Syllabus
No ratings yet
Econometrics Syllabus
4 pages
1 Logistic Regression
No ratings yet
1 Logistic Regression
1 page
Lab 7 Supplementing Qgis With R New Skills: Last Modified 7 May 2014
No ratings yet
Lab 7 Supplementing Qgis With R New Skills: Last Modified 7 May 2014
3 pages

A Spatial Autoregressive Random Forest Algorithm For Small-Area Spatial Prediction

Uploaded by

A Spatial Autoregressive Random Forest Algorithm For Small-Area Spatial Prediction

Uploaded by

Submitted to the Annals of Applied Statistics

A SPATIAL AUTOREGRESSIVE RANDOM FOREST ALGORITHM FOR

B Y C ARA M AC B RIDE1,a , V INNY DAVIES1,b

In spatial areal unit data with missing or suppressed values, it is de-

2.2.2. Small-area characteristics. The level of socio-economic deprivation in each Data

3. Competitor prediction models. The SPAR-Forest algorithm proposed here is bench-

(1) Y (Ak ) ∼ N{β0 + x(Ak )⊤ β, σ 2 } for k = 1, . . . , K,

(2) Y (Ak ) ∼ N{β0 + x(Ak )⊤ β + ϕ(Ak ), σ 2 } for k = 1, . . . , K.

(3) ϕ ∼ N{0, τ −1 [ρ(diag(W1) − W) + (1 − ρ)I]−1 },

(4) ϕ ∼ N{0, τ −1 [(I − ρW̃)(I − ρW̃)]−1 }.

(5) Y (Ak ) = m[x(Ak )] + ϵ(Ak ) for k = 1, . . . , N,

(6) Y (Ak ) = m[x(Ak )] + ϵ(Ak ) for k = 1, . . . , N,

(7) Y (Ak ) = m̂[x(Ak )] + {m[x(Ak )] − m̂[x(Ak )]} + ϵ(Ak ) for k = 1, . . . , N.

(8) Y (Ak ) = m̂[x(Ak )] + ϕ(Ak ) + ϵ(Ak ) for k = 1, . . . , N.

Y (Ak ) ∼ N{β0 + m̂(k) [x(Ak )] + ϕ(Ak ), σ 2 } for k = 1, . . . , K,

Acknowledgements. The authors gratefully acknowledge the helpful comments from

Deposited on 2 October 2024

Enlighten – Research publications by members of the University of Glasgow

You might also like