A Spatial Autoregressive Random Forest Algorithm For Small-Area Spatial Prediction
A Spatial Autoregressive Random Forest Algorithm For Small-Area Spatial Prediction
1. Introduction. Spatial areal unit data are prevalent in fields including ecology (Brewer
and Nolan, 2007), economics (Kawabata, Naoi and Yasuda, 2022), and epidemiology (Lee
and Anderson, 2023), and the aims of modelling these data include hotspot identification
(Knorr-Held and Raßer, 2000), boundary detection (Lee, Meeks and Pettersson, 2021), eco-
logical regression (Wang et al., 2022), and the quantification of spatial inequalities (Jack, Lee
and Dean, 2019). Unlike for point-level data, spatial prediction is not normally the inferen-
tial goal, because there is one data value for each areal unit and hence nothing to predict.
However, areal unit data sometimes contain missing values, making spatial prediction an im-
portant methodological challenge. These missing values could be caused by the observed
value not existing, not being measured or being suppressed, the latter occurring because it
may disclose the identity of individuals. Here, we model median property prices at a small-
area scale in Scotland, and these data are only publicly released if 5 or more properties sold
in a year, leading to around 9% of the small areas having missing values.
Statistical models for these areal unit data typically represent the mean function with a
linear combination of available features and a set of random effects, with the latter capturing
any residual spatial autocorrelation in the data after feature adjustment. Conditional autore-
gressive (Besag, York and Mollié, 1991) models and spatial autoregressive models (Whittle,
1954) are commonly used for this purpose, which capture this spatial autocorrelation by
smoothing the random effects in neighbouring areal units towards each other. In contrast,
machine learning algorithms are the state of the art approach to non-spatial prediction, with
examples including random forests (Breiman, 2001), gradient boosting machines (Friedman,
Keywords and phrases: Areal Unit Data, Random Forests, Property Prices, Spatial Autoregressive Models.
1
2
2001) and neural networks (LeCun et al., 1990). These algorithms model the relationship be-
tween each feature and the target variable as a complex non-linear function, typically leading
to improved predictive performance compared to models with linear feature-target relation-
ships. These competing paradigms thus utilise different aspects of spatial areal unit data to
make predictions, with machine learning algorithms utilising complex non-linear feature-
target relationships whilst ignoring residual spatial autocorrelation, while spatial smoothing
models capture this autocorrelation at the expense of simpler feature-target relationships.
The use of machine learning algorithms in spatial statistics is a growing research area, with
Berrocal et al. (2020) and Credit (2022) comparing the predictive performance of traditional
spatial statistical models and machine learning algorithms. A number of hybrid methodolo-
gies have also been proposed, which for point-level spatial data include the random forest
regression Kriging (RFRK, Hengl et al., 2015) and random forest generalised least squares
(RF-GLS, Saha, Basu and Datta, 2023) algorithms. For areal unit data, Xia, Stewart and Fan
(2021) and Soltani et al. (2022) incorporate spatially lagged features into tree-based machine
learning models, while Georganos et al. (2021) propose a geographical random forest (GRF)
algorithm that fits a separate local random forest for each areal unit using only nearby data
points. In the related field of image analysis convolutional neural networks (CNN, see Le-
Cun, Bengio and Hinton, 2015) have been developed, which extend neural network models
by spatially smoothing the features and subsequent nodes in the network using a spatial mov-
ing average filter applied to each pixel’s 8 neighbouring pixels. These pixel-based models
have been extended to irregularly shaped areal unit data by graphical convolutional neural
networks (GCNN, see for example Kipf and Welling, 2017 and Zhu et al., 2022), which re-
place the regular spatial moving average filter with an irregular one based on the geographical
contiguity of the areal units. However, unlike the point-level RFRK and RF-GLS algorithms,
the above set of machine learning algorithms for areal unit data do not explicitly allow for
residual spatial autocorrelation in the target variable after feature adjustment.
Therefore, this paper proposes an iterative SPatial AutoregRessive random forest algo-
rithm called SPAR-forest for predicting spatial areal unit data, which is a novel fusion of
spatial correlation (smoothing) models and random forests that overcomes the above limita-
tion. This algorithm incorporates flexible feature-target relationships via a random forest and
residual spatial autocorrelation via a spatial random effects model, and it iteratively re-fits
each component based on the current value of the other. The total number of iterations is
one of the tuning parameters of the algorithm, which collectively are optimised via a 10-fold
cross validation procedure. A random forest is the machine learning algorithm used to capture
non-linear feature-target effects due to its computational efficiency and inbuilt bootstrapping
procedure, because the latter allows approximately out-of-sample predictions to be obtained
for the training set via out-of-bag predictions. For details of why this is needed see Section 4.
This methodology is motivated by a new study aiming to predict median property prices in
2018 at the small-area scale in Scotland, and details of this study are presented in Section 2.
Section 3 provides a review of competitor prediction models, while our novel SPAR-Forest
algorithm is described in Section 4. The study design used for assessing predictive perfor-
mance is outlined in Section 5, along with the metrics used to measure predictive accuracy.
Section 6 presents the results of a simulation study that compares the prediction accuracy of
a range of models under different fixed conditions, while the results of the motivating study
are presented in Section 7. Finally, the paper ends in Section 8 with a summary of the main
findings and areas for future work.
2. Motivating study. The aim of the study is to predict median property prices at the
small-area scale in Scotland in 2018, which is the most recent year of data that are publicly
available. The data relate to spatial units called Data Zones (DZ), which are a small-area
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 3
geography containing between 500 and 1,000 people. Data Zones nest within 32 larger Local
Authorities (LA), which are the administrative units that run public services such as schools
and rubbish collections. Three of these LAs (Na h-Eileanan Siar, Orkney, and Shetland) are
island communities that contain only 95 DZs in total, which are removed to avoid having
small numbers of DZs in an LA when splitting the data into training and test sets. This leaves
N = 6, 881 DZs as the study region, which comprise mainland Scotland and some of the
islands. The data used in this study are described below, and unless otherwise stated were
obtained from https://statistics.gov.scot/home.
2.1. Target variable. The target variable is the median selling price of all properties sold
in 2018, with the median being used because it is robust to outlying observations. Median
prices that are based on less than 5 sales are suppressed (or do not exist in the case of zero
sales) to ensure individual properties are not identifiable, which results in around 9% of DZs
having missing values. Additionally, one DZ had a median price of just £600, and as this
is likely to be an error this value is treated as missing. The remaining data exhibit a skewed
distribution (see Section 1.1 of the supplemental material) that ranges between £19, 500 and
£878, 000, with a median value of £139, 282. Figure 1 displays the spatial patterns in median
property prices across the two largest cities of Edinburgh (A, top) and Glasgow (B, bottom),
while the whole of Scotland is not shown because most DZs would then be too small to see.
In the figure DZs with missing property prices are not shaded, which in some cases makes
them appear to be white / very light grey when plotted over the background map. The figure
shows that prices are more expensive in Edinburgh compared to Glasgow, with median prices
of £230, 000 and £122, 000 respectively. Glasgow also exhibits a much higher proportion
of DZs with missing property prices than Edinburgh, being 16.8% and 4.0% respectively.
These missing values appear to be spatially clustered in Glasgow, where as in Edinburgh
they appear to be more randomly scattered. In Glasgow, three of the most prominent clusters
of missing values are in the residential areas of Drumchapel in the far north-west (south-west
of Bearsden), Castlemilk in the south (north-east of Clarkston), and in the east-end of the city
(south of Stepps).
2.2. Features. A number of features that are likely to explain the spatial variation in me-
dian property prices were obtained, including characteristics of the DZ itself and the proper-
ties situated within them. Some of these features contain a small number of missing values,
which are imputed using the K nearest neighbours (KNN) algorithm with K = 5 as recom-
mended by Kuhn and Johnson (2019). Additionally, a very small number of clear outliers
were assumed to be data errors and imputed as above. The numeric features were then stan-
dardised to have a mean of zero and a standard deviation of one. The set of features is sum-
marised below, with additional exploratory analysis of their distributions given in Section 1.2
of the supplemental material.
2.2.1. Property characteristics. Average property size is measured by the mean number
of rooms excluding bathrooms and kitchens, while property type is summarised by the per-
centages of: (i) flats; and (ii) semi-detached / detached houses; in each DZ. Additionally, the
density of properties is summarised by the number of dwellings per hectare. Finally, council
tax is a levy paid by each householder for public services, and the council tax band of a prop-
erty provides a crude measure of a property’s worth. The latter has 8 levels labelled A to H,
with the cheapest properties in band A and the most expensive in band H. The percentages
of properties in each of these 8 bands is available, but as they are highly correlated principal
components analysis (PCA) is applied to obtain independent features. The first 5 PCs explain
over 95% of the variation in these variables, and hence are used in the prediction model.
4
(A) - Edinburgh
(B) - Glasgow
F IG 1. Maps of median property prices in each DZ in Edinburgh (A, top) and Glasgow (B, bottom). DZs with
missing median property prices have no colour shading.
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 5
2.3. Study aims. Within the overarching aim of spatial areal unit prediction, this study
addresses three key questions. Firstly, how does the predictive performance of the proposed
SPAR-Forest algorithm compare to machine learning algorithms and spatial CAR / SAR
smoothing models? Secondly, how does property price predictability vary regionally across
Scotland, and which areas can be predicted with the greatest and least amounts of accuracy
and precision? Thirdly, what are the likely median property prices for the 9% of Data Zones
that have missing values, and how do these predictions compare to the prices in the remaining
Data Zones? This paper will thus provide users with information on average property prices
in their local areas, as well as access to a state-of-the-art prediction algorithm for spatial areal
unit data.
3.1. Normal linear model. The simplest baseline model is the normal linear model,
which when applied to the training set is given by
3.2. Spatial smoothing models. Residual spatial autocorrelation not accounted for by
the features is ubiquitous in areal unit data, and it can be modelled by adding autocorrelated
random effects ϕ = [ϕ(A1 ), . . . , ϕ(AK )] to (1) via
In this model a row standardised neighbourhood matrix W̃ is commonly used rather than
the original binary matrix W. Again ρ is a spatial dependence parameter, with ρ = 0 cor-
responding to independence (the precision matrix again simplifies to the identity matrix)
while stronger autocorrelation is captured as ρ increases. The full spatial model comprises
the data likelihood model (2), one of the random effects models (3) or (4), and prior dis-
tributions for the parameters (β0 , β = (β1 , . . . , βp ), σ 2 , τ, ρ). Weakly informative priors are
assumed here for these parameters to let the data speak for themselves, which are the ones
recommended by the INLA software used for inference (Rue, Martino and Chopin, 2009).
Specifically: (i) βj ∼ N(0, 100, 000) for j = 0, . . . , p; (ii) ln(σ −2 ) ∼ log-gamma(1, 0.01);
(iii) τ ∼ log-gamma(1, 0.01); and (iv) ln[ρ/(1 − ρ)] ∼ N(0, 100). Once fitted to the training
set the model is used to predict property prices in the test set by sampling from the posterior
predictive distribution, and details are provided in Section 2.1 of the supplemental material
accompanying this paper.
3.3. Random forest model. Random forests (RF) are one of the best performing ma-
chine learning prediction algorithms (Boehmke and Greenwell, 2020), and were originally
proposed by Breiman (2001). They are based on the additive decomposition
4. Methodology. This section proposes a novel iterative spatial prediction algorithm for
areal unit data called SPAR-Forest, which uses random forests to estimate non-linear
feature-target relationships and Bayesian spatial autoregressive models to allow for any resid-
ual spatial autocorrelation. A Bayesian approach to inference using INLA is taken for the
spatial smoothing model, because it provides estimates of the spatial random effects for the
training set which are used in our iterative algorithm. We note, that maximum likelihood ap-
proaches can be used to estimate spatial random effects models, such as via the R package
spmodel. However, these estimation algorithms typically integrate out the random effects
rather than estimating them, which precludes their use here. In principle, any spatial smooth-
ing model that is appropriate for areal unit data could be used, but here we illustrate our
approach with CAR and SAR models as they are the most popular in the areal unit modelling
literature. The rationale for our algorithm is outlined in Section 4.1, while algorithmic details
are provided in Section 4.2.
8
4.1. Overall approach and rationale. The observed data {Y (Ak )} represent error-prone
measurements of the true values {m[x(Ak )]}, leading to the decomposition
4.2. Implementation. The iterative SPAR-Forest prediction algorithm has the following
tuning parameters: (i) the number of iterations of the algorithm R; (ii) the random forest
specific tuning parameters (mtry , minnode ); and (iii) the CAR / SAR model tuning parameter
D . All of these are estimated using a 10-fold cross validation procedure applied to the training
set, details of which are given in the next section. Thus, the algorithm below is presented for
a fixed set of tuning parameters.
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 9
Algorithm - SPAR-Forest
Stage 0: Initialise the random effects by setting ϕ̃(Ak ) = 0 for all training set observa-
tions, and fix the tuning parameters (mtry , minnode , D, R).
Stage 1: Iterate the following steps r = 1, . . . , R times.
A. Compute the decorrelated target variable Z(Ak ) = Y (Ak ) − ϕ̃(Ak ) for observations
in the training set k = 1, . . . , K .
B. Fit a random forest model with tuning parameters (mtry , minnode ) to the training set
with features {x(Ak )}K K
k=1 and target variable {Z(Ak )}k=1 , to estimate the effects of
the features on the decorreltaed target variable. Use this model to produce out-of-sample
predictions {m̂(k) [x(Ak )]}N k=1 for both the training and test sets, with the former being
produced using the out-of-bag approach.
C. Fit the following spatial random effects model described in Section 3.2 to the training
data using INLA:
Further details about random forests (Stage 1 B.) and the Bayesian CAR / SAR model
(Stage 1 C. and 2) are provided in Section 3 of the main paper and Section 2 of the supple-
mental material. The above algorithm is implemented in R, and software allowing others to
apply the method to their own data is available at https://github.com/vinnydavies/SPARforest
and described in more detail in Section 3 of the supplemental material. Specifically, the soft-
ware fits two variants of the SPAR-Forest algorithm, the first using the CAR model (3) to
represent the spatial random effects and the second replacing this with the spatial autoregres-
sive model (4). The software uses the ranger (Wright and Ziegler, 2017) package to fit the
random forests, and the INLA package (Rue, Martino and Chopin, 2009) to fit the Gaussian
CAR / SAR models.
The SPAR-Forest algorithm uses the random forest to make out-of-sample predictions
{m̂(k) [x(Ak )]}Kk=1 for observations in the training set via an out-of-bag approach, which
are subsequently used in step C. An out-of-bag prediction of Y (Ak ) is made by averaging
the predictions from the sub-forest of trees that were fitted without using Y (Ak ), which is
possible as random forests use a different bootstrapped (sampled with replacement) copy of
the training data when fitting each tree in the forest. Out-of-bag predictions are needed so
that the training and test set predictions are generated in the same way, i.e., without using
the data point in question. If one instead replaced {m̂(k) [x(Ak )]}K k=1 with in-sample fitted
values, then they would likely be closer to the observed data compared to those in the test
set, leading to overfitting of the training set and an underestimation of predictive uncertainty
(see Section 4.3 of the supplemental material for an example).
10
This ability to produce out-of-bag predictions for the training set in a computationally effi-
cient manner is the main reason why random forests are used to capture feature-target effects
in our algorithm, rather than using other machine learning algorithms such as neural net-
works that only produce in-sample fitted values by default. In principle however, one could
apply a bootstrapping approach to a neural network for this purpose, as bootstrapping neural
networks has been implemented in a variety of contexts (see for example, Franke and Neu-
mann, 2000 and Palmer et al., 2022). However in practice, this is computationally infeasible.
For example, running a single random forest on the motivating study data for 1,000 boot-
strapped trees in the forest takes 3.6 seconds on an iMac computer with a 3.8 GHz processor
and 32GB of memory, where as running a neural network (with 3 hidden layers each having
64 nodes and run for 1,000 epochs) repeatedly for 1,000 boostrapped data samples takes 8.3
hours. Thus, incorporating a neural network with such a bootstrapping procedure within our
proposed algorithm for a combined R = 20 iterations would take approximately 167 hours
for one model run, and the algorithm would need to be run large numbers of times for tuning
via a cross validation approach (see below).
An alternative would be to run the neural network only once for each of the R iterations
of our algorithm, and use in-sample predictions {m̂[x(Ak )]}K k=1 for the training set in the
spatial smoothing step C. However, initial testing showed that this approach leads to poor
performance, and details are given in Section 4 of the supplemental material. Also included
in that section is a comparison of using a random forest and a neural network for predicting
the motivating property price data, because it shows that the random forest performs better
and is hence likely to be more appropriate for our proposed algorithm.
5. Study design for assessing predictive performance. In both the simulation (Section
6) and property price (Section 7) studies the predictive performance of two variants of the
SPAR-Forest algorithm are assessed, namely CAR-Forest that uses (3) for the random
effects and SAR-Forest that uses (4) for the random effects. These iterative prediction
algorithms are compared against the following competitors: (i) a normal linear model (LM
- Section 3.1); (ii) spatial CAR and SAR models (CAR / SAR - Section 3.2); (iii) a random
forest model (RF - Section 3.3); and (iv) a geographical random forest model (GRF - Section
2.3 supplemental material). Additionally, for the motivating property price study we also
present results from a simplified non-iterative form of the SPAR-Forest algorithm equivalent
to R = 1, but as it did not perform as well as the full algorithm its results are shown in Section
6.2 of the supplemental material. The normal linear model is included for its simplicity,
while the remaining competitors comprise state of the art models in spatial statistics, machine
learning and existing fusions of these paradigms.
The predictive performance of each model is assessed by randomly splitting the Data
Zones into an 80% training set and a 20% test set, which for the motivating property price
study include 5,011 and 1,253 Data Zones respectively. Additionally, to ensure the results
for the property price study are not affected by the particular choice of training-test split, we
repeat the prediction experiment on 5 independent training-test splits. All the models except
the normal linear model contain tuning parameter(s), which are initially optimised using the
training set. This optimisation is done using a 10-fold cross validation procedure, which splits
the training set into ten random subsets of approximately equal size. Each model is fitted to
nine of these subsets with different combinations of tuning parameters, and for each combi-
nation the observations in the tenth subset, known as the validation set, are predicted. This
process is repeated treating each of the ten subsets as the validation set once, and the optimal
values of the tuning parameters are the combination that minimise the root mean square error
(see below for the definition) of the predictions. This process is repeated independently for
each of the five training and test splits.
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 11
Once the optimal tuning parameters have been chosen, each model is refitted to the full
training set using these optimal values, and is then used to make out-of-sample predictions for
the test set. As median property price is a continuous measurement, the quality of these pre-
dictions is assessed using the following standard metrics. In what follows {Y (Ar ), Ỹ (Ar )}
respectively denote the observed median property price and the prediction for the r th areal
unit in the test set, where following the notation in Section 3, r = K + 1, . . . , N .
v
u N i2
u 1 X h
Root mean square error − RMSE = t Ỹ (Ar ) − Y (Ar ) .
N −K
r=K+1
n o
Median absolute error − MAE = Medianr=K+1,...,N Ỹ (Ar ) − Y (Ar ) .
Coverage probability − CP = The proportion of the N − K 95% prediction
intervals that contain the true value.
Average interval width − AIW = The average width of the N − K 95%
prediction intervals.
The accuracy of the point prediction is summarised by both the RMSE and MAE metrics,
with the best model minimising both quantities. We present both metrics because as the
RMSE utilises both arithmetic mean and squared operators it is much less robust to individual
DZs with big prediction errors than the MAE is. The appropriateness of the 95% prediction
intervals is quantified by the coverage probability and average interval width, and the former
should be close to 0.95 if predictive uncertainty is appropriately captured. Finally, the average
interval width should be as small as possible as long as the coverage probability is close to
0.95.
6. Simulation study. This section presents a simulation study, whose aim is to compare
the predictive performance of the SPAR-Forest algorithm against the competitor prediction
models outlined above in a number of controlled scenarios.
6.1. Data generation. The study is based on the N = 746 Data Zones contained within
the Glasgow City local authority, because using all of mainland Scotland would make the
simulation study computationally infeasible. This is because the complete study involves fit-
ting each of the models described in the main paper thousands of times, due to both the
number of simulated data sets generated under multiple scenarios and the optimisation of the
tuning parameters required for each model. Each simulated data set consists of a continuous
target variable {Y (Ak )}N N
k=1 , five features {x(Ak ) = [x1 (Ak ), . . . , x5 (Ak )]}k=1 and the east-
ing and northing spatial coordinates of the Data Zone centroids. The features are assumed to
be independent in space, which ensures they are not collinear with the additional residual
spatial autocorrelation induced into the target variable (see below). Each feature is generated
by sampling N realisations from an independent uniform random variable, which has a min-
imum value of 0 and a maximum value of 2π . These limits are chosen so that the non-linear
feature-target relationships outlined below exhibit sizeable non-linearity.
The target variable is generated as Y (Ak ) = f [x(Ak )] + ϕ(Ak ) + ϵ(Ak ), a linear combi-
nation of the true value f [x(Ak )] + ϕ(Ak ) and independent zero-mean Gaussian error ϵ(Ak )
with standard deviation σ = 1. The true value thus depends on both the features x(Ak ) and
residual spatial autocorrelation induced by the random effect ϕ(Ak ), and the exact specifi-
cation of f [x(Ak )] + ϕ(Ak ) is varied across the four scenarios described below. The set of
12
spatial random effects for all DZs are generated from a zero-mean multivariate Gaussian dis-
tribution, where the covariance matrix is equivalent to that from the CAR model proposed by
Leroux, Lei and Breslow (2000). Here, the spatial neighbourhood matrix W is constructed
using the 5 nearest neighbours rule, and we set ρ = 0.9 to ensure strong spatial dependence.
The variance of these spatially autocorrelated random effects controls the size of its influence
on the target variable, and this is varied across the scenarios described below.
6.2. Study design. Fifty simulated data sets are generated under each of four different
scenarios, and the set of models outlined in Section 5 are compared in this study, which
include both the CAR and SAR variants of the SPAR-Forest algorithm and a range of com-
petitors from both spatial statistics and machine learning. In all of the scenarios the first two
features {x1 (Ak ), x2 (Ak )}N
k=1 have relationships with the response while the remaining 3
features do not, with the latter included in all models to ensure the results are not adversely
affected by the presence of unimportant features. Additionally, the easting and northing co-
ordinates of each Data Zone’s centroid are included as features in the linear model, random
forest and geographical random forest because they have no other way to capture the spatial
structure in the data, while this is not necessary for the CAR / SAR and the SPAR-Forest
approaches. The 4 different scenarios and their rationales are described below.
Scenario 1: The features have non-linear relationships with the response via f [x(Ak )] =
3 sin[x1 (Ak )] + 2 ln[x2 (Ak )], and the random effects variance is chosen so that the
marginal standard deviations of the feature component f [x(Ak )] and the random effects
ϕ(Ak ) are similar. This scenarios thus gives the features and the residual spatial autocor-
relation roughly equal prominence in influencing the true values.
Scenario 2: The features have non-linear relationships with the response via f [x(Ak )] =
3 sin[x1 (Ak )] + 2 ln[x2 (Ak )], and the random effects variance is chosen so that the
marginal standard deviation of the feature component f [x(Ak )] is twice that of the random
effects ϕ(Ak ). This scenarios thus makes the features more important than the residual
spatial autocorrelation in influencing the true values.
Scenario 3: The features have non-linear relationships with the response via f [x(Ak )] =
1.5 sin[x1 (Ak )] + ln[x2 (Ak )], and the random effects variance is chosen so that the
marginal standard deviation of the feature component f [x(Ak )] is half that of the ran-
dom effects ϕ(Ak ). This scenarios thus makes the residual spatial autocorrelation more
important than the features in influencing the true values.
Scenario 4: The features have linear relationships with the response via f [x(Ak )] =
x1 (Ak ) + x2 (Ak ), and the random effects variance is chosen so that the marginal stan-
dard deviations of the feature and random effect components are similar. This scenarios
thus compares the models if the feature effects are exactly linear rather than non-linear.
The predictive performances of all the models are quantified using the approach outlined in
Section 5, with the exception that only a single training-test split is considered for each sim-
ulated data set. The sets of possible tuning parameters for the models used in this simulation
study are described in Section 5.1 of the supplemental material.
6.3. Results of the simulation study. The predictive performances of all models across all
scenarios are summarised in Figure 2 by their RMSE values, while the corresponding results
for the MAE, CP and AIW metrics are presented in Section 5.2 of the supplemental material.
Each boxplot represents the set of out-of-sample RMSE values for a single model across the
50 simulated data sets in a single scenario, and the four panels relate to the four scenarios.
Additionally, the numbers at the top of each boxplot present the mean RMSE over the 50
simulated data sets for ease of reference.
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 13
F IG 2. Boxplots showing the out-of-sample RMSE for the test set predictions for each simulated data set in each
scenario from each model. The mean values for the boxplots are presented at the top of each graph.
The figure shows a number of key findings, the first being that the simple linear model per-
forms the worst across the board, due to the fact that it cannot accommodate unknown shaped
non-linear feature effects or residual spatial autocorrelation. In contrast, the SPAR-Forest al-
gorithms perform the best across scenarios 1 to 3, where both non-linear feature-target rela-
tionships and residual spatial autocorrelation are present. The CAR-Forest and SAR-Forest
variants show almost identical results, which is not surprising given the similarities between
their specifications. For scenario 4, where all features have linear effects on the target vari-
able, the CAR / SAR models slightly outperform the SPAR-Forest algorithm, which is be-
cause they are the models that are closest to the data generating mechanism. However in this
case, the SPAR-Forest algorithm could easily be extended to accommodate linear feature-
target relationships, simply by putting those features thought to have linear effects into the
CAR / SAR component of the algorithm rather than into the random forest component. Fi-
nally as expected, the random forest models outperform the CAR / SAR models when the
features effects dominate the residual spatial autocorrelation (Scenario 2), while the opposite
is true when the residual spatial autocorrelation is dominant (Scenario 3). In contrast, when
the two components have a similar influence on the target variable (Scenario 1) then these
models perform similarly.
7. Results from the property price study. This section presents the results of the mo-
tivating study, focusing on the 3 questions outlined in Section 2.3. Section 7.1 describes the
implementation of the models, while Section 7.2 compares their predictive abilities. Section
7.3 provides a local authority comparison of predictive performance from the best performing
model, while Section 7.4 predicts median property prices for the Data Zones with missing
values. Additional results from this study are presented in Section 6 of the supplemental ma-
terial, including an assessment of the stability of the iterative SPAR-Forest algorithm across
the R iterations (Section 6.1), and predictive performance metrics for simpler non-iterative
SPAR-Forest algorithms where R = 1 (Section 6.2).
14
7.1. Model implementation. All models are applied with the complete set of features de-
scribed in Section 2, which from initial analysis performed similarly to using an additional
forwards or backwards stepwise feature selection approach. The easting and northing coordi-
nates of each Data Zone’s centroid are included as features in the linear model, random forest
and geographical random forest because these models have no other way of capturing spatial
location, while this is not necessary for the remaining models that explicitly capture spatial
structure.
Initially, the normal linear model was fitted to both median property price and its natural
logarithm, and as the residual normality assumption is only plausible on the log scale, all
models are applied on this scale in the interests of fairness. The resulting predictions are then
back-transformed to the original scale when computing the predictive performance metrics
outlined in Section 5. As a number of the models are based on a normality assumption,
the back transformation for the point predictions follows the log-normal result that if X ∼
N(µ, σ 2 ), then E[exp(X)] = exp(µ + σ 2 /2). Here, the linear, CAR, SAR and SPAR-Forest
methods (both CAR and SAR variants) provide estimates of σ 2 directly from the model,
while for the random forest and its geographical extension σ 2 is estimated by the sample
variance of the out-of-bag prediction errors following the ideas in Zhang et al. (2020).
The CAR and SAR models have a single tuning parameter D that determines the con-
struction of the neighbourhood matrix W, and the candidate values considered comprise
D ∈ {5, 7, 9}. All models that incorporate a random forest component are implemented with
Ntr = 1, 000 trees, which initial analyses showed was sufficient for the prediction error to
stabilise. Additionally, each random forest model was optimised with respect to all possible
combinations of the tuning parameters mtry ∈ {10, 20, 30, 40, 51} and minnode ∈ {1, 5, 10},
where the largest value of mtry is chosen to be equivalent to bagging. The GRF model con-
tains all the random forest tuning parameters as well as the additional parameters (bw , α). Fol-
lowing Georganos et al. (2021), these latter parameters are optimised over all possible com-
binations of bw ∈ {100, 500, 1, 000} and α ∈ {0.25, 0.5, 0.75, 1}. Finally, the SPAR-Forest
algorithm (both CAR and SAR variants) was optimised with respect to the tuning parameters
from the CAR / SAR model (D ), the random forest model (mtry , minnode ) and the total
number of iterations of the algorithm (R). The same sets of possible values described above
for the first two types were considered, while we considered all possible combinations of
R = 1, . . . , 20.
7.2. Comparing the predictive ability of the models. The predictive abilities of the mod-
els for each training-test split as well as the mean over all five splits are summarised in Table
1, which presents the four metrics outlined in Section 5, namely RMSE, MAE, CP and AIW.
The two variants of the proposed SPAR-Forest algorithm (CAR-Forest and SAR-Forest) pro-
duce the best point predictions in terms of both RMSE and MAE in almost all cases, with the
lowest average RMSE coming from SAR-Forest (£39, 723) while the lowest average MAE
comes from CAR-Forest (£16, 732). These point predictions show an average improvement
over all five training-test splits of £1,761 (RMSE) and £814 (MAE), when comparing the
best SPAR-Forest algorithm against the best competitor model. The GRF model performs
the best out of the competitor models in both RMSE and MAE, while the spatial CAR / SAR
models slightly outperform a non-spatial random forest.
The 95% prediction intervals from all models exhibit close to their nominal coverage prob-
abilities (CP), with values ranging between 0.939 (SAR model) and 0.955 (GRF) on average
across all 5 training-test splits. The SAR-Forest intervals are the narrowest in all cases with
an average width of £136,827, which is £4,392 narrower compared to those from the exist-
ing competitor model with the narrowest intervals (SAR). Overall then, both variants of the
proposed SPAR-Forest algorithm outperform their spatial smoothing and machine learning
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 15
competitors, and there are only small differences between the CAR and SAR variants of the
new algorithm. The remainder of this section presents results from the SAR-Forest variant,
because it has the smallest RMSE and substantially narrowest prediction intervals, albeit with
a slightly smaller coverage than those from the CAR-Forest variant.
7.3. Comparing the accuracy of property price predictions by local authority. To get a
regional view of the predictability of property prices in Scotland, we compute the RMSE and
MAE metrics from the SAR-Forest model separately for each of the 29 local authorities in
the study. These regional metrics are plotted against each other in Figure 3, which presents
the mean values over all 5 training and test splits. The shading of the labels in the figure
denote the median prices across DZs in each LA, with darker colours having higher prices.
The left panel (A) displays the absolute RMSE and MAE values, while the right panel (B)
presents both metrics as percentages of their median property prices to remove the scaling
effect caused by some regions having more expensive properties than others.
Both panels show moderate to strong linear relationships between the RMSE and the MAE
metrics, suggesting that the relative accuracy of property price predictions by local authority
Split
Model
1 2 3 4 5 Mean
RMSE
LM £46,992 £50,132 £49,569 £45,879 £42,331 £46,981
CAR £41,617 £44,528 £45,084 £40,845 £37,115 £41,838
SAR £41,812 £44,709 £45,038 £40,904 £37,264 £41,945
RF £41,222 £46,407 £44,988 £42,622 £36,884 £42,424
GRF £40,562 £45,713 £44,640 £41,218 £35,288 £41,484
CAR-Forest £39,240 £43,066 £43,441 £39,964 £35,660 £40,274
SAR-Forest £39,605 £42,935 £42,001 £39,269 £34,806 £39,723
MAE
LM £20,401 £19,816 £20,718 £19,753 £18,988 £19,935
CAR £18,565 £17,519 £17,614 £17,577 £17,388 £17,733
SAR £18,549 £17,976 £17,844 £17,422 £17,489 £17,856
RF £18,124 £17,736 £18,255 £18,588 £17,181 £17,977
GRF £17,680 £17,352 £17,600 £18,585 £16,513 £17,546
CAR-Forest £16,222 £16,838 £17,184 £17,313 £16,104 £16,732
SAR-Forest £16,869 £16,645 £17,419 £16,484 £16,357 £16,755
CP
LM 0.932 0.943 0.936 0.943 0.950 0.941
CAR 0.947 0.947 0.938 0.946 0.949 0.945
SAR 0.942 0.940 0.928 0.939 0.945 0.939
RF 0.935 0.951 0.952 0.944 0.960 0.949
GRF 0.947 0.954 0.954 0.951 0.966 0.955
CAR-Forest 0.940 0.944 0.939 0.945 0.958 0.945
SAR-Forest 0.935 0.939 0.943 0.936 0.952 0.941
AIW
LM £155,091 £148,898 £152,074 £151,954 £153,574 £152,318
CAR £148,860 £147,725 £150,017 £146,681 £147,125 £148,082
SAR £144,202 £139,263 £140,811 £139,416 £142,407 £141,219
RF £148,397 £150,477 £151,874 £148,213 £150,840 £149,960
GRF £150,160 £148,381 £152,490 £149,820 £151,432 £150,457
CAR-Forest £144,511 £138,633 £143,863 £140,063 £144,179 £142,250
SAR-Forest £139,748 £134,057 £137,834 £134,466 £138,032 £136,827
TABLE 1
Comparison of the out-of-sample predictive abilities of each model for each data split, and the overall mean over
all 5 splits. The acronyms in the table denote: RMSE - root mean square error; MAE - median absolute error;
CP - coverage probability; and AIW - average interval width.
16
F IG 3. Comparison of the point prediction accuracy of the SAR-Forest algorithm by local authority, as measured
by MAE and RMSE. Panel (A) presents the un-scaled MAE / RMSE values, while panel (B) presents metrics scaled
by the median property price in each LA.
are similar regardless of which metric is used. The left panel (A) shows that generally LAs
with more expensive properties on average have higher prediction errors, with the City of
Edinburgh, East Lothian and East Renfrewshire having the least accurate predictions in ab-
solute terms, while Clackmannanshire and Dundee City have the most accurate predictions.
However, once these prediction metrics have been scaled to account for differences in me-
dian property prices (panel (B)), then East Lothian and West Dumbartonshire have the least
predictable prices, while Clackmannanshire, Dundee City and Midlothian are the most pre-
dictable. These results also show there are no clear spatial or urban-rural trends in the relative
predictability of Scotland’s housing market, as nearby or similar areas do not necessarily have
similar prediction metrics.
7.4. Predicting missing median property prices. The SAR-Forest algorithm is now fitted
to the data for the 6,264 Data Zones that have non-missing property prices, and is subse-
quently used to predict the remaining 617 DZs with missing values. The model is fitted with
Ntr = 1, 000 trees as before, while the tuning parameters are chosen as the medians of the
optimal values identified in the 5 training-test splits, specifically: R = 19, D = 9, mtry = 30
and minnode = 5. The distributions of the predictions (617 DZs) and the observed prices
(6,264 DZs) are displayed in panel (A) of Figure 4 by density estimates, with the predic-
tions in dark grey while the observed data are in light grey. The figure shows that the two
distributions are skewed to the right with broadly similar shapes, but that the Data Zones
with missing values have lower predicted prices on average (median £96,961) compared to
those with available data (median £139,282). The likely reason for this price differential
is illustrated in panel (B) of Figure 4, which presents the predictions (dark grey, point esti-
mates and 95% prediction intervals) and observed prices (light grey, observed values) against
a measure of socio-economic deprivation, specifically the first principal component of the
education domain of the SIMD. The figure shows that property prices decrease as the level
THE SPAR-FOREST SPATIAL PREDICTION ALGORITHM 17
F IG 4. Comparison of the Data Zones with predicted (dark grey - predictions and 95% prediction intervals) and
observed (light grey - observations) property prices. The left panel (A) shows density estimates while the right
panel (B) shows the relationship between price / prediction and socio-economic deprivation as measured by the
first principal component of the education domain of the SIMD.
of socio-economic deprivation increases as expected, and that a high proportion of the Data
Zones with missing property price data are socio-economically deprived.
8. Discussion. This paper has proposed a novel fusion of random forests and autore-
gressive spatial smoothing models for the prediction of spatial areal unit data with missing
values. It is thus one of the first coherent frameworks for applying random forests to spatially
autocorrelated areal unit data, because unlike the RFRK (Hengl et al., 2015) and GLS (Saha,
Basu and Datta, 2023) algorithms for point-level data, existing areal unit level methods such
as geographical random forests (Georganos et al., 2021) and graphical convolutional neural
networks (Zhu et al., 2022) do not explicitly allow for the residual spatial autocorrelation in
the data after the feature effects have been accounted for. The improved predictive perfor-
mance of our SPAR-Forest algorithm compared with a number of state-of-the-art alternatives
has been evidenced in both simulated and real data settings, and its superior performance is
likely to be because it captures flexible feature-target relationships via random forests and
residual spatial autocorrelation via random effects. In principle, any spatial smoothing model
that allows predictions to be made for both training and test set observations could be used in
our algorithm in place of the CAR / SAR component, such as two dimensional spline-based
smoothers, but CAR / SAR models were used here as they are the predominant approaches
to smoothing spatial areal unit data. In contrast, random forests are a critical part of our algo-
rithm, due to their ability to produce out-of-bag predictions for both the training and test sets
in a computationally efficient manner. Other machine learning approaches could in principle
be used, but as discussed in Section 4.2 they may be computationally infeasible (e.g., neural
networks).
In the motivating study the SAR-Forest variant narrowly outperformed the CAR-Forest
variant in most of the prediction metrics, although the differences between these two similar
18
models were generally not large. Furthermore, SAR-Forest provided improved predictive
performance compared to the best existing competitor by around 4.2% (£1,761) in RMSE,
4.5% (£791) in MAE, and 3.2% (£4,464) in the precision of its 95% prediction intervals.
These improvements in point prediction are benchmarked against the geographical random
forest model, which generally outperforms the linear, random forest and CAR / SAR models
in our study. Finally, a comparison of the random forest and CAR / SAR models shows
that the latter slightly outperforms the former in terms of point prediction, suggesting that
for these property price data the ability to capture residual spatial autocorrelation is more
important than capturing non-linear feature-target relationships.
The simulation study illustrates that the SPAR-Forest algorithm performs as one would
expect under controlled conditions, which provides further evidence of its efficacy and ro-
bustness. In the presence of both non-linear feature-target relationships and residual spatial
autocorrelation SPAR-Forest outperforms its competitors, because it is the only method that
can accommodate both of these components. However, SPAR-Forest does not perform as
well as the CAR / SAR models when all the feature-target relationships are exactly linear,
which is because in this case the additional unnecessary flexibility of the random forest re-
sults in poorer predictive performance compared with rigidly enforcing the relationships to
be linear. Therefore in practical applications one should identify any features in advance that
exhibit close to linear relationships with the target variable via exploratory scatterplots, and
include those features as linear terms in the CAR / SAR component of the algorithm rather
than in the random forest.
This paper has focused exclusively on predictive performance for purely spatial data, and
hence there is much scope to further develop our methodology to address more complex sce-
narios. One such challenge is how to model multivariate spatio-temporal data, where differ-
ent random forest models would be needed for each target variable and possibly time period,
while complex residual autocorrelations over space, time and between the target variables
would need to be accounted for. A second challenge concerns the inferential goal, which
here was restricted to prediction for areal units with missing data values. In this context the
effects of the features on the target variable were not of direct interest, which allowed us to
plug their estimated effects from the random forest into the second stage CAR / SAR model
as a fixed offset. However, in the field of spatial ecological regression the relationships be-
tween features and the target variable are the primary inferential goal, meaning that the effects
of the features and their uncertainty needs to be quantified. One possible solution is to split
the features into a set of those that are of primary interest and a second set of confounders,
with the confounders remaining in the random forest component due to its flexibility while
the features of primary interest are included in the Bayesian CAR / SAR model as suggested
above in the context of linear feature-target relationships. A second possible solution is to
use interpretable machine learning tools for feature effects, such as partial dependence plots
(Friedman, 2001) and variable importance plots (see Greenwell and Boehmke, 2020).
Finally, as noted in the introduction spatial areal unit data are prevalent in many fields,
which gives the SPAR-Forest algorithm a wide range of possible application areas beyond
the property price example that motivated this paper. One important such area is disease
mapping (Lawson, 2018), where the fusion of machine learning and spatial autocorrelation
models has the potential to help answer a number of public health questions, such as where
are the hotspots of disease risk, which features affect disease risk, and how big are the health
inequalities between different communities and how are they changing over time.
REFERENCES
B ERROCAL , V., G UAN , Y., M UYSKENS , A., WANG , H., R EICH , B., M ULHOLLAND , J. and C HANG , H. (2020).
A comparison of statistical and machine learning methods for creating national daily maps of ambient PM2.5
concentration. Atmospheric Environment 222 117130.
B ESAG , J., YORK , J. and M OLLIÉ , A. (1991). Bayesian image restoration with two applications in spatial statis-
tics. Annals of the Institute of Statistics and Mathematics 43 1-59.
B OEHMKE , B. and G REENWELL , B. (2020). Hands-on machine learning with R. Chapman & Hall/CRC.
B REIMAN , L. (1984). Classification and regression trees. Routledge.
B REIMAN , L. (2001). Random forests. Machine Learning 45 5-32.
B REWER , M. and N OLAN , A. (2007). Variable smoothing in Bayesian intrinsic autoregressions. Environmetrics
18 841-857.
C REDIT, K. (2022). Spatial models or random forest? evaluating the use of spatially explicit machine learning
methods to predict employment density around new transit stations in Los Angeles. Geographical Analysis 54
58-83.
FARAWAY, J. (2014). Linear models with R. Chapman & Hall/CRC.
F RANKE , J. and N EUMANN , M. (2000). Bootstrapping Neural Networks. Neural Computation 12 1929-1949.
F RIEDMAN , J. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29
1189 – 1232.
G EORGANOS , S., G RIPPA , T., G ADIAGA , A., L INARD , C., L ENNERT, M., VANHUYSSE , S., M BOGA , N.,
W OLFF , E. and K ALOGIROU , S. (2021). Geographical random forests: a spatial extension of the random
forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto Iinter-
national 36 121-136.
G REENWELL , B. and B OEHMKE , B. (2020). Variable importance plots — An introduction to the vip package.
The R Journal 20 343-366.
H AINING , B. and L I , G. (2020). Modelling spatial and spatio-temporal data: A Bayesian approach. Chapman &
Hall/CRC.
H ENGL , T., H EUVELINK , G., K EMPEN , B., L EENAARS , J., WALSH , M., S HEPHERD , K., S ILA , A., M AC M IL -
LAN , R., M ENDES DE J ESUS , J., TAMENE , L. and T ONDOH , J. (2015). Mapping soil properties of Africa at
250m resolution: random forests significantly improve current predictions. PLOS ONE 10 1-26.
JACK , E., L EE , D. and D EAN , N. (2019). Estimating the changing nature of Scotland’s health inequalities by us-
ing a multivariate spatiotemporal model. Journal of the Royal Statistical Society Series A: Statistics in Society
182 1061-1080.
K AWABATA , M., NAOI , M. and YASUDA , S. (2022). Earthquake risk reduction and residential land prices in
Tokyo. Journal of Spatial Econometrics 3 5.
K IPF, T. and W ELLING , M. (2017). Semi-supervised classification with graph convolutional networks. In Inter-
national Conference on Learning Representations.
K NORR -H ELD , L. and R ASSER , G. (2000). Bayesian detection of clusters and discontinuities in disease maps.
Biometrics 56 13-21.
K UHN , M. and J OHNSON , K. (2019). Feature engineering and selection: A practical approach for predictive
models. Chapman & Hall/CRC.
L AWSON , A. (2018). Bayesian disease mapping hierarchical modeling in spatial epidemiology. Chapman &
Hall/CRC.
L E C UN , Y., B ENGIO , Y. and H INTON , G. (2015). Deep learning. Nature 521 436–444.
L E C UN , Y., B OSER , B., D ENKER , J., H ENDERSON , D., H OWARD , R., H UBBARD , W. and JACKEL , L. (1990).
Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing
Systems 396-404.
L EE , D. and A NDERSON , C. (2023). Delivering spatially comparable inference on the risks of multiple severities
of respiratory disease from spatially misaligned disease count data. Biometrics 79 2691-2704.
L EE , D., M EEKS , K. and P ETTERSSON , W. (2021). Improved inference for areal unit count data using graph-
based optimisation. Statistics and Computing 31 51.
L EROUX , B., L EI , X. and B RESLOW, N. (2000). Estimation of disease rates in small areas: A new mixed model
for spatial dependence Statistical Models in Epidemiology, the Environment and Clinical Trials, Halloran, M
and Berry, D (eds), 135-178. Springer-Verlag, New York.
PALMER , G., D U , S., P OLITOWICZ , A., E MORY, J., YANG , X., G AUTAM , A., G UPTA , G., L I , Z., JACOBS , R.
and M ORGAN , D. (2022). Calibration after bootstrap for accurate uncertainty quantification in regression
models. npj Computational Materials 8 115.
R IEBLER , A., S ØRBYE , S., S IMPSON , D. and RUE , H. (2016). An intuitive Bayesian spatial model for disease
mapping that accounts for scaling. Statistical Methods in Medical Research 25 1145-1165.
20
RUE , H., M ARTINO , S. and C HOPIN , N. (2009). Approximate Bayesian inference for latent Gaussian models
uUsing integrated nested Laplace approximations (with discussion). Journal of the Royal Statistical Society
Series B 71 319-392.
S AHA , A., BASU , S. and DATTA , A. (2023). Random forests for spatially dependent data. Journal of the Ameri-
can Statistical Association 118 665-683.
S OLTANI , A., H EYDARI , M., AGHAEI , F. and P ETTIT, C. (2022). Housing price prediction incorporating spatio-
temporal dependency into machine learning algorithms. Cities 131 103941.
WANG , X., WANG , X., R EN , X. and W EN , F. (2022). Can digital financial inclusion affect CO2 emissions of
China at the prefecture level? Evidence from a spatial econometric approach. Energy Economics 109 105966.
W HITTLE , P. (1954). On stationary processes in the plane. Biometrika 41 434–449.
W RIGHT, M. and Z IEGLER , A. (2017). ranger: A fast implementation of random forests for high dimensional
data in C++ and R. Journal of Statistical Software 77 1–17.
X IA , Z., S TEWART, K. and FAN , J. (2021). Incorporating space and time into random forest models for analyzing
geospatial patterns of drug-related crime incidents in a major U.S. metropolitan area. Computers, Environment
and Urban Systems 87 101599.
Z HANG , H., Z IMMERMAN , J., N ETTLETON , D. and N ORDMAN , D. (2020). Random forest prediction intervals.
The American Statistician 74 392-406.
Z HU , D., L IU , Y., YAO , X. and F ISCHER , M. (2022). Spatial regression graph convolutional neural networks: A
deep learning paradigm for spatial multivariate distributions. Geoinformatica 26 645-676.
MacBride, Cara, Davies, Vinny and Lee, Duncan (2025) A spatial
autoregressive random forest algorithm for small-area spatial
prediction. Annals of Applied Statistics, 9(1), pp. 485-504.
(doi: 10.1214/24-AOAS1969)
There may be differences between this version and the published version.
You are advised to consult the publisher’s version if you wish to cite from
it.
http://eprints.gla.ac.uk/337422/