Análise Espacial Com Regressão Linear e Kernel
Análise Espacial Com Regressão Linear e Kernel
Epidemics
journal homepage: www.elsevier.com/locate/epidemics
Keywords: Kernel methods are a popular technique for extending linear models to handle non-linear spatial problems via a
Regression mapping to an implicit, high-dimensional feature space. While kernel methods are computationally cheaper than
Random Fourier features an explicit feature mapping, they are still subject to cubic cost on the number of points. Given only a few
Kernel methods thousand locations, this computational cost rapidly outstrips the currently available computational power. This
Kernel approximation
paper aims to provide an overview of kernel methods from first-principals (with a focus on ridge regression) and
progress to a review of random Fourier features (RFF), a method that enables the scaling of kernel methods to big
datasets. We show how the RFF method is capable of approximating the full kernel matrix, providing a sig-
nificant computational speed-up for a negligible cost to accuracy and can be incorporated into many existing
spatial methods using only a few lines of code. We give an example of the implementation of RFFs on a simulated
spatial data set to illustrate these properties. Lastly, we summarise the main issues with RFFs and highlight some
of the advanced techniques aimed at alleviating them. At each stage, the associated R code is provided.
1. Introduction computational speed-ups for kernel methods with minimal loss in ac-
curacy, both theoretically and in practice (Rahimi and Recht, 2007;
The mapping of infectious disease has become a cornerstone of Yang et al., 2012; Avron et al., 2018). However, these papers are
global health. Maps can represent the distribution of an infectious mathematically rigorous and it is not always apparent how these
disease through space and often time. From pre-intervention planning methods work, nor how to incorporate them into spatial analysis. This
through to near near-elimination settings and from the global to the paper aims to provide the reader with a fundamentals of the RFF
village-scale, mapping can inform policy and decision making (Tatem method works and illustrate how RFF can be incorporated into existing
et al., 2010; Noma et al., 2002; Cuadros et al., 2017; Gleason et al., spatial methods using only a few lines of code. The first half of the
2017; Mena et al., 2016). Despite their importance, only a small subset paper introduces a large body of theory known as model-based geos-
of infectious diseases have been comprehensively mapped, estimated at tatistics (Diggle et al., 1998), building from linear models to kernel
4% (Hay et al., 2013). As the world becomes increasingly connected, it methods, specifically kernel ridge regression, capable of capturing
is likely that more data will become available to enable mapping of complex non-linear spatial patterns. This paper focuses on the model-
many more diseases. Multiple methods have been developed to allow ling of spatial processes but these methods naturally extend to spatio-
for the flexible and non-linear analysis of spatial data, particularly temporal processes. The second half of the paper focuses on RFF,
kernel methods including Gaussian processes. While these models showing how they can speed up kernel methods by approximating the
proved excellent flexibility to model the complex dynamics of infectious implicit feature mappings associated with shift-invariant kernels. The
diseases through time and space, they can be computationally in- paper concludes with a discussion of some of the advantages and dis-
tensive. Combining computationally intensive algorithms with lots of advantages of RFF and highlights some of the advanced methods to
data can require greater resources than what is available to the average improve upon standard RFF. Code is presented wherever relevant, in-
researcher. The goal of this paper is to introduce a powerful compu- clude a brief toy example in R where we fit a spatial problem using
tationally favourable new approach called random Fourier features nothing more than linear regression and some transforms.
(RFF), that represents a simple method to extend kernel methods to There is considerable overlap between our introduction of kernel
large spatial problems. learning and the more traditional formulations based on Gaussian
Various papers on RFF have shown they can provide significant process regression. For an introduction to the Gaussian process and
⁎
Corresponding author.
E-mail addresses: [email protected] (P. Milton), [email protected] (H. Coupland), [email protected] (E. Giorgi), [email protected] (S. Bhatt).
https://doi.org/10.1016/j.epidem.2019.100362
Received 22 February 2019; Received in revised form 5 August 2019; Accepted 19 August 2019
Available online 21 August 2019
1755-4365/ © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/BY/4.0/).
P. Milton, et al. Epidemics 29 (2019) 100362
model-based geostatistics, we refer the reader here (Rasmussen and Minimising the loss function to find weights that best capture the
Williams, 2005; Diggle and Ribeiro, 2007). For a detailed description of data is referred to as training. The overall model performance after
the mathematical correspondence between kernels and Gaussian pro- training can be summarised by its error, often calculated as some
cesses, we refer the reader here (Kanagawa et al., 2018). measure of the difference between the models predicted and the ob-
served response variables. Mean squared error (MSE) is a very common
2. Linear model choice which measures the average squared difference between the
estimated models predicted and the observed response
1 N
In spatial analysis, the aim is to find a model that coverts a set of (MSE = N i (yi yˆi ) ). However, a good model should have not only
inputs to a corresponding output of interest at N locations in space and be able to accurately predict the response variables at locations used in
time. The output is termed the response variable, and represents the training, but also at different locations not used in training the model.
variable that we are trying to predict at each of the N locations, such as To demonstrate the model's capability to do this, a small proportion of
case counts or prevalence for a disease under investigation (Gething the total available data is set aside and excluded from training which
et al., 2016) (or ancillary epidemiological variables including anthro- forms a test dataset. The error between the predicted and observed
pocentric indicators like height and weight (Osgood-Zimmerman et al., response variables at these test locations is calculated to check that the
2018; Josepha et al., 2019) or socioeconomic indicators such as access model is able to make accurate predictions at locations that were not
to water or education (Graetz et al., 2018; Andres et al., 2018)). The used to train the model, this process is termed testing. A model with
inputs are referred to as explanatory variables and consist of multiple small training and testing error should be capable of generalising to
independent variables recorded at the same locations as the response unobserved location where response variables have not been recorded.
variables. The choice of explanatory variables is highly dependent on The OLS estimator is a best linear unbiased estimator (BLUE) when
the specifics of a disease, but common examples are population size, the assumptions of the Gauss-Markov theorem are met (see (Davidson
age, precipitation, urbanicity and spatial or space-time coordinates. et al., 2004) for the standard proof). The Gauss-Markov assumptions
Probably the simplest model to link the explanatory and response include that the error terms, ϵ, have a mean of zero, constant variance
variables is the linear model. A linear model assumes that responses are and are pairwise uncorrelated. However, there are two critical as-
a weighted linear combination of the explanatory variables with some sumptions required for many analyses. The first is that the response
additional uncorrelated (independent) noise, ϵ, which may be written variable is assumed to be a linear function of the explanatory variables
as (McCullagh and Nelder, 1989): specified in the model. Some problems are non-linear such that even the
best linear model is a inappropriate representation. The second is that
y = Xw + (1)
the design matrix, X, is full rank. The design matrix will not be full rank
where y = (y1 , y2 , …, yN ) Ndenotes the vector of response variables if any of the explanatory variables are perfectly multicollinear, which
given at N locations. The explanatory variables are generally given as refers to the situation when one explanatory variable can be expressed
matrix X = [x1, x2, …, xN ] N × d , often referred to as the design matrix.
as an exact linear combination of one or multiple other explanatory
The design matrix consists of N rows for each location, and d columns variables (Farrar and Glauber, 1967). When the data contains perfect
for each explanatory variable. The vector w = (w1, w2, …, wd) d,
multicollinearity, it is no longer possible to invert the matrix XTX,
represents the weights which are used to adjust the influence of each preventing the derivation of the weights in Eq. (3). However, it is more
explanatory variable on the model's prediction of the response. For any common for variables show multicollinearity (rather than perfect
multicollinearity), when there is an approximate but not exact linear
d
location i, the model is given as yi = j = 1 xi, j wj + i .
The task is to find an optimal set of weights such the transform of relationship between two or more explanatory variables. Multi-
the explanatory variables is as close as possible to the response vari- collinearity is common in epidemiological data where seemingly in-
ables. In order to do this, we first need to define what is meant by “as dependent explanatory variables can actually correlate through some
close as possible” using a loss function. A common choice of loss func- latent variable (for example, many variables can be correlated with
tion is the squared loss function that computes the sum of the squared socioeconomic status) (Vatcheva et al., 2016). In cases of strong (but
differences between the predicted responses given by the model and the not perfect) multicollinearity, inversion algorithms may still fail in
observed response variables for a given set of weights written as: finding the inverse of XTX, or generate inaccurate solutions. Except for
the assumption of perfect multicollinearity, violating one of more the
1
Gauss-Markov assumptions does not prevent the fitting of a linear
S (w) = y Xw 2
2 (2)
model, but results in an estimator that is not the BLUE.
The closer the model's predicted responses (Xw) for a given set of
weight are to the observed responses (y), then the smaller the value of
the loss function. Therefore, the set of weights that minimise the loss 3. A linear model of non-linear features
function will represent the optimal set of weights and give the smallest
difference between the model's predicted and the observed response For many spatial problems the response variables cannot be de-
variables. Note, the multiplication by half in Eq. (2) is used solely to scribed as a linear function of the explanatory variables. One approach
simplify the derivative of the loss function and does not affect the so- to introduce non-linearity to the model is to transform the explanatory
lutions. To find the set of weights that minimise the loss function re- variables using non-linear transformations, such that the responses are
quires solving minw d y X w 2 . Elementary calculus tells us that described as a linear combination of non-linear terms. For example,
this minimal value can be found by taking the derivative of the loss rather than a weighted sum of linear terms (i.e. x1 + x2 + x3), we may
function (with respect to the weights), setting it equal to zero and re- instead use terms with exponents, logarithms or trigonometric func-
arranging the equation to solve for the optimal weights: tions (i.e. exp(x1) + log(x2) + sin(x3)). Transforming the inputs rather
S (w) changing the model allows us to continue using the convenient maths
= X T (X w y ) = 0 we derived for linear models and apply it to nonlinear systems. The
w
transformation of the explanatory variables is called a feature mapping
wˆ = (X T X ) 1X T y (3)
and the new space to which the data is mapped is called the feature
The value of the weights that minimises the squared loss function is space. Fig. 1 presents an example of a mapping to a feature space such
denoted as ŵ , and is called the ordinary least squares (OLS) estimator. that the response can be expressed in linear terms (code for this ex-
Given the explanatory variables at location i, the model's prediction of ample is given in Supplementary Code 1). Generally, a mapping is de-
the response, ŷi , is computed as yˆi = xi w
ˆ. noted by Φ and the general form of a feature mapping can be written as:
2
P. Milton, et al. Epidemics 29 (2019) 100362
Fig. 1. An example of non-linear feature mapping, where the space is mapped for (A) an input space in which the problem is non-linear into a new feature space (B)
in which the outputs can be described as a linear combination of the inputs.
: xi d (x i) D will not be able to make reliable predictions when presented with data
:X N ×d (X ) N ×D
(4) from locations where the response variable has not been recorded. This
balance between either reducing the training error or testing error
The explanatory variables at location i, xi, are mapped from a vector of forms the famed result in machine learning termed the bias-variance
length d to a vector of length D denoted as Φ(xi). Applying this mapping trade-off (Geman et al., 1992; Domingos, 2000). Overfitting occurs
to the entire design matrix gives a new design matrix that exists in the when the model contains more explanatory variables than are appro-
feature space, (X ) N ×D
. A mapping can project to into higher-di- priate to describe the data, so that the additional variables are fitted to
mensional (d < D) or lower-dimensional (d > D) space, although in the noise. As an extreme example, if the number of explanatory vari-
the context of spatial analysis, mapping to a higher-dimensional space ables is equal to or greater than the number of locations, then a linear
is more common. The same set of equations can be used to solve a model can pass through every point exactly. Any high dimensional
model that is linear in feature space after the mapping simply by re- mapping greatly increases the risk of excess explanatory variables and
placing design matrix, X, with the new design matrix in the feature thus overfitting. An approach that is frequently adopted to prevent
space, Φ(X): overfitting is to apply regularisation.
wˆ = ( (X )T (X )) 1 (X )T y Broadly, regularisation acts to penalise more complex models.
Different forms of regularisation define the complexity of a model dif-
yˆi = ˆ
(x i) w (5)
ferently but a common choice of regularisation is called Tikhonov
Depending on the choice of mapping, the model is now capable of regularisation or ridge regression in the statistical literature (Tikhonov,
capturing non-linear relationships between the explanatory and re- 1963; Bell et al., 1978; Hoerl and Kennard, 1970). The idea behind
sponse variables. The solution after the feature mapping now requires ridge regression is to control complexity by penalising models with
solving for the D weights, corresponding to a weight for each column of many large weights. This prevents the model from using the redundant
the design matrix in feature space, Φ(X). Specifically, the solution is of explanatory variables to capture the noise and forces the model to focus
computational complexity in the best case, where the big O notation, , on the relationship between explanatory and response. When a model
is used to denote how the relative running time or space requirements has large weights only a small change in the explanatory variables is
grow as the input size grows. This complexity of is extremely useful required to induce a large change in the response. Therefore, a model
because even when the number of locations, N, is large the dimensions with small weights cannot change as quickly as a model with large
of the explanatory variables dominate the solution. weights. For a linear model this effectively controls the steepness of the
regression line, but for non-linear regression (or linear regression in
4. Overfitting and ridge regression non-linear feature space) this corresponds to the wiggliness of the re-
gression curve. Ridge regularisation consists of two key components; a
Any high-dimensional mapping that significantly expands the Euclidean norm term, w 2 , and and a regularisation parameter λ. The
number of features (D ≫ d) greatly increases the risk of overfitting. A Euclidean norm terms computes the positive square root of the sum of
model is overfitted if it too closely or exactly predicts the training data, squares of all the weights in the model, w 2 : = w12 + w22+ + wd2 . A
but fails to accurately predict the test data. Overfitting is characterised model with many large weights (whether positive or negative) will have
by a very small training error but a high testing error. All data, but a large norm. The λ > 0 parameter scales the norm and controls the
especially epidemiological data, contains random irreducible noise. amount of regularisation. Ridge regularisation is achieved by adding a
When a model is overfitted, it captures the random noise in the training penalisation term to the loss function that depends on the norm of the
data as well as the relationships between the explanatory and response weights. The loss function for ridge regression is given by:
variables. However, this pattern of noise will vary between training and
1
test data. Therefore, while capturing the noise in the training data re- S (w) = (y X w) 2 + w 2
2 2 (6)
duces the training error, it increases testing error by reducing the
capability of the model to generalise to the test data which will in- When λ is big, the w term significantly increases the value of the
2
herently have different random noise. An inability to accurately predict loss function if the norm of the weights is large so that the act of
response variables for test data demonstrates that an overfitted model minimising the loss function favours models with small weights. As λ
3
P. Milton, et al. Epidemics 29 (2019) 100362
approaches zero, large norms are less heavily penalised, allowing for independent of the primal by expressing the problem as a constrained
models with larger weights representing optimal solutions to Eq. (6). minimisation problem and solving using Lagrangian multipliers (Sup-
When λ = 0 the loss function is equal to the OLS solution. Regularisa- plementary Equations 1). The dual solution involves computing and
tion results in model weights that are biased towards zero, and thus (by inverting the matrix XXT, with computational complexity (N 3) in the
design) the ridge estimate is a biased estimator. worst case. With the dual solution, no matter how high dimensional the
As with the linear model, training requires finding the weights that feature mapping, the dual solution will only require solving for the
minimise the ridge regression loss function. This is calculated in the same number of dual variables and be of the same computational
same way, by taking the derivative of the ridge loss function, setting it complexity. Therefore, for any high-dimensional mapping where
equal to zero, and solving for the optimal ridge weights ŵridge : D > N, the dual solution is computationally easier to solve than the
primal.
S (w)
= X T (X w y ) + w = 0 Less obviously, Eq. (9) computes XXT with the resulting matrix a
w
symmetric positive semidefinite matrix called the Gram matrix,
wˆ ridge = (X T X + In) 1X T y (7) G N ×N
. The gram matrix contains the inner products of explanatory
where In is the identity matrix (a square matrix in which all the ele- variables between every all of N locations. An inner product is a way to
ments of the principal diagonal are ones and all other elements are multiply vectors together with the result of this multiplication being a
zeros) with N × N dimensions. The optimal regularisation parameter λ scalar measure of the vectors similarity. This notion of similarity is
is often unknown but can be estimated during training through methods central to spatial or temporal analysis where we want to leverage the
including cross-validation or restricted maximum likelihood. The so- fact that points close to each other in space or time should be more
lution in Eq. (7) is known as the primal solution of the ridge regression similar than those far apart. Explicitly, the inner product of two vectors
and has computational complexity (d 2N ) . As with linear regression, xi and x j d is given by:
4
P. Milton, et al. Epidemics 29 (2019) 100362
symmetric positive semi-definite function that corresponds to com- 7. The big N problem
puting the inner product a corresponding reproducing kernel Hilbert
space (RKHS) (Shawe-Taylor et al., 2004). This RKHS and is specific to One of the primary motivations of combining the dual and kernels is
the chosen kernel function and simply represents a feature space that that this process only requires computing and inverting the kernel
has a valid inner product (the notion of similarity is conserved). For a matrix K N ×N
, and solving for the N dual variables, even when the
detailed explanation of the mathematics of kernels we recommend dimensions of the (implicit) feature space is infinitely large. Historically
(Shawe-Taylor et al., 2004). If the kernel function is applied to all pairs this led to the widespread adoption of kernel methods to solve difficult
of locations in the training data, the resulting matrix generated is problems on small/medium datasets. However, the dual still requires
termed the kernel matrix, K N ×N
, and can be thought of as the gram (N 2) storage for N observations and (N 3) complexity. Given only a
matrix in RKHS: few thousand points, these costs can rapidly outstrip the storage and
computational power available to most researchers.
K= (X ), (X ) = (X ) (X )T (15) A plethora of methods have been developed to allow the scaling of
kernels methods to large datasets. These methods aim to find smaller or
Importantly, the dual solution only requires inner products to solve.
simpler matrices that provide good approximations of the full kernel
Therefore, the gram matrix in Eq. (12) can be replaced by the kernel
matrix. The three main techniques used are low-rank approximations,
matrix:
sparse approximations and spectral methods. Low-rank approximations
yˆ = K (K + In) 1y (16) of a matrix aim to find smaller representations of the kernel matrix that
contains all (or nearly all) of the information in the full kernel (Bach
This is equivalent to solving the dual in RKHS associated with the and Jordan, 2005). For example, the popular Nyström approximates the
kernel. However, the data is never actually mapped into this implicit full kernel matrix through a subset of its columns and rows (Williams
feature space. Instead the kernel function direct computes the inner and Seeger, 2001). In contrast, sparse methods aim to find re-
product in the RKHS and uses those inner products to solve the dual. presentations of the matrix that are mostly zeros because efficient al-
Given that the data is never actually explicitly mapped to RKHS, the gorithms exist for the storage of and computation with such matrices
feature space can be of very high or even infinite dimensions, as long as (Rue and Held, 2005; Straeter, 1971; Saad and Schultz, 1986). One of
we have the associated kernel function, and thus the inner product in the best examples is the sparse matrix generated when modelling spa-
this feature space, the the dual can be solved. tial data as a Gaussian Markov random field (GMRF) that are solutions
The replacement of the explicit mapping with the implicit mapping to Stochastic Partial Differential Equation (SPDE) (Lindgren et al.,
associated with a kernel function is called the kernel trick and can be 2011; Whittle, 1954, 1963). However, the remainder of this paper will
applied to many algorithms with dual solutions. Eq. (16) combines focus on an exciting, new subset of spectral methods called random
ridge regression with the kernel trick to create kernel ridge regression Fourier features (RFF).
(KRR), that effectively solves the model in the RKHS with the ridge RFF combines the flexibility of kernels with the computational
penalty preventing overfitting and ensure small loss for both training benefits of the primal solution. RFF use the Fourier transform of a
and testing. The dual solution still only requires computing the N dual kernel function to explicitly map the data to a relatively low-dimen-
variables with complexity (N 3) . Thus, given an appropriate kernel, sional space that approximates the implicit feature space associated
the KRR dual can be solved in infinite-dimensional feature space with with the kernel. The data in this feature space can either be used to
no added computational complexity. Therefore, the combination of the construct an unbiased estimator of the full kernel or be used to solve the
dual solution and kernels is a powerful tool capable of extending linear primal at a significantly reduced computational cost. This is distinct
models to very high dimensional feature spaces with the ability to from many low-rank or sparse approximations that rely on the ap-
handle nearly any non-linear problem. proximations of kernel matrix and the thus the dual. The RFF method
Kernel functions can be derived by direct construction; finding the approximates the entire kernel function at once, does not rely on in-
function that corresponds to the taking the inner product in the feature ducing points or “knots”, does not require throwing away any data, can
space (see Supplementary Equations 2 for an example of constructing easily be incorporated to many existing linear models and has steps no
the polynomial kernel from an explicit feature mapping). Generally, more complicated than sampling from some probability distribution
kernels are described by showing that the proposed function satisfies and applying trigonometric functions.
Mercer's conditions; proving the function is symmetric positive-definite
function and guaranteeing there exists a RKHS (Shawe-Taylor et al., 8. Random Fourier features
2004). A huge number of kernels are already described, with specific
kernels often used to solve specific tasks. Common kernels for spatial RFF and other spectral method rely on the characterisation of the
analysis include the squared exponential and Matérn kernels. The kernel function through its Fourier transform. Any function can be
squared exponential kernel is popular because it generates smooth decomposed into the periodic, trigonometric functions (sin and cos) of
functions that are appropriate for spatial interpolation, while the Ma- different frequencies that make it up. As an analogy, consider how a
térn kernel allows a better balance between smoothness and roughness complex musical chord is composed of a number of individual notes,
of the resulting functions that may better represent true spatial pro- where each note is just a string vibrating at a particular frequency. The
cesses (Stein, 2012). The exponential kernel is popular choices for Fourier transform tells us the how much trigonometric functions of each
modelling temporal data, where the similarity between two points is frequency must be added to construct a given target function. The
assumed to steadily exponentially decay as the time between them in- central idea behind spectral methods is that a good approximation of
crease. For example, the squared exponential kernel is given by: the frequencies that make up the kernel function will naturally yield a
good approximation to the kernel function itself. For the mathematics
(x i x j) 2 of Fourier transforms, we refer the reader here (Bracewell and
kS E (xi , xj) = exp
2 2
(17) Bracewell, 1986). All spectral methods are based on the same mathe-
matical foundation; the celebrated Bochner's theorem (Bochner and
where 0 is the length-scale. The length-scale controls the Chandrasekharan, 1949). Loosely, Bochner's theorem states that a shift-
smoothness of the resulting functions. The kernel parameters are often invariant kernel functions (where the output of the kernel is only de-
learned from the data alongside the model variables. The squared ex- pendent on the difference between the inputs and not the explicit values
ponential kernel corresponds to the inner product in an infinite di- of the input themselves), k(xi, xj) = k(xi − xj), on d can be expressed
mensional feature space shown in Supplementary Equations 3. through a Fourier transform (Rudin, 1990):
5
P. Milton, et al. Epidemics 29 (2019) 100362
k (x i x j) = ei
T
(xi xj) ( ) d frequencies from them. For example, generating the frequencies for
d (18)
approximating a squared exponential requires independently sampling
This theorem tells us that the Fourier transform of a shift-invariant frequencies from a Gaussian distribution, or from Cauchy distribution
kernel takes the form of probability distribution, ( ) . This distribution for a Laplacian kernel (Table 1). This is visualised in Fig. 2 that shows
is called the spectral density of the kernel and is the distribution of the different spectral densities (Fig. 2A,C,E,G,I) and the resulting functions
amount of a given frequency, ω, that must be added to construct the produced by sampling from the kernel generated by each spectral
kernel function. The larger the spectral density for a given ω, the density (Fig. 2B,D,F,H,J).
greater the amount of that frequency must be added to reproduce the In Fig. 2A, the spectral density is composed of two delta functions
kernel function. Applying Euler's identity (eiπ = cos(π) + i sin(π)) to the such that sampled frequencies, ω, can only take values equal to 1 or 2.
exponential and ignoring the imaginary component, let us consider The functions generated by sampling from this spectral density show
Bochner's theorem in terms of the trigonometric functions: strong periodicity and closely resemble the standard trigonometric
T
functions with corresponding frequencies (Fig. 2B). When the fre-
cos( Tx
j)
cos( Tx
i) quencies of the two possible delta functions are increased so that they
k (x i x j) = D ( ) d
lie in the set {10, 20}, the functions are again highly cyclical but, due to
sin( Tx Tx )
i) sin( j (19)
their higher frequencies, have rougher sample paths and a much
However, a major problem is that evaluating the integral in Eq. (19) smaller period (Fig. 2C,D). By expanding the spectral density to contain
requires integrating over the infinite set of all possible frequencies. To five possible frequencies (Fig. 2E,G), the sample paths show con-
avoid this, we can approximate this infinite integral by a finite one siderably more variation due to the inclusion of a larger variety of
using Monte Carlo integration. In Monte Carlo integration the full in- frequencies (Fig. 2F,H). Finally, Fig. 2I and K show samples functions
tegral of a function is approximated by computing the value of the generated by sampling frequencies form a Gaussian and Cauchy dis-
function evaluated at a random set of points and averaging. In RFF, the tribution respectively. The Gaussian spectral density corresponds to a
integral is approximated by averaging the sum of the function evaluated spectral density of the squared exponential kernel (Gaussian kernel)
at random samples of ω drawn from drawn from the probability dis- and gives rise to smooth sample functions with a huge amount of
tribution ( ) . The greater the number of samples that are evaluated, variety when compared to the simpler spectral densities (Fig. 2J). The
the closer the approximation gets to the value of the full integral. In- Cauchy distribution corresponds to a spectral density generated by the
deed, one of the best properties of random Fourier features is that of Fourier transform of the Laplacian kernel and generates functions with
uniform convergence of the Monte Carlo approximation of the entire a high degree of roughness (Fig. 2L) due to the inclusion of very high
kernel function (rather than pointwise) (Rahimi and Recht, 2007). frequencies in the long tails of the distribution (Fig. 2K). The code for
Therefore, the infinite integral in Eq. (19) can be converted to a finite sampling the spectral densities and generating functions is given in
approximation by taking m independent samples of ω from the power Supplementary Code 2.
spectral density, and computing the Monte Carlo approximation of the Eq. (20) shows how the RFF can be used to approximate the kernel
kernel function as: function and thus the whole kernel matrix through pairwise applica-
m T T
cos( T tion. Incredibly, this can all be written in just 4 lines of R-code:
1 cos( s x i) s x j) m i .i.d .
k (x i x j) T T
, { s }s = 1 ( ) Code 1 Example of creating random Fourier features to approximate
m sin( s x i) sin( s x j)
s=1 a squared exponential kernel matrix.
(20)
When the frequencies are sampled from the power spectral density, the
RFF approximation of the kernel function is an unbiased estimator of
the kernel function (Rahimi and Recht, 2007). Given that the spectral
densities represent probability distributions, it is trivial to sample
Table 1
Common shift-invariant kernels and their spectral densities.
Kernel Kernel function, k(xi, xj) Power spectral density, ( )
‡
Squared Exponential xi xj 22 D
2) 2 exp( 2 2
exp (2 2 2)
2 2
Matén*, †, ‡
21 2 xi xj 2 2 xi xj 2 D D
2D+ ( + D)
( )
k 2 ( + )
2 2 2 2 2
( ) +4 2
( ) 2 2
Laplaciana exp( xi xj 1 ) D
()
2 2 D
i=1 2+ 2
i
* Γ(·) is the gamma function and kλ(·) is the modified Bessel function of the second kind.
†
Parameter υ > 0. If υ = 0.5 the Matén equates to the exponential kernel. As υ→ ∞ the Matén converges to the squared exponential kernel.
‡
Parameter ℓ > 0.
a
Parameter σ > 0.
6
P. Milton, et al. Epidemics 29 (2019) 100362
Fig. 2. Power spectral densities (A,C,E,G,I,K) and the functions produced by sampling from the resulting kernel (B,D,F,H,J,L). The spectral densities correspond to
sampling from delta functions (arrowheads) such that the sampled frequencies can only take the point values corresponding to each delta. (I) is Gaussian distribution
corresponding to a spectral density of the squared exponential kernel. (K) is Cauchy distribution corresponding to a spectral density of the Laplacian kernel.
However, one of the key observations about RFF is that it defines a achieve the generalisation error as if we had used all points. However,
feature space of its own: the full theoretical properties of RFF estimators are still far from fully
m T T m
understood.
T cos(
1 cos( s x i) s x j) 1 T m
T T
= RFF (x i) RFF (x j), { s }s = 1
m s=1 sin( s x i) sin( s x j) m s=1
i . i . d. 9. Variation in the linear model
( ) (21)
7
P. Milton, et al. Epidemics 29 (2019) 100362
Fig. 3. (A) All 500 points generated from the latent spatial process given by yi = x i2,1 + x i2,2 + i where i (µ = 0, 2 = 1) ), and (B) the subset of data points used to
train the regression models.
10. Toy example of random Fourier features for spatial analysis training data resulting in poor testing performance. In comparison, the
regularisation applied in kernel ridge regression helps prevent overfitting
As an example of how to use RFF, we simulate a simple non-linear with the KRR model (with λ = 3.98) having marginally higher training
spatial regression problem. A set of random points in space is generated error than the kernel regression model but less than half the testing error.
such that each location has unique coordinates (longitude and latitude). Therefore, the KRR model trades a small decrease in training perfor-
Each location has a response variable generated the function mance for a significant increase in generalisability.
y = x12 + x 22 + where ϵ is random Gaussian noise The importance regularisation is further illustrated by comparing how
( (µ = 0, 2 = 1) ). Fig. 3A shows 500 random points drawn from the training and testing performance of kernel regression and KRR varies
the spatial process. The simulated data were used to train three models, with the number of Fourier features (Fig. 4). Increasing the number of
a linear regression model, a kernel regression model (no regularisation) sampled Fourier features increases results in a steady reduction in training
and a KRR model. Both the kernel regression and KRR models use 100 error of kernel regression (Fig. 4A, blue line). As the number of features
Fourier features to approximate a squared exponential kernel. For KRR, increases the kernel regression model shows increasing testing error
k-fold cross-validation was used to find the optimal regularisation (Fig. 4B, blue line) as the model is significantly overfitted to the training
parameter (λ). We assume, as is common for nearly all real-world data. In comparison, KRR the training error remains constant above 10
spatial processes, that we only observe a subset of all possible locations. features (Fig. 4A, red line) and maintains a stable testing error that is
Therefore, the models are trained on only 20% of all the generated significantly lower than kernel regression even when additional Fourier
points, shown in Fig. 3B. The remaining data not used to training is then features are added (Fig. 4B, red line). The KRR prevents overfitting even as
used for testing. The code for this example is provided in Supplemen- more features are added the regularisation by increasing the magnitude of
tary Code 3. The code can easily be changed to any function of the the regularisation parameter, λ (Fig. 4B, inset).
user's choice and the user can specify the number of Fourier features.
Note, the provided code generates points at random and therefore a
user's results may differ from the exact results shown here. 11. Advanced methods for random Fourier features
Each model was trained using the same training data and their pre-
dictive performance measured by MSE for both the training and testing 11.1. Limitations of RFF
data. The training and testing performance of the three models are shown
in Table 2. As expected, the non-linear nature of the spatial process re- Given the good empirical performance of the RFF method, little has
sults in very poor performance of the linear model with large values of been published on their limitations, including in the context of spatial
both the training and testing error. The Kernel regression model has analysis. Firstly, RFF can be poor at capturing very fine-scale variation
excellent training performance, with the infinite feature space of the as noted in Ton et al. (2018). This is likely due to fine-scale features
squared exponential kernel able to capture the non-linear relationship being captured by the tails of the spectral density that will be infre-
between the spatial coordinates and the response. However, in the ab- quently sampled in the Monte Carlo integration. Secondly, from a
sence of regularisation, the kernel regression model greatly overfits the computational perspective, RFF are very efficient but can still be out-
preformed by some state-of-the-art spatial statistics approaches. For
Table 2 example, the sparse matrix approaches based on SPDE as solution
Training and testing mean squared error of linear, kernel and Kernel ridge re- GMRF provide impressive savings with complexity (m1.5) (compared
gression models for a basic non-linear spatial problem. to (m3) from the RFF primal solution) (Lindgren et al., 2011). Other
Model Training error (MSE) Testing error (MSE) methods such as the multiresolution Kernel approximation (MRA)
provide incredible performance but are only valid for two dimensions
Linear 2.26 2.73 (Ding et al., 2017). Thirdly, while the convergence properties of RFF
Kernel Regression* 0.64 2.70
suggest excellent predictive capability (Rudi and Rosasco, 2017), al-
Kernel Ridge Regression*, †
0.88 1.19
ternative data-dependent methods including versions of the Nyström
* Using a squared exponential kernel approximated using the 100 random approximation can perform much better in some settings (Yang et al.,
Fourier features. 2012; Rudi et al., 2015). The following sections will discuss the current
†
Optimal ridge parameter of λ = 3.98 estimated by k-fold validation. methods that address some of these limitations. From here onward, we
8
P. Milton, et al. Epidemics 29 (2019) 100362
Fig. 4. An example of the bias-variance trade-off for kernel regression (blue) and kernel ridge regression model (red) with a random Fourier feature approximation of
a squared exponential kernel. The mean squared error (MSE) is calculated for both the (A) training data and the testing data (B) for an increasing number of sampled
Fourier features, with optimal λ (estimated by k-fold cross-validation) for the ridge regression model shown (B inset).
call the RFF method described in the previous section as the standard 11.3. Leverage score sampling
RFF method.
In the standard RFF method, frequencies are sampled with a prob-
ability proportional to their spectral density. However, the power
11.2. Quasi-Monte-Carlo Features (QMC RFF) spectral density of a kernel does not depend on the data, X (see
Table 1). Therefore, the sampling probability of a given frequency is
One of the most significant limitations of standard RFF lies within data-independent. Data-independent sampling is sub-optimal and can
the Monte Carlo integration. The infinite integral that describes the yield very poor approximations (Bach, 2012; Mahoney and Drineas,
kernel function is converted to a finite approximation by sampling 2009; Gittens and Mahoney, 2016), and has been identified as one of
frequencies from the spectral density (Eq. (20)). The convergence of the reasons RFFs perform poorly in certain situations (Li et al., 2018).
Monte Carlo integration to the true integral occurs at the rate (m0.5) , An alternative is a data-dependent approach, that considers the im-
which means for some problems a large number of features are required portance of various features given some data. Several data-dependent
to approximate the integral accurately. approaches for RFF have been proposed (Ionescu et al., 2017; Rudi and
A popular alternative is to use Quasi-Monte Carlo (QMC) integration Rosasco, 2017; Li et al., 2018), but one of the most promising and ea-
(Avron et al., 2018). In QMC integration, the set of points chosen to siest to implement is sampling from the leverage distribution of the RFF
approximate the integral is chosen using a deterministic, low-dis- (abbreviated to LRFF) (Li et al., 2018).
crepancy sequence (Niederreiter, 1978). In this context, low-dis- Leverage scores are popular across statistics and are a key tool for
crepancy means the points generated appear random even though they regression diagnostics and outlier detection (Hoaglin and Welsch, 1978;
are generated from a deterministic, non-random process. For example, Velleman and Welsch, 1981). A leverage score measures the importance
the Halton sequence, that generates points on a uniform hypercube of a given observation on the solution of a regression problem. How-
before transforming the points through a quantile function (inverse ever, the perspective of leverage scores as a measure of importance can
cumulative distribution function) (Halton, 1964). Low-discrepancy se- be extended to any matrix. The leverage scores of a matrix A is given by
quences prevent clustering and enforce more uniformity in the sampled the diagonal matrix T = A (AAT ) 1AT with leverage score for ith row of
frequencies, allowing QMC to converge at close to (m 1) (Asmussen matrix A is equal to the ith diagonal element in matrix T, denoted as τi,i
and Glynn, 2007) and can provide substantial improvements in the and calculated by:
accuracy of the approximation of the kernel matrix for the same com-
putational complexity. Crucially, QMC is trivial to implement within i, i = aTi (AAT ) 1ai = [A (AAT ) 1AT ]ii (24)
the RFF framework for some distributions. For example, for the squared
exponential kernel, instead of generating features as by taking random τi,i can also be seen as a measure of the importance of the row ai.
samples from a Gaussian, we generate them as: Most leverage score sampling methods apply ridge regularisation to
Code 2 Example of Quasi-Monte Carlo sampling of a Gaussian power the leverage scores equation, controlled by the regularisation parameter
spectral density of a squared exponential kernel using a Halton se- λLRFF given by:
quence.
i, i ( ) = aTi (AAT + LRFF I )
1a
i (25)
9
P. Milton, et al. Epidemics 29 (2019) 100362
The resulting scores are termed ridge leverage scores (El Alaoui and concatenated into a frequency matrix, m× d
. If we consider a
Mahoney, 2014). The regularisation serves a nearly identical purpose as squared exponential kernel, Ω is actually just a random Gaussian ma-
when applied in the context of linear regression; ensuring the matrix trix, as sampled frequencies are simply standard normal distribution
inversion is always possible with scores that less sensitive to pertur- random variables scaled by the kernel parameter, σ. Therefore, the
bations in the underlying matrix. The addition of the ridge regularisa- matrix = G , where G is a random Gaussian matrix of dimension
1
10
P. Milton, et al. Epidemics 29 (2019) 100362
12. Conclusion
Felix et al. (2016) extended the ORF method further to a method
known as structured ORF (SORF). The SORF method avoids the com- Regression is a key technique for nearly all scientific disciplines and
putationally expensive steps of deriving the orthogonal matrix ( (N 3) can be extended from its simplest forms to highly complex and flexible
time) and computing random basis matrix ( (N 2) time) by replacing models capable of describing nearly any type of data using feature
the random orthogonal matrix, O, by a class of specially structured mapping. Kernels permit working with infinite dimensional feature
matrices (consisting of products of binary diagonal matrices and Walsh- spaces but are not computationally feasible with large datasets. The RFF
Hadamard matrices) that have orthogonality with near-Gaussian entries method is capable of approximating the full kernel matrix and valid for
and can use highly efficient algorithms (such as the fast Walsh-Hada- kernels in high dimensions, but its key advantage over other methods is
mard transform) (Felix et al., 2016). The SORF method maintains a that using RFF to solve the primal rather than the dual solution provides
lower approximation error than the standard RFF method and is sig- significant computational benefit. All these advantages are achieved
nificantly more computationally efficient than ORF, with computing using only a probability distributions and trigonometric functions and
ΦSORF(X) taking only (N log(N )) time. However, technically the re- can be encapsulated in 4 lines of code. While some advanced deriva-
sulting approximation of the kernel is no longer unbiased. tives of existing sparse or low-rank methods may outperform the
standard RFF, several advanced RFF methods are emerging that con-
11.5. Non-stationary and arbitrary kernel functions tinue to improve on the standard RFF method. To that end, random
Fourier features and their extensions represent an exciting new tool for
One of the most significant limitations to the standard RFF method multi-dimensional spatial analysis on large datasets.
is the restriction to shift-invariant kernels, where k(xi, xj) = k(xi − xj).
This restriction means that the kernel value is only dependent on the lag Author contributions
or distance between the points rather than the actual locations. This
property imposes stationarity on the spatiotemporal process. While this PM drafted the manuscript which was read, revised for critical in-
assumption is not unreasonable, and non-stationarity is often uni- tellectual content, modified and then approved by all authors. HC
dentifiable, in some cases the relaxation of stationarity can significantly provided significant support on the mathematical aspects, and EG
improve model performance (Paciorek and Schervish, 2006). provided significant support on the spatial statistics. SB conceived and
To extend the RFF method to non-stationary kernels requires a more the paper and provided significant support on the structuring, drafting
general representation of Bochner's theorem capable of capturing the and editing of the paper itself.
spectral characteristics of both stationary and non-stationary kernels.
This extension (Yaglom, 1987) states than any kernel (stationary or Funding
non-stationary) can be expressed as its Fourier transform in the form of:
T T PM would like to acknowledge the Medical Research Council-
k (x i , x j ) = ei ( 1 xi 2 xj) ( 1) ( 2) d 1d 2 (28)
Rd × Rd Doctoral Training Programme at Imperial College London. HC's re-
search is funded by a studentship from the Wellcome Trust. SB receives
This equation is nearly identical to the original derivation of
financial support from the Bill and Melinda Gates Foundation
Bochner's theorem given in Eq. (18), but now we have two spectral
densities on D to integrate over. It is easy if that if the two spectral (1606H5002/JH6). EG is supported by an MRC Strategic Skills
Fellowship in Biostatistics (MR/M015297/1). The funders had no role
densities are the same, the function returns to the definition for a sta-
tionary kernel. Applying the same treatment and Monte Carlo integra- preparation or publication of the manuscript. The views, opinions, as-
sumptions or any other information set out in this article are solely
tion can be performed to give the feature space of the non-stationary
kernel (Ton et al., 2018), now given by: those of the authors. The authors thank the UK National Institute for
Health Research Health Protection Research Unit (NIHR HPRU) in
cos( T
1 x) + cos( T
2 x)
Modelling Methodology at Imperial College London in partnership with
RFF (x) = T T
, {1,2} ={ {1,2},1, …, {1,2}, m } Public Health England (PHE) for funding (grant HPRU-2012-10080).
sin( 1 x) + sin( 2 x)
i .i.d .
{1,2} ( ) (29) Conflicts of interest
Note, that this derivation requires drawing independent samples for
None declared.
both of the spectral densities, l ( ) , such that we generate two fre-
quency matrices, l m× d.
In both the stationary and non-stationary case the choice of the Acknowledgements
kernel is often arbitrary or made with knowledge of the process being
modelled. For example, if the spatial data is expected to be very We gratefully acknowledge Reviewer 1 and 2 for their insightful
smooth, then a squared exponential kernel can be used. It is, however, comments in the manuscript review process.
possible to treat ω as unknown kernel parameters variables and infer
their values (Ton et al., 2018). This is equivalent to deriving an em- Appendix A. Supplementary data
pirical spectral distribution. This strategy is data dependent and can
achieve impressive results; however, great care must be taken to avoid Supplementary data associated with this article can be found, in the
overfitting (Ton et al., 2018). online version, at https://doi.org/10.1016/j.epidem.2019.100362.
11
P. Milton, et al. Epidemics 29 (2019) 100362
References arXiv:1807.02582.
Li, Z., Ton, J.-F., Oglic, D., Sejdinovic, D., 2018. A Unified Analysis of Random Fourier
Features. arXiv preprint arXiv:1806.09178.
Andres, L.A., et al., 2018. Geo-Spatial Modeling of Access to Water and Sanitation in Lindgren, F., Rue, H., Lindström, J., 2011. An explicit link between Gaussian fields and
Nigeria. The World Bank. Gaussian Markov random fields: the stochastic partial differential equation approach.
Asmussen, S., Glynn, P.W., 2007. Stochastic Simulation: Algorithms and Analysis. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 73, 423–498. https://doi.org/10.1111/j.
Springer Science & Business Media. 1467-9868.2011.00777.x.
Avron, H., et al., 2018. Random Fourier Features for Kernel Ridge Regression: Mahoney, M.W., Drineas, P., 2009. CUR matrix decompositions for improved data ana-
Approximation Bounds and Statistical Guarantees. CoRR abs/1804.09893. arXiv: lysis. Proc. Natl. Acad. Sci pnas-0803205106.
1804.09893. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models. Chapman and Hall,
Bach, F.R., Jordan, M.I., 2005. Predictive low-rank decomposition for kernel methods. In: London, UK.
Proceedings of the 22nd international conference on Machine learning – ICML’05. Mena, I., et al., 2016. Origins of the 2009 H1N1 influenza pandemic in swine in Mexico.
ISBN: 1595931805. Elife 5, e16777.
Bach, F.R., 2012. Sharp Analysis of Low-Rank Kernel Matrix Approximations. CoRR abs/ Musco, C., Musco, C., 2017. Recursive Sampling for the Nystrom Method in Advances in
1208.2015. arXiv: 1208.2015. Neural Information Processing Systems. pp. 3833–3845.
Bell, J.B., Tikhonov, A.N., Arsenin, V.Y., 1978. Solutions of ill-posed problems. Math. Nelder, J.A., Wedderburn, R.W., 1972. Generalized linear models. J. R. Stat. Soc.: Ser. A
Comput ISSN: 00255718. (Gen.) 135, 370–384.
Bochner, S., Chandrasekharan, K., 1949. Fourier Transforms. Princeton University Press. Niederreiter, H., 1978. Quasi-Monte Carlo methods and pseudo-random numbers. Bull.
Boyd, S., Vandenberghe, L., 2004. Convex Optimization. Cambridge University Press. Am. Math. Soc. 84, 957–1041.
Bracewell, R.N., Bracewell, R.N., 1986. The Fourier Transform and Its Applications. Noma, M., et al., 2002. Rapid epidemiological mapping of onchocerciasis (REMO): its
McGraw-Hill, New York. application by the African Programme for Onchocerciasis Control (APOC). Ann. Trop.
Cameron, A.C., Trivedi, P.K., 2013. Regression Analysis of Count Data. Cambridge Med. Parasitol. 96, S29–S39. https://doi.org/10.1179/000349802125000637. ISSN:
University Press. 0003-4983.
Carlin, B.P., Louis, T.A., 2008. Bayesian Methods for Data Analysis. CRC Press. Osgood-Zimmerman, A., et al., 2018. Mapping child growth failure in Africa between
Cohen, M.B., Musco, C., Musco, C., 2017. Input sparsity time low-rank approximation via 2000 and 2015. Nature 555, 41.
ridge leverage score sampling. Proceedings of the Twenty-Eighth Annual ACM-SIAM Paciorek, C.J., Schervish, M.J., 2006. Spatial modelling using a new class of nonstationary
Symposium on Discrete Algorithms 1758–1777. covariance functions. Environmetrics 17, 483–506.
Cuadros, D.F., et al., 2017. Mapping the spatial variability of HIV infection in Sub- Rahimi, A., Recht, B., 2007. Random features for large scale kernel machines. Adv. Neural
Saharan Africa: effective information for localized HIV prevention and control. Sci. Inf. Process. Syst. ISSN: 0033-6599, arXiv: arXiv:1409.1151v1.
Rep. 7, 9093. Rasmussen, C.E., Williams, C.K.I., 2005. Gaussian Processes for Machine Learning
Davidson, R., MacKinnon, J.G., et al., 2004. Econometric Theory and Methods. Oxford (Adaptive Computation and Machine Learning). The MIT Press ISBN: 026218253X.
University Press, New York. Rudi, A., Rosasco, L., 2017. Generalization Properties of Learning With Random Features
Diggle, P., Ribeiro, P., 2007. Model-Based Geostatistics. English. Springer ISBN: in Advances in Neural Information Processing Systems. pp. 3215–3225.
0387329072 978-0387329079. Rudi, A., Camoriano, R., Rosasco, L., 2015. Less is More: Nyström Computational
Diggle, P.J., Tawn, J., Moyeed, R., 1998. Model-based geostatistics. J. R. Stat. Soc. Ser. C Regularization in Advances in Neural Information Processing Systems. pp.
(Appl. Stat.) 47, 299–350. 1657–1665.
Ding, Y., Kondor, R., Eskreis-Winkler, J., 2017. Multiresolution Kernel Approximation for Rudi, A., Calandriello, D., Carratino, L., Rosasco, L., 2018. On Fast Leverage Score
Gaussian Process Regression in Advances in Neural Information Processing Systems. Sampling and Optimal Learning in Advances in Neural Information Processing
pp. 3740–3748. Systems. pp. 5673–5683.
Domingos, P., 2000. A Unified Bias-Variance Decomposition and its Applications. Science. Rudin, W., 1990. Fourier Analysis On Groups. ISBN: 047152364X.
Drineas, P., Magdon-Ismail, M., Mahoney, M.W., Woodruff, D.P., 2012. Fast approx- Rue, H., Held, L., 2005. Gaussian Markov Random Fields: Theory and Applications. CRC
imation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13, Press.
3475–3506. Saad, Y., Schultz, M.H., 1986. GMRES: a generalized minimal residual algorithm for
El Alaoui, A., Mahoney, M., 2014. Fast Randomized Kernel Methods With Statistical solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7, 856–869.
Guarantees, vol. 1411 arXiv preprint. arXiv. Shawe-Taylor, J., Cristianini, N., et al., 2004. Kernel Methods for Pattern Analysis.
Farrar, D.E., Glauber, R.R., 1967. Multicollinearity in regression analysis: the problem Cambridge University Press.
revisited. Rev. Econ. Stat. 92–107. Stein, M.L., 2012. Interpolation of Spatial Data: Some Theory for Kriging. Springer
Felix, X.Y., Suresh, A.T., Choromanski, K.M., Holtmann-Rice, D.N., Kumar, S., 2016. Science & Business Media.
Orthogonal Random Features in Advances in Neural Information Processing Systems. Straeter, T.A., 1971. On the Extension of the Davidon-Broyden Class of Rank One, Quasi-
pp. 1975–1983. Newton Minimization Methods to an Infinite Dimensional Hilbert Space With
Geman, S., Bienenstock, E., Doursat, R., 1992. Neural networks and the bias/variance Applications to Optimal Control Problems.
dilemma. Neural Comput. ISSN: 0899-7667. arXiv: arXiv:1011.1669v3. Tatem, A.J., et al., 2010. Ranking of elimination feasibility between malaria-endemic
Gentle, J.E., 2012. Numerical Linear Algebra for Applications in Statistics. Springer countries. Lancet 376, 1579–1591.
Science & Business Media. Tikhonov, A.N., 1963. Solution of incorrectly formulated problems and the regularization
Gething, P.W., et al., 2016. Mapping Plasmodium falciparum mortality in Africa between method. Soviet Math ISSN: 10634584.
1990 and 2015. New Engl. J. Med. 375, 2435–2445. Ton, J.-F., Flaxman, S., Sejdinovic, D., Bhatt, S., 2018. Spatial mapping with Gaussian
Gittens, A., Mahoney, M.W., 2016. Revisiting the Nyström method for improved large- processes and nonstationary Fourier features. Spat. Stat. 28, 59–78. One world, one
scale machine learning. J. Mach. Learn. Res. 17, 3977–4041. health, ISSN: 2211-6753. http://www.sciencedirect.com/science/article/pii/
Gleason, B.L., et al., 2017. Geospatial analysis of household spread of Ebola virus in a S2211675317302890.
quarantined village-Sierra Leone, 2014. Epidemiol. Infect. 145, 2921–2929. Vatcheva, K.P., Lee, M., McCormick, J.B., Rahbar, M.H., 2016. Multicollinearity in re-
Graetz, N., et al., 2018. Mapping local variation in educational attainment across Africa. gression analyses conducted in epidemiologic studies. Epidemiology (Sunnyvale,
Nature 555, 48. Calif.) 6.
Halton, J.H., 1964. Algorithm 247: radical-inverse quasi-random point sequence. Velleman, P.F., Welsch, R.E., 1981. Efficient computing of regression diagnostics. Am.
Commun. ACM 7, 701–702. https://doi.org/10.1145/355588.365104. ISSN: 0001- Stat. 35, 234–242.
0782. Whittle, P., 1954. On stationary processes in the plane. Biometrika 434–449.
Hay, S.I., et al., 2013. Global mapping of infectious disease. Philos. Trans. R. Soc. B: Biol. Whittle, P., 1963. Stochastic-processes in several dimensions. Bull. Int. Stat. Inst. 40,
Sci. 368, 20120250. 974–994.
Hoaglin, D.C., Welsch, R.E., 1978. The hat matrix in regression and ANOVA. Am. Stat. 32, Williams, C.K.I., Seeger, M., 2001. In: In: Leen, T.K., Dietterich, T.G., Tresp, V. (Eds.),
17–22. Advances in Neural Information Processing Systems, vol. 13. MIT Press, pp. 682–688.
Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: biased estimation for nonorthogonal http://papers.nips.cc/paper/1866-using-the-nystrommethod-to-speed-up-kernel-
problems. Technometrics. ISSN: 15372723. arXiv: 9809069v1 [arXiv:gr-qc]. machines.pdf.
Ionescu, C., Popa, A., Sminchisescu, C., 2017. Large-scale data-dependent kernel ap- Yaglom, A.M., 1987. Correlation Theory of Stationary and Related Random Functions.
proximation. In: In: Singh, A., Zhu, J. (Eds.), Proceedings of the 20th International Springer New York, New York, NY.
Conference on Artificial Intelligence and Statistics, vol. 54. PMLR, Fort Lauderdale, Yang, T., Li, Y.-f., Mahdavi, M., Jin, R., Zhou, Z.-H., 2012. In: In: Pereira, F., Burges,
FL, USA, pp. 19–27 April. C.J.C., Bottou, L., Weinberger, K.Q. (Eds.), Advances in Neural Information
Josepha, G., Gething, P.W., Bhatt, S., Ayling, S.C., 2019. Understanding the Geographical Processing Systems, vol. 25. Curran Associates, Inc., pp. 476–484. http://papers.nips.
Distribution of Stunting in Tanzania: A Geospatial Analysis of the 2015–16. cc/paper/4588-nystrom-method-vs-random-fourier-features-a-theoretical-
Demographic and Health Survey. andempirical-comparison.pdf.
Kanagawa, M., Hennig, P., Sejdinovic, D., Sriperumbudur, B.K., 2018. Gaussian Processes
and Kernel Methods: A Review on Connections and Equivalences. arXiv preprint
12