0% found this document useful (0 votes)

25 views12 pages

Análise Espacial Com Regressão Linear e Kernel

Artigo com análise espacial e kernel

Uploaded by

s4nfkbjwmn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views12 pages

Análise Espacial Com Regressão Linear e Kernel

Artigo com análise espacial e kernel

Uploaded by

s4nfkbjwmn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Epidemics 29 (2019) 100362

Contents lists available at ScienceDirect

Epidemics
journal homepage: www.elsevier.com/locate/epidemics

Spatial analysis made easy with linear regression and kernels T

a,⁎ a b a
Philip Milton , Helen Coupland , Emanuele Giorgi , Samir Bhatt
a
MRC Centre for Outbreak Analysis and Modelling, Department of Infectious Disease Epidemiology, Imperial College London, London, UK
b
CHICAS, Lancaster Medical School, Lancaster University, Lancaster, UK

ARTICLE INFO ABSTRACT

Keywords: Kernel methods are a popular technique for extending linear models to handle non-linear spatial problems via a
Regression mapping to an implicit, high-dimensional feature space. While kernel methods are computationally cheaper than
Random Fourier features an explicit feature mapping, they are still subject to cubic cost on the number of points. Given only a few
Kernel methods thousand locations, this computational cost rapidly outstrips the currently available computational power. This
Kernel approximation
paper aims to provide an overview of kernel methods from first-principals (with a focus on ridge regression) and
progress to a review of random Fourier features (RFF), a method that enables the scaling of kernel methods to big
datasets. We show how the RFF method is capable of approximating the full kernel matrix, providing a sig-
nificant computational speed-up for a negligible cost to accuracy and can be incorporated into many existing
spatial methods using only a few lines of code. We give an example of the implementation of RFFs on a simulated
spatial data set to illustrate these properties. Lastly, we summarise the main issues with RFFs and highlight some
of the advanced techniques aimed at alleviating them. At each stage, the associated R code is provided.

1. Introduction computational speed-ups for kernel methods with minimal loss in ac-
curacy, both theoretically and in practice (Rahimi and Recht, 2007;
The mapping of infectious disease has become a cornerstone of Yang et al., 2012; Avron et al., 2018). However, these papers are
global health. Maps can represent the distribution of an infectious mathematically rigorous and it is not always apparent how these
disease through space and often time. From pre-intervention planning methods work, nor how to incorporate them into spatial analysis. This
through to near near-elimination settings and from the global to the paper aims to provide the reader with a fundamentals of the RFF
village-scale, mapping can inform policy and decision making (Tatem method works and illustrate how RFF can be incorporated into existing
et al., 2010; Noma et al., 2002; Cuadros et al., 2017; Gleason et al., spatial methods using only a few lines of code. The first half of the
2017; Mena et al., 2016). Despite their importance, only a small subset paper introduces a large body of theory known as model-based geos-
of infectious diseases have been comprehensively mapped, estimated at tatistics (Diggle et al., 1998), building from linear models to kernel
4% (Hay et al., 2013). As the world becomes increasingly connected, it methods, specifically kernel ridge regression, capable of capturing
is likely that more data will become available to enable mapping of complex non-linear spatial patterns. This paper focuses on the model-
many more diseases. Multiple methods have been developed to allow ling of spatial processes but these methods naturally extend to spatio-
for the flexible and non-linear analysis of spatial data, particularly temporal processes. The second half of the paper focuses on RFF,
kernel methods including Gaussian processes. While these models showing how they can speed up kernel methods by approximating the
proved excellent flexibility to model the complex dynamics of infectious implicit feature mappings associated with shift-invariant kernels. The
diseases through time and space, they can be computationally in- paper concludes with a discussion of some of the advantages and dis-
tensive. Combining computationally intensive algorithms with lots of advantages of RFF and highlights some of the advanced methods to
data can require greater resources than what is available to the average improve upon standard RFF. Code is presented wherever relevant, in-
researcher. The goal of this paper is to introduce a powerful compu- clude a brief toy example in R where we fit a spatial problem using
tationally favourable new approach called random Fourier features nothing more than linear regression and some transforms.
(RFF), that represents a simple method to extend kernel methods to There is considerable overlap between our introduction of kernel
large spatial problems. learning and the more traditional formulations based on Gaussian
Various papers on RFF have shown they can provide significant process regression. For an introduction to the Gaussian process and

⁎
Corresponding author.
E-mail addresses: [email protected] (P. Milton), [email protected] (H. Coupland), [email protected] (E. Giorgi), [email protected] (S. Bhatt).

https://doi.org/10.1016/j.epidem.2019.100362
Received 22 February 2019; Received in revised form 5 August 2019; Accepted 19 August 2019
Available online 21 August 2019
1755-4365/ © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/BY/4.0/).
P. Milton, et al. Epidemics 29 (2019) 100362

model-based geostatistics, we refer the reader here (Rasmussen and Minimising the loss function to find weights that best capture the
Williams, 2005; Diggle and Ribeiro, 2007). For a detailed description of data is referred to as training. The overall model performance after
the mathematical correspondence between kernels and Gaussian pro- training can be summarised by its error, often calculated as some
cesses, we refer the reader here (Kanagawa et al., 2018). measure of the difference between the models predicted and the ob-
served response variables. Mean squared error (MSE) is a very common
2. Linear model choice which measures the average squared difference between the
estimated models predicted and the observed response
1 N
In spatial analysis, the aim is to find a model that coverts a set of (MSE = N i (yi yˆi ) ). However, a good model should have not only
inputs to a corresponding output of interest at N locations in space and be able to accurately predict the response variables at locations used in
time. The output is termed the response variable, and represents the training, but also at different locations not used in training the model.
variable that we are trying to predict at each of the N locations, such as To demonstrate the model's capability to do this, a small proportion of
case counts or prevalence for a disease under investigation (Gething the total available data is set aside and excluded from training which
et al., 2016) (or ancillary epidemiological variables including anthro- forms a test dataset. The error between the predicted and observed
pocentric indicators like height and weight (Osgood-Zimmerman et al., response variables at these test locations is calculated to check that the
2018; Josepha et al., 2019) or socioeconomic indicators such as access model is able to make accurate predictions at locations that were not
to water or education (Graetz et al., 2018; Andres et al., 2018)). The used to train the model, this process is termed testing. A model with
inputs are referred to as explanatory variables and consist of multiple small training and testing error should be capable of generalising to
independent variables recorded at the same locations as the response unobserved location where response variables have not been recorded.
variables. The choice of explanatory variables is highly dependent on The OLS estimator is a best linear unbiased estimator (BLUE) when
the specifics of a disease, but common examples are population size, the assumptions of the Gauss-Markov theorem are met (see (Davidson
age, precipitation, urbanicity and spatial or space-time coordinates. et al., 2004) for the standard proof). The Gauss-Markov assumptions
Probably the simplest model to link the explanatory and response include that the error terms, ϵ, have a mean of zero, constant variance
variables is the linear model. A linear model assumes that responses are and are pairwise uncorrelated. However, there are two critical as-
a weighted linear combination of the explanatory variables with some sumptions required for many analyses. The first is that the response
additional uncorrelated (independent) noise, ϵ, which may be written variable is assumed to be a linear function of the explanatory variables
as (McCullagh and Nelder, 1989): specified in the model. Some problems are non-linear such that even the
best linear model is a inappropriate representation. The second is that
y = Xw + (1)
the design matrix, X, is full rank. The design matrix will not be full rank
where y = (y1 , y2 , …, yN ) Ndenotes the vector of response variables if any of the explanatory variables are perfectly multicollinear, which
given at N locations. The explanatory variables are generally given as refers to the situation when one explanatory variable can be expressed
matrix X = [x1, x2, …, xN ] N × d , often referred to as the design matrix.
as an exact linear combination of one or multiple other explanatory
The design matrix consists of N rows for each location, and d columns variables (Farrar and Glauber, 1967). When the data contains perfect
for each explanatory variable. The vector w = (w1, w2, …, wd) d,
multicollinearity, it is no longer possible to invert the matrix XTX,
represents the weights which are used to adjust the influence of each preventing the derivation of the weights in Eq. (3). However, it is more
explanatory variable on the model's prediction of the response. For any common for variables show multicollinearity (rather than perfect
multicollinearity), when there is an approximate but not exact linear
d
location i, the model is given as yi = j = 1 xi, j wj + i .
The task is to find an optimal set of weights such the transform of relationship between two or more explanatory variables. Multi-
the explanatory variables is as close as possible to the response vari- collinearity is common in epidemiological data where seemingly in-
ables. In order to do this, we first need to define what is meant by “as dependent explanatory variables can actually correlate through some
close as possible” using a loss function. A common choice of loss func- latent variable (for example, many variables can be correlated with
tion is the squared loss function that computes the sum of the squared socioeconomic status) (Vatcheva et al., 2016). In cases of strong (but
differences between the predicted responses given by the model and the not perfect) multicollinearity, inversion algorithms may still fail in
observed response variables for a given set of weights written as: finding the inverse of XTX, or generate inaccurate solutions. Except for
the assumption of perfect multicollinearity, violating one of more the
1
Gauss-Markov assumptions does not prevent the fitting of a linear
S (w) = y Xw 2
2 (2)
model, but results in an estimator that is not the BLUE.
The closer the model's predicted responses (Xw) for a given set of
weight are to the observed responses (y), then the smaller the value of
the loss function. Therefore, the set of weights that minimise the loss 3. A linear model of non-linear features
function will represent the optimal set of weights and give the smallest
difference between the model's predicted and the observed response For many spatial problems the response variables cannot be de-
variables. Note, the multiplication by half in Eq. (2) is used solely to scribed as a linear function of the explanatory variables. One approach
simplify the derivative of the loss function and does not affect the so- to introduce non-linearity to the model is to transform the explanatory
lutions. To find the set of weights that minimise the loss function re- variables using non-linear transformations, such that the responses are
quires solving minw d y X w 2 . Elementary calculus tells us that described as a linear combination of non-linear terms. For example,
this minimal value can be found by taking the derivative of the loss rather than a weighted sum of linear terms (i.e. x1 + x2 + x3), we may
function (with respect to the weights), setting it equal to zero and re- instead use terms with exponents, logarithms or trigonometric func-
arranging the equation to solve for the optimal weights: tions (i.e. exp(x1) + log(x2) + sin(x3)). Transforming the inputs rather
S (w) changing the model allows us to continue using the convenient maths
= X T (X w y ) = 0 we derived for linear models and apply it to nonlinear systems. The
w
transformation of the explanatory variables is called a feature mapping
wˆ = (X T X ) 1X T y (3)
and the new space to which the data is mapped is called the feature
The value of the weights that minimises the squared loss function is space. Fig. 1 presents an example of a mapping to a feature space such
denoted as ŵ , and is called the ordinary least squares (OLS) estimator. that the response can be expressed in linear terms (code for this ex-
Given the explanatory variables at location i, the model's prediction of ample is given in Supplementary Code 1). Generally, a mapping is de-
the response, ŷi , is computed as yˆi = xi w
ˆ. noted by Φ and the general form of a feature mapping can be written as:

2
P. Milton, et al. Epidemics 29 (2019) 100362

Fig. 1. An example of non-linear feature mapping, where the space is mapped for (A) an input space in which the problem is non-linear into a new feature space (B)
in which the outputs can be described as a linear combination of the inputs.

: xi d (x i) D will not be able to make reliable predictions when presented with data
:X N ×d (X ) N ×D
(4) from locations where the response variable has not been recorded. This
balance between either reducing the training error or testing error
The explanatory variables at location i, xi, are mapped from a vector of forms the famed result in machine learning termed the bias-variance
length d to a vector of length D denoted as Φ(xi). Applying this mapping trade-off (Geman et al., 1992; Domingos, 2000). Overfitting occurs
to the entire design matrix gives a new design matrix that exists in the when the model contains more explanatory variables than are appro-
feature space, (X ) N ×D
. A mapping can project to into higher-di- priate to describe the data, so that the additional variables are fitted to
mensional (d < D) or lower-dimensional (d > D) space, although in the noise. As an extreme example, if the number of explanatory vari-
the context of spatial analysis, mapping to a higher-dimensional space ables is equal to or greater than the number of locations, then a linear
is more common. The same set of equations can be used to solve a model can pass through every point exactly. Any high dimensional
model that is linear in feature space after the mapping simply by re- mapping greatly increases the risk of excess explanatory variables and
placing design matrix, X, with the new design matrix in the feature thus overfitting. An approach that is frequently adopted to prevent
space, Φ(X): overfitting is to apply regularisation.
wˆ = ( (X )T (X )) 1 (X )T y Broadly, regularisation acts to penalise more complex models.
Different forms of regularisation define the complexity of a model dif-
yˆi = ˆ
(x i) w (5)
ferently but a common choice of regularisation is called Tikhonov
Depending on the choice of mapping, the model is now capable of regularisation or ridge regression in the statistical literature (Tikhonov,
capturing non-linear relationships between the explanatory and re- 1963; Bell et al., 1978; Hoerl and Kennard, 1970). The idea behind
sponse variables. The solution after the feature mapping now requires ridge regression is to control complexity by penalising models with
solving for the D weights, corresponding to a weight for each column of many large weights. This prevents the model from using the redundant
the design matrix in feature space, Φ(X). Specifically, the solution is of explanatory variables to capture the noise and forces the model to focus
computational complexity in the best case, where the big O notation, , on the relationship between explanatory and response. When a model
is used to denote how the relative running time or space requirements has large weights only a small change in the explanatory variables is
grow as the input size grows. This complexity of is extremely useful required to induce a large change in the response. Therefore, a model
because even when the number of locations, N, is large the dimensions with small weights cannot change as quickly as a model with large
of the explanatory variables dominate the solution. weights. For a linear model this effectively controls the steepness of the
regression line, but for non-linear regression (or linear regression in
4. Overfitting and ridge regression non-linear feature space) this corresponds to the wiggliness of the re-
gression curve. Ridge regularisation consists of two key components; a
Any high-dimensional mapping that significantly expands the Euclidean norm term, w 2 , and and a regularisation parameter λ. The
number of features (D ≫ d) greatly increases the risk of overfitting. A Euclidean norm terms computes the positive square root of the sum of
model is overfitted if it too closely or exactly predicts the training data, squares of all the weights in the model, w 2 : = w12 + w22+ + wd2 . A
but fails to accurately predict the test data. Overfitting is characterised model with many large weights (whether positive or negative) will have
by a very small training error but a high testing error. All data, but a large norm. The λ > 0 parameter scales the norm and controls the
especially epidemiological data, contains random irreducible noise. amount of regularisation. Ridge regularisation is achieved by adding a
When a model is overfitted, it captures the random noise in the training penalisation term to the loss function that depends on the norm of the
data as well as the relationships between the explanatory and response weights. The loss function for ridge regression is given by:
variables. However, this pattern of noise will vary between training and
1
test data. Therefore, while capturing the noise in the training data re- S (w) = (y X w) 2 + w 2
2 2 (6)
duces the training error, it increases testing error by reducing the
capability of the model to generalise to the test data which will in- When λ is big, the w term significantly increases the value of the
2

herently have different random noise. An inability to accurately predict loss function if the norm of the weights is large so that the act of
response variables for test data demonstrates that an overfitted model minimising the loss function favours models with small weights. As λ

3
P. Milton, et al. Epidemics 29 (2019) 100362

approaches zero, large norms are less heavily penalised, allowing for independent of the primal by expressing the problem as a constrained
models with larger weights representing optimal solutions to Eq. (6). minimisation problem and solving using Lagrangian multipliers (Sup-
When λ = 0 the loss function is equal to the OLS solution. Regularisa- plementary Equations 1). The dual solution involves computing and
tion results in model weights that are biased towards zero, and thus (by inverting the matrix XXT, with computational complexity (N 3) in the
design) the ridge estimate is a biased estimator. worst case. With the dual solution, no matter how high dimensional the
As with the linear model, training requires finding the weights that feature mapping, the dual solution will only require solving for the
minimise the ridge regression loss function. This is calculated in the same number of dual variables and be of the same computational
same way, by taking the derivative of the ridge loss function, setting it complexity. Therefore, for any high-dimensional mapping where
equal to zero, and solving for the optimal ridge weights ŵridge : D > N, the dual solution is computationally easier to solve than the
primal.
S (w)
= X T (X w y ) + w = 0 Less obviously, Eq. (9) computes XXT with the resulting matrix a
w
symmetric positive semidefinite matrix called the Gram matrix,
wˆ ridge = (X T X + In) 1X T y (7) G N ×N
. The gram matrix contains the inner products of explanatory
where In is the identity matrix (a square matrix in which all the ele- variables between every all of N locations. An inner product is a way to
ments of the principal diagonal are ones and all other elements are multiply vectors together with the result of this multiplication being a
zeros) with N × N dimensions. The optimal regularisation parameter λ scalar measure of the vectors similarity. This notion of similarity is
is often unknown but can be estimated during training through methods central to spatial or temporal analysis where we want to leverage the
including cross-validation or restricted maximum likelihood. The so- fact that points close to each other in space or time should be more
lution in Eq. (7) is known as the primal solution of the ridge regression similar than those far apart. Explicitly, the inner product of two vectors
and has computational complexity (d 2N ) . As with linear regression, xi and x j d is given by:

the design matrix X can be replaced with (X ) N ×D

to perform ridge xi, xj = (x i,1, xi,2, …, x i, d ), (xj,1, xj,2, …, xj, d)
regression in feature space with complexity . Note, the addition of λIn = x i,1 xj,1 + x i,2 xj,2 + + x i, d xj, d (11)
ensures that when λ > 0 the matrix (XTX + λIn) is always invertible
even when X is not full rank. Although useful for ill-posed problems and where ·,· is used to signify the inner product. Therefore, the Gram
multicollinearity, we consider ridge regression primarily for its ability matrix can be thought of as containing elements which represent the
to prevent overfitting. similarity between all pairs of inputs. Taking Eqs. (9) and (10) and
substituting XXT for the full gram matrix, G, and xiXT for the ith row of
5. The dual and the gram matrix the gram matrix, Gi (corresponding to the pairwise inner product be-
tween location i and all N other locations), lets the dual solution for the
The primal solution to ridge regression required finding the set of d weights and model prediction be rewritten as:
optimal weights, corresponding to a weight for each the column of the = (G + In) 1y
design matrix. However, a high dimensional feature mapping (D ≫ d) yˆi = Gi (12)
significantly increases the number of weights that must be found, with
the computational complexity of solving the primal solution exhibiting Therefore, the solution to the dual only requires inner products, or in
quadratic growth with the number of features (). An alternative method other words, the dual solution only requires similarity between points
is called the dual solution, that seeks to find optimal variables of length and not their explicit values. We can exploit this property to allow us to
N, corresponding to a variable for each row of the design matrix. Given work with the linear regression model in very-high or even infinite-
that the number of rows of the design matrix stays constant when the dimensional feature space using kernels.
data is mapped, this dual solution has a computational complexity that
is independent of the dimensions of the feature mapping. Consider the 6. Kernel functions, kernel matrix and the kernel trick
following matrix identity:
It is rarely apparent a priori what the most appropriate mapping is to
(X T X + In) X T = X T XXT + X T
apply to the data. Therefore, while the dual solution is of constant
= X T (XXT + In) (8)
complexity even in very high dimensional feature spaces, the question
On the left hand side of the equation, the bracketed terms are multi- remains, which mapping should we use? How many terms should be
plied by XT which may be expanded and then, using matrix algebra, it added? How do we capture interactions between terms? A brute force
may be written as an alternative factorisation. By applying this identity, approach quickly becomes combinatorially large. Given that the solu-
(X T X + In) 1X T in Eq. (7) can be replaced with X T (XXT + In) 1 to tion to the dual is independent of the number of features, and reg-
give a new equation for the optimal weights: ularisation allows us to limit model complexity to prevent overfitting,
the ideal situation would be to define a high-dimensional, if not infinite
ˆ ridge = X T (XXT + In) 1y
w (9) mapping capable of modelling nearly any function and then apply
This new equation can be further simplified by letting regularisation. An infinite feature mapping can be expressed as:
= (XXT + In) 1y . The variables N
are the dual variables and, in : xi d (x i)
contrast to the primal solution, correspond to a variable for each row of
:X N ×d (X ) N×
(13)
the design matrix. Substituting α allows us to express the equations for
the weights as w ˆ ridge = X T = iN= 1 i x i which demonstrates that the However, we cannot compute the infinite terms required for explicit
vector of weights can be written as a weighted linear combination of infinite-dimensional feature mapping. To work with these very high or
the N training points. The predicted response given by the model at infinite dimensional spaces, we need to move away from explicit to
location i becomes: implicit feature maps which is achieved through kernel functions.
Broadly, kernel functions, denoted as k(· , ·), take in the explanatory
yˆi = x i X T (10)
variables at two locations and directly computes their inner product in a
Given that the dual has been derived directly from the primal it is easy corresponding feature space:
to see that both solutions are equivalent, with the two solutions said to k (x i , x j ) = (xi), (xj) (14)
exhibit strong duality (Boyd and Vandenberghe, 2004). However, the
dual form can be derived directly from ridge regression loss function More formally, a kernel function is defined as a two-argument,

4
P. Milton, et al. Epidemics 29 (2019) 100362

symmetric positive semi-definite function that corresponds to com- 7. The big N problem
puting the inner product a corresponding reproducing kernel Hilbert
space (RKHS) (Shawe-Taylor et al., 2004). This RKHS and is specific to One of the primary motivations of combining the dual and kernels is
the chosen kernel function and simply represents a feature space that that this process only requires computing and inverting the kernel
has a valid inner product (the notion of similarity is conserved). For a matrix K N ×N
, and solving for the N dual variables, even when the
detailed explanation of the mathematics of kernels we recommend dimensions of the (implicit) feature space is infinitely large. Historically
(Shawe-Taylor et al., 2004). If the kernel function is applied to all pairs this led to the widespread adoption of kernel methods to solve difficult
of locations in the training data, the resulting matrix generated is problems on small/medium datasets. However, the dual still requires
termed the kernel matrix, K N ×N
, and can be thought of as the gram (N 2) storage for N observations and (N 3) complexity. Given only a
matrix in RKHS: few thousand points, these costs can rapidly outstrip the storage and
computational power available to most researchers.
K= (X ), (X ) = (X ) (X )T (15) A plethora of methods have been developed to allow the scaling of
kernels methods to large datasets. These methods aim to find smaller or
Importantly, the dual solution only requires inner products to solve.
simpler matrices that provide good approximations of the full kernel
Therefore, the gram matrix in Eq. (12) can be replaced by the kernel
matrix. The three main techniques used are low-rank approximations,
matrix:
sparse approximations and spectral methods. Low-rank approximations
yˆ = K (K + In) 1y (16) of a matrix aim to find smaller representations of the kernel matrix that
contains all (or nearly all) of the information in the full kernel (Bach
This is equivalent to solving the dual in RKHS associated with the and Jordan, 2005). For example, the popular Nyström approximates the
kernel. However, the data is never actually mapped into this implicit full kernel matrix through a subset of its columns and rows (Williams
feature space. Instead the kernel function direct computes the inner and Seeger, 2001). In contrast, sparse methods aim to find re-
product in the RKHS and uses those inner products to solve the dual. presentations of the matrix that are mostly zeros because efficient al-
Given that the data is never actually explicitly mapped to RKHS, the gorithms exist for the storage of and computation with such matrices
feature space can be of very high or even infinite dimensions, as long as (Rue and Held, 2005; Straeter, 1971; Saad and Schultz, 1986). One of
we have the associated kernel function, and thus the inner product in the best examples is the sparse matrix generated when modelling spa-
this feature space, the the dual can be solved. tial data as a Gaussian Markov random field (GMRF) that are solutions
The replacement of the explicit mapping with the implicit mapping to Stochastic Partial Differential Equation (SPDE) (Lindgren et al.,
associated with a kernel function is called the kernel trick and can be 2011; Whittle, 1954, 1963). However, the remainder of this paper will
applied to many algorithms with dual solutions. Eq. (16) combines focus on an exciting, new subset of spectral methods called random
ridge regression with the kernel trick to create kernel ridge regression Fourier features (RFF).
(KRR), that effectively solves the model in the RKHS with the ridge RFF combines the flexibility of kernels with the computational
penalty preventing overfitting and ensure small loss for both training benefits of the primal solution. RFF use the Fourier transform of a
and testing. The dual solution still only requires computing the N dual kernel function to explicitly map the data to a relatively low-dimen-
variables with complexity (N 3) . Thus, given an appropriate kernel, sional space that approximates the implicit feature space associated
the KRR dual can be solved in infinite-dimensional feature space with with the kernel. The data in this feature space can either be used to
no added computational complexity. Therefore, the combination of the construct an unbiased estimator of the full kernel or be used to solve the
dual solution and kernels is a powerful tool capable of extending linear primal at a significantly reduced computational cost. This is distinct
models to very high dimensional feature spaces with the ability to from many low-rank or sparse approximations that rely on the ap-
handle nearly any non-linear problem. proximations of kernel matrix and the thus the dual. The RFF method
Kernel functions can be derived by direct construction; finding the approximates the entire kernel function at once, does not rely on in-
function that corresponds to the taking the inner product in the feature ducing points or “knots”, does not require throwing away any data, can
space (see Supplementary Equations 2 for an example of constructing easily be incorporated to many existing linear models and has steps no
the polynomial kernel from an explicit feature mapping). Generally, more complicated than sampling from some probability distribution
kernels are described by showing that the proposed function satisfies and applying trigonometric functions.
Mercer's conditions; proving the function is symmetric positive-definite
function and guaranteeing there exists a RKHS (Shawe-Taylor et al., 8. Random Fourier features
2004). A huge number of kernels are already described, with specific
kernels often used to solve specific tasks. Common kernels for spatial RFF and other spectral method rely on the characterisation of the
analysis include the squared exponential and Matérn kernels. The kernel function through its Fourier transform. Any function can be
squared exponential kernel is popular because it generates smooth decomposed into the periodic, trigonometric functions (sin and cos) of
functions that are appropriate for spatial interpolation, while the Ma- different frequencies that make it up. As an analogy, consider how a
térn kernel allows a better balance between smoothness and roughness complex musical chord is composed of a number of individual notes,
of the resulting functions that may better represent true spatial pro- where each note is just a string vibrating at a particular frequency. The
cesses (Stein, 2012). The exponential kernel is popular choices for Fourier transform tells us the how much trigonometric functions of each
modelling temporal data, where the similarity between two points is frequency must be added to construct a given target function. The
assumed to steadily exponentially decay as the time between them in- central idea behind spectral methods is that a good approximation of
crease. For example, the squared exponential kernel is given by: the frequencies that make up the kernel function will naturally yield a
good approximation to the kernel function itself. For the mathematics
(x i x j) 2 of Fourier transforms, we refer the reader here (Bracewell and
kS E (xi , xj) = exp
2 2
(17) Bracewell, 1986). All spectral methods are based on the same mathe-
matical foundation; the celebrated Bochner's theorem (Bochner and
where 0 is the length-scale. The length-scale controls the Chandrasekharan, 1949). Loosely, Bochner's theorem states that a shift-
smoothness of the resulting functions. The kernel parameters are often invariant kernel functions (where the output of the kernel is only de-
learned from the data alongside the model variables. The squared ex- pendent on the difference between the inputs and not the explicit values
ponential kernel corresponds to the inner product in an infinite di- of the input themselves), k(xi, xj) = k(xi − xj), on d can be expressed
mensional feature space shown in Supplementary Equations 3. through a Fourier transform (Rudin, 1990):

5
P. Milton, et al. Epidemics 29 (2019) 100362

k (x i x j) = ei
T
(xi xj) ( ) d frequencies from them. For example, generating the frequencies for
d (18)
approximating a squared exponential requires independently sampling
This theorem tells us that the Fourier transform of a shift-invariant frequencies from a Gaussian distribution, or from Cauchy distribution
kernel takes the form of probability distribution, ( ) . This distribution for a Laplacian kernel (Table 1). This is visualised in Fig. 2 that shows
is called the spectral density of the kernel and is the distribution of the different spectral densities (Fig. 2A,C,E,G,I) and the resulting functions
amount of a given frequency, ω, that must be added to construct the produced by sampling from the kernel generated by each spectral
kernel function. The larger the spectral density for a given ω, the density (Fig. 2B,D,F,H,J).
greater the amount of that frequency must be added to reproduce the In Fig. 2A, the spectral density is composed of two delta functions
kernel function. Applying Euler's identity (eiπ = cos(π) + i sin(π)) to the such that sampled frequencies, ω, can only take values equal to 1 or 2.
exponential and ignoring the imaginary component, let us consider The functions generated by sampling from this spectral density show
Bochner's theorem in terms of the trigonometric functions: strong periodicity and closely resemble the standard trigonometric
T
functions with corresponding frequencies (Fig. 2B). When the fre-
cos( Tx
j)
cos( Tx
i) quencies of the two possible delta functions are increased so that they
k (x i x j) = D ( ) d
lie in the set {10, 20}, the functions are again highly cyclical but, due to
sin( Tx Tx )
i) sin( j (19)
their higher frequencies, have rougher sample paths and a much
However, a major problem is that evaluating the integral in Eq. (19) smaller period (Fig. 2C,D). By expanding the spectral density to contain
requires integrating over the infinite set of all possible frequencies. To five possible frequencies (Fig. 2E,G), the sample paths show con-
avoid this, we can approximate this infinite integral by a finite one siderably more variation due to the inclusion of a larger variety of
using Monte Carlo integration. In Monte Carlo integration the full in- frequencies (Fig. 2F,H). Finally, Fig. 2I and K show samples functions
tegral of a function is approximated by computing the value of the generated by sampling frequencies form a Gaussian and Cauchy dis-
function evaluated at a random set of points and averaging. In RFF, the tribution respectively. The Gaussian spectral density corresponds to a
integral is approximated by averaging the sum of the function evaluated spectral density of the squared exponential kernel (Gaussian kernel)
at random samples of ω drawn from drawn from the probability dis- and gives rise to smooth sample functions with a huge amount of
tribution ( ) . The greater the number of samples that are evaluated, variety when compared to the simpler spectral densities (Fig. 2J). The
the closer the approximation gets to the value of the full integral. In- Cauchy distribution corresponds to a spectral density generated by the
deed, one of the best properties of random Fourier features is that of Fourier transform of the Laplacian kernel and generates functions with
uniform convergence of the Monte Carlo approximation of the entire a high degree of roughness (Fig. 2L) due to the inclusion of very high
kernel function (rather than pointwise) (Rahimi and Recht, 2007). frequencies in the long tails of the distribution (Fig. 2K). The code for
Therefore, the infinite integral in Eq. (19) can be converted to a finite sampling the spectral densities and generating functions is given in
approximation by taking m independent samples of ω from the power Supplementary Code 2.
spectral density, and computing the Monte Carlo approximation of the Eq. (20) shows how the RFF can be used to approximate the kernel
kernel function as: function and thus the whole kernel matrix through pairwise applica-
m T T
cos( T tion. Incredibly, this can all be written in just 4 lines of R-code:
1 cos( s x i) s x j) m i .i.d .
k (x i x j) T T
, { s }s = 1 ( ) Code 1 Example of creating random Fourier features to approximate
m sin( s x i) sin( s x j)
s=1 a squared exponential kernel matrix.
(20)

When the frequencies are sampled from the power spectral density, the
RFF approximation of the kernel function is an unbiased estimator of
the kernel function (Rahimi and Recht, 2007). Given that the spectral
densities represent probability distributions, it is trivial to sample

Table 1
Common shift-invariant kernels and their spectral densities.
Kernel Kernel function, k(xi, xj) Power spectral density, ( )

‡
Squared Exponential xi xj 22 D
2) 2 exp( 2 2
exp (2 2 2)
2 2

Matén*, †, ‡
21 2 xi xj 2 2 xi xj 2 D D
2D+ ( + D)
( )
k 2 ( + )
2 2 2 2 2
( ) +4 2
( ) 2 2

Laplaciana exp( xi xj 1 ) D
()
2 2 D
i=1 2+ 2
i

* Γ(·) is the gamma function and kλ(·) is the modified Bessel function of the second kind.
†
Parameter υ > 0. If υ = 0.5 the Matén equates to the exponential kernel. As υ→ ∞ the Matén converges to the squared exponential kernel.
‡
Parameter ℓ > 0.
a
Parameter σ > 0.

6
P. Milton, et al. Epidemics 29 (2019) 100362

Fig. 2. Power spectral densities (A,C,E,G,I,K) and the functions produced by sampling from the resulting kernel (B,D,F,H,J,L). The spectral densities correspond to
sampling from delta functions (arrowheads) such that the sampled frequencies can only take the point values corresponding to each delta. (I) is Gaussian distribution
corresponding to a spectral density of the squared exponential kernel. (K) is Cauchy distribution corresponding to a spectral density of the Laplacian kernel.

However, one of the key observations about RFF is that it defines a achieve the generalisation error as if we had used all points. However,
feature space of its own: the full theoretical properties of RFF estimators are still far from fully
m T T m
understood.
T cos(
1 cos( s x i) s x j) 1 T m
T T
= RFF (x i) RFF (x j), { s }s = 1
m s=1 sin( s x i) sin( s x j) m s=1
i . i . d. 9. Variation in the linear model
( ) (21)

where Throughout this paper, we have focused on squared loss/Gaussian

likelihoods, but Fourier features can be used with any loss function. For
T
cos( s x) m i.i .d . example, when performing classification with binary data, (i.e. y ∈ {0,
RFF (x) = , { s }s = 1 ( )
sin( T
s x) (22) 1}) the cross-entropy loss (also known as log loss) can be used which is
written as:
The data is mapped to a Fourier feature space, ΦRFF, that approximates
the previously implicit feature space of the kernel. Technically, ΦRFF is S (w) = [y log( 1 (X w)) + (1 y )log(1 1 (X w))]
(23)
a function space that is dense in a RKHS - the same space of functions
from our kernel matrix. Applying this mapping to the entire design where φ is the Sigmoid or Logit function. The design matrix X can be
matrix is given by RFF(X ) = [cos(X T )sin(X T )] N × 2m where
substituted for the Fourier feature matrix, ΦRFF(X) to derive a compu-
m× d
is the frequency matrix with each rows corresponding to a tationally efficient extension of kernel regression. Other examples in-
sampled ω (ω1, …, ωm). Matrix ΦRFF(X) is termed the Fourier feature clude the use of a Poisson likelihood to model count data (Cameron and
matrix, with each column representing a Fourier feature. See Trivedi, 2013) or a generalised linear model (GLM) for many other non-
Supplementary Equations 4 for a more comprehensive walk-through of Gaussian response variables (Nelder and Wedderburn, 1972). GLMs still
derivation of the Fourier feature matrix from Bochner's theorem. use a linear combination of explanatory variables (with a link function
The Fourier feature matrix contains all the data explicitly mapped controlling how the expected value of the response relates to the linear
into a finite dimensional space that approximates the implicit (and combination of explanatory variables) and thus can still use feature
potentially infinite) feature mapping of the kernel function. The ex- mapping including RFF to map the model into feature space. The linear
plicitly mapped data in the Fourier feature matrix can be used to solve model can easily be extended to include uncertainty using Bayesian
the primal. The primal solution now requires computing and inverting inference through the inclusion of appropriate likelihood function and
RFF (X )
T
RFF (X )
2m × 2m
with resulting linear model takes the form priors on the weights (Carlin and Louis, 2008). Many other algorithms
y RFF (X ) ˆ
w . The computational complexity of this primal solution is use kernels, notably kernel methods, a group of non-parametric, prob-
(Nm2) (where m is the number of i . i . d . samples from ( ) ), pro- abilistic machine learning algorithms used for pattern analysis. These
viding a significant computational speed-up when m < < N, and is include support vector machines for classification or Gaussian processes
particularly relevant for large datasets. The original paper by Rahimi for regression and can effectively be constructed by applying kernels to
and Recht (Rahimi and Recht, 2007) showed that every entry in our Bayesian linear models. These methods work in implicit feature space
log(N )
kernel matrix is approximated to an error of ± ξ with m = 2 Fourier associated with the kernel and are capable of modelling huge variety of
features. A more recent result shows that only N log(N ) features can non-linear problems, but also suffer from the “Big N” problem re-
achieve the same learning bounds as full kernel ridge regression with stricting their use to small/medium datasets. Many kernel methods
squared loss (Rudi and Rosasco, 2017). For example, given N = 100, have primal solutions that can be solved by sunstituting in the Fourier
000 data points, we would only need m ≈ 3600 Fourier features to feature matrix for a very large computational speed-up.

7
P. Milton, et al. Epidemics 29 (2019) 100362

Fig. 3. (A) All 500 points generated from the latent spatial process given by yi = x i2,1 + x i2,2 + i where i (µ = 0, 2 = 1) ), and (B) the subset of data points used to
train the regression models.

10. Toy example of random Fourier features for spatial analysis training data resulting in poor testing performance. In comparison, the
regularisation applied in kernel ridge regression helps prevent overfitting
As an example of how to use RFF, we simulate a simple non-linear with the KRR model (with λ = 3.98) having marginally higher training
spatial regression problem. A set of random points in space is generated error than the kernel regression model but less than half the testing error.
such that each location has unique coordinates (longitude and latitude). Therefore, the KRR model trades a small decrease in training perfor-
Each location has a response variable generated the function mance for a significant increase in generalisability.
y = x12 + x 22 + where ϵ is random Gaussian noise The importance regularisation is further illustrated by comparing how
( (µ = 0, 2 = 1) ). Fig. 3A shows 500 random points drawn from the training and testing performance of kernel regression and KRR varies
the spatial process. The simulated data were used to train three models, with the number of Fourier features (Fig. 4). Increasing the number of
a linear regression model, a kernel regression model (no regularisation) sampled Fourier features increases results in a steady reduction in training
and a KRR model. Both the kernel regression and KRR models use 100 error of kernel regression (Fig. 4A, blue line). As the number of features
Fourier features to approximate a squared exponential kernel. For KRR, increases the kernel regression model shows increasing testing error
k-fold cross-validation was used to find the optimal regularisation (Fig. 4B, blue line) as the model is significantly overfitted to the training
parameter (λ). We assume, as is common for nearly all real-world data. In comparison, KRR the training error remains constant above 10
spatial processes, that we only observe a subset of all possible locations. features (Fig. 4A, red line) and maintains a stable testing error that is
Therefore, the models are trained on only 20% of all the generated significantly lower than kernel regression even when additional Fourier
points, shown in Fig. 3B. The remaining data not used to training is then features are added (Fig. 4B, red line). The KRR prevents overfitting even as
used for testing. The code for this example is provided in Supplemen- more features are added the regularisation by increasing the magnitude of
tary Code 3. The code can easily be changed to any function of the the regularisation parameter, λ (Fig. 4B, inset).
user's choice and the user can specify the number of Fourier features.
Note, the provided code generates points at random and therefore a
user's results may differ from the exact results shown here. 11. Advanced methods for random Fourier features
Each model was trained using the same training data and their pre-
dictive performance measured by MSE for both the training and testing 11.1. Limitations of RFF
data. The training and testing performance of the three models are shown
in Table 2. As expected, the non-linear nature of the spatial process re- Given the good empirical performance of the RFF method, little has
sults in very poor performance of the linear model with large values of been published on their limitations, including in the context of spatial
both the training and testing error. The Kernel regression model has analysis. Firstly, RFF can be poor at capturing very fine-scale variation
excellent training performance, with the infinite feature space of the as noted in Ton et al. (2018). This is likely due to fine-scale features
squared exponential kernel able to capture the non-linear relationship being captured by the tails of the spectral density that will be infre-
between the spatial coordinates and the response. However, in the ab- quently sampled in the Monte Carlo integration. Secondly, from a
sence of regularisation, the kernel regression model greatly overfits the computational perspective, RFF are very efficient but can still be out-
preformed by some state-of-the-art spatial statistics approaches. For
Table 2 example, the sparse matrix approaches based on SPDE as solution
Training and testing mean squared error of linear, kernel and Kernel ridge re- GMRF provide impressive savings with complexity (m1.5) (compared
gression models for a basic non-linear spatial problem. to (m3) from the RFF primal solution) (Lindgren et al., 2011). Other
Model Training error (MSE) Testing error (MSE) methods such as the multiresolution Kernel approximation (MRA)
provide incredible performance but are only valid for two dimensions
Linear 2.26 2.73 (Ding et al., 2017). Thirdly, while the convergence properties of RFF
Kernel Regression* 0.64 2.70
suggest excellent predictive capability (Rudi and Rosasco, 2017), al-
Kernel Ridge Regression*, †
0.88 1.19
ternative data-dependent methods including versions of the Nyström
* Using a squared exponential kernel approximated using the 100 random approximation can perform much better in some settings (Yang et al.,
Fourier features. 2012; Rudi et al., 2015). The following sections will discuss the current
†
Optimal ridge parameter of λ = 3.98 estimated by k-fold validation. methods that address some of these limitations. From here onward, we

8
P. Milton, et al. Epidemics 29 (2019) 100362

Fig. 4. An example of the bias-variance trade-off for kernel regression (blue) and kernel ridge regression model (red) with a random Fourier feature approximation of
a squared exponential kernel. The mean squared error (MSE) is calculated for both the (A) training data and the testing data (B) for an increasing number of sampled
Fourier features, with optimal λ (estimated by k-fold cross-validation) for the ridge regression model shown (B inset).

call the RFF method described in the previous section as the standard 11.3. Leverage score sampling
RFF method.
In the standard RFF method, frequencies are sampled with a prob-
ability proportional to their spectral density. However, the power
11.2. Quasi-Monte-Carlo Features (QMC RFF) spectral density of a kernel does not depend on the data, X (see
Table 1). Therefore, the sampling probability of a given frequency is
One of the most significant limitations of standard RFF lies within data-independent. Data-independent sampling is sub-optimal and can
the Monte Carlo integration. The infinite integral that describes the yield very poor approximations (Bach, 2012; Mahoney and Drineas,
kernel function is converted to a finite approximation by sampling 2009; Gittens and Mahoney, 2016), and has been identified as one of
frequencies from the spectral density (Eq. (20)). The convergence of the reasons RFFs perform poorly in certain situations (Li et al., 2018).
Monte Carlo integration to the true integral occurs at the rate (m0.5) , An alternative is a data-dependent approach, that considers the im-
which means for some problems a large number of features are required portance of various features given some data. Several data-dependent
to approximate the integral accurately. approaches for RFF have been proposed (Ionescu et al., 2017; Rudi and
A popular alternative is to use Quasi-Monte Carlo (QMC) integration Rosasco, 2017; Li et al., 2018), but one of the most promising and ea-
(Avron et al., 2018). In QMC integration, the set of points chosen to siest to implement is sampling from the leverage distribution of the RFF
approximate the integral is chosen using a deterministic, low-dis- (abbreviated to LRFF) (Li et al., 2018).
crepancy sequence (Niederreiter, 1978). In this context, low-dis- Leverage scores are popular across statistics and are a key tool for
crepancy means the points generated appear random even though they regression diagnostics and outlier detection (Hoaglin and Welsch, 1978;
are generated from a deterministic, non-random process. For example, Velleman and Welsch, 1981). A leverage score measures the importance
the Halton sequence, that generates points on a uniform hypercube of a given observation on the solution of a regression problem. How-
before transforming the points through a quantile function (inverse ever, the perspective of leverage scores as a measure of importance can
cumulative distribution function) (Halton, 1964). Low-discrepancy se- be extended to any matrix. The leverage scores of a matrix A is given by
quences prevent clustering and enforce more uniformity in the sampled the diagonal matrix T = A (AAT ) 1AT with leverage score for ith row of
frequencies, allowing QMC to converge at close to (m 1) (Asmussen matrix A is equal to the ith diagonal element in matrix T, denoted as τi,i
and Glynn, 2007) and can provide substantial improvements in the and calculated by:
accuracy of the approximation of the kernel matrix for the same com-
putational complexity. Crucially, QMC is trivial to implement within i, i = aTi (AAT ) 1ai = [A (AAT ) 1AT ]ii (24)
the RFF framework for some distributions. For example, for the squared
exponential kernel, instead of generating features as by taking random τi,i can also be seen as a measure of the importance of the row ai.
samples from a Gaussian, we generate them as: Most leverage score sampling methods apply ridge regularisation to
Code 2 Example of Quasi-Monte Carlo sampling of a Gaussian power the leverage scores equation, controlled by the regularisation parameter
spectral density of a squared exponential kernel using a Halton se- λLRFF given by:
quence.
i, i ( ) = aTi (AAT + LRFF I )
1a
i (25)

9
P. Milton, et al. Epidemics 29 (2019) 100362

The resulting scores are termed ridge leverage scores (El Alaoui and concatenated into a frequency matrix, m× d
. If we consider a
Mahoney, 2014). The regularisation serves a nearly identical purpose as squared exponential kernel, Ω is actually just a random Gaussian ma-
when applied in the context of linear regression; ensuring the matrix trix, as sampled frequencies are simply standard normal distribution
inversion is always possible with scores that less sensitive to pertur- random variables scaled by the kernel parameter, σ. Therefore, the
bations in the underlying matrix. The addition of the ridge regularisa- matrix = G , where G is a random Gaussian matrix of dimension
1

tion stabilises leverage scores computation and permits fast leverage m ×d

. In orthogonal random features (ORF), the aim is to impose or-
score sampling methods that approximate leverage scores using subsets thogonality on Ω, such that it contains significantly less redundancy
of the full data (Musco and Musco, 2017; Rudi et al., 2018; Drineas than a random Gaussian matrix, capable of faster convergence to the
et al., 2012; Cohen et al., 2017). full kernel matrix with lower variance estimates. The G in this section
Leverage scores improve upon the standard RFF method by sam- refers to the random Gaussian matrix and should not be confused with
pling ω's with probability proportional to importance rather than their the gram matrix, but the G notation is used to be consistent with the
spectral density. The important features are extracted by sub-sampling original paper (Felix et al., 2016).
columns of a matrix ΦRFF(X) with probability proportional to their The simplest method to impose orthogonality would be to replace G
leverage scores. The resulting sub-sample should contain all the im- with a random orthogonal matrix, O. However, simply replacing G with
portant features such that an equally accurate approximation to the full O means that the RFF will no longer be an unbiased estimator (Rahimi
matrix can be achieved using a smaller number of features. A key dis- and Recht, 2007). For example, for the squared exponential/Gaussian
tinction to note is that the formula for the leverage score in Eq. (25), kernel, the row norms of the G matrix will follow a Chi distribution. In
and much of the literature, seeks the leverage scores of the rows (τi,i comparison, the orthogonal matrix, O, will have rows with unit norm
give the importance of the ith row of A). For RFF, we want the leverage by definition. Thus, replacing G with O does not produce an equivalent
scores of the columns as they correspond to the Fourier features, and unbiased estimator. To return to equivalency, the orthogonal matrix, O,
thus the transpositions in Eq. (25) must be swapped. The ridge leverage must be scaled by the diagonal matrix, S, with diagonal entries that are
score of the Fourier features matrix is given by: random variables drawn from a Chi distribution with D degrees of
freedom. This ensures that the row norms of G and SO will be identi-
T ( ) = diag( (X )( (X )T (X ) + LRFF I )
1 (X )T ) cally distributed and can be used to construct a matrix of orthogonal
random frequencies given by ORF = SO. The elements of S are only
1
(X )T (26)
1 (xi)T
i, i ( ) = (x i)( (X ) + LRFF I )
Chi distributed random variables when G is a random Gaussian matrix.
Note, in Eq. (26) RFF subscript is suppressed for legibility For other kernels, the diagonal elements of matrix S are computed as
(Φ(X) = ΦRFF(X)) and Φ(xi) = ΦRFF(xi). The computation and inversion the L2 norm for the corresponding row in G (the definition of a Chi
of the matrix ΦRFF(X) increases the computational burden of this ap- distributed random variable is identical to taking the L2 norm of a set of
proach, but only has to be calculated once for a given regularisation standard normal distributed variables). Therefore, the ith diagonal
parameter. With suitable scaling, we can now sample Fourier features element of S is calculated by:
with a probability proportional to the leverage distribution, allowing us
to sample features proportional to their importance. The code is given D

by: si, i = Gi 2 = Gi2, j

Code 3 LRFF example. j =1 (27)

Thus, generating orthogonal random features for a given kernel re-

quires three steps. First, derive the orthogonal matrix O by performing
11.4. Orthogonal random features QR decomposition on the feature matrix G (where O corresponding to
the Q matrix of QR decomposition). See Gentle (2012) for an excellent
One of the benefits of RFF is that they allow us to define kernels in summary of QR decomposition. Second, compute the entries of the
higher dimensions. For example, one can use a kernel in 4-dimensions diagonal matrix, S, by talking the L2 norms of corresponding rows of G.
Finally, compute the orthogonal feature matrix as ORF = SO. This
1
to represent Cartesian spatial coordinates x, y, z and time, t, the foun-
dation for any spatiotemporal modelling. However, increasing di- orthogonal frequency matrix replaces the random frequency matrix to
mensionality comes at a cost of increased variance of the RFF estimate generate the orthogonal random feature matrix ΦORF(X), computed as
and significantly more features to achieve an accurate approximation
T T
ORF (X ) = [cos(X ORF ) sin(X ORF )]
N × 2m
. The matrix ΦORF(X)
(Felix et al., 2016). One proposed approach to solve this issue with high will contain orthogonal Fourier features with significantly less re-
dimensional kernels is to draw each new feature dimension orthogon- dundant features than the standard RFF method. The R code for ORF is
ally. as follows:
In the standard RFF method, the sampled frequencies can be Code 4 ORF example.

10
P. Milton, et al. Epidemics 29 (2019) 100362

12. Conclusion
Felix et al. (2016) extended the ORF method further to a method
known as structured ORF (SORF). The SORF method avoids the com- Regression is a key technique for nearly all scientific disciplines and
putationally expensive steps of deriving the orthogonal matrix ( (N 3) can be extended from its simplest forms to highly complex and flexible
time) and computing random basis matrix ( (N 2) time) by replacing models capable of describing nearly any type of data using feature
the random orthogonal matrix, O, by a class of specially structured mapping. Kernels permit working with infinite dimensional feature
matrices (consisting of products of binary diagonal matrices and Walsh- spaces but are not computationally feasible with large datasets. The RFF
Hadamard matrices) that have orthogonality with near-Gaussian entries method is capable of approximating the full kernel matrix and valid for
and can use highly efficient algorithms (such as the fast Walsh-Hada- kernels in high dimensions, but its key advantage over other methods is
mard transform) (Felix et al., 2016). The SORF method maintains a that using RFF to solve the primal rather than the dual solution provides
lower approximation error than the standard RFF method and is sig- significant computational benefit. All these advantages are achieved
nificantly more computationally efficient than ORF, with computing using only a probability distributions and trigonometric functions and
ΦSORF(X) taking only (N log(N )) time. However, technically the re- can be encapsulated in 4 lines of code. While some advanced deriva-
sulting approximation of the kernel is no longer unbiased. tives of existing sparse or low-rank methods may outperform the
standard RFF, several advanced RFF methods are emerging that con-
11.5. Non-stationary and arbitrary kernel functions tinue to improve on the standard RFF method. To that end, random
Fourier features and their extensions represent an exciting new tool for
One of the most significant limitations to the standard RFF method multi-dimensional spatial analysis on large datasets.
is the restriction to shift-invariant kernels, where k(xi, xj) = k(xi − xj).
This restriction means that the kernel value is only dependent on the lag Author contributions
or distance between the points rather than the actual locations. This
property imposes stationarity on the spatiotemporal process. While this PM drafted the manuscript which was read, revised for critical in-
assumption is not unreasonable, and non-stationarity is often uni- tellectual content, modified and then approved by all authors. HC
dentifiable, in some cases the relaxation of stationarity can significantly provided significant support on the mathematical aspects, and EG
improve model performance (Paciorek and Schervish, 2006). provided significant support on the spatial statistics. SB conceived and
To extend the RFF method to non-stationary kernels requires a more the paper and provided significant support on the structuring, drafting
general representation of Bochner's theorem capable of capturing the and editing of the paper itself.
spectral characteristics of both stationary and non-stationary kernels.
This extension (Yaglom, 1987) states than any kernel (stationary or Funding
non-stationary) can be expressed as its Fourier transform in the form of:
T T PM would like to acknowledge the Medical Research Council-
k (x i , x j ) = ei ( 1 xi 2 xj) ( 1) ( 2) d 1d 2 (28)
Rd × Rd Doctoral Training Programme at Imperial College London. HC's re-
search is funded by a studentship from the Wellcome Trust. SB receives
This equation is nearly identical to the original derivation of
financial support from the Bill and Melinda Gates Foundation
Bochner's theorem given in Eq. (18), but now we have two spectral
densities on D to integrate over. It is easy if that if the two spectral (1606H5002/JH6). EG is supported by an MRC Strategic Skills
Fellowship in Biostatistics (MR/M015297/1). The funders had no role
densities are the same, the function returns to the definition for a sta-
tionary kernel. Applying the same treatment and Monte Carlo integra- preparation or publication of the manuscript. The views, opinions, as-
sumptions or any other information set out in this article are solely
tion can be performed to give the feature space of the non-stationary
kernel (Ton et al., 2018), now given by: those of the authors. The authors thank the UK National Institute for
Health Research Health Protection Research Unit (NIHR HPRU) in
cos( T
1 x) + cos( T
2 x)
Modelling Methodology at Imperial College London in partnership with
RFF (x) = T T
, {1,2} ={ {1,2},1, …, {1,2}, m } Public Health England (PHE) for funding (grant HPRU-2012-10080).
sin( 1 x) + sin( 2 x)
i .i.d .
{1,2} ( ) (29) Conflicts of interest
Note, that this derivation requires drawing independent samples for
None declared.
both of the spectral densities, l ( ) , such that we generate two fre-
quency matrices, l m× d.

In both the stationary and non-stationary case the choice of the Acknowledgements
kernel is often arbitrary or made with knowledge of the process being
modelled. For example, if the spatial data is expected to be very We gratefully acknowledge Reviewer 1 and 2 for their insightful
smooth, then a squared exponential kernel can be used. It is, however, comments in the manuscript review process.
possible to treat ω as unknown kernel parameters variables and infer
their values (Ton et al., 2018). This is equivalent to deriving an em- Appendix A. Supplementary data
pirical spectral distribution. This strategy is data dependent and can
achieve impressive results; however, great care must be taken to avoid Supplementary data associated with this article can be found, in the
overfitting (Ton et al., 2018). online version, at https://doi.org/10.1016/j.epidem.2019.100362.

11
P. Milton, et al. Epidemics 29 (2019) 100362

References arXiv:1807.02582.
Li, Z., Ton, J.-F., Oglic, D., Sejdinovic, D., 2018. A Unified Analysis of Random Fourier
Features. arXiv preprint arXiv:1806.09178.
Andres, L.A., et al., 2018. Geo-Spatial Modeling of Access to Water and Sanitation in Lindgren, F., Rue, H., Lindström, J., 2011. An explicit link between Gaussian fields and
Nigeria. The World Bank. Gaussian Markov random fields: the stochastic partial differential equation approach.
Asmussen, S., Glynn, P.W., 2007. Stochastic Simulation: Algorithms and Analysis. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 73, 423–498. https://doi.org/10.1111/j.
Springer Science & Business Media. 1467-9868.2011.00777.x.
Avron, H., et al., 2018. Random Fourier Features for Kernel Ridge Regression: Mahoney, M.W., Drineas, P., 2009. CUR matrix decompositions for improved data ana-
Approximation Bounds and Statistical Guarantees. CoRR abs/1804.09893. arXiv: lysis. Proc. Natl. Acad. Sci pnas-0803205106.
1804.09893. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models. Chapman and Hall,
Bach, F.R., Jordan, M.I., 2005. Predictive low-rank decomposition for kernel methods. In: London, UK.
Proceedings of the 22nd international conference on Machine learning – ICML’05. Mena, I., et al., 2016. Origins of the 2009 H1N1 influenza pandemic in swine in Mexico.
ISBN: 1595931805. Elife 5, e16777.
Bach, F.R., 2012. Sharp Analysis of Low-Rank Kernel Matrix Approximations. CoRR abs/ Musco, C., Musco, C., 2017. Recursive Sampling for the Nystrom Method in Advances in
1208.2015. arXiv: 1208.2015. Neural Information Processing Systems. pp. 3833–3845.
Bell, J.B., Tikhonov, A.N., Arsenin, V.Y., 1978. Solutions of ill-posed problems. Math. Nelder, J.A., Wedderburn, R.W., 1972. Generalized linear models. J. R. Stat. Soc.: Ser. A
Comput ISSN: 00255718. (Gen.) 135, 370–384.
Bochner, S., Chandrasekharan, K., 1949. Fourier Transforms. Princeton University Press. Niederreiter, H., 1978. Quasi-Monte Carlo methods and pseudo-random numbers. Bull.
Boyd, S., Vandenberghe, L., 2004. Convex Optimization. Cambridge University Press. Am. Math. Soc. 84, 957–1041.
Bracewell, R.N., Bracewell, R.N., 1986. The Fourier Transform and Its Applications. Noma, M., et al., 2002. Rapid epidemiological mapping of onchocerciasis (REMO): its
McGraw-Hill, New York. application by the African Programme for Onchocerciasis Control (APOC). Ann. Trop.
Cameron, A.C., Trivedi, P.K., 2013. Regression Analysis of Count Data. Cambridge Med. Parasitol. 96, S29–S39. https://doi.org/10.1179/000349802125000637. ISSN:
University Press. 0003-4983.
Carlin, B.P., Louis, T.A., 2008. Bayesian Methods for Data Analysis. CRC Press. Osgood-Zimmerman, A., et al., 2018. Mapping child growth failure in Africa between
Cohen, M.B., Musco, C., Musco, C., 2017. Input sparsity time low-rank approximation via 2000 and 2015. Nature 555, 41.
ridge leverage score sampling. Proceedings of the Twenty-Eighth Annual ACM-SIAM Paciorek, C.J., Schervish, M.J., 2006. Spatial modelling using a new class of nonstationary
Symposium on Discrete Algorithms 1758–1777. covariance functions. Environmetrics 17, 483–506.
Cuadros, D.F., et al., 2017. Mapping the spatial variability of HIV infection in Sub- Rahimi, A., Recht, B., 2007. Random features for large scale kernel machines. Adv. Neural
Saharan Africa: effective information for localized HIV prevention and control. Sci. Inf. Process. Syst. ISSN: 0033-6599, arXiv: arXiv:1409.1151v1.
Rep. 7, 9093. Rasmussen, C.E., Williams, C.K.I., 2005. Gaussian Processes for Machine Learning
Davidson, R., MacKinnon, J.G., et al., 2004. Econometric Theory and Methods. Oxford (Adaptive Computation and Machine Learning). The MIT Press ISBN: 026218253X.
University Press, New York. Rudi, A., Rosasco, L., 2017. Generalization Properties of Learning With Random Features
Diggle, P., Ribeiro, P., 2007. Model-Based Geostatistics. English. Springer ISBN: in Advances in Neural Information Processing Systems. pp. 3215–3225.
0387329072 978-0387329079. Rudi, A., Camoriano, R., Rosasco, L., 2015. Less is More: Nyström Computational
Diggle, P.J., Tawn, J., Moyeed, R., 1998. Model-based geostatistics. J. R. Stat. Soc. Ser. C Regularization in Advances in Neural Information Processing Systems. pp.
(Appl. Stat.) 47, 299–350. 1657–1665.
Ding, Y., Kondor, R., Eskreis-Winkler, J., 2017. Multiresolution Kernel Approximation for Rudi, A., Calandriello, D., Carratino, L., Rosasco, L., 2018. On Fast Leverage Score
Gaussian Process Regression in Advances in Neural Information Processing Systems. Sampling and Optimal Learning in Advances in Neural Information Processing
pp. 3740–3748. Systems. pp. 5673–5683.
Domingos, P., 2000. A Unified Bias-Variance Decomposition and its Applications. Science. Rudin, W., 1990. Fourier Analysis On Groups. ISBN: 047152364X.
Drineas, P., Magdon-Ismail, M., Mahoney, M.W., Woodruff, D.P., 2012. Fast approx- Rue, H., Held, L., 2005. Gaussian Markov Random Fields: Theory and Applications. CRC
imation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13, Press.
3475–3506. Saad, Y., Schultz, M.H., 1986. GMRES: a generalized minimal residual algorithm for
El Alaoui, A., Mahoney, M., 2014. Fast Randomized Kernel Methods With Statistical solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7, 856–869.
Guarantees, vol. 1411 arXiv preprint. arXiv. Shawe-Taylor, J., Cristianini, N., et al., 2004. Kernel Methods for Pattern Analysis.
Farrar, D.E., Glauber, R.R., 1967. Multicollinearity in regression analysis: the problem Cambridge University Press.
revisited. Rev. Econ. Stat. 92–107. Stein, M.L., 2012. Interpolation of Spatial Data: Some Theory for Kriging. Springer
Felix, X.Y., Suresh, A.T., Choromanski, K.M., Holtmann-Rice, D.N., Kumar, S., 2016. Science & Business Media.
Orthogonal Random Features in Advances in Neural Information Processing Systems. Straeter, T.A., 1971. On the Extension of the Davidon-Broyden Class of Rank One, Quasi-
pp. 1975–1983. Newton Minimization Methods to an Infinite Dimensional Hilbert Space With
Geman, S., Bienenstock, E., Doursat, R., 1992. Neural networks and the bias/variance Applications to Optimal Control Problems.
dilemma. Neural Comput. ISSN: 0899-7667. arXiv: arXiv:1011.1669v3. Tatem, A.J., et al., 2010. Ranking of elimination feasibility between malaria-endemic
Gentle, J.E., 2012. Numerical Linear Algebra for Applications in Statistics. Springer countries. Lancet 376, 1579–1591.
Science & Business Media. Tikhonov, A.N., 1963. Solution of incorrectly formulated problems and the regularization
Gething, P.W., et al., 2016. Mapping Plasmodium falciparum mortality in Africa between method. Soviet Math ISSN: 10634584.
1990 and 2015. New Engl. J. Med. 375, 2435–2445. Ton, J.-F., Flaxman, S., Sejdinovic, D., Bhatt, S., 2018. Spatial mapping with Gaussian
Gittens, A., Mahoney, M.W., 2016. Revisiting the Nyström method for improved large- processes and nonstationary Fourier features. Spat. Stat. 28, 59–78. One world, one
scale machine learning. J. Mach. Learn. Res. 17, 3977–4041. health, ISSN: 2211-6753. http://www.sciencedirect.com/science/article/pii/
Gleason, B.L., et al., 2017. Geospatial analysis of household spread of Ebola virus in a S2211675317302890.
quarantined village-Sierra Leone, 2014. Epidemiol. Infect. 145, 2921–2929. Vatcheva, K.P., Lee, M., McCormick, J.B., Rahbar, M.H., 2016. Multicollinearity in re-
Graetz, N., et al., 2018. Mapping local variation in educational attainment across Africa. gression analyses conducted in epidemiologic studies. Epidemiology (Sunnyvale,
Nature 555, 48. Calif.) 6.
Halton, J.H., 1964. Algorithm 247: radical-inverse quasi-random point sequence. Velleman, P.F., Welsch, R.E., 1981. Efficient computing of regression diagnostics. Am.
Commun. ACM 7, 701–702. https://doi.org/10.1145/355588.365104. ISSN: 0001- Stat. 35, 234–242.
0782. Whittle, P., 1954. On stationary processes in the plane. Biometrika 434–449.
Hay, S.I., et al., 2013. Global mapping of infectious disease. Philos. Trans. R. Soc. B: Biol. Whittle, P., 1963. Stochastic-processes in several dimensions. Bull. Int. Stat. Inst. 40,
Sci. 368, 20120250. 974–994.
Hoaglin, D.C., Welsch, R.E., 1978. The hat matrix in regression and ANOVA. Am. Stat. 32, Williams, C.K.I., Seeger, M., 2001. In: In: Leen, T.K., Dietterich, T.G., Tresp, V. (Eds.),
17–22. Advances in Neural Information Processing Systems, vol. 13. MIT Press, pp. 682–688.
Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: biased estimation for nonorthogonal http://papers.nips.cc/paper/1866-using-the-nystrommethod-to-speed-up-kernel-
problems. Technometrics. ISSN: 15372723. arXiv: 9809069v1 [arXiv:gr-qc]. machines.pdf.
Ionescu, C., Popa, A., Sminchisescu, C., 2017. Large-scale data-dependent kernel ap- Yaglom, A.M., 1987. Correlation Theory of Stationary and Related Random Functions.
proximation. In: In: Singh, A., Zhu, J. (Eds.), Proceedings of the 20th International Springer New York, New York, NY.
Conference on Artificial Intelligence and Statistics, vol. 54. PMLR, Fort Lauderdale, Yang, T., Li, Y.-f., Mahdavi, M., Jin, R., Zhou, Z.-H., 2012. In: In: Pereira, F., Burges,
FL, USA, pp. 19–27 April. C.J.C., Bottou, L., Weinberger, K.Q. (Eds.), Advances in Neural Information
Josepha, G., Gething, P.W., Bhatt, S., Ayling, S.C., 2019. Understanding the Geographical Processing Systems, vol. 25. Curran Associates, Inc., pp. 476–484. http://papers.nips.
Distribution of Stunting in Tanzania: A Geospatial Analysis of the 2015–16. cc/paper/4588-nystrom-method-vs-random-fourier-features-a-theoretical-
Demographic and Health Survey. andempirical-comparison.pdf.
Kanagawa, M., Hennig, P., Sejdinovic, D., Sriperumbudur, B.K., 2018. Gaussian Processes
and Kernel Methods: A Review on Connections and Equivalences. arXiv preprint

Geospatial Health Data: Modeling and Visualization With R-INLA and Shiny (Chapman & Hall/CRC Biostatistics Series) - ISBN 036735795X, 978-0367357955
100% (31)
Geospatial Health Data: Modeling and Visualization With R-INLA and Shiny (Chapman & Hall/CRC Biostatistics Series) - ISBN 036735795X, 978-0367357955
23 pages
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
No ratings yet
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
15 pages
Geospatial Health Data Modeling and Visualization with R INLA and Shiny 1st Edition Digital Download
100% (14)
Geospatial Health Data Modeling and Visualization with R INLA and Shiny 1st Edition Digital Download
15 pages
Spatial Analysis in Epidemiology Full Text EPUB
100% (9)
Spatial Analysis in Epidemiology Full Text EPUB
14 pages
Spatial Analysis in Epidemiology Enhanced eBook Download
100% (1)
Spatial Analysis in Epidemiology Enhanced eBook Download
17 pages
Using R for Bayesian Spatial and Spatio Temporal Health Modeling - 1st Edition Scribd PDF Download
100% (10)
Using R for Bayesian Spatial and Spatio Temporal Health Modeling - 1st Edition Scribd PDF Download
16 pages
Disease Mapping From Foundations to Multidimensional Modeling - 1st Edition Optimized DOCX Download
100% (8)
Disease Mapping From Foundations to Multidimensional Modeling - 1st Edition Optimized DOCX Download
16 pages
Isatis Neo Techrefs
100% (2)
Isatis Neo Techrefs
142 pages
Using R for Bayesian Spatial and Spatio Temporal Health Modeling - 1st Edition High-Resolution PDF Download
100% (1)
Using R for Bayesian Spatial and Spatio Temporal Health Modeling - 1st Edition High-Resolution PDF Download
16 pages
Lecture_11_AGD_restart_lower_bounds
No ratings yet
Lecture_11_AGD_restart_lower_bounds
5 pages
Coursework 5 - Web
No ratings yet
Coursework 5 - Web
16 pages
Disease Mapping From Foundations to Multidimensional Modeling 1st Edition Miguel A. Martinez-Beneito - Quickly access the ebook and start reading today
100% (2)
Disease Mapping From Foundations to Multidimensional Modeling 1st Edition Miguel A. Martinez-Beneito - Quickly access the ebook and start reading today
60 pages
Using R For Bayesian Spatial And Spatiotemporal Health Modeling Andrew B Lawson pdf download
No ratings yet
Using R For Bayesian Spatial And Spatiotemporal Health Modeling Andrew B Lawson pdf download
86 pages
Super Resolution: Ms. Manisha A. Bhusa
No ratings yet
Super Resolution: Ms. Manisha A. Bhusa
5 pages
Buku Spatial Epidemilogy
No ratings yet
Buku Spatial Epidemilogy
443 pages
Andrew B Lawson - Using R for Bayesian Spatial and Spatio-Temporal Health Modeling-CRC Press (2021)
No ratings yet
Andrew B Lawson - Using R for Bayesian Spatial and Spatio-Temporal Health Modeling-CRC Press (2021)
300 pages
Parameter-Free Plug-And-play Admm For Image Restoration
No ratings yet
Parameter-Free Plug-And-play Admm For Image Restoration
5 pages
Spatial Project
No ratings yet
Spatial Project
7 pages
FRK Large Spatial SpatioTemporal Datasets
No ratings yet
FRK Large Spatial SpatioTemporal Datasets
41 pages
zhang15d
No ratings yet
zhang15d
42 pages
Fourier Feature Approximations For Periodic Kernels
No ratings yet
Fourier Feature Approximations For Periodic Kernels
8 pages
arning Time Series Classification with Fisher Information
No ratings yet
arning Time Series Classification with Fisher Information
22 pages
Liu et al. - 2021 - Random Features for Kernel Approximation A Survey on Algorithms, Theory, and Beyond
No ratings yet
Liu et al. - 2021 - Random Features for Kernel Approximation A Survey on Algorithms, Theory, and Beyond
35 pages
(Legal Code) Disclaimer
No ratings yet
(Legal Code) Disclaimer
58 pages
Interpolation Slides - EFTF
No ratings yet
Interpolation Slides - EFTF
8 pages
Lecture03_kernel
No ratings yet
Lecture03_kernel
28 pages
Suriya Gunasekar CV
No ratings yet
Suriya Gunasekar CV
5 pages
Learning From Noisy Labels With Deep Neural Networks Survey
No ratings yet
Learning From Noisy Labels With Deep Neural Networks Survey
19 pages
Fields Vignette
No ratings yet
Fields Vignette
135 pages
Geophysical Prospecting - 2024 - Li - One‐dimensional deep learning inversion of marine controlled‐source electromagnetic (1)
No ratings yet
Geophysical Prospecting - 2024 - Li - One‐dimensional deep learning inversion of marine controlled‐source electromagnetic (1)
21 pages
Estimation Methods in Network Analysis
No ratings yet
Estimation Methods in Network Analysis
57 pages
On The Selection Stability of Stability Selection
No ratings yet
On The Selection Stability of Stability Selection
20 pages
10.3934_math.2021633
No ratings yet
10.3934_math.2021633
17 pages
Period Kernal Approx
No ratings yet
Period Kernal Approx
11 pages
Methods For Scalar-On-Function-Regression
No ratings yet
Methods For Scalar-On-Function-Regression
26 pages
Varad CPP
No ratings yet
Varad CPP
51 pages
A Penalized Functional Linear Cox Regression Model For Spatially-Defined Environmental Exposure With An Estimated Buffer Distance
No ratings yet
A Penalized Functional Linear Cox Regression Model For Spatially-Defined Environmental Exposure With An Estimated Buffer Distance
27 pages
Class04 Feature+Kernels
No ratings yet
Class04 Feature+Kernels
35 pages
Data Fusion in Radial Basis Function Networks For Spatial Regression
No ratings yet
Data Fusion in Radial Basis Function Networks For Spatial Regression
13 pages
Collaborative Filtering - Dotx
No ratings yet
Collaborative Filtering - Dotx
36 pages
MappingRegionalClimate
No ratings yet
MappingRegionalClimate
198 pages
A_Seismic_Sensor_based_Human_Activity_Recognition_Framework_using_Deep_Learning
No ratings yet
A_Seismic_Sensor_based_Human_Activity_Recognition_Framework_using_Deep_Learning
8 pages
Functional Singular Value Decomposition: Jianbin Tan, Pixu Shi and Anru R. Zhang
No ratings yet
Functional Singular Value Decomposition: Jianbin Tan, Pixu Shi and Anru R. Zhang
89 pages
Estimating The Support of A High-Dimensional Distribution
No ratings yet
Estimating The Support of A High-Dimensional Distribution
28 pages
Kriging vs. Simulation, a 2D Map Example — GeostatsPy Well-documented Demonstration Geostatistical Workflows
No ratings yet
Kriging vs. Simulation, a 2D Map Example — GeostatsPy Well-documented Demonstration Geostatistical Workflows
16 pages
Linear Regression: Volker Tresp 2017
No ratings yet
Linear Regression: Volker Tresp 2017
25 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
Random Fourier Features — Random Walks
No ratings yet
Random Fourier Features — Random Walks
10 pages
A Survey On Transfer Learning: Sinno Jialin Pan and Qiang Yang, Fellow, IEEE
No ratings yet
A Survey On Transfer Learning: Sinno Jialin Pan and Qiang Yang, Fellow, IEEE
15 pages
Tire Noise Sound Synthesis
No ratings yet
Tire Noise Sound Synthesis
16 pages
Integrating Geostatistical Maps and Infectious Disease Transmission Models Using Adaptive Multiple Importance Sampling
No ratings yet
Integrating Geostatistical Maps and Infectious Disease Transmission Models Using Adaptive Multiple Importance Sampling
19 pages
Supervised Learning Algorithms cheat sheet
No ratings yet
Supervised Learning Algorithms cheat sheet
20 pages
EVT - Inductrial 6 12
No ratings yet
EVT - Inductrial 6 12
7 pages
Physics-Based Learning Models For Ship Hydrodynamics
No ratings yet
Physics-Based Learning Models For Ship Hydrodynamics
22 pages
Spatial Statistics in R
No ratings yet
Spatial Statistics in R
29 pages
Disease Mapping
No ratings yet
Disease Mapping
35 pages
Spatial_Statistical_Modelling_of_Insurance_Claims
No ratings yet
Spatial_Statistical_Modelling_of_Insurance_Claims
79 pages
mFD
No ratings yet
mFD
93 pages
TFM Oviedo de La Fuente
No ratings yet
TFM Oviedo de La Fuente
92 pages
Statistical Computing in Functional Data Analysis
No ratings yet
Statistical Computing in Functional Data Analysis
28 pages
Week 9 Notes
No ratings yet
Week 9 Notes
6 pages
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
No ratings yet
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
26 pages
Spatial Relationships Between Two Georeferenced Variables With Applications in R ISBN 3030566803, 9783030566807 One-Click eBook Download
No ratings yet
Spatial Relationships Between Two Georeferenced Variables With Applications in R ISBN 3030566803, 9783030566807 One-Click eBook Download
16 pages
Intro Spatial Models INLA-3-43
No ratings yet
Intro Spatial Models INLA-3-43
41 pages
Machine Larning-Assisted Modeling of Composite Materials and Structures - A Review
No ratings yet
Machine Larning-Assisted Modeling of Composite Materials and Structures - A Review
21 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
A Survey of Regularization Methods For First-Kind Volterra Equations
No ratings yet
A Survey of Regularization Methods For First-Kind Volterra Equations
30 pages
Spatial Statistics in R
No ratings yet
Spatial Statistics in R
29 pages
Robust Principal Component Functional Logistic Regression
No ratings yet
Robust Principal Component Functional Logistic Regression
23 pages
Matrix Factorization - A Simple Tutorial and Implementation in Python
No ratings yet
Matrix Factorization - A Simple Tutorial and Implementation in Python
9 pages
Spatial Point Patterns
No ratings yet
Spatial Point Patterns
27 pages
2017 Unser (Slides) Biomedical Image Reconstruction
No ratings yet
2017 Unser (Slides) Biomedical Image Reconstruction
37 pages
Data Science Interview Questions
100% (1)
Data Science Interview Questions
300 pages
Revisiting Gaussian Markov random fields and Bayesian disease mapping
No ratings yet
Revisiting Gaussian Markov random fields and Bayesian disease mapping
19 pages
Application of the iteratively regularized Gauss–Newton method to parameter identification problems in Computational Fluid Dynamics
No ratings yet
Application of the iteratively regularized Gauss–Newton method to parameter identification problems in Computational Fluid Dynamics
12 pages
Ex 4
No ratings yet
Ex 4
15 pages
Datos Funcionales Multivariados Muestreo y Predicción
No ratings yet
Datos Funcionales Multivariados Muestreo y Predicción
18 pages
Kriging
No ratings yet
Kriging
4 pages
Introduction To Geostatistics: Andr As B Ardossy Institute of Hydraulic Engineering University of Stuttgart
100% (3)
Introduction To Geostatistics: Andr As B Ardossy Institute of Hydraulic Engineering University of Stuttgart
134 pages
8 Machine Learning in Trading
No ratings yet
8 Machine Learning in Trading
17 pages
A General Frequency Domain Method For Assessing Spatial Covariance Structures
No ratings yet
A General Frequency Domain Method For Assessing Spatial Covariance Structures
25 pages
Gadgetron: An Open Source Framework For Medical Image Reconstruction
No ratings yet
Gadgetron: An Open Source Framework For Medical Image Reconstruction
9 pages
Fast Doc
No ratings yet
Fast Doc
22 pages
The Evolving Interface: A Journey Through Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science
From Everand
The Evolving Interface: A Journey Through Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science
Pasquale De Marco
No ratings yet
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
From Everand
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
An Introduction to Information Theory
From Everand
An Introduction to Information Theory
Fazlollah M. Reza
No ratings yet
Some Mathematical Methods of Physics
From Everand
Some Mathematical Methods of Physics
Gerald Goertzel
No ratings yet
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Finite Elements and Approximation
From Everand
Finite Elements and Approximation
O. C. Zienkiewicz
4.5/5 (4)
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet

Análise Espacial Com Regressão Linear e Kernel

Uploaded by

Análise Espacial Com Regressão Linear e Kernel

Uploaded by

Epidemics 29 (2019) 100362

Contents lists available at ScienceDirect

Spatial analysis made easy with linear regression and kernels T

ARTICLE INFO ABSTRACT

the design matrix X can be replaced with (X ) N ×D

where Throughout this paper, we have focused on squared loss/Gaussian

tion stabilises leverage scores computation and permits fast leverage m ×d

by: si, i = Gi 2 = Gi2, j

Thus, generating orthogonal random features for a given kernel re-

You might also like