0% found this document useful (0 votes)
12 views

A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions

Uploaded by

vipin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions

Uploaded by

vipin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

SIAM J. SCI. COMPUT.


c 2016 Society for Industrial and Applied Mathematics
Vol. 38, No. 4, pp. B619–B644

A RECURSIVE LOCAL POLYNOMIAL APPROXIMATION METHOD


USING DIRICHLET CLOUDS AND RADIAL BASIS FUNCTIONS∗
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

ARTA A. JAMSHIDI† AND WARREN B. POWELL‡

Abstract. We present a recursive function approximation technique that does not require the
storage of the arrival data stream. Our work is motivated by algorithms in stochastic optimization
which require approximating functions in a recursive setting such as a stochastic approximation
algorithm. The unique collection of these features in this technique is essential for nonlinear modeling
of large data sets where the storage of the data becomes prohibitively expensive and in circumstances
where our knowledge about a given query point increases as new information arrives. The algorithm
presented here employs radial basis functions (RBFs) to provide locally adaptive parametric models
(such as linear models). The local models are updated using recursive least squares and only store the
statistical representative of the local approximations. The resulting scheme is very fast and memory
efficient without compromising accuracy in comparison to methods well accepted as the standard
and some advanced techniques used for functional data analysis in the literature. We motivate the
algorithm using synthetic data and illustrate the algorithm on several real data sets.

Key words. radial basis functions, function approximation, local polynomials, data fitting

AMS subject classifications. 65D10, 65D15, 62M10, 94A12

DOI. 10.1137/15M1008592

1. Introduction. There are three major classes of function approximation meth-


ods: look-up tables, parametric models (linear or nonlinear), and nonparametric mod-
els. Parametric regression techniques (such as linear regression [36]) assume that the
underlying structure of the data is known a priori and is in the span of the regressor
function. Due to the simplicity of this approach it is commonly used for regression.
Nonparametric models [15, 38, 59] do not assume a specific structure underlying the
data. Nonparametric methods use the raw data to build local approximations of the
function, producing a flexible but data-intensive representation. Although nonpara-
metric models are generally data hungry, the resulting approximation may be more
accurate [20, 28, 56] and is less sensitive to structural errors arising from a parametric
model. Most nonparametric models require keeping track of all observed data points,
which makes function evaluations increasingly expensive as the algorithm progresses,
a serious problem in stochastic search. Regression based on least squares, Lasso re-
gression, ridge regression, and least absolute deviation all have the same underlying
structure as ordinary least squares, except the objective/penalty terms are slightly
modified. In this work we formulate a problem that strikes a balance between para-
metric and nonparametric models to benefit from the advantages of both modeling
strategies. Semiparametric models are discussed in [54].

∗ Submitted to the journal’s Computational Methods in Science and Engineering section Febru-

ary 17, 2015; accepted for publication (in revised form) April 28, 2016; published electronically
August 2, 2016. This work was partially supported by grant FA9550-08-1-0195 from the Air Force
Office of Scientific Research. Any opinions, findings, and conclusions or recommendations expressed
in this material are those of the authors and do not necessarily reflect the views of the Air Force
Office of Scientific Research.
http://www.siam.org/journals/sisc/38-4/M100859.html
† School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran (arta.

[email protected]). Previously at Department of Operations Research and Financial Engineering,


Princeton University, Princeton, NJ 08544.
‡ Department of Operations Research and Financial Engineering, Princeton University, Princeton,

NJ 08544 ([email protected]).
B619

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B620 ARTA A. JAMSHIDI AND WARREN B. POWELL

Our work is motivated by the need to approximate functions within stochastic


search algorithms where new observations arrive recursively. The most classical rep-
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

resentation of the problem is given by

min E F (x, ν),


x

where E denotes the expectation operator, x is a deterministic parameter, and ν is


a random variable. Other applications include approximate dynamic programming,
where we need to approximate expectations of value functions. Since we are unable to
compute the expectation directly, we depend on the use of Monte Carlo samples. We
are interested in the class of algorithms which replaces E F (x, ν) with an approxima-
n
tion F (x) which can be quickly optimized. As we obtain new information from each
n
iteration, we need a fast and flexible method for updating the approximation F (x).
We require an approximation method that is more flexible than classical parametric
models offer, but we need a fast, compact representation to minimize computational
overhead.
Bayesian techniques for function approximation or regression are computationally
intense and require storage of all the data points [20]. Updating the model requires
revisiting all the data points in the history. These limitations make these methods
impractical in circumstances where there is a need to update the model as a stream of
new information arrives and in situations where there is a need for fast algorithms with
limited storage space. In addition these methods have multiple tunable parameters.
There is a rich literature on nonparametric Bayesian methods for regression [26].
Our work is closest in spirit to [55] with a proposed model that mixes over both
the covariates and response, and the response is drawn from a multinomial logistic
model. This work handles various response shapes. Dirichlet process mixtures of
generalized linear models (DP-GLMs) proposed in [26] are a generalization of this
idea for various response types. Other statistical techniques that are widely used for
regression include regression trees [4], where data is divided into a fixed, tree-based
partitioning and a regression model is fit to data in each leaf of the tree, and Gaussian
processes (GPs), which assume the observations arise from a Gaussian process model
with known covariance function. This method does not handle variations in the
variance of the response unless a Dirichlet process mixtures of GPs is assumed [49].
Locally weighted scatterplot smoothing (LOWESS) [11, 12, 13] does not require
a global function to represent the underlying function. This procedure produces a
model based on segments of data based on nearest neighborhood. At each point this
method assigns a low-degree polynomial using weighted least squares to the segment of
data that are closest to the point of interest. The weighted least squares assigns more
weight to the points that are closer to the query point. This method is computationally
intensive and keeps track of all the data points. In addition, it does not incorporate
an updating method as new data points arrive.
Locally linear models [16, 17], which use a weighted mixture of locally linear
regression models, offer advantages over regular kernel smoothing regression methods
such as [60] and [41]. This technique builds linear models around each observation and
keeps track of all the data points. In addition this method does not provide a global
mathematical formula for the regression function. This method performs better than
spline methods [16].
The multilayer-perceptron and the associated back-propagation algorithm [61, 53]
and radial basis functions (RBFs) [5, 45, 44] have received considerable attention for
function approximation. Back-propagation networks require a lot of training data

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B621

and as a result are quite slow. One of the main attractions of RBFs is that the
resulting optimization problem can be broken efficiently into linear and nonlinear
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

subproblems. An automatic function approximation technique using RBF that does


not need to tune ad hoc parameters is proposed in [31]; the multivariate extension
of this technique is proposed in [33]. For a comprehensive treatment of various RBF
techniques that build a model by adding or pruning RBFs to fine-tune the model, see
[31] and references therein. The issue of selecting the number of basis functions with
growing and pruning algorithms from a Bayesian perspective is described in [29]. In
[1], a hierarchical full Bayesian model for RBFs is proposed. Normalized RBFs are
presented in [34] which reduce the order of the model compared to nonnormalized
techniques.
We seek a general purpose approximation method that can approximate a wide
range of functions using a compact representation which can be updated recursively
with fast function evaluations. Toward this goal, we propose a novel recursive and
fast approximation scheme for modeling nonlinear data streams that does not require
storage of the data history. The covariates may be continuous or categorical, but
we assume that the response function is continuous, although not necessarily differen-
tiable. The new method is robust in the presence of homoscedastic and heteroscedastic
noise. Our method automatically determines the model order for a given data set and
produces an analytical formulation describing the underlying function. Unlike similar
algorithms in the literature, our method has one tunable parameter. Our proposed
scheme introduces a cover over the input space to define local regions. This scheme
only stores a statistical representation of data in local regions and locally approximates
the data with a low order polynomial such as a linear model. The local model param-
eters are quickly updated using recursive least squares. A nonlinear weighting system
is associated with each local model which determines the contribution of this local
model to the overall model output. The combined effect of the local approximations
and the weights associated to them produces a nonlinear function approximation tool.
On the data sets considered in this study, our new algorithm provides superior com-
putational efficiency, both in terms of computational time and memory requirements
compared to the existing nonparametric regression methods without loss of accuracy.
The test results on various synthesized and real multivariate data sets are carried out
and compared with nonparametric regression methods in the literature. In this work
we have made an attempt to bridge various fields in statistics, approximation theory,
numerical analysis, and function approximation. Other potential areas of application
for this algorithm include active learning, to estimate where to sample a space such as
a parameter space of a complex system [14], and tracking global optima in dynamic
environments; see, e.g., [40].
The organization of this paper is as follows: Section 2 provides an overview of
RBFs and motivates the use of normalized RBFs with polynomial modulated response
terms. Section 3 introduces various aspects of the proposed algorithm, including the
notion of a cloud cover for the input space, the model building procedure, and the
updating roles. This section also provides the stopping criterion for the algorithm
and the procedure to measure the goodness of fit. Section 4 shows the convergence
properties of the proposed algorithm with finite data and asymptotically. Section 5
demonstrates the performance and robustness of the algorithm using various synthetic
and real data sets with various input dimensions and noise levels. This section provides
a comparison of the new method and available algorithms in the literature. Section 6
provides concluding remarks and discusses future work.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B622 ARTA A. JAMSHIDI AND WARREN B. POWELL

2. Normalized radial basis functions. RBFs are powerful tools for function
approximation [5, 7, 46]. Over the years RBFs have been used successfully for solving
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

a wide range of function approximation problems (see, e.g., [3]). RBFs have also been
used for global optimization (see, e.g., [51, 52, 25, 30]). An RBF expansion is a linear
summation of special nonlinear basis functions. In general, an RBF is a mapping
f : Rn −→ R that is represented by
Nc

(1) f (x) = αi φ(x − ci Wi ),
i=1

where x is an input pattern, φ is the RBF centered at location ci , and αi denotes the
weight for the ith RBF. Nc denotes the total number of RBFs. The term Wi , which
is a symmetric positive definite matrix, denotes the parameters in the weighted inner
product, √
xW = xT W x.
Note that throughout this work Wi is diagonal, as explained in section 3.1. The
dimensions of the input n and output m are specified by the dimensions of the input-
output pairs. Universal approximation properties of RBFs are established in [42, 43].
Normalized RBFs of the following form were proposed by Moody and Darken [37]:
Nc
αi φ(x − ci Wi )
(2) f (x) = i=1
Nc
.
i=1 φ(x − ci Wi )

Normalized RBFs appear to have advantages over regular RBFs, especially in the
domain of pattern classification; see, e.g., [6]. In this work it is reported that using
normalized RBFs reduced the order of the model and the robustness of generalization.
It has also been reported that normalized RBFs require less data when training models
of dynamical systems [34].
The polynomial modulation of the normalized RBF expansion leads to the ex-
pansion
Nc
pi (x)φ(x − ci Wi )
(3) f (x) = i=1
Nc ,
i=1 φ(x − ci Wi )

where pi (x) is assumed to be a low order polynomial such as a linear function (pi (x) =
αi x+βi ). Note that the denominator does not become zero for Nc > 1 and distinct ci .
This is a property of the RBF φ. Let ψ contain all the parameters, αi , βi , ci , Wi , Nc ,
that are used to define f (x) as shown in (3). Enter these unknown parameters to the
argument of f (x) as f (x, ψ). Later we use this notation for optimizing the parameters
contained in ψ.
The above formulation leads us to the idea of normalized RBFs that have poly-
nomial response terms. In what follows we provide a fast algorithm that recursively
updates the response terms and the weights associated to them as new data points
arrive. During the training phase the unknown parameters in the model need to be
calculated, including the number of RBFs, Nc , in the function expansion.
3. Online learning scheme. The goal of our proposed algorithm is to find a
mapping f from x ∈ Rn , a vector in n-dimensional space, to R such that y k = f (xk )
as the input-output pair {(xk , y k )} becomes available. We expect to have a total of
K arrival data points where k ∈ {1, . . . , K}, with X = {xk }K k K
k=1 and Y = {y }k=1

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B623

representing the domain and range observable values. We assume that data arrive as
a stream. Throughout the process we would like to find the parameters associated
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

with the model provided in (3) to find an accurate model for the data. The observed
data might be noisy, and we would like to have a model that has good generalization
ability. In this regression framework, the problem is to minimize the cost function
K
1
E(ψ) =  f (xk , ψ) − y k 2 ,
2
k=1

given the available data points. In this work, the Euclidean inner product is used for
the metric  · , and we use the Gaussian kernel,

φ(r) = exp(−r2 ).

Note that other local kernels could be used. For a comprehensive review of RBF
kernels and recently developed skew and compactly supported expansions, see [32]. In
what follows the description of an algorithm in this framework is presented that does
not require the storage of the data stream and locally approximates the underlying
functional behavior of the data. The response model parameters are updated quickly
using recursive least squares, and the weights associated to each linear response are
updated via statistical techniques. Note that the model order is also determined in
the algorithm.
3.1. A data driven space cover. The algorithm proposed here works by form-
ing a cover for the domain of the underlying function f upon arrival of new data points,
xk . We define a cover
C = {Ui : i ∈ Δ},
where Ui denotes an indexed family of sets, indexed with the elements of set Δ. C is a
cover of X if X ⊆ i∈Δ Ui . On the arrival of the first data point, x1 , a cover element
or cloud U1 is formed with the center on this point and with a distance threshold of
DT to form a ball around x1 . When the system receives a new data point x2 with
associated y 2 value, if the data point is within distance DT of x1 , x2 is assigned to
the same cloud as x1 , and the cloud’s centroid and variance are updated; otherwise
a new cloud is created. New data points xk are assigned to clouds using the distance
metric

(4) Di = xk − ci ,

where ci is the centroid of cloud i ∈ {1, . . . , Nc }.


Let
I ∗ = arg min Di .
i

Let I be the smallest index in set I (there might be multiple minimizers). Compare
DI , with distance threshold DT ; if DI < DT , update the ball indexed with UI .
Otherwise spawn a new cloud UNc +1 , according to the previous description. When a
new element is added to the cloud, the number of local balls, Nc , is also updated to
Nc = Nc + 1.
The algorithm retains a statistical representation of the data over a cover that is
specified by the data stream and a distance threshold. The distance threshold is the
only parameter in this algorithm that requires tuning. The value of this threshold
depends on the variations of the response values over the domain (i.e., complexity of

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B624 ARTA A. JAMSHIDI AND WARREN B. POWELL

the surface landscape). The data that is stored in the model are the number of points
in each cloud, Ki , the centroid of each cloud, ci , and the variance in each dimension
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

of the data in each cloud, Wi (the diagonal of the sample covariance matrix). In
addition the response coefficients for each cloud are also stored. Note that the model
can be evaluated at any point of its domain after adapting the model using the new
information. If the geometry of the data is more complex, there will be a need for
more clouds in the cover to capture this behavior. The proposed space cover is a part
of our function approximation technique. We have developed this independently. In
this procedure, the clouds form, move, and could dissipate (if there are not enough
data, a possibility studied in future work) to model the topology of the underlying
data. Note that the data is not a priori sorted or organized in any way, and the cloud
centers must be learned adaptively from the data. To the best of our understanding
there are some similarities to clustering procedures such as those found in [57, 24].
3.2. Recursive update of the model. There are two parts in the model that
are updated during the training: the response to the local model associated to cloud
I, and its weights.
Recursive update for response. We solve a local least squares problem of the
form minθI  XIkI θI − YIkI 22 , where XIkI is a matrix of the form
⎡ ⎤
1 x1
XIkI = ⎣ : : ⎦,
kI
1 x

where the vectors x1 , . . . , xkI are domain values. The vector YIkI contains all the
associated range values YIkI = [y 1 , . . . , y kI ]. The vector θIT = [β I , αI1 , . . . , αIn ] contains
all the parameters for the linear response. The direct solution to this system is
T T
θI = (XIkI XIkI )−1 XIkI YIkI . After receiving n + 1 affinely independent data points
in a local neighborhood (we refer to this state of the model as the no-knowledge state),
the matrix XIkI is formed with kI = n + 1. The recursive least squares update [10] is
used to compute the linear response model parameters. For kI = n + 1, we initialize
T
the recursion with PIkI = (XIkI XIkI )−1 . When a new data point (xk , y k ) is assigned
to this local neighborhood, with kI > n + 1, let aT = [1, xk ].
The recursion formula is then
PIkI aaT PIkI
(5) PIkI +1 = PIkI − ,
1 + aT PIkI a

(6) θIkI +1 = θIkI + PIkI +1 a y k − aT θIkI ,

where kI is the recursion index for the cloud indexed I. These equations are easily
modified to include a discount factor which puts more weight on recent observations.
Recursive update for the weights. When a new data point (xk , y k ) is assigned
to the local neighborhood indexed I, the weights WI associated with cloud UI is also
updated. For this purpose the number of data points that have been assigned to this
local ball is updated as kI = kI + 1. The center of the kernels are updated via the
recursion
xk − ckI I −1
(7) ckI I = ckI I −1 + .
kI

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B625

The width or scale of the model in each dimension of the cloud I is updated using
the Welford formula [35]. Initialize S1 = 0. For each additional data point xk assigned
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

to this cloud, use the recurrence formulas

(8) S kI = S kI −1 + (xk − ckI I −1 )T (xk − ckI I ).

The kI th estimate of the variance is the diagonal elements of


1
(9) WI2 = S kI .
kI − 1

The weighted inner product in the argument of the RBF function (as shown in
(3)) is calculated from this recursive update of the standard deviation of data in each
dimension. Note that in this study the weight matrices are diagonal. For cases where
the standard deviation of the data along a specific dimension is zero, we introduce a
penalty term to avoid singularity by replacing Wi with Wi + P , where P is a specified
constant. The required storage is proportional to Nc (3n + 1).
Ideally, we assume that the arrival data are scattered. At initial stages of model
construction it is best to have data that are sampled randomly from the input space.
The order of the data plays a role in the initial stage of the placement of the centroids;
however, the centers of the clouds stabilize after there are enough data points in a
kI −1
xk −c 
given cloud from (7), ckI I − cIkI −1  = I
kI < DkIT . As the number of data points
DT
in a cloud kI increases, kI becomes smaller. In a case where the data points arrive
in one direction after a finite number of data points, a new cloud is created.
3.3. Model evaluation. This section describes how to compute the approxi-
k
mation F (x) at a query point x. As shown in (3), the output of the proposed model
is the weighted average of the linear responses over the local clouds Ui associated to
the cover C. In this study a Gaussian kernel, φ(r) = exp(−r2 ), determines the weight
of each response function. The widths, Wi , and the centroids, ci , of the kernels,
i ∈ {1, . . . , Nc }, were computed recursively as mentioned in section 3.2. The model
output is formulated as
Nc
i=1φ(x − ci Wi )([1, x]θi )
(10) f (x) = Nc .
i=1 φ(x − ci Wi )

Nc is the total number of clouds. Note that θi is the least squares solution of the
response over the ith local cloud. Note that if a cloud does not accumulate n + 1
affinely independent data points, the models built in other clouds are evaluated at a
query in this region.
A summary of the above procedure is presented in Algorithm 1.
3.4. Goodness of fit and stopping criteria. In data modeling one of the
major goals is to model data in such a way that the model does not over- or underfit
the data that is not used for training but is generated from the same process as the
training data. One general approach to finding such a smooth model to observed
data is known as regularization. Regularization describes the process of fitting a
smooth function through the data set using a modified optimization problem that
penalizes variation [58]. A standard technique to achieve regularization is via cross-
validation [22, 59]. Such methods involve partitioning the data into subsets of training,
validation, and testing data; for details see, e.g., [28]. To determine the tunable

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B626 ARTA A. JAMSHIDI AND WARREN B. POWELL

Algorithm 1 DC-RBF algorithm.


Input DT , calculated using cross-validation.
Initialize Nc = 0, k = 1, Δ = ∅
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

while xk do
if k = 1 then
c1 = x1 , Nc = 1, k1 = 1, Δ = {1}, form cloud U1
else
compute Di = xk − ci  according to (4) for all i ∈ Δ
compute I ∗ = arg min Di and let I = min I ∗
i
if DI < DT then
update cloud UI , kI = kI + 1
update the cloud centroid, cI , using (7)
update the scale of cloud, WI , using (8) and (9) if the standard deviation of
data is zero along a dimension use WI + P , where P is a specified constant
if kI ≥ n + 1 and the points are affinely independent then
update the response parameter, θI , using (5) and (6)
else
store xk assigned to cloud I
end if
else
spawn new cloud UNc +1 , Nc = Nc + 1, kN c = 1, Δ = {1, . . . , Nc }
end if
end if
k =k+1
end while
evaluate the model as needed using (10)

parameter of the model proposed in this work, i.e., the distance threshold, DT , we
use t-fold cross-validation techniques [19].
In this scheme, the original data set is randomly partitioned into t subsample sets.
One of the t subsample sets is chosen as the validation data for testing the model,
and the remaining subsample sets are used as training data. This process is repeated
t times with each of the t subsample sets used only once as the validation data. All
the t results are averaged to provide an overall estimate of the error.
The advantage of this method over repeated random subsampling is that all ob-
servations are used for both training and validation, and each observation is used for
validation exactly once [27]. We have chosen to use 5-fold cross-validation. We record
the accuracy of the model on the testing data set. This procedure is repeated for all
five folds, and the mean squared error (MSE) on the test data sets is computed at
each time using
L
1 2
(11) M SE = f (xl ) − y l ,
L
l=1

where L is the size of the testing data set. Overfitting is reduced by tuning the
parameter, DT , to minimize M SE calculated through cross-validation, as opposed to
minimizing the MSE of points within the training data set [28].
To increase the number of estimates, one could run the above t-fold cross-valida-
tion multiple times. For this purpose the data needs to be repartitioned each time.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B627

To determine the predictive ability of the models, we use several measures to show
the accuracy of the predictions. The R2 value is calculated through the following
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

formula: L l l 2
l=1 (y − f (x ))
R2 = 1 −  L
,
l
l=1 (y − ȳ)
2

where ȳ = L1 Tl L y l is the mean of y l , L is the size of the testing data set, and TL
is the size of training set. The variance of the data with regards to the true values
is calculated in the numerator. The denominator presents the variance of the model
output with respect to the mean of the outcomes. This measure shows how well the
model is performing in comparison to the case where the mean value is used as the
predictor. An R2 value of 1 indicates an exactly fitted model, while a value of 0
indicates a model that adds no predictive power. We have also used the notion of
mean absolute error (MAE) to record the performance of the model. The MAE is
defined as follows:
L
1
(12) MAE = |f (xl ) − y l |.
L
l=1

If the distance threshold is set too high, the proposed method will fit a single
hyperplane to the entire data set. In practice one could determine the DT by observing
a part of data and use this for the rest of the input data.
4. The convergence properties of the DC-RBF algorithm. This section
describes facts about the DC-RBF algorithm. Theorems describe the finite data and
asymptotic behavior of this algorithm. We assume scattered data arrive and the
underlying structure is continuously differentiable. Note that the proofs are carried
out for the algorithm in steady state where there are enough clouds to cover the desired
domain of the function, the clouds have been stabilized (the movement of the centers
are negligible), and there are enough points in each cloud to form a response surface,
as explained in the previous section. Theorems are provided for both homoscedastic
and heteroscedastic types of noise.
Assume we have
(13) y k = θxk + k
for k ∈ {1, . . . , K}. Vector θ is the parameter that needs to be estimated. To simplify
our analysis we assume the origin of the coordinate system is adjusted such that the
intercept is zero. The set DK = {(xk , y k )}, with k ∈ {1, . . . , K}, denotes the collection
of sequentially observed data points. The error variables, k , are random.
Remark 1 (Gauss–Markov theorem [18]). The Gauss–Markov theorem states that
in a linear regression model in which the errors have expectation zero and are uncor-
related and have equal variances, the lowest possible MSE (variance) of the linear un-
biased estimator (BLUE) of the coefficients is given by the ordinary least squares esti-
mator. Actual errors need not be normal, nor independent and identically distributed
(only uncorrelated and homoscedastic). That is, E(k ) = 0, var(k ) = σ 2 < ∞, and
cov(k1 , k2 ) = 0 for k1 = k2 and k1 , k2 ∈ {1, . . . , K}.
The proof of this follows from a contradiction on the mean and variance of another
linear estimator for coefficient θ. An example could be adding another term to the
least squares solution given by (X T X)−1 X T . It can be shown that the mean of this
new estimator is not unbiased and produces a variance that is greater than what is
produced by the MSE solution, i.e., σ 2 (X T X)−1 .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B628 ARTA A. JAMSHIDI AND WARREN B. POWELL

Note that in our work we use the recursive least squares solution on the local
clouds as new data points arrive at that local region.
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Lemma 2. For a given DT , if the arrival data points sample an n-dimensional


ambient space, the DC-RBF algorithm will produce null output for a cloud cover that
does not accumulate Mo = n+1 affinely independent data points (n is the dimension of
the ambient space). Therefore DT > 0.5||NMo (xk1 ) − xk2 || for xk1 , xk2 ∈ X . NMo (x)
denotes the Mo th nearest neighbor of point x.
Proof. According to the construction of the model in the DC-RBF algorithm,
there is a need for n + 1 affinely independent data points to construct a hyperplane
from Rn to R. Therefore if DT is sufficiently small, then the model produces no
output.
Proposition 3. Given DT = diam(X ), assuming the underlying data structure
is a hyperplane (line in dimension 1) with homoscedastic additive noise, the DC-RBF
algorithm is unbiased asymptotically and with a finite number of data points larger
than n + 1. The diameter of domain X is defined as

diam(X ) = max ||xk1 − xk2 ||.


1≤k1 ,k2 ≤K

Proof. Under the hypothesis HIk (null hypothesis at cloud I and iteration k) that
cloud I has a linear structure, the rank one update of the recursive least squares
solution at each iteration given by (5) and (6) provides the best linear unbiased
estimator for the coefficient θ. Note that in this argument the local iteration counter
kI (the kth data point that arrives to cloud I) is the same as the global counter,
kI = k. We offer a proof based on induction. At k = n − 1 the argument is true since
the algorithm saves n − 1 data points to initialize matrix P , and the least squares
error is computed directly. Letting the argument be true for k = m ∈ N, we show
that at iteration k = m + 1, the rank one update solution via the recursive least
squares solution is also unbiased. This follows from the derivation of the recursive
least squares via the Sherman–Morrison formula,

(A−1 u)(v T A−1 )


(A + uv T )−1 = A−1 − .
1 + v T A−1 u
T −1 T
Let Gm = X m T X m and am+1 = [xm+1 , 1]. Then θm+1 = Gm+1 X m+1 Y m+1 =
−1
Gm+1 (X m T Y m + am+1 y m ). Notice that X m T Y m = Gm Gm −1 X m T Y m = Gm θm =
T −1 T
Gm+1 θm − am+1 am+1 θm . Therefore, θm+1 = θm − Gm+1 am+1 (am+1 θm − y m ).
−1 T
Let P m+1 = Gm+1 = (Gm + am+1 am+1 )−1 . Therefore,
T
θm+1 = θm + P m+1 am+1 y m − am+1 θm .

Hence, at each iteration the estimate is unbiased.


Asymptotically, when the number of data points becomes very large the same
induction is true. Therefore when the number of local iterations k → ∞ we deduce
that θk+1 = θk + P k+1 a y k − aT θk leads to an unbiased estimator for θ, according
to Remark 1.
Lemma 4. Given DT = diam(X), assuming the underlying data structure is a
hyperplane with heteroscedastic or correlated additive noise, the DC-RBF algorithm
is unbiased asymptotically and with a finite number of data points larger than n + 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B629

Proof. The proof follows that of Proposition 3 and incorporates the notion of gen-
eralized least squares. Generalized least squares assumes the conditional variance of Y
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

given X is a known or estimated matrix Ω. This matrix is used as the weight in com-
puting the minimum squared Mahalanobis length; hence β̂ = (X  Ω−1 X)−1 X  Ω−1 Y.
Ω rescales the input data to make it uncorrelated, and then the Gauss–Markov the-
orem is applied. For the case where the variance of the noise changes, the weight
matrix Ω is diagonal, and this is a weighted least squares problem.
We denote the bias generated by the DC-RBF algorithm at each point of the
domain of function f (x) as B(x) = (f (x) − E[y|x])2 . The variance at each point
1
of the domain is V (x) = K (f (x) − f¯(x))2 , where f¯(x) is the mean of f (x) over its
domain values x.
Proposition 5. If the underlying nonlinear functional structure g(x) is Lipschitz
continuous with Lipschitz constant Lc , φ(.) is uniform, and the observed data points
are drawn from
(14) y k = g(xk ) + k ,
 (ω1 −ω2 )2
then x∈X B(x) < Q(Lc , ω1 , ω2 ) = Lc 2 + O. Here DT = diam(X ) is the
chosen distance threshold, ω1 and ω2 determine the intersecting points of the linear
estimation and the underlying true structure g(x), and O denotes a residual term.
Proof. We assume that the underlying function g(x) is Lipschitz continuous; i.e.,
there are restrictions on the variations in the underlying structure. This implies
that for every pair of points on the graph of this function, the absolute value of the
slope of the line connecting them is no greater than a definite real number (Lipschitz
constant, Lc (g)). The slope of the secant line passing through xk1 , xk2 ∈ X is equal to
k1 k2
the difference quotient g(x|xk1)−g(x
−xk2 |
)
< Lc when xk1 = xk2 . Without loss of generality
we only consider the portion of the data that has an ascent followed by a descent slope
in its structure. This is achieved by the appropriate choice of DT . By assumption,
φ(.) is uniform weights. First we show that the least squares solution to this structure
intersects at two points, ω1 and ω2 (see Figure 1(a) for a visual presentation). The
proof follows from a contradiction argument.

(a) First order model. (b) Second order model.

Fig. 1. Visual demonstration of the regression lines intersecting the underlying nonlinear func-
tion g(x). In this figure we provide the geometry for the case where the model consists of a single
line, followed by the case where the model consists of two lines.

We denote the residuals of the least squares fit line at iteration k, f k (x) = ak x


with the truth function g(x), as ek (x). Note that the residuals at each point are

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B630 ARTA A. JAMSHIDI AND WARREN B. POWELL

discrete samplings of the mapping

e : x ∈ [ωl , ωr ] → ek (x) ∈ R.
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

We assume that the data used for training is given by XI ∈ UI (index I denotes
a specific cloud, in this case the whole data diameter with I = 1). Note that the
global iteration k is the same as the local index kI . The following analysis is carried
out for a given iteration k. These data are contained in the closed interval x ∈ BI =
[ωl , ωr ] ∈ UI . In addition, we assume that both g and f intersect at the origin of the
coordinate system, i.e., e(ω1 ) = 0, and that e(x) = 0 at x ∈ / BI .
Let a+ denote the optimized parameters for the linear estimator f , i.e.,

a+ = argmin (axl − e(xl ))2 .
a
xl ∈BI

We would like to show that there exists x ∈ [ωl , ωr ], such that

a+ x − e(x) = 0.

This indicates that the line a+ x intersects g(x) at another point. If a+ x − e(x) = 0
for all x ∈ [ωr , ωl ], then, without loss of generality, one could assume a+ x − e(x) > 0
for all x ∈ [ωl , ωr ].
So, there must be a smallest distance between the optimal hyperplane a+ x and
the residual curve e(x). This distance is defined as

α+ = min a+ x − e(x).
x∈[ωl ,ωr ]

Say this minimum occurs at the point x+ . Note that this point x+ may not actually
correspond to a sampled data point xk .
Now consider another hyperplane a∗ x which is obtained by scaling the optimal
hyperplane so that it produces another point of intersection with e(x), i.e.,

a∗ x − e(x) = 0

for x = 0 and x ∈ BI . By assumption we have a∗ x = κa+ x, where 0 < κ < 1, so

0 < e(x) ≤ a∗ x < a+ x

for all x ∈ (ωl , ωr ). Hence a∗ x − e(x) < a+ x − e(x) and


 
(a∗ x − e(xl ))2 < (a+ x − e(xl ))2 ,
xl ∈[ωl ,ωr ] xl ∈[ωl ,ωr ]

which is a contradiction, as we assumed a+ was the optimal solution. Thus, the


parameters a+ could not have been obtained using least squares minimization.
Using the geometry that is produced by the intersection line and the nonlinear
geometry, we could contain the residuals within three triangles that are generated
by the lines passing through the intersecting points ω1 and ω2 with slop Lc (see
Figure 1(a) for a visual presentation). We conclude that
 
l l (ω1 − ω2 )2
|f (x ) − g(x )| < Lc + (ωl − ω1 ) + (ωr − ω2 ) .
2 2

l
2
x ∈BI

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B631

Note that BI is an interval that contains the region of interest, diam(BI ) > |ω1 − ω2 |,
where ωr and ωl are the right and left ends of the interval BI , respectively. f k is the
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

line structure that is the least squares solution. We present this proof for a single
dimension; however, the higher dimensional argument follows directly by considering a
ball BI with a diameter that covers the intersecting hyperplane with codimension 2 at
{Ω1 , . . . , Ωn } with g(x). The argument follows the volume that is contained between
the hyperplane of codimension 1 and the polyhedron (tetrahedra in dimension 2) that
contains the nonlinear structure and has slope equal to Lc for all its faces.
Note that in one dimension we have the notion of a line and a tangent line, and
the model output is the weighted sum of piecewise linear approximation to the data.
In higher dimensions we have the notion of a hyperplane with the tangent plane to
make a weighted piecewise linear approximation to the underlying manifold.
Definition 6 (partition of unity [39]). A partition of unity of a topological space
X is a set of continuous functions, {ρi }i∈Δ (Δ is an index set), from X to the unit
interval [0, 1], such that for every point x ∈ X , there is a neighborhood of x where all
but a finitenumber of the functions are 0, and the sum of all the function values at x
is 1, i.e., i∈Δ ρi (x) = 1.
The notion of partition of unity allows the extension of a local construction to the
whole space. We use the following method to identify our partition of unity. Given
any open cover Ui , i ∈ Δ (Δ is an index set), of a space, there exists a partition ρi ,
i ∈ Δ, such that supp ρi ∈ Ui (supp ρi indicates the support of function ρi ). Such a
partition is said to be subordinate to the open cover Ui . Thus we choose to have the
supports indexed by the open cover.
If functions ρi are compactly supported, given any open cover Ui , i ∈ Δ, of a
space, there exists a partition ρj , j ∈ Λ, indexed over a possibly distinct index set Λ
such that each ρj has compact support, and for each j ∈ Λ, supp ρj ∈ Ui for some
i ∈ Δ.
Theorem 7. Assuming the underlying data structure, g(x), is nonlinear, Lip-
schitz continuous, and the distance threshold parameter, DT1 = diamX , the DC-RBF
algorithm with uniform RBF produces an upper bound on the bias denoted by QDT1 (Q
is defined in Proposition 5); however, if DT2 = diamX
N (N ∈ N, is a natural number)
with N > 1, and upper bound on the bias QDT2 , then QDT2 < QDT1 .
Proof. As new data points arrive, more of the true underlying geometry reveals
itself. We begin by posing as a null hypothesis that the underlying structure is
described by a line. As the number of observations increases, we test this hypothesis
against alternative hypotheses that the underlying structure is described by a series of
lines over regions. At this stage the DC-RBF algorithm, given DT = diamX , attempts
to model the secant line passing through the curvature (see Figure 1(a) for a visual
presentation). What we mean by the secant line is the regression line passing through
the data points associated to cloud indexed I, where at this stage I = 1. Note that
the linear solution produces bias, with an upper bound provided in Proposition 5.
Similar to Proposition 5, we assume the regression line intersects the underlying
nonlinear structure at points ω1 and ω2 with upper bound on the bias Q. The domain
of interest is BI = [ωl , ωr ]. As DT shrinks (without loss of generality, we assume
DT = diamX2 ), the space is broken down into smaller sets (see Figure 1(b) for a visual
presentation). For a one-dimensional space, this choice of DT breaks the space down

into two parts, hence Δ = {1, 2}. The local RBFs φ(r), with i φ(x−c i W )
φ(x−ci W ) = 1
i

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B632 ARTA A. JAMSHIDI AND WARREN B. POWELL

with i ∈ Δ, form a partition of unity as provided in Definition 6. This would allow the
expansion of each solution to a larger domain to produce a smooth transition from
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

one local model to the neighboring models. With this setup, we assume that the two
regression lines l1 and l2 intersect at point ωM . According to Proposition 5, and with
a uniform choice for kernel φ(.), we get the following bounds for the bias on l1 and l2
denoted by Ql1 and Ql2 , respectively:

(ωl11 − ωl12 )2
Q l1 = L c + (ωl − ωl11 ) + (ωM − ωl12 )
2 2
2
and 
(ωl21 − ωl22 )2
Q l2 = L c + (ωM − ωl21 ) + (ωr − ωl12 ) .
2 2
2
One could simply verify that Ql1 + Ql2 < Q. This would result in reduction of
the upper bound on the bias, hence QDT2 < QDT1 , with the cost of having more
clouds in the cover (more parameters in the model) and a corresponding increase in
the variance of the model, VDT1 < VDT2 .
The argument provided here is for one dimension, but, using similar arguments
made in Proposition 5, could be generalized to higher dimensions.
Variable subset selection and shrinkage are methods that introduce bias and try to
reduce the variance of the estimate. These methods trade a little bias for a larger re-
duction in variance. In practice we choose DT via cross-validation over other available
methods that are developed for this purpose.
Theorem 8. Asymptotically, the DC-RBF algorithm is unbiased for continuously
differentiable functions underlying nonlinear structure and heteroscedastic noise.
Proof. This theorem follows from Theorem 7 and takes into account that in the
limit DT → 0 and K → ∞. Let

pi (x)φ(x − ci Wi )
fΔ (x) = i∈Δ ,
i∈Δ φ(x − ci Wi )

where pi (x) represents a linear model and Δ is an index set. As DT → 0, |Δ| → ∞


(ι = |Δ| denotes cardinality of Δ); therefore, at x = ci ,

pi (x)φ(x − ci Wi )dx
lim fΔ (x) =  .
ι→∞ φ(x − ci Wi )dx

Since DT → 0 leads to Wi → 0, limW →0 φ(x  − ci W ) = δ(x − ci ), where δ(.) is the


Dirac delta function. We then have limι→∞ φ(x − ci )dx = 1. Note that K → ∞
guarantees that in every local ball around a given point in domain x with radius
ξ > 0, B(x,
 ξ) contains at least n + 1 data points, as suggested by Lemma 2; thus
limι→∞ pi (x)φ(x − ci Wi )dx = pi (ci ), where x = ci .
If the true underlying nonlinear function g ∈ C ∞ , then by the Taylor expansion,
∞
g (n) (x)
g(x, a) = (x − a)n ,
n=0
n!

where g(x, a) denotes the approximation of function g(x) around point a and g (n) (x)
denotes the nth derivative of the function g at point x. If the underlying function
g(x) ∈ C 1 , we consider a first order linear approximation to function g(x) at each

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B633

point, i.e., g(x) ≈ f (x) = g 1 (x − a)1 . In other words, at each point x, the first order
approximation is a hyperplane. Since in the current implementation of the DC-RBF
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

algorithm pi (x) is chosen to be a linear function, pi (x) is the tangent line to each point
x. This is true since the limit of the secant line passing through the points xk1 and
xk2 when xk1 → xk2 is a tangent line at point xk1 . When xk1 → xk2 , ||xk1 − xk2 || → 0
and in the limit this secant line forms the tangent at the meeting point xj . As a result
k1
)−f (xk2 )
the difference quotient f (x||xk1 −xk2 ||
approaches the slope of the tangent line of g(x)
at that xk2 , i.e.,
f (xk1 ) − f (xk2 )
lim = g 1 (xk2 ).
||xk1 −xk2 ||→0 ||xk1 − xk2 ||

This is true as K → ∞. Up to the first order approximation within a ball B(x, ξ) with
ρ > 0, the function is assumed to be linear. According to Remark 1, and considering
the fact that the variation of the noise within B(x, ξ) is negligible, the solution is
BLUE. Since this is true for every point in domain X , the DC-RBF algorithm is
therefore asymptotically unbiased.
5. The empirical results. Here we demonstrate the performance of the algo-
rithm on a variety of synthesized and real data sets. The data sets have distinct
features in terms of input dimension, complexity of the response function, as well as
noise content. Note that the algorithm provides a functional representation of the un-
derlying data. The first three synthetic examples concentrate on the performance and
discussions on DC-RBF algorithm, followed by two real examples with comparisons
to other statistical techniques.
5.1. Synthesized data sets.
One-dimensional newsvendor data set. In the newsvendor problem, a news-
vendor is trying to determine how many units of an item x to stock. The stocking cost
and selling price for the item is c and p, respectively. One could assume an arbitrary
distribution on demand D. The expected profit is given by
F (x) = E [p min(x, D)] − cx.
This problem poses a challenge for online data analysis due to the special behav-
ior in the function around the optimal solution, which is highly dependent on the
characteristics of a data set.
Figure 2 describes the experimental setup and the model output for two different
distributions on D. Figure 2 shows the training, testing, and model output for each
case. The specification of the algorithm for both problems is identical. The number
and position of training and testing data are the same in both experiments. We employ
our function approximation technique to find the maximum number of units to stock
to maximize expected profit. The MSE value for the first experiment is 107.39 with
a model that has three clouds. The second experiment has resulted in a model with
three clouds with MSE value of 426.63. We observe that the maximum stocking point
is well identified in both experiments. Note that the algorithm learns the underlying
functional behavior of the given data set within the scope that is determined by the
radius of the local balls. The model generalizes well within local regions. Figure 3
shows the performance of the proposed algorithm for different random ordering of
input data and a different distance threshold. The order of the data plays a role in
the initial stage of the model construction; however, the centers of the clouds stabilize
after there are enough data points in a given cloud.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B634 ARTA A. JAMSHIDI AND WARREN B. POWELL
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

600 500

400
0
200
Average profit

Average profit
0 −500

training data
−200 testing data
training data −1000 model output
testing data
−400 model output

−1500
−600

−800
−2000
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Units stocked Units stocked

(a) D ∼ (50, 60). (b) D ∼ (30, 70).

Fig. 2. Newsvendor data set generated using [p min(x, D)] − cx. The training and testing data
sets consist of 24 and 36 data points, respectively. To generate this data set, c = 50, p = 60, and
D is a random uniform integer between 50 and 60 for panel (a) and between 30 and 70 for panel
(b). Inventory stock levels x were sampled from a random uniform integer distribution from 20 to
80. The MSE and MAE of the model output on the test data set in the original scale of the data
are 33847 and 107.39 for panel (a) and 453470 and 426.63 for panel (b), respectively. The model
distance threshold is 15. There are three clouds in each model. The centroids of the clouds in both
models are 29.80, 51, 69.85 with standard deviations of 6.54, 5.71, and 4.87, respectively.

600 600

400 400

200 200
Average profit

Average profit

0 0

−200 training data −200


training data
testing data
testing data
−400 model output −400
model output

−600 −600

−800 −800

20 30 40 50 60 70 80 20 30 40 50 60 70 80
Units stocked Units stocked

(a) Different input ordering. (b) Different value of DT .

Fig. 3. Performance of DC-RBF algorithm on a different random ordering of input data


and distance threshold of 35 for the newsvendor problem shown in Figure 2(a). The centroids of
the clouds in panel (a) are 28.55, 48.28, 68.62 with standard deviations of 5.54, 5.21, and 5.70,
respectively. For panel (b), the centroid of the clouds are 34.71 and 65.80 with standard deviations
of 9.83 and 7.8, respectively.

The underlying function forms a shape that is not in the span of a polynomial
of degree p, in particular for problems where D has low variance (making these more
challenging problems). Therefore parametric methods do not perform well, especially
given the fact that the basis functions of a parametric model must be known a priori.
For the noisy data set the interpolation methods such as cubic splines (despite the
regression schemes) do not perform well and often result in overfitting the data. In-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B635

terpolation techniques also require more data points to produce an accurate response;
see, e.g., [59]. We observe that, in this experiment, our method outperforms kernel
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

smoothing regression both in noisy and noise-free data sets. The poor performance
of kernel smoothing in this experiment is due to its local averaging property, which
tends to underfit the function around the local extreme points. Therefore, the pre-
dicted value of the maximum is worse than the true value. In addition, this technique
requires storage of the history of data points and does not provide an analytical form
for the underlying function.
We have tested our algorithm on various other one-dimensional noisy data sets
and have observed a similar conclusion. Therefore we only report on a newsvendor
data set which plays a key role in resource management and produces an interesting
case for data analysis due to the shape around the optimum result. Our method has
the lowest or a very comparable MSE compared to the techniques described in this
section and captures well the local structure of the signal.
An oscillatory data set with varying additive noise. To demonstrate the
ability of the algorithm to discover the underlying nonlinear function while the obser-
vations are highly corrupted with various levels of noise, we have tested the method
on a synthesized data set that is produced by y = sin( πx4 ) + 0.1xN (0, 1). Figure 4(a)
shows the noisy and noise-free data sets. The challenge is to find the underlying func-
tion sin(.) from the noisy input signal with large variation in the variance of noise.
Figure 4(b) shows the testing data set and the model output. One could observe that
the proposed algorithm has performed well in recovering the underlying function with
only five clouds in the function expansion. The final MSE is 0.8 and MAE 0.62. The
distance threshold is 3.

3 2.5
noise−free data training data
noisy data 2 testing data
2 model output
1.5

1
1
0.5

0 0
y

−0.5
−1
−1

−2 −1.5

−2

−3 −2.5
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
x x

(a) Actual data. (b) Model output.

Fig. 4. Panel (a) shows a data set of 200 data points in the interval [0, 16] generated with
y = sin( πx4
) + 0.1xN (0, 1). Both noise-free and noisy data are shown in this figure. Panel (b)
shows the training and testing data sets, each with 100 points, as well as the model output. In this
graph the data is rescaled using the standard deviation of the whole data. The centroid of the clouds
are 1.24 and 4.58, 7.93, 11.60, 14.60 with the standard deviation of 0.81 and 1.13, 1.06, 1.04, 0.79,
respectively.

Two-dimensional
√ saddle data set. A data set generated from z = x2 − y 2 +
N (0, 5) which produces a noisy saddle shape is used in this study. Figure 5(a) shows
the data points that are used for training and testing. Figure 6 demonstrates four

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B636 ARTA A. JAMSHIDI AND WARREN B. POWELL
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

10 10

5 5

0 0
z

z
−5 −5

−10 −10

5 5
5 5
0 0
0 0
−5 −5 −5 −5
y x y x

(a) Training and testing data. (b) Final model.



Fig. 5. Saddle data set generated by z = x2 − y 2 + N (0, 5). There are 125 training and 100
testing data points over the domain x ∈ [−3, 3] × [−3, 3]. The final model with four clouds has MSE
of 4.49 with MAE of 1.7. The distance threshold is 3.

major snapshots of the model-making process. At first there are not enough points
to form a model. Then, as new data points arrive, the first cloud is formed, as shown
in Figure 6(a). Depending on the spatial location of the arrival points, the first cloud
might get updated or the second cloud might get formed. The same process repeats
till the fourth cloud of the model is shaped, as shown in Figure 6(d). Each panel in this
figure shows the incident when a new cloud is formed with three affinely independent
data points. The third data point that activates the cloud, i.e., the first data point
that is assigned to a cloud that results in construction of a plane over its designated
cloud, is also plotted in each panel. Finally, Figure 5(b) shows the final model after
presenting 125 data points in the presence of the 100 testing points.
We observe that the accuracy of the model is similar to LOESS (locally scatter-
plot smoothing). However, unlike LOESS our method does not require the storage
of the data points and has superior speed. In addition, DC-RBF provides an analyt-
ical formulation for the underlying structure. Our test results on other synthesized
surfaces provide the same conclusion. For further elaboration on the LOESS method
and its comparison to DC-RBF, see section 6.

5.2. Real benchmark data sets and comparison results. In this section
we show the performance of the proposed method on two real data sets and compare
results to the related methods in the literature in terms of accuracy, speed, batch
vs. online, and the requirement of storing the previous data points. We select two
data sets with continuous response values. The selected data sets represent regression
challenges on noise heteroscedasticity and moderate dimensionality. We compare our
results to techniques that we find are closest to the spirit of our work.
The list of the benchmark algorithms is as follows:
• Dirichlet process mixtures of generalized linear models (DP-GLM): A Bayesian
nonparametric method that finds a global model of the joint input-response
pair distribution through a mixture of local generalized linear models [26].
• Ordinary least squares (OLS): A parametric method that is widely used for
data fitting problems and often provides a reasonable fit to data. In this
study [1, x]T has been chosen as basis functions (xi , i ∈ {1, . . . , n} denotes
the ith coordinate).
• CART: This is a nonparametric tree regression method [4]. This method is

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B637
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

10 10

5 5

0 0
z

z
−5 −5

−10 −10

5 5
5 5
0 0
0 0
−5 −5 −5 −5
y x y x

(a) Arrival of first cloud. (b) Arrival of second cloud.

10 10

5 5

0 0
z

−5 −5

−10 −10

5 5
5 5
0 0
0 0
−5 −5 −5 −5
y x y x

(c) Arrival of third cloud. (d) Arrival of fourth cloud.

Fig. 6. The model behavior at the starting point of adding a new cloud to the model. The final
model consists of four clouds.

available in the MATLAB function classregtree.


• Bayesian CART: A tree regression model with a prior over tree size [8], im-
plemented in R with the tgp package.
• Bayesian treed linear model: A tree regression model with a prior over tree
size and a linear model in each of the leaves [9], implemented in R with the
tgp package.
• Gaussian processes (GP): A nonparametric method for continuous inputs and
responses [50]. This algorithm is available in MATLAB with program name
gpml.
• Treed Gaussian processes: A tree regression model with a prior over tree size
and a GP on each leaf node [23]. This is implemented in R with the tgp
package.
DP-GLM [26] is closest to the spirit of our work, in that it also forms local linear
regressions, although algorithmically they are quite distinct. So we have emphasized
this comparison in this work and used results reported in [26] on the performance of
other competing methods in the literature. DP-GLM was designed for batch appli-
cations and is much slower and data intensive than the DC-RBF method proposed
in this paper. In comparison to other methods that obtained relatively good results
such as GP and treed GP, GP regression simply uses a locally weighted estimate
which makes no attempt to model the higher order relationships that we are able to
capture with DC-RBF through our local linear models. Treed GP simply generalizes

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B638 ARTA A. JAMSHIDI AND WARREN B. POWELL

traning data
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

6
testing data
model output
4

2
y

−2

−4

100 200 300 400 500 600 700 800


x

Fig. 7. The CMB data set that consists of 899 data points. The training and testing data sets
and the model output for the CMB data set are shown in this figure. The training data set consists
of 500 randomly selected points, and the remaining 499 are used for testing. The centroid of the
clouds are 62.22, 195.01, 647.96, 328.01, 816.23, 425.90, 530.63 with the standard deviation of 38.25,
43.43, 43.98, 33.67, 49.94, 31.94, 31.19, respectively.

GP by capturing low-dimensional interactions between a small subset of variables (3


or fewer), but still produces locally constant estimates. Both of these methods are
designed for very low dimensional settings. DC-RBF can handle higher dimensional
interactions in the locally linear approximations.
Cosmic microwave background (CMB) [2]. This data set maps positive
integers x = 1, 2, . . . , 899, called “multipole moments,” to power spectrum Cl . It
consists of 899 observations, shown in Figure 7, where 500 randomly selected points
are used as training data, and the remaining 499 are used for testing. This data set is
highly nonlinear and heteroscedastic. These features make this data set an interesting
benchmark choice. The underlying function relates continuous domain values to their
corresponding continuous responses.
Figure 7 also shows the model output. The distance threshold parameter value,
95, is determined using 5-fold cross-validation. The output is a continuous plot of
a parsimonious model with only seven clouds. The MSE and MAE for the testing
data are 0.62 and 0.41, respectively. The result shows that the method is capable of
handling the heteroscedasticity in this data set quite well. The comparison between
the DC-RBF method and the six methods described above is summarized in Table 1.
In terms of error rates our proposed algorithm slightly underperforms on the smaller
data sets compared to the DP-GLM, although with a dramatic reduction in CPU
time. However, it is quite competitive on larger data sets. Note that even though
our proposed algorithm is in the same algorithmic class as DP-GLM, it only uses a
given data point once and does not require storage of the data points. In contrast,
DP-GLM is a batch algorithm and relies on exhaustive computation to compute a
model and requires all the history to rebuild the model once a new data point arrives.
The new method has one adjustable hyperparameter, while the DP-GLM has multiple
hyperparameters to tune. The DC-RBF method is orders of magnitude faster, since

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B639
Table 1
This table summarizes the performance of various regression techniques on the CMB data set.
The MSEs and MAEs on different sizes of testing data sets are reported in this table. The presented
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

results are the mean of the performances for various methods for different permutations of data
points selected for training and testing data sets. For DC-RBF, the hyperparameter DT is kept fixed
for a given data set size. The hyperparameter is identified using a 5-fold cross-validation technique.
The hyperparameters of the other techniques are calculated by sweeping over a set of candidate
parameters that varies by 3 to 5 orders of magnitude, and then the parameter that does the best on
the training set is chosen; these number are reported from [26].

Training set size 30 50 100 250 500 30 50 100 250 500


Method Mean absolute error Mean square error
Bayesian CART 0.66 0.64 0.54 0.50 0.47 1.04 1.01 0.93 0.94 0.84
Bayesian TLM 0.64 0.52 0.49 0.48 0.46 1.10 0.95 0.93 0.95 0.85
Gaussian process 0.55 0.53 0.50 0.51 0.47 1.06 0.97 0.93 0.96 0.85
Treed GP 0.52 0.49 0.48 0.48 0.46 1.03 0.95 0.95 0.96 0.89
Linear regression 0.66 0.65 0.63 0.65 0.63 1.08 1.04 1.01 1.04 0.96
CART 0.62 0.60 0.60 0.56 0.56 1.45 1.34 1.43 1.29 1.41
DP-GLM 0.58 0.51 0.49 0.48 0.45 1.00 0.94 0.91 0.94 0.83
DC-RBF 0.61 0.53 0.49 0.47 0.46 1.40 1.13 0.98 0.91 0.85

DP-GLM requires the use of Markov chain Monte Carlo methods to reoptimize the
clustering as each data point is added, whereas DC-RBF updates instantly.
Concrete compressive strength (CCS) [62]. The CCS data set provides a
mapping from an eight-dimensional input space to a continuous output. This data
set has low noise and variance. The eight continuous input dimensions are as follows:
cement components, blast furnace slag, fly ash, water, superplasticizer, coarse aggre-
gate, and fine aggregate, all measured in kg per m3 , and the age of the mixture in
days. The response is the compressive strength of the resulting concrete. There are
1,030 observations in this data set. The challenge working with this data set is to
find a continuous mapping that represents the input-output relationship well in this
multidimensional data.
Similar to the experiment using the CMB data set, we have summarized the
comparison results for this data set in Table 2. The distance threshold parameter
value, 255, is determined using 5-fold cross-validation. We observe that as the number
of observations increases, the performance of the new method is enhanced due to the
nature of our design, which requires a certain number of data points to activate a
cloud. From Table 2, we see that in terms of error our method remains competitive
with DP-GLM, which is computationally much more demanding.
6. Concluding remarks. We propose a fast, recursive, function approximation
method which avoids the need to store the complete history of data (typical of non-
parametric methods) or the specification of hyperparameters for Bayesian priors. The
method assigns locally linear parametric models for regions of the covariate space that
are created dynamically based on the domain values of the input data without the
need for prespecified classification schemes. A weighting scheme is associated to each
locally linear approximation using normalized RBF. Each local model is updated
recursively with the arrival of each new observation, which is then used to update the
cloud representation before the data are discarded.
The new method is robust in the presence of homoscedastic and heteroscedastic
noise. Through the sole parameter DT , our method automatically determines the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B640 ARTA A. JAMSHIDI AND WARREN B. POWELL

Table 2
The performance table for CCS data. In this table the results are reported in terms of MSE
and MAE for various competing methods, including DP-GLM. The experimental setup here is the
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

same as described for CMB data set.

Training set size 30 50 100 250 500 30 50 100 250 500


Method Mean absolute error Mean square error
Bayesian CART 0.78 0.72 0.63 0.55 0.54 0.95 0.80 0.61 0.49 0.46
Bayesian TLM 1.08 0.95 0.60 0.35 1.10 7.85 9.56 4.28 0.26 1,232
Gaussian process 0.53 0.52 0.38 0.31 0.26 0.49 0.45 0.26 0.18 0.14
Treed GP 0.73 0.40 0.47 0.28 0.22 1.40 0.30 3.40 0.20 0.11
Linear regression 0.61 0.56 0.51 0.50 0.50 0.66 0.50 0.43 0.41 0.40
CART 0.72 0.62 0.52 0.43 0.34 0.87 0.65 0.46 0.33 0.23
DP-GLM 0.54 0.50 0.45 0.42 0.40 0.47 0.41 0.33 0.28 0.27
DC-RBF 0.57 0.54 0.51 0.47 0.39 0.54 0.50 0.44 0.40 0.29

model order for a given data set (one of the most challenging tasks in nonlinear
function approximation). Unlike similar algorithms in the literature, our method has
only one tunable parameter and is asymptotically unbiased.
Table 3 provides a comparison of different statistical methods that we have con-
sidered in this paper. This table brings together features such as speed, complexity
of implementation, storage requirement, recursivity, the type of noise that can be
handled, and the number of tunable parameters.
Table 3
The benchmark table to compare various statistical function approximation techniques in terms
of overall speed in terms of model updating and model evaluation (Ultra fast, Fast, Very slow), com-
plexity of implementation (Complex, Intermediate, Simple), storage requirement (All the history,
Statistical Representation, None), recursivity (Yes, No), type of noise they could handle (HEt-
eroscedasticity, HOmoscedasticity), and the number of tunable parameters (None, One, Few).

Algorithm Speed Complexity Storage Recursivity Noise TP


DP-GLM V C A N HE F
Bayesian CART V C A N HO F
Bayesian TLM F I A N HO F
Gaussian process V C A N HO F
Treed GP V C A N HO F
Linear regression U S N Y HO N
CART F I A N HO F
Kernel regression F I A N HO F
Locally linear F I A Y HO F
LOWESS F I A N HO F
DC-RBF U S SR Y HE O

More specifically, in contrast to DP-GLM, DC-RBF offers orders of magnitude


faster updates, does not require as many tunable parameters, avoids the complexity
of its implementation, and can be updated recursively without the need for storing
the data history. DP-GLM was designed for a batch data set. In a dynamic setting,
it required repeating the entire clustering process (itself an iterative procedure) for
each new data point.
The power of DC-RBF lies in its adaptability to approximate complex surfaces,
compactness in model representation, parsimony in model specification, recursivity,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B641

and speed. Parametric methods will always suffer from the need to tune the choice
of basis functions. Nonparametric methods require the storage of the entire history,
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

which complicates function computations in stochastic optimization settings (our mo-


tivating application). DP-GLM, which fits local polynomial approximations to clus-
ters, is extremely slow and cannot be adapted to a recursive setting. GP models can
work very well, but only in particular problem classes with continuous covariates, and
they also require more tunable parameters. GP models, as well as techniques such as
splines, also introduce the risk of overfitting.
Our experience with Bayesian methods (in particular, in the development of DP-
GLM, for which the second author was a developer) is that they are quite sensitive
to the tuning of the hyperparameters that make up the priors. This introduces an
undesirable degree of human intervention that is likely to limit their usefulness in
black box stochastic optimization algorithms.
DC-RBF appears to satisfy most of our needs for approximating continuous func-
tions in the recursive setting of stochastic optimization. It starts with a linear, para-
metric model and adds clouds only as the range of covariates expands. It can adapt
to quite general surfaces and requires only that we retain the history in the form
of a relatively compact set of clouds. It does not require the specification of basis
functions (we believe the local linear models should be kept very simple) and offers
fast, recursive updating algorithms.
Performance in terms of solution quality is always going to be an evaluation
limited to a specific set of experimental tests. We used the newsvendor setting to
demonstrate the ability of DC-RBF to adapt to both a sharp, nondifferentiable func-
tion (which is difficult to approximate using low order polynomials) and a smooth
function. The method works well, theoretically and experimentally, in the presence
of heteroscedastic noise. Finally, it appears to be competitive against much more
complex algorithms such as DP-GLM.
DC-RBF does introduce the need to tune a distance threshold parameter DT .
This is not a minor requirement, as the choice of DT captures the overall behavior
of the surface. We suspect the choice of DT will generally become apparent within
any specific problem class. We also doubt that this method would be effective for
high-dimensional applications (say, more than 10 or 20 covariates).
We believe that there are opportunities to build on this basic idea to overcome
the need to specify DT . A natural extension is to specify a dictionary of possible
values of DT (possibly on a log scale), which can be viewed as models at different
levels of granularity. We might then estimate a family of approximations, one for each
DT , which can then be combined using a hierarchical weighting scheme such as that
proposed in [21].
We have shown numerical results on synthetic and real data which demonstrate
the success of the algorithm for multidimensional function approximation. Despite
a model generated by DP-GLM which remains biased toward the early observations,
(this form of modeling may not serve well in applications that provide updates for the
existing information, such as ADP), DC-RBF could adapt very quickly to new infor-
mation. This type of approximation tool is especially useful for value function approxi-
mation in the context of approximate dynamic programming, where a stream of exoge-
nous information contributes to the system dynamics. We intend to use the approx-
imate technique for value function approximation in the context of approximate dy-
namic programming [47] and optimal learning [48] for energy resource allocation under
uncertainty. In future studies we plan on enhancing the algorithm to adapt the resolu-
tion of DT as more complicated patterns are presented to the algorithm over time.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B642 ARTA A. JAMSHIDI AND WARREN B. POWELL

REFERENCES

[1] C. Andrieu, N. Freitas, and A. Doucet, Robust full Bayesian learning for radial basis
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

networks, Neural Comput., 13 (2001), pp. 2359–2407.


[2] L. Bennett, M. Halpern, G. Hinshaw, N. Jarosik, A. Kogut, M. Limon, S. S. Meyer,
L. Page, D. N. Spergel, G. S. Tucker, E. Wollack, E. L. Wright, C. Barnes,
M. R. Greason, R. S. Hill, E. Komatsu, M. R. Nolta, N. Odegard, H. V. Peiris,
L. Verde, and J. L. Weiland, First-year Wilkinson microwave anisotropy probe (WMAP)
1 observations: Preliminary maps and basic results, Astrophys. J. Suppl. Ser., 148 (2003),
pp. 1–27.
[3] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK,
1995.
[4] L. Brieman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression
Trees, Chapman & Hall/CRC, New York, 1984.
[5] D. S. Broomhead and D. Lowe, Multivariable functional interpolation and adaptive networks,
Complex Systems, 2 (1988), pp. 321–355.
[6] G. Bugmann, Normalized Gaussian radial basis function networks, Neurocomput., 20 (1998),
pp. 97–110.
[7] M. D. Buhmann, Radial Basis Functions, Cambridge University Press, Cambridge, UK, 2003.
[8] H. A. Chipman, E. I. George, and R. E. McCulloch, Bayesian CART model search, J.
Amer. Statist. Assoc., 93 (1998), pp. 935–948.
[9] H. A. Chipman, E. I. George, and R. E. McCulloch, Bayesian treed models, Machine
Learning, 48 (2002), pp. 299–320.
[10] E. K. P. Chong, An Introduction to Optimization, Wiley, 2008.
[11] W. S. Cleveland, Robust locally weighted regression and smoothing scatterplots, J. Amer.
Statist. Assoc., 74 (1979), pp. 829–836.
[12] W. S. Cleveland, LOWESS: A program for smoothing scatterplots by robust locally weighted
regression, Amer. Statistician, 35 (1981), p. 54.
[13] W. S. Cleveland and S. J. Devlin, LOWESS: A program for smoothing scatterplots by robust
locally weighted regression, J. Amer. Statist. Assoc., 83 (1988), pp. 596–610.
[14] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, Active learning with statistical models, J.
Artificial Intelligence, 4 (1996), pp. 129–145.
[15] R. L. Eubank, Spline Smoothing and Nonparametric Regression, Marcel Dekker, New York,
1988.
[16] J. Fan, Design-adaptive nonparametric regression, J. Amer. Statist. Assoc., 87 (1992), pp. 998–
1004.
[17] J. Fan and I. Gijbels, Local Polynomial Modelling and Its Applications, Chapman & Hall/
CRC Monographs on Statistics & Applied Probability, London, 1996.
[18] C. F. Gauss, Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Gottingae,
1825.
[19] S. Geisser, Predictive Inference: An Introduction, Chapman & Hall/CRC, New York, 1993.
[20] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, 2nd ed.,
Chapman & Hall/CRC, Boca Raton, FL, 2003.
[21] A. George, W. B. Powell, and S. Kulkarni, Value function approximation using multi-
ple aggregation for multiattribute resource management, J. Mach. Learn. Res., 9 (2008),
pp. 2079–2111.
[22] F. Girosi, M. Jones, and T. Poggio, Regularization theory and neural network architectures,
Neural Comput., 7 (1995), pp. 219–269.
[23] R. B. Gramacy and H. K. H. Lee, Bayesian treed Gaussian process models with an application
to computer modeling, J. Amer. Statist. Assoc., 103 (2008), pp. 1119–1130.
[24] I. D. Guedalia, M. London, and M. Werman, An on-line agglomerative clustering method
for nonstationary data, Neural Comput., 11 (1999), pp. 521–540.
[25] H.-M. Gutmann, A radial basis function method for global optimization, J. Global Optim., 19
(2001), pp. 201–227.
[26] L. A. Hannah, D. M. Blei, and W. B. Powell, Dirichlet process mixtures of generalized
linear models, J. Mach. Learn. Res., 12 (2011), pp. 1923–1953.
[27] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer,
New York, 2009.
[28] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed., Prentice Hall, Upper
Saddle River, NJ, 1999.
[29] C. C. Holmes and B. K. Mallick, Bayesian radial basis functions of variable dimension,
Neural Comput., 10 (1998), pp. 1217–1233.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B643

[30] T. Ishikawa and M. Matsunami, An optimization method based on radial basis function, IEEE
Trans. Magnetics, 33 (1997), pp. 1868–1871.
[31] A. A. Jamshidi and M. J. Kirby, Towards a black box algorithm for nonlinear function
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

approximation over high-dimensional domains, SIAM J. Sci. Comput., 29 (2007), pp. 941–
963, doi:10.1137/050646457.
[32] A. A. Jamshidi and M. J. Kirby, Skew-radial basis function expansions for empirical modeling,
SIAM J. Sci. Comput., 31 (2010), pp. 4715–4743, doi:10.1137/08072293X.
[33] A. A. Jamshidi and M. J. Kirby, Modeling multivariate time series on manifolds with skew
radial basis functions, Neural Comput., 23 (2011), pp. 97–123.
[34] R. D. Jones, Y. C. Lee, C. W. Barnes, G. W. Flake, K. Lee, P. S. Lewis, and S. Qian,
Function approximation and time series prediction with neural networks, in Proceedings
of the 1990 International Joint Conference on Neural Networks (IJCNN), Vol. 1, IEEE,
1990, pp. 649–665.
[35] D. E. Knuth, Art of Computer Programming. Vol. 2. Seminumerical Algorithms, 3rd ed.,
Addison-Wesley, Reading, MA, 1998.
[36] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to Linear Regression
Analysis, 3rd ed., Wiley-Interscience, New York, 2001.
[37] J. Moody and C. Darken, Fast learning in networks of locally-tuned processing units, Neural
Comput., 1 (1989), pp. 281–294.
[38] H. G. Müller, Weighted local regression and kernel methods for nonparametric curve fitting,
J. Amer. Statist. Assoc., 82 (1988), pp. 231–238.
[39] J. Munkres, Topology, 2nd ed., Prentice Hall, Upper Saddle River, NJ, 2000.
[40] S. Morales-Enciso and J. Branke, Tracking global optima in dynamic environments with
efficient global optimization, European J. Oper. Res., 242 (2015), pp. 744–755.
[41] E. A. Nadaraya, On estimating regression, Theory Probab. Appl., 9 (1946), pp. 141–142.
[42] J. Park and I. W. Sandberg, Universal approximation using radial-basis-function networks,
Neural Comput., 3 (1991), pp. 246–257.
[43] J. Park and I. W. Sandberg, Approximation and radial-basis-function networks, Neural Com-
put., 5 (1993), pp. 305–316.
[44] T. Poggio and F. Girosi, Regularization algorithm for learning that are equivalent to multi-
layer networks, Science, 247 (1990), pp. 978–982.
[45] M. J. D. Powell, Radial basis functions for multivariable interpolation: A review, in Algo-
rithms for Approximation, J. C. Mason and M. G. Cox, eds., Clarendon Press, Oxford,
1987, pp. 143–167.
[46] M. J. D. Powell, The theory of radial basis function approximation in 1990, in Advances
in Numerical Analysis, Vol. II, W. Light, ed., Oxford University Press, New York, 1992,
pp. 105–210.
[47] W. B. Powell, Approximate Dynamic Programming, Wiley, 2011.
[48] W. B. Powell and I. O. Ryzhov, Optimal Learning, Wiley, Hoboken, NJ, 2012.
[49] C. E. Rasmussen and Z. Ghahramani, Infinite mixtures of Gaussian process experts, in Ad-
vances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA, 2001,
pp. 881–888.
[50] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, MIT
Press, Cambridge, MA, 2006.
[51] R. G. Regis and C. A. Shoemaker, A stochastic radial basis function method for the global
optimization of expensive functions, INFORMS J. Comput., 19 (2007), pp. 497–509.
[52] R. G. Regis and C. A. Shoemaker, Improved strategies for radial basis function methods for
global optimization, J. Global Optim., 37 (2007), pp. 113–135.
[53] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by
error propagation, in Parallel Distributed Processing, D. E. Rumelhart and J. L. McClel-
land, eds., 1986, pp. 318–362.
[54] D. Ruppert, M. P. Wand, and R. J. Carroll, Semiparametric Regression, Cambridge Series
in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK,
2003.
[55] B. Shahbaba and R. M. Neal, Nonlinear models using Dirichlet process mixtures, J. Mach.
Learning Res., 10 (2009), pp. 1829–1850.
[56] J. S. Simonoff, Smoothing Methods in Statistics, Springer, New York, 1996.
[57] Q. Song and N. Kasabov, ECM: A novel on-line, evolving clustering method and its applica-
tions, in Foundations of Cognitive Science, MIT Press, Cambridge, UK, 2001, pp. 631–682.
[58] A. N. Tikhonov and V. Y. Arsenin, Solutions of Ill-Posed Problems, John Wiley & Sons,
New York, 1977.
[59] G. Wahba, Spline bases, regularization, and generalized cross validation for solving approxi-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


B644 ARTA A. JAMSHIDI AND WARREN B. POWELL

mation problems with large quantities of data, in Approximation Theory III, W. Cheney,
ed., Academic Press, 1980, pp. 905–912.
[60] G. S. Watson, Smooth regression analysis, Sankhya Ser. A, 26 (1946), pp. 359–372.
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

[61] P. J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences, Ph.D. dissertation, Division of Applied Mathematics, Harvard University, Boston,
MA, 1974.
[62] I. C. Yeh, Modeling of strength of high-performance concrete using artificial neural networks,
Cement Concrete Res., 28 (1998), pp. 1797–1808.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

You might also like