0% found this document useful (0 votes)

12 views

A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions

Uploaded by

vipin

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions

Uploaded by

vipin

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

SIAM J. SCI. COMPUT.

c 2016 Society for Industrial and Applied Mathematics
Vol. 38, No. 4, pp. B619–B644

A RECURSIVE LOCAL POLYNOMIAL APPROXIMATION METHOD

USING DIRICHLET CLOUDS AND RADIAL BASIS FUNCTIONS∗
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

ARTA A. JAMSHIDI† AND WARREN B. POWELL‡

Abstract. We present a recursive function approximation technique that does not require the
storage of the arrival data stream. Our work is motivated by algorithms in stochastic optimization
which require approximating functions in a recursive setting such as a stochastic approximation
algorithm. The unique collection of these features in this technique is essential for nonlinear modeling
of large data sets where the storage of the data becomes prohibitively expensive and in circumstances
where our knowledge about a given query point increases as new information arrives. The algorithm
presented here employs radial basis functions (RBFs) to provide locally adaptive parametric models
(such as linear models). The local models are updated using recursive least squares and only store the
statistical representative of the local approximations. The resulting scheme is very fast and memory
eﬃcient without compromising accuracy in comparison to methods well accepted as the standard
and some advanced techniques used for functional data analysis in the literature. We motivate the
algorithm using synthetic data and illustrate the algorithm on several real data sets.

Key words. radial basis functions, function approximation, local polynomials, data ﬁtting

AMS subject classifications. 65D10, 65D15, 62M10, 94A12

DOI. 10.1137/15M1008592

1. Introduction. There are three major classes of function approximation meth-

ods: look-up tables, parametric models (linear or nonlinear), and nonparametric mod-
els. Parametric regression techniques (such as linear regression [36]) assume that the
underlying structure of the data is known a priori and is in the span of the regressor
function. Due to the simplicity of this approach it is commonly used for regression.
Nonparametric models [15, 38, 59] do not assume a specific structure underlying the
data. Nonparametric methods use the raw data to build local approximations of the
function, producing a flexible but data-intensive representation. Although nonpara-
metric models are generally data hungry, the resulting approximation may be more
accurate [20, 28, 56] and is less sensitive to structural errors arising from a parametric
model. Most nonparametric models require keeping track of all observed data points,
which makes function evaluations increasingly expensive as the algorithm progresses,
a serious problem in stochastic search. Regression based on least squares, Lasso re-
gression, ridge regression, and least absolute deviation all have the same underlying
structure as ordinary least squares, except the objective/penalty terms are slightly
modified. In this work we formulate a problem that strikes a balance between para-
metric and nonparametric models to benefit from the advantages of both modeling
strategies. Semiparametric models are discussed in [54].

∗ Submitted to the journal’s Computational Methods in Science and Engineering section Febru-

ary 17, 2015; accepted for publication (in revised form) April 28, 2016; published electronically
August 2, 2016. This work was partially supported by grant FA9550-08-1-0195 from the Air Force
Office of Scientific Research. Any opinions, findings, and conclusions or recommendations expressed
in this material are those of the authors and do not necessarily reflect the views of the Air Force
Office of Scientific Research.
http://www.siam.org/journals/sisc/38-4/M100859.html
† School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran (arta.

[email protected]). Previously at Department of Operations Research and Financial Engineering,

Princeton University, Princeton, NJ 08544.
‡ Department of Operations Research and Financial Engineering, Princeton University, Princeton,

NJ 08544 ([email protected]).
B619

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

B620 ARTA A. JAMSHIDI AND WARREN B. POWELL

Our work is motivated by the need to approximate functions within stochastic

search algorithms where new observations arrive recursively. The most classical rep-
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

resentation of the problem is given by

min E F (x, ν),

where E denotes the expectation operator, x is a deterministic parameter, and ν is

a random variable. Other applications include approximate dynamic programming,
where we need to approximate expectations of value functions. Since we are unable to
compute the expectation directly, we depend on the use of Monte Carlo samples. We
are interested in the class of algorithms which replaces E F (x, ν) with an approxima-
n
tion F (x) which can be quickly optimized. As we obtain new information from each
n
iteration, we need a fast and flexible method for updating the approximation F (x).
We require an approximation method that is more flexible than classical parametric
models offer, but we need a fast, compact representation to minimize computational
overhead.
Bayesian techniques for function approximation or regression are computationally
intense and require storage of all the data points [20]. Updating the model requires
revisiting all the data points in the history. These limitations make these methods
impractical in circumstances where there is a need to update the model as a stream of
new information arrives and in situations where there is a need for fast algorithms with
limited storage space. In addition these methods have multiple tunable parameters.
There is a rich literature on nonparametric Bayesian methods for regression [26].
Our work is closest in spirit to [55] with a proposed model that mixes over both
the covariates and response, and the response is drawn from a multinomial logistic
model. This work handles various response shapes. Dirichlet process mixtures of
generalized linear models (DP-GLMs) proposed in [26] are a generalization of this
idea for various response types. Other statistical techniques that are widely used for
regression include regression trees [4], where data is divided into a fixed, tree-based
partitioning and a regression model is fit to data in each leaf of the tree, and Gaussian
processes (GPs), which assume the observations arise from a Gaussian process model
with known covariance function. This method does not handle variations in the
variance of the response unless a Dirichlet process mixtures of GPs is assumed [49].
Locally weighted scatterplot smoothing (LOWESS) [11, 12, 13] does not require
a global function to represent the underlying function. This procedure produces a
model based on segments of data based on nearest neighborhood. At each point this
method assigns a low-degree polynomial using weighted least squares to the segment of
data that are closest to the point of interest. The weighted least squares assigns more
weight to the points that are closer to the query point. This method is computationally
intensive and keeps track of all the data points. In addition, it does not incorporate
an updating method as new data points arrive.
Locally linear models [16, 17], which use a weighted mixture of locally linear
regression models, offer advantages over regular kernel smoothing regression methods
such as [60] and [41]. This technique builds linear models around each observation and
keeps track of all the data points. In addition this method does not provide a global
mathematical formula for the regression function. This method performs better than
spline methods [16].
The multilayer-perceptron and the associated back-propagation algorithm [61, 53]
and radial basis functions (RBFs) [5, 45, 44] have received considerable attention for
function approximation. Back-propagation networks require a lot of training data

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B621

and as a result are quite slow. One of the main attractions of RBFs is that the
resulting optimization problem can be broken eﬃciently into linear and nonlinear
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

subproblems. An automatic function approximation technique using RBF that does

not need to tune ad hoc parameters is proposed in [31]; the multivariate extension
of this technique is proposed in [33]. For a comprehensive treatment of various RBF
techniques that build a model by adding or pruning RBFs to fine-tune the model, see
[31] and references therein. The issue of selecting the number of basis functions with
growing and pruning algorithms from a Bayesian perspective is described in [29]. In
[1], a hierarchical full Bayesian model for RBFs is proposed. Normalized RBFs are
presented in [34] which reduce the order of the model compared to nonnormalized
techniques.
We seek a general purpose approximation method that can approximate a wide
range of functions using a compact representation which can be updated recursively
with fast function evaluations. Toward this goal, we propose a novel recursive and
fast approximation scheme for modeling nonlinear data streams that does not require
storage of the data history. The covariates may be continuous or categorical, but
we assume that the response function is continuous, although not necessarily differen-
tiable. The new method is robust in the presence of homoscedastic and heteroscedastic
noise. Our method automatically determines the model order for a given data set and
produces an analytical formulation describing the underlying function. Unlike similar
algorithms in the literature, our method has one tunable parameter. Our proposed
scheme introduces a cover over the input space to define local regions. This scheme
only stores a statistical representation of data in local regions and locally approximates
the data with a low order polynomial such as a linear model. The local model param-
eters are quickly updated using recursive least squares. A nonlinear weighting system
is associated with each local model which determines the contribution of this local
model to the overall model output. The combined effect of the local approximations
and the weights associated to them produces a nonlinear function approximation tool.
On the data sets considered in this study, our new algorithm provides superior com-
putational efficiency, both in terms of computational time and memory requirements
compared to the existing nonparametric regression methods without loss of accuracy.
The test results on various synthesized and real multivariate data sets are carried out
and compared with nonparametric regression methods in the literature. In this work
we have made an attempt to bridge various fields in statistics, approximation theory,
numerical analysis, and function approximation. Other potential areas of application
for this algorithm include active learning, to estimate where to sample a space such as
a parameter space of a complex system [14], and tracking global optima in dynamic
environments; see, e.g., [40].
The organization of this paper is as follows: Section 2 provides an overview of
RBFs and motivates the use of normalized RBFs with polynomial modulated response
terms. Section 3 introduces various aspects of the proposed algorithm, including the
notion of a cloud cover for the input space, the model building procedure, and the
updating roles. This section also provides the stopping criterion for the algorithm
and the procedure to measure the goodness of fit. Section 4 shows the convergence
properties of the proposed algorithm with finite data and asymptotically. Section 5
demonstrates the performance and robustness of the algorithm using various synthetic
and real data sets with various input dimensions and noise levels. This section provides
a comparison of the new method and available algorithms in the literature. Section 6
provides concluding remarks and discusses future work.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

B622 ARTA A. JAMSHIDI AND WARREN B. POWELL

2. Normalized radial basis functions. RBFs are powerful tools for function
approximation [5, 7, 46]. Over the years RBFs have been used successfully for solving
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

a wide range of function approximation problems (see, e.g., [3]). RBFs have also been
used for global optimization (see, e.g., [51, 52, 25, 30]). An RBF expansion is a linear
summation of special nonlinear basis functions. In general, an RBF is a mapping
f : Rn −→ R that is represented by
Nc

(1) f (x) = αi φ(x − ci Wi ),
i=1

where x is an input pattern, φ is the RBF centered at location ci , and αi denotes the
weight for the ith RBF. Nc denotes the total number of RBFs. The term Wi , which
is a symmetric positive deﬁnite matrix, denotes the parameters in the weighted inner
product, √
xW = xT W x.
Note that throughout this work Wi is diagonal, as explained in section 3.1. The
dimensions of the input n and output m are speciﬁed by the dimensions of the input-
output pairs. Universal approximation properties of RBFs are established in [42, 43].
Normalized RBFs of the following form were proposed by Moody and Darken [37]:
Nc
αi φ(x − ci Wi )
(2) f (x) = i=1
Nc
.
i=1 φ(x − ci Wi )

Normalized RBFs appear to have advantages over regular RBFs, especially in the
domain of pattern classiﬁcation; see, e.g., [6]. In this work it is reported that using
normalized RBFs reduced the order of the model and the robustness of generalization.
It has also been reported that normalized RBFs require less data when training models
of dynamical systems [34].
The polynomial modulation of the normalized RBF expansion leads to the ex-
pansion
Nc
pi (x)φ(x − ci Wi )
(3) f (x) = i=1
Nc ,
i=1 φ(x − ci Wi )

where pi (x) is assumed to be a low order polynomial such as a linear function (pi (x) =
αi x+βi ). Note that the denominator does not become zero for Nc > 1 and distinct ci .
This is a property of the RBF φ. Let ψ contain all the parameters, αi , βi , ci , Wi , Nc ,
that are used to deﬁne f (x) as shown in (3). Enter these unknown parameters to the
argument of f (x) as f (x, ψ). Later we use this notation for optimizing the parameters
contained in ψ.
The above formulation leads us to the idea of normalized RBFs that have poly-
nomial response terms. In what follows we provide a fast algorithm that recursively
updates the response terms and the weights associated to them as new data points
arrive. During the training phase the unknown parameters in the model need to be
calculated, including the number of RBFs, Nc , in the function expansion.
3. Online learning scheme. The goal of our proposed algorithm is to ﬁnd a
mapping f from x ∈ Rn , a vector in n-dimensional space, to R such that y k = f (xk )
as the input-output pair {(xk , y k )} becomes available. We expect to have a total of
K arrival data points where k ∈ {1, . . . , K}, with X = {xk }K k K
k=1 and Y = {y }k=1

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B623

representing the domain and range observable values. We assume that data arrive as
a stream. Throughout the process we would like to ﬁnd the parameters associated
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

with the model provided in (3) to ﬁnd an accurate model for the data. The observed
data might be noisy, and we would like to have a model that has good generalization
ability. In this regression framework, the problem is to minimize the cost function
K
1
E(ψ) = f (xk , ψ) − y k 2 ,
2
k=1

given the available data points. In this work, the Euclidean inner product is used for
the metric · , and we use the Gaussian kernel,

φ(r) = exp(−r2 ).

Note that other local kernels could be used. For a comprehensive review of RBF
kernels and recently developed skew and compactly supported expansions, see [32]. In
what follows the description of an algorithm in this framework is presented that does
not require the storage of the data stream and locally approximates the underlying
functional behavior of the data. The response model parameters are updated quickly
using recursive least squares, and the weights associated to each linear response are
updated via statistical techniques. Note that the model order is also determined in
the algorithm.
3.1. A data driven space cover. The algorithm proposed here works by form-
ing a cover for the domain of the underlying function f upon arrival of new data points,
xk . We deﬁne a cover
C = {Ui : i ∈ Δ},
where Ui denotes an indexed family of sets, indexed with the elements of set Δ. C is a
cover of X if X ⊆ i∈Δ Ui . On the arrival of the ﬁrst data point, x1 , a cover element
or cloud U1 is formed with the center on this point and with a distance threshold of
DT to form a ball around x1 . When the system receives a new data point x2 with
associated y 2 value, if the data point is within distance DT of x1 , x2 is assigned to
the same cloud as x1 , and the cloud’s centroid and variance are updated; otherwise
a new cloud is created. New data points xk are assigned to clouds using the distance
metric

(4) Di = xk − ci ,

where ci is the centroid of cloud i ∈ {1, . . . , Nc }.

Let
I ∗ = arg min Di .
i
∗
Let I be the smallest index in set I (there might be multiple minimizers). Compare
DI , with distance threshold DT ; if DI < DT , update the ball indexed with UI .
Otherwise spawn a new cloud UNc +1 , according to the previous description. When a
new element is added to the cloud, the number of local balls, Nc , is also updated to
Nc = Nc + 1.
The algorithm retains a statistical representation of the data over a cover that is
speciﬁed by the data stream and a distance threshold. The distance threshold is the
only parameter in this algorithm that requires tuning. The value of this threshold
depends on the variations of the response values over the domain (i.e., complexity of

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

B624 ARTA A. JAMSHIDI AND WARREN B. POWELL

the surface landscape). The data that is stored in the model are the number of points
in each cloud, Ki , the centroid of each cloud, ci , and the variance in each dimension
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

of the data in each cloud, Wi (the diagonal of the sample covariance matrix). In
addition the response coeﬃcients for each cloud are also stored. Note that the model
can be evaluated at any point of its domain after adapting the model using the new
information. If the geometry of the data is more complex, there will be a need for
more clouds in the cover to capture this behavior. The proposed space cover is a part
of our function approximation technique. We have developed this independently. In
this procedure, the clouds form, move, and could dissipate (if there are not enough
data, a possibility studied in future work) to model the topology of the underlying
data. Note that the data is not a priori sorted or organized in any way, and the cloud
centers must be learned adaptively from the data. To the best of our understanding
there are some similarities to clustering procedures such as those found in [57, 24].
3.2. Recursive update of the model. There are two parts in the model that
are updated during the training: the response to the local model associated to cloud
I, and its weights.
Recursive update for response. We solve a local least squares problem of the
form minθI XIkI θI − YIkI 22 , where XIkI is a matrix of the form
⎡ ⎤
1 x1
XIkI = ⎣ : : ⎦,
kI
1 x

where the vectors x1 , . . . , xkI are domain values. The vector YIkI contains all the
associated range values YIkI = [y 1 , . . . , y kI ]. The vector θIT = [β I , αI1 , . . . , αIn ] contains
all the parameters for the linear response. The direct solution to this system is
T T
θI = (XIkI XIkI )−1 XIkI YIkI . After receiving n + 1 aﬃnely independent data points
in a local neighborhood (we refer to this state of the model as the no-knowledge state),
the matrix XIkI is formed with kI = n + 1. The recursive least squares update [10] is
used to compute the linear response model parameters. For kI = n + 1, we initialize
T
the recursion with PIkI = (XIkI XIkI )−1 . When a new data point (xk , y k ) is assigned
to this local neighborhood, with kI > n + 1, let aT = [1, xk ].
The recursion formula is then
PIkI aaT PIkI
(5) PIkI +1 = PIkI − ,
1 + aT PIkI a

(6) θIkI +1 = θIkI + PIkI +1 a y k − aT θIkI ,

where kI is the recursion index for the cloud indexed I. These equations are easily
modiﬁed to include a discount factor which puts more weight on recent observations.
Recursive update for the weights. When a new data point (xk , y k ) is assigned
to the local neighborhood indexed I, the weights WI associated with cloud UI is also
updated. For this purpose the number of data points that have been assigned to this
local ball is updated as kI = kI + 1. The center of the kernels are updated via the
recursion
xk − ckI I −1
(7) ckI I = ckI I −1 + .
kI

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B625

The width or scale of the model in each dimension of the cloud I is updated using
the Welford formula [35]. Initialize S1 = 0. For each additional data point xk assigned
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

to this cloud, use the recurrence formulas

(8) S kI = S kI −1 + (xk − ckI I −1 )T (xk − ckI I ).

The kI th estimate of the variance is the diagonal elements of

1
(9) WI2 = S kI .
kI − 1

The weighted inner product in the argument of the RBF function (as shown in
(3)) is calculated from this recursive update of the standard deviation of data in each
dimension. Note that in this study the weight matrices are diagonal. For cases where
the standard deviation of the data along a specific dimension is zero, we introduce a
penalty term to avoid singularity by replacing Wi with Wi + P , where P is a specified
constant. The required storage is proportional to Nc (3n + 1).
Ideally, we assume that the arrival data are scattered. At initial stages of model
construction it is best to have data that are sampled randomly from the input space.
The order of the data plays a role in the initial stage of the placement of the centroids;
however, the centers of the clouds stabilize after there are enough data points in a
kI −1
xk −c
given cloud from (7), ckI I − cIkI −1 = I
kI < DkIT . As the number of data points
DT
in a cloud kI increases, kI becomes smaller. In a case where the data points arrive
in one direction after a finite number of data points, a new cloud is created.
3.3. Model evaluation. This section describes how to compute the approxi-
k
mation F (x) at a query point x. As shown in (3), the output of the proposed model
is the weighted average of the linear responses over the local clouds Ui associated to
the cover C. In this study a Gaussian kernel, φ(r) = exp(−r2 ), determines the weight
of each response function. The widths, Wi , and the centroids, ci , of the kernels,
i ∈ {1, . . . , Nc }, were computed recursively as mentioned in section 3.2. The model
output is formulated as
Nc
i=1φ(x − ci Wi )([1, x]θi )
(10) f (x) = Nc .
i=1 φ(x − ci Wi )

Nc is the total number of clouds. Note that θi is the least squares solution of the
response over the ith local cloud. Note that if a cloud does not accumulate n + 1
affinely independent data points, the models built in other clouds are evaluated at a
query in this region.
A summary of the above procedure is presented in Algorithm 1.
3.4. Goodness of fit and stopping criteria. In data modeling one of the
major goals is to model data in such a way that the model does not over- or underfit
the data that is not used for training but is generated from the same process as the
training data. One general approach to finding such a smooth model to observed
data is known as regularization. Regularization describes the process of fitting a
smooth function through the data set using a modified optimization problem that
penalizes variation [58]. A standard technique to achieve regularization is via cross-
validation [22, 59]. Such methods involve partitioning the data into subsets of training,
validation, and testing data; for details see, e.g., [28]. To determine the tunable

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

B626 ARTA A. JAMSHIDI AND WARREN B. POWELL

Algorithm 1 DC-RBF algorithm.

Input DT , calculated using cross-validation.
Initialize Nc = 0, k = 1, Δ = ∅
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

while xk do
if k = 1 then
c1 = x1 , Nc = 1, k1 = 1, Δ = {1}, form cloud U1
else
compute Di = xk − ci according to (4) for all i ∈ Δ
compute I ∗ = arg min Di and let I = min I ∗
i
if DI < DT then
update cloud UI , kI = kI + 1
update the cloud centroid, cI , using (7)
update the scale of cloud, WI , using (8) and (9) if the standard deviation of
data is zero along a dimension use WI + P , where P is a speciﬁed constant
if kI ≥ n + 1 and the points are aﬃnely independent then
update the response parameter, θI , using (5) and (6)
else
store xk assigned to cloud I
end if
else
spawn new cloud UNc +1 , Nc = Nc + 1, kN c = 1, Δ = {1, . . . , Nc }
end if
end if
k =k+1
end while
evaluate the model as needed using (10)

parameter of the model proposed in this work, i.e., the distance threshold, DT , we
use t-fold cross-validation techniques [19].
In this scheme, the original data set is randomly partitioned into t subsample sets.
One of the t subsample sets is chosen as the validation data for testing the model,
and the remaining subsample sets are used as training data. This process is repeated
t times with each of the t subsample sets used only once as the validation data. All
the t results are averaged to provide an overall estimate of the error.
The advantage of this method over repeated random subsampling is that all ob-
servations are used for both training and validation, and each observation is used for
validation exactly once [27]. We have chosen to use 5-fold cross-validation. We record
the accuracy of the model on the testing data set. This procedure is repeated for all
ﬁve folds, and the mean squared error (MSE) on the test data sets is computed at
each time using
L
1 2
(11) M SE = f (xl ) − y l ,
L
l=1

where L is the size of the testing data set. Overﬁtting is reduced by tuning the
parameter, DT , to minimize M SE calculated through cross-validation, as opposed to
minimizing the MSE of points within the training data set [28].
To increase the number of estimates, one could run the above t-fold cross-valida-
tion multiple times. For this purpose the data needs to be repartitioned each time.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B627

To determine the predictive ability of the models, we use several measures to show
the accuracy of the predictions. The R2 value is calculated through the following
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

formula: L l l 2
l=1 (y − f (x ))
R2 = 1 − L
,
l
l=1 (y − ȳ)
2

where ȳ = L1 Tl L y l is the mean of y l , L is the size of the testing data set, and TL
is the size of training set. The variance of the data with regards to the true values
is calculated in the numerator. The denominator presents the variance of the model
output with respect to the mean of the outcomes. This measure shows how well the
model is performing in comparison to the case where the mean value is used as the
predictor. An R2 value of 1 indicates an exactly ﬁtted model, while a value of 0
indicates a model that adds no predictive power. We have also used the notion of
mean absolute error (MAE) to record the performance of the model. The MAE is
deﬁned as follows:
L
1
(12) MAE = |f (xl ) − y l |.
L
l=1

If the distance threshold is set too high, the proposed method will fit a single
hyperplane to the entire data set. In practice one could determine the DT by observing
a part of data and use this for the rest of the input data.
4. The convergence properties of the DC-RBF algorithm. This section
describes facts about the DC-RBF algorithm. Theorems describe the finite data and
asymptotic behavior of this algorithm. We assume scattered data arrive and the
underlying structure is continuously differentiable. Note that the proofs are carried
out for the algorithm in steady state where there are enough clouds to cover the desired
domain of the function, the clouds have been stabilized (the movement of the centers
are negligible), and there are enough points in each cloud to form a response surface,
as explained in the previous section. Theorems are provided for both homoscedastic
and heteroscedastic types of noise.
Assume we have
(13) y k = θxk + k
for k ∈ {1, . . . , K}. Vector θ is the parameter that needs to be estimated. To simplify
our analysis we assume the origin of the coordinate system is adjusted such that the
intercept is zero. The set DK = {(xk , y k )}, with k ∈ {1, . . . , K}, denotes the collection
of sequentially observed data points. The error variables, k , are random.
Remark 1 (Gauss–Markov theorem [18]). The Gauss–Markov theorem states that
in a linear regression model in which the errors have expectation zero and are uncor-
related and have equal variances, the lowest possible MSE (variance) of the linear un-
biased estimator (BLUE) of the coefficients is given by the ordinary least squares esti-
mator. Actual errors need not be normal, nor independent and identically distributed
(only uncorrelated and homoscedastic). That is, E(k ) = 0, var(k ) = σ 2 < ∞, and
cov(k1 , k2 ) = 0 for k1 = k2 and k1 , k2 ∈ {1, . . . , K}.
The proof of this follows from a contradiction on the mean and variance of another
linear estimator for coefficient θ. An example could be adding another term to the
least squares solution given by (X T X)−1 X T . It can be shown that the mean of this
new estimator is not unbiased and produces a variance that is greater than what is
produced by the MSE solution, i.e., σ 2 (X T X)−1 .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

B628 ARTA A. JAMSHIDI AND WARREN B. POWELL

Note that in our work we use the recursive least squares solution on the local
clouds as new data points arrive at that local region.
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Lemma 2. For a given DT , if the arrival data points sample an n-dimensional

ambient space, the DC-RBF algorithm will produce null output for a cloud cover that
does not accumulate Mo = n+1 affinely independent data points (n is the dimension of
the ambient space). Therefore DT > 0.5||NMo (xk1 ) − xk2 || for xk1 , xk2 ∈ X . NMo (x)
denotes the Mo th nearest neighbor of point x.
Proof. According to the construction of the model in the DC-RBF algorithm,
there is a need for n + 1 affinely independent data points to construct a hyperplane
from Rn to R. Therefore if DT is sufficiently small, then the model produces no
output.
Proposition 3. Given DT = diam(X ), assuming the underlying data structure
is a hyperplane (line in dimension 1) with homoscedastic additive noise, the DC-RBF
algorithm is unbiased asymptotically and with a finite number of data points larger
than n + 1. The diameter of domain X is defined as

diam(X ) = max ||xk1 − xk2 ||.

1≤k1 ,k2 ≤K

Proof. Under the hypothesis HIk (null hypothesis at cloud I and iteration k) that
cloud I has a linear structure, the rank one update of the recursive least squares
solution at each iteration given by (5) and (6) provides the best linear unbiased
estimator for the coeﬃcient θ. Note that in this argument the local iteration counter
kI (the kth data point that arrives to cloud I) is the same as the global counter,
kI = k. We oﬀer a proof based on induction. At k = n − 1 the argument is true since
the algorithm saves n − 1 data points to initialize matrix P , and the least squares
error is computed directly. Letting the argument be true for k = m ∈ N, we show
that at iteration k = m + 1, the rank one update solution via the recursive least
squares solution is also unbiased. This follows from the derivation of the recursive
least squares via the Sherman–Morrison formula,

(A−1 u)(v T A−1 )

(A + uv T )−1 = A−1 − .
1 + v T A−1 u
T −1 T
Let Gm = X m T X m and am+1 = [xm+1 , 1]. Then θm+1 = Gm+1 X m+1 Y m+1 =
−1
Gm+1 (X m T Y m + am+1 y m ). Notice that X m T Y m = Gm Gm −1 X m T Y m = Gm θm =
T −1 T
Gm+1 θm − am+1 am+1 θm . Therefore, θm+1 = θm − Gm+1 am+1 (am+1 θm − y m ).
−1 T
Let P m+1 = Gm+1 = (Gm + am+1 am+1 )−1 . Therefore,
T
θm+1 = θm + P m+1 am+1 y m − am+1 θm .

Hence, at each iteration the estimate is unbiased.

Asymptotically, when the number of data points becomes very large the same
induction is true. Therefore when the number of local iterations k → ∞ we deduce
that θk+1 = θk + P k+1 a y k − aT θk leads to an unbiased estimator for θ, according
to Remark 1.
Lemma 4. Given DT = diam(X), assuming the underlying data structure is a
hyperplane with heteroscedastic or correlated additive noise, the DC-RBF algorithm
is unbiased asymptotically and with a ﬁnite number of data points larger than n + 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B629

Proof. The proof follows that of Proposition 3 and incorporates the notion of gen-
eralized least squares. Generalized least squares assumes the conditional variance of Y
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

given X is a known or estimated matrix Ω. This matrix is used as the weight in com-
puting the minimum squared Mahalanobis length; hence β̂ = (X Ω−1 X)−1 X Ω−1 Y.
Ω rescales the input data to make it uncorrelated, and then the Gauss–Markov the-
orem is applied. For the case where the variance of the noise changes, the weight
matrix Ω is diagonal, and this is a weighted least squares problem.
We denote the bias generated by the DC-RBF algorithm at each point of the
domain of function f (x) as B(x) = (f (x) − E[y|x])2 . The variance at each point
1
of the domain is V (x) = K (f (x) − f¯(x))2 , where f¯(x) is the mean of f (x) over its
domain values x.
Proposition 5. If the underlying nonlinear functional structure g(x) is Lipschitz
continuous with Lipschitz constant Lc , φ(.) is uniform, and the observed data points
are drawn from
(14) y k = g(xk ) + k ,
(ω1 −ω2 )2
then x∈X B(x) < Q(Lc , ω1 , ω2 ) = Lc 2 + O. Here DT = diam(X ) is the
chosen distance threshold, ω1 and ω2 determine the intersecting points of the linear
estimation and the underlying true structure g(x), and O denotes a residual term.
Proof. We assume that the underlying function g(x) is Lipschitz continuous; i.e.,
there are restrictions on the variations in the underlying structure. This implies
that for every pair of points on the graph of this function, the absolute value of the
slope of the line connecting them is no greater than a deﬁnite real number (Lipschitz
constant, Lc (g)). The slope of the secant line passing through xk1 , xk2 ∈ X is equal to
k1 k2
the diﬀerence quotient g(x|xk1)−g(x
−xk2 |
)
< Lc when xk1 = xk2 . Without loss of generality
we only consider the portion of the data that has an ascent followed by a descent slope
in its structure. This is achieved by the appropriate choice of DT . By assumption,
φ(.) is uniform weights. First we show that the least squares solution to this structure
intersects at two points, ω1 and ω2 (see Figure 1(a) for a visual presentation). The
proof follows from a contradiction argument.

(a) First order model. (b) Second order model.

Fig. 1. Visual demonstration of the regression lines intersecting the underlying nonlinear func-
tion g(x). In this ﬁgure we provide the geometry for the case where the model consists of a single
line, followed by the case where the model consists of two lines.

We denote the residuals of the least squares ﬁt line at iteration k, f k (x) = ak x

with the truth function g(x), as ek (x). Note that the residuals at each point are

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

B630 ARTA A. JAMSHIDI AND WARREN B. POWELL

discrete samplings of the mapping

e : x ∈ [ωl , ωr ] → ek (x) ∈ R.
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

We assume that the data used for training is given by XI ∈ UI (index I denotes
a speciﬁc cloud, in this case the whole data diameter with I = 1). Note that the
global iteration k is the same as the local index kI . The following analysis is carried
out for a given iteration k. These data are contained in the closed interval x ∈ BI =
[ωl , ωr ] ∈ UI . In addition, we assume that both g and f intersect at the origin of the
coordinate system, i.e., e(ω1 ) = 0, and that e(x) = 0 at x ∈ / BI .
Let a+ denote the optimized parameters for the linear estimator f , i.e.,

a+ = argmin (axl − e(xl ))2 .
a
xl ∈BI

We would like to show that there exists x ∈ [ωl , ωr ], such that

a+ x − e(x) = 0.

This indicates that the line a+ x intersects g(x) at another point. If a+ x − e(x) = 0
for all x ∈ [ωr , ωl ], then, without loss of generality, one could assume a+ x − e(x) > 0
for all x ∈ [ωl , ωr ].
So, there must be a smallest distance between the optimal hyperplane a+ x and
the residual curve e(x). This distance is deﬁned as

α+ = min a+ x − e(x).
x∈[ωl ,ωr ]

Say this minimum occurs at the point x+ . Note that this point x+ may not actually
correspond to a sampled data point xk .
Now consider another hyperplane a∗ x which is obtained by scaling the optimal
hyperplane so that it produces another point of intersection with e(x), i.e.,

a∗ x − e(x) = 0

for x = 0 and x ∈ BI . By assumption we have a∗ x = κa+ x, where 0 < κ < 1, so

0 < e(x) ≤ a∗ x < a+ x

for all x ∈ (ωl , ωr ). Hence a∗ x − e(x) < a+ x − e(x) and

(a∗ x − e(xl ))2 < (a+ x − e(xl ))2 ,
xl ∈[ωl ,ωr ] xl ∈[ωl ,ωr ]

which is a contradiction, as we assumed a+ was the optimal solution. Thus, the

parameters a+ could not have been obtained using least squares minimization.
Using the geometry that is produced by the intersection line and the nonlinear
geometry, we could contain the residuals within three triangles that are generated
by the lines passing through the intersecting points ω1 and ω2 with slop Lc (see
Figure 1(a) for a visual presentation). We conclude that

l l (ω1 − ω2 )2
|f (x ) − g(x )| < Lc + (ωl − ω1 ) + (ωr − ω2 ) .
2 2

l
2
x ∈BI

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B631

Note that BI is an interval that contains the region of interest, diam(BI ) > |ω1 − ω2 |,
where ωr and ωl are the right and left ends of the interval BI , respectively. f k is the
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

line structure that is the least squares solution. We present this proof for a single
dimension; however, the higher dimensional argument follows directly by considering a
ball BI with a diameter that covers the intersecting hyperplane with codimension 2 at
{Ω1 , . . . , Ωn } with g(x). The argument follows the volume that is contained between
the hyperplane of codimension 1 and the polyhedron (tetrahedra in dimension 2) that
contains the nonlinear structure and has slope equal to Lc for all its faces.
Note that in one dimension we have the notion of a line and a tangent line, and
the model output is the weighted sum of piecewise linear approximation to the data.
In higher dimensions we have the notion of a hyperplane with the tangent plane to
make a weighted piecewise linear approximation to the underlying manifold.
Definition 6 (partition of unity [39]). A partition of unity of a topological space
X is a set of continuous functions, {ρi }i∈Δ (Δ is an index set), from X to the unit
interval [0, 1], such that for every point x ∈ X , there is a neighborhood of x where all
but a ﬁnitenumber of the functions are 0, and the sum of all the function values at x
is 1, i.e., i∈Δ ρi (x) = 1.
The notion of partition of unity allows the extension of a local construction to the
whole space. We use the following method to identify our partition of unity. Given
any open cover Ui , i ∈ Δ (Δ is an index set), of a space, there exists a partition ρi ,
i ∈ Δ, such that supp ρi ∈ Ui (supp ρi indicates the support of function ρi ). Such a
partition is said to be subordinate to the open cover Ui . Thus we choose to have the
supports indexed by the open cover.
If functions ρi are compactly supported, given any open cover Ui , i ∈ Δ, of a
space, there exists a partition ρj , j ∈ Λ, indexed over a possibly distinct index set Λ
such that each ρj has compact support, and for each j ∈ Λ, supp ρj ∈ Ui for some
i ∈ Δ.
Theorem 7. Assuming the underlying data structure, g(x), is nonlinear, Lip-
schitz continuous, and the distance threshold parameter, DT1 = diamX , the DC-RBF
algorithm with uniform RBF produces an upper bound on the bias denoted by QDT1 (Q
is deﬁned in Proposition 5); however, if DT2 = diamX
N (N ∈ N, is a natural number)
with N > 1, and upper bound on the bias QDT2 , then QDT2 < QDT1 .
Proof. As new data points arrive, more of the true underlying geometry reveals
itself. We begin by posing as a null hypothesis that the underlying structure is
described by a line. As the number of observations increases, we test this hypothesis
against alternative hypotheses that the underlying structure is described by a series of
lines over regions. At this stage the DC-RBF algorithm, given DT = diamX , attempts
to model the secant line passing through the curvature (see Figure 1(a) for a visual
presentation). What we mean by the secant line is the regression line passing through
the data points associated to cloud indexed I, where at this stage I = 1. Note that
the linear solution produces bias, with an upper bound provided in Proposition 5.
Similar to Proposition 5, we assume the regression line intersects the underlying
nonlinear structure at points ω1 and ω2 with upper bound on the bias Q. The domain
of interest is BI = [ωl , ωr ]. As DT shrinks (without loss of generality, we assume
DT = diamX2 ), the space is broken down into smaller sets (see Figure 1(b) for a visual
presentation). For a one-dimensional space, this choice of DT breaks the space down

into two parts, hence Δ = {1, 2}. The local RBFs φ(r), with i φ(x−c i W )
φ(x−ci W ) = 1
i

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

B632 ARTA A. JAMSHIDI AND WARREN B. POWELL

with i ∈ Δ, form a partition of unity as provided in Deﬁnition 6. This would allow the
expansion of each solution to a larger domain to produce a smooth transition from
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

one local model to the neighboring models. With this setup, we assume that the two
regression lines l1 and l2 intersect at point ωM . According to Proposition 5, and with
a uniform choice for kernel φ(.), we get the following bounds for the bias on l1 and l2
denoted by Ql1 and Ql2 , respectively:

(ωl11 − ωl12 )2
Q l1 = L c + (ωl − ωl11 ) + (ωM − ωl12 )
2 2
2
and
(ωl21 − ωl22 )2
Q l2 = L c + (ωM − ωl21 ) + (ωr − ωl12 ) .
2 2
2
One could simply verify that Ql1 + Ql2 < Q. This would result in reduction of
the upper bound on the bias, hence QDT2 < QDT1 , with the cost of having more
clouds in the cover (more parameters in the model) and a corresponding increase in
the variance of the model, VDT1 < VDT2 .
The argument provided here is for one dimension, but, using similar arguments
made in Proposition 5, could be generalized to higher dimensions.
Variable subset selection and shrinkage are methods that introduce bias and try to
reduce the variance of the estimate. These methods trade a little bias for a larger re-
duction in variance. In practice we choose DT via cross-validation over other available
methods that are developed for this purpose.
Theorem 8. Asymptotically, the DC-RBF algorithm is unbiased for continuously
diﬀerentiable functions underlying nonlinear structure and heteroscedastic noise.
Proof. This theorem follows from Theorem 7 and takes into account that in the
limit DT → 0 and K → ∞. Let

pi (x)φ(x − ci Wi )
fΔ (x) = i∈Δ ,
i∈Δ φ(x − ci Wi )

where pi (x) represents a linear model and Δ is an index set. As DT → 0, |Δ| → ∞

(ι = |Δ| denotes cardinality of Δ); therefore, at x = ci ,

pi (x)φ(x − ci Wi )dx
lim fΔ (x) = .
ι→∞ φ(x − ci Wi )dx

Since DT → 0 leads to Wi → 0, limW →0 φ(x − ci W ) = δ(x − ci ), where δ(.) is the

Dirac delta function. We then have limι→∞ φ(x − ci )dx = 1. Note that K → ∞
guarantees that in every local ball around a given point in domain x with radius
ξ > 0, B(x,
ξ) contains at least n + 1 data points, as suggested by Lemma 2; thus
limι→∞ pi (x)φ(x − ci Wi )dx = pi (ci ), where x = ci .
If the true underlying nonlinear function g ∈ C ∞ , then by the Taylor expansion,
∞
g (n) (x)
g(x, a) = (x − a)n ,
n=0
n!

where g(x, a) denotes the approximation of function g(x) around point a and g (n) (x)
denotes the nth derivative of the function g at point x. If the underlying function
g(x) ∈ C 1 , we consider a ﬁrst order linear approximation to function g(x) at each

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B633

point, i.e., g(x) ≈ f (x) = g 1 (x − a)1 . In other words, at each point x, the ﬁrst order
approximation is a hyperplane. Since in the current implementation of the DC-RBF
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

algorithm pi (x) is chosen to be a linear function, pi (x) is the tangent line to each point
x. This is true since the limit of the secant line passing through the points xk1 and
xk2 when xk1 → xk2 is a tangent line at point xk1 . When xk1 → xk2 , ||xk1 − xk2 || → 0
and in the limit this secant line forms the tangent at the meeting point xj . As a result
k1
)−f (xk2 )
the diﬀerence quotient f (x||xk1 −xk2 ||
approaches the slope of the tangent line of g(x)
at that xk2 , i.e.,
f (xk1 ) − f (xk2 )
lim = g 1 (xk2 ).
||xk1 −xk2 ||→0 ||xk1 − xk2 ||

This is true as K → ∞. Up to the first order approximation within a ball B(x, ξ) with
ρ > 0, the function is assumed to be linear. According to Remark 1, and considering
the fact that the variation of the noise within B(x, ξ) is negligible, the solution is
BLUE. Since this is true for every point in domain X , the DC-RBF algorithm is
therefore asymptotically unbiased.
5. The empirical results. Here we demonstrate the performance of the algo-
rithm on a variety of synthesized and real data sets. The data sets have distinct
features in terms of input dimension, complexity of the response function, as well as
noise content. Note that the algorithm provides a functional representation of the un-
derlying data. The first three synthetic examples concentrate on the performance and
discussions on DC-RBF algorithm, followed by two real examples with comparisons
to other statistical techniques.
5.1. Synthesized data sets.
One-dimensional newsvendor data set. In the newsvendor problem, a news-
vendor is trying to determine how many units of an item x to stock. The stocking cost
and selling price for the item is c and p, respectively. One could assume an arbitrary
distribution on demand D. The expected profit is given by
F (x) = E [p min(x, D)] − cx.
This problem poses a challenge for online data analysis due to the special behav-
ior in the function around the optimal solution, which is highly dependent on the
characteristics of a data set.
Figure 2 describes the experimental setup and the model output for two different
distributions on D. Figure 2 shows the training, testing, and model output for each
case. The specification of the algorithm for both problems is identical. The number
and position of training and testing data are the same in both experiments. We employ
our function approximation technique to find the maximum number of units to stock
to maximize expected profit. The MSE value for the first experiment is 107.39 with
a model that has three clouds. The second experiment has resulted in a model with
three clouds with MSE value of 426.63. We observe that the maximum stocking point
is well identified in both experiments. Note that the algorithm learns the underlying
functional behavior of the given data set within the scope that is determined by the
radius of the local balls. The model generalizes well within local regions. Figure 3
shows the performance of the proposed algorithm for different random ordering of
input data and a different distance threshold. The order of the data plays a role in
the initial stage of the model construction; however, the centers of the clouds stabilize
after there are enough data points in a given cloud.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

B634 ARTA A. JAMSHIDI AND WARREN B. POWELL
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

600 500

400
0
200
Average profit

Average profit
0 −500

training data
−200 testing data
training data −1000 model output
testing data
−400 model output

−1500
−600

−800
−2000
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Units stocked Units stocked

(a) D ∼ (50, 60). (b) D ∼ (30, 70).

Fig. 2. Newsvendor data set generated using [p min(x, D)] − cx. The training and testing data
sets consist of 24 and 36 data points, respectively. To generate this data set, c = 50, p = 60, and
D is a random uniform integer between 50 and 60 for panel (a) and between 30 and 70 for panel
(b). Inventory stock levels x were sampled from a random uniform integer distribution from 20 to
80. The MSE and MAE of the model output on the test data set in the original scale of the data
are 33847 and 107.39 for panel (a) and 453470 and 426.63 for panel (b), respectively. The model
distance threshold is 15. There are three clouds in each model. The centroids of the clouds in both
models are 29.80, 51, 69.85 with standard deviations of 6.54, 5.71, and 4.87, respectively.

600 600

400 400

200 200
Average profit

Average profit

0 0

−200 training data −200

training data
testing data
testing data
−400 model output −400
model output

−600 −600

−800 −800

20 30 40 50 60 70 80 20 30 40 50 60 70 80
Units stocked Units stocked

(a) Diﬀerent input ordering. (b) Diﬀerent value of DT .

Fig. 3. Performance of DC-RBF algorithm on a diﬀerent random ordering of input data

and distance threshold of 35 for the newsvendor problem shown in Figure 2(a). The centroids of
the clouds in panel (a) are 28.55, 48.28, 68.62 with standard deviations of 5.54, 5.21, and 5.70,
respectively. For panel (b), the centroid of the clouds are 34.71 and 65.80 with standard deviations
of 9.83 and 7.8, respectively.

The underlying function forms a shape that is not in the span of a polynomial
of degree p, in particular for problems where D has low variance (making these more
challenging problems). Therefore parametric methods do not perform well, especially
given the fact that the basis functions of a parametric model must be known a priori.
For the noisy data set the interpolation methods such as cubic splines (despite the
regression schemes) do not perform well and often result in overﬁtting the data. In-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B635

terpolation techniques also require more data points to produce an accurate response;
see, e.g., [59]. We observe that, in this experiment, our method outperforms kernel
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

smoothing regression both in noisy and noise-free data sets. The poor performance
of kernel smoothing in this experiment is due to its local averaging property, which
tends to underfit the function around the local extreme points. Therefore, the pre-
dicted value of the maximum is worse than the true value. In addition, this technique
requires storage of the history of data points and does not provide an analytical form
for the underlying function.
We have tested our algorithm on various other one-dimensional noisy data sets
and have observed a similar conclusion. Therefore we only report on a newsvendor
data set which plays a key role in resource management and produces an interesting
case for data analysis due to the shape around the optimum result. Our method has
the lowest or a very comparable MSE compared to the techniques described in this
section and captures well the local structure of the signal.
An oscillatory data set with varying additive noise. To demonstrate the
ability of the algorithm to discover the underlying nonlinear function while the obser-
vations are highly corrupted with various levels of noise, we have tested the method
on a synthesized data set that is produced by y = sin( πx4 ) + 0.1xN (0, 1). Figure 4(a)
shows the noisy and noise-free data sets. The challenge is to find the underlying func-
tion sin(.) from the noisy input signal with large variation in the variance of noise.
Figure 4(b) shows the testing data set and the model output. One could observe that
the proposed algorithm has performed well in recovering the underlying function with
only five clouds in the function expansion. The final MSE is 0.8 and MAE 0.62. The
distance threshold is 3.

3 2.5
noise−free data training data
noisy data 2 testing data
2 model output
1.5

1
1
0.5

0 0
y

−0.5
−1
−1

−2 −1.5

−2

−3 −2.5
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
x x

(a) Actual data. (b) Model output.

Fig. 4. Panel (a) shows a data set of 200 data points in the interval [0, 16] generated with
y = sin( πx4
) + 0.1xN (0, 1). Both noise-free and noisy data are shown in this ﬁgure. Panel (b)
shows the training and testing data sets, each with 100 points, as well as the model output. In this
graph the data is rescaled using the standard deviation of the whole data. The centroid of the clouds
are 1.24 and 4.58, 7.93, 11.60, 14.60 with the standard deviation of 0.81 and 1.13, 1.06, 1.04, 0.79,
respectively.

Two-dimensional
√ saddle data set. A data set generated from z = x2 − y 2 +
N (0, 5) which produces a noisy saddle shape is used in this study. Figure 5(a) shows
the data points that are used for training and testing. Figure 6 demonstrates four

B636 ARTA A. JAMSHIDI AND WARREN B. POWELL
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

10 10

5 5

0 0
z

z
−5 −5

−10 −10

5 5
5 5
0 0
0 0
−5 −5 −5 −5
y x y x

(a) Training and testing data. (b) Final model.

√
Fig. 5. Saddle data set generated by z = x2 − y 2 + N (0, 5). There are 125 training and 100
testing data points over the domain x ∈ [−3, 3] × [−3, 3]. The ﬁnal model with four clouds has MSE
of 4.49 with MAE of 1.7. The distance threshold is 3.

major snapshots of the model-making process. At first there are not enough points
to form a model. Then, as new data points arrive, the first cloud is formed, as shown
in Figure 6(a). Depending on the spatial location of the arrival points, the first cloud
might get updated or the second cloud might get formed. The same process repeats
till the fourth cloud of the model is shaped, as shown in Figure 6(d). Each panel in this
figure shows the incident when a new cloud is formed with three affinely independent
data points. The third data point that activates the cloud, i.e., the first data point
that is assigned to a cloud that results in construction of a plane over its designated
cloud, is also plotted in each panel. Finally, Figure 5(b) shows the final model after
presenting 125 data points in the presence of the 100 testing points.
We observe that the accuracy of the model is similar to LOESS (locally scatter-
plot smoothing). However, unlike LOESS our method does not require the storage
of the data points and has superior speed. In addition, DC-RBF provides an analyt-
ical formulation for the underlying structure. Our test results on other synthesized
surfaces provide the same conclusion. For further elaboration on the LOESS method
and its comparison to DC-RBF, see section 6.

5.2. Real benchmark data sets and comparison results. In this section
we show the performance of the proposed method on two real data sets and compare
results to the related methods in the literature in terms of accuracy, speed, batch
vs. online, and the requirement of storing the previous data points. We select two
data sets with continuous response values. The selected data sets represent regression
challenges on noise heteroscedasticity and moderate dimensionality. We compare our
results to techniques that we find are closest to the spirit of our work.
The list of the benchmark algorithms is as follows:
• Dirichlet process mixtures of generalized linear models (DP-GLM): A Bayesian
nonparametric method that finds a global model of the joint input-response
pair distribution through a mixture of local generalized linear models [26].
• Ordinary least squares (OLS): A parametric method that is widely used for
data fitting problems and often provides a reasonable fit to data. In this
study [1, x]T has been chosen as basis functions (xi , i ∈ {1, . . . , n} denotes
the ith coordinate).
• CART: This is a nonparametric tree regression method [4]. This method is

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B637
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

10 10

5 5

0 0
z

z
−5 −5

−10 −10

5 5
5 5
0 0
0 0
−5 −5 −5 −5
y x y x

(a) Arrival of ﬁrst cloud. (b) Arrival of second cloud.

10 10

5 5

0 0
z

−5 −5

−10 −10

5 5
5 5
0 0
0 0
−5 −5 −5 −5
y x y x

(c) Arrival of third cloud. (d) Arrival of fourth cloud.

Fig. 6. The model behavior at the starting point of adding a new cloud to the model. The ﬁnal
model consists of four clouds.

available in the MATLAB function classregtree.

• Bayesian CART: A tree regression model with a prior over tree size [8], im-
plemented in R with the tgp package.
• Bayesian treed linear model: A tree regression model with a prior over tree
size and a linear model in each of the leaves [9], implemented in R with the
tgp package.
• Gaussian processes (GP): A nonparametric method for continuous inputs and
responses [50]. This algorithm is available in MATLAB with program name
gpml.
• Treed Gaussian processes: A tree regression model with a prior over tree size
and a GP on each leaf node [23]. This is implemented in R with the tgp
package.
DP-GLM [26] is closest to the spirit of our work, in that it also forms local linear
regressions, although algorithmically they are quite distinct. So we have emphasized
this comparison in this work and used results reported in [26] on the performance of
other competing methods in the literature. DP-GLM was designed for batch appli-
cations and is much slower and data intensive than the DC-RBF method proposed
in this paper. In comparison to other methods that obtained relatively good results
such as GP and treed GP, GP regression simply uses a locally weighted estimate
which makes no attempt to model the higher order relationships that we are able to
capture with DC-RBF through our local linear models. Treed GP simply generalizes

B638 ARTA A. JAMSHIDI AND WARREN B. POWELL

traning data
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

6
testing data
model output
4

2
y

−2

−4

100 200 300 400 500 600 700 800

Fig. 7. The CMB data set that consists of 899 data points. The training and testing data sets
and the model output for the CMB data set are shown in this ﬁgure. The training data set consists
of 500 randomly selected points, and the remaining 499 are used for testing. The centroid of the
clouds are 62.22, 195.01, 647.96, 328.01, 816.23, 425.90, 530.63 with the standard deviation of 38.25,
43.43, 43.98, 33.67, 49.94, 31.94, 31.19, respectively.

GP by capturing low-dimensional interactions between a small subset of variables (3

or fewer), but still produces locally constant estimates. Both of these methods are
designed for very low dimensional settings. DC-RBF can handle higher dimensional
interactions in the locally linear approximations.
Cosmic microwave background (CMB) [2]. This data set maps positive
integers x = 1, 2, . . . , 899, called “multipole moments,” to power spectrum Cl . It
consists of 899 observations, shown in Figure 7, where 500 randomly selected points
are used as training data, and the remaining 499 are used for testing. This data set is
highly nonlinear and heteroscedastic. These features make this data set an interesting
benchmark choice. The underlying function relates continuous domain values to their
corresponding continuous responses.
Figure 7 also shows the model output. The distance threshold parameter value,
95, is determined using 5-fold cross-validation. The output is a continuous plot of
a parsimonious model with only seven clouds. The MSE and MAE for the testing
data are 0.62 and 0.41, respectively. The result shows that the method is capable of
handling the heteroscedasticity in this data set quite well. The comparison between
the DC-RBF method and the six methods described above is summarized in Table 1.
In terms of error rates our proposed algorithm slightly underperforms on the smaller
data sets compared to the DP-GLM, although with a dramatic reduction in CPU
time. However, it is quite competitive on larger data sets. Note that even though
our proposed algorithm is in the same algorithmic class as DP-GLM, it only uses a
given data point once and does not require storage of the data points. In contrast,
DP-GLM is a batch algorithm and relies on exhaustive computation to compute a
model and requires all the history to rebuild the model once a new data point arrives.
The new method has one adjustable hyperparameter, while the DP-GLM has multiple
hyperparameters to tune. The DC-RBF method is orders of magnitude faster, since

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B639
Table 1
This table summarizes the performance of various regression techniques on the CMB data set.
The MSEs and MAEs on diﬀerent sizes of testing data sets are reported in this table. The presented
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

results are the mean of the performances for various methods for different permutations of data
points selected for training and testing data sets. For DC-RBF, the hyperparameter DT is kept fixed
for a given data set size. The hyperparameter is identified using a 5-fold cross-validation technique.
The hyperparameters of the other techniques are calculated by sweeping over a set of candidate
parameters that varies by 3 to 5 orders of magnitude, and then the parameter that does the best on
the training set is chosen; these number are reported from [26].

Training set size 30 50 100 250 500 30 50 100 250 500

Method Mean absolute error Mean square error
Bayesian CART 0.66 0.64 0.54 0.50 0.47 1.04 1.01 0.93 0.94 0.84
Bayesian TLM 0.64 0.52 0.49 0.48 0.46 1.10 0.95 0.93 0.95 0.85
Gaussian process 0.55 0.53 0.50 0.51 0.47 1.06 0.97 0.93 0.96 0.85
Treed GP 0.52 0.49 0.48 0.48 0.46 1.03 0.95 0.95 0.96 0.89
Linear regression 0.66 0.65 0.63 0.65 0.63 1.08 1.04 1.01 1.04 0.96
CART 0.62 0.60 0.60 0.56 0.56 1.45 1.34 1.43 1.29 1.41
DP-GLM 0.58 0.51 0.49 0.48 0.45 1.00 0.94 0.91 0.94 0.83
DC-RBF 0.61 0.53 0.49 0.47 0.46 1.40 1.13 0.98 0.91 0.85

DP-GLM requires the use of Markov chain Monte Carlo methods to reoptimize the
clustering as each data point is added, whereas DC-RBF updates instantly.
Concrete compressive strength (CCS) [62]. The CCS data set provides a
mapping from an eight-dimensional input space to a continuous output. This data
set has low noise and variance. The eight continuous input dimensions are as follows:
cement components, blast furnace slag, fly ash, water, superplasticizer, coarse aggre-
gate, and fine aggregate, all measured in kg per m3 , and the age of the mixture in
days. The response is the compressive strength of the resulting concrete. There are
1,030 observations in this data set. The challenge working with this data set is to
find a continuous mapping that represents the input-output relationship well in this
multidimensional data.
Similar to the experiment using the CMB data set, we have summarized the
comparison results for this data set in Table 2. The distance threshold parameter
value, 255, is determined using 5-fold cross-validation. We observe that as the number
of observations increases, the performance of the new method is enhanced due to the
nature of our design, which requires a certain number of data points to activate a
cloud. From Table 2, we see that in terms of error our method remains competitive
with DP-GLM, which is computationally much more demanding.
6. Concluding remarks. We propose a fast, recursive, function approximation
method which avoids the need to store the complete history of data (typical of non-
parametric methods) or the specification of hyperparameters for Bayesian priors. The
method assigns locally linear parametric models for regions of the covariate space that
are created dynamically based on the domain values of the input data without the
need for prespecified classification schemes. A weighting scheme is associated to each
locally linear approximation using normalized RBF. Each local model is updated
recursively with the arrival of each new observation, which is then used to update the
cloud representation before the data are discarded.
The new method is robust in the presence of homoscedastic and heteroscedastic
noise. Through the sole parameter DT , our method automatically determines the

B640 ARTA A. JAMSHIDI AND WARREN B. POWELL

Table 2
The performance table for CCS data. In this table the results are reported in terms of MSE
and MAE for various competing methods, including DP-GLM. The experimental setup here is the
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

same as described for CMB data set.

Training set size 30 50 100 250 500 30 50 100 250 500

Method Mean absolute error Mean square error
Bayesian CART 0.78 0.72 0.63 0.55 0.54 0.95 0.80 0.61 0.49 0.46
Bayesian TLM 1.08 0.95 0.60 0.35 1.10 7.85 9.56 4.28 0.26 1,232
Gaussian process 0.53 0.52 0.38 0.31 0.26 0.49 0.45 0.26 0.18 0.14
Treed GP 0.73 0.40 0.47 0.28 0.22 1.40 0.30 3.40 0.20 0.11
Linear regression 0.61 0.56 0.51 0.50 0.50 0.66 0.50 0.43 0.41 0.40
CART 0.72 0.62 0.52 0.43 0.34 0.87 0.65 0.46 0.33 0.23
DP-GLM 0.54 0.50 0.45 0.42 0.40 0.47 0.41 0.33 0.28 0.27
DC-RBF 0.57 0.54 0.51 0.47 0.39 0.54 0.50 0.44 0.40 0.29

model order for a given data set (one of the most challenging tasks in nonlinear
function approximation). Unlike similar algorithms in the literature, our method has
only one tunable parameter and is asymptotically unbiased.
Table 3 provides a comparison of diﬀerent statistical methods that we have con-
sidered in this paper. This table brings together features such as speed, complexity
of implementation, storage requirement, recursivity, the type of noise that can be
handled, and the number of tunable parameters.
Table 3
The benchmark table to compare various statistical function approximation techniques in terms
of overall speed in terms of model updating and model evaluation (Ultra fast, Fast, Very slow), com-
plexity of implementation (Complex, Intermediate, Simple), storage requirement (All the history,
Statistical Representation, None), recursivity (Yes, No), type of noise they could handle (HEt-
eroscedasticity, HOmoscedasticity), and the number of tunable parameters (None, One, Few).

Algorithm Speed Complexity Storage Recursivity Noise TP

DP-GLM V C A N HE F
Bayesian CART V C A N HO F
Bayesian TLM F I A N HO F
Gaussian process V C A N HO F
Treed GP V C A N HO F
Linear regression U S N Y HO N
CART F I A N HO F
Kernel regression F I A N HO F
Locally linear F I A Y HO F
LOWESS F I A N HO F
DC-RBF U S SR Y HE O

More speciﬁcally, in contrast to DP-GLM, DC-RBF oﬀers orders of magnitude

faster updates, does not require as many tunable parameters, avoids the complexity
of its implementation, and can be updated recursively without the need for storing
the data history. DP-GLM was designed for a batch data set. In a dynamic setting,
it required repeating the entire clustering process (itself an iterative procedure) for
each new data point.
The power of DC-RBF lies in its adaptability to approximate complex surfaces,
compactness in model representation, parsimony in model speciﬁcation, recursivity,

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B641

and speed. Parametric methods will always suﬀer from the need to tune the choice
of basis functions. Nonparametric methods require the storage of the entire history,
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

which complicates function computations in stochastic optimization settings (our mo-

tivating application). DP-GLM, which fits local polynomial approximations to clus-
ters, is extremely slow and cannot be adapted to a recursive setting. GP models can
work very well, but only in particular problem classes with continuous covariates, and
they also require more tunable parameters. GP models, as well as techniques such as
splines, also introduce the risk of overfitting.
Our experience with Bayesian methods (in particular, in the development of DP-
GLM, for which the second author was a developer) is that they are quite sensitive
to the tuning of the hyperparameters that make up the priors. This introduces an
undesirable degree of human intervention that is likely to limit their usefulness in
black box stochastic optimization algorithms.
DC-RBF appears to satisfy most of our needs for approximating continuous func-
tions in the recursive setting of stochastic optimization. It starts with a linear, para-
metric model and adds clouds only as the range of covariates expands. It can adapt
to quite general surfaces and requires only that we retain the history in the form
of a relatively compact set of clouds. It does not require the specification of basis
functions (we believe the local linear models should be kept very simple) and offers
fast, recursive updating algorithms.
Performance in terms of solution quality is always going to be an evaluation
limited to a specific set of experimental tests. We used the newsvendor setting to
demonstrate the ability of DC-RBF to adapt to both a sharp, nondifferentiable func-
tion (which is difficult to approximate using low order polynomials) and a smooth
function. The method works well, theoretically and experimentally, in the presence
of heteroscedastic noise. Finally, it appears to be competitive against much more
complex algorithms such as DP-GLM.
DC-RBF does introduce the need to tune a distance threshold parameter DT .
This is not a minor requirement, as the choice of DT captures the overall behavior
of the surface. We suspect the choice of DT will generally become apparent within
any specific problem class. We also doubt that this method would be effective for
high-dimensional applications (say, more than 10 or 20 covariates).
We believe that there are opportunities to build on this basic idea to overcome
the need to specify DT . A natural extension is to specify a dictionary of possible
values of DT (possibly on a log scale), which can be viewed as models at different
levels of granularity. We might then estimate a family of approximations, one for each
DT , which can then be combined using a hierarchical weighting scheme such as that
proposed in [21].
We have shown numerical results on synthetic and real data which demonstrate
the success of the algorithm for multidimensional function approximation. Despite
a model generated by DP-GLM which remains biased toward the early observations,
(this form of modeling may not serve well in applications that provide updates for the
existing information, such as ADP), DC-RBF could adapt very quickly to new infor-
mation. This type of approximation tool is especially useful for value function approxi-
mation in the context of approximate dynamic programming, where a stream of exoge-
nous information contributes to the system dynamics. We intend to use the approx-
imate technique for value function approximation in the context of approximate dy-
namic programming [47] and optimal learning [48] for energy resource allocation under
uncertainty. In future studies we plan on enhancing the algorithm to adapt the resolu-
tion of DT as more complicated patterns are presented to the algorithm over time.

B642 ARTA A. JAMSHIDI AND WARREN B. POWELL

REFERENCES

[1] C. Andrieu, N. Freitas, and A. Doucet, Robust full Bayesian learning for radial basis
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

networks, Neural Comput., 13 (2001), pp. 2359–2407.

[2] L. Bennett, M. Halpern, G. Hinshaw, N. Jarosik, A. Kogut, M. Limon, S. S. Meyer,
L. Page, D. N. Spergel, G. S. Tucker, E. Wollack, E. L. Wright, C. Barnes,
M. R. Greason, R. S. Hill, E. Komatsu, M. R. Nolta, N. Odegard, H. V. Peiris,
L. Verde, and J. L. Weiland, First-year Wilkinson microwave anisotropy probe (WMAP)
1 observations: Preliminary maps and basic results, Astrophys. J. Suppl. Ser., 148 (2003),
pp. 1–27.
[3] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK,
1995.
[4] L. Brieman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classiﬁcation and Regression
Trees, Chapman & Hall/CRC, New York, 1984.
[5] D. S. Broomhead and D. Lowe, Multivariable functional interpolation and adaptive networks,
Complex Systems, 2 (1988), pp. 321–355.
[6] G. Bugmann, Normalized Gaussian radial basis function networks, Neurocomput., 20 (1998),
pp. 97–110.
[7] M. D. Buhmann, Radial Basis Functions, Cambridge University Press, Cambridge, UK, 2003.
[8] H. A. Chipman, E. I. George, and R. E. McCulloch, Bayesian CART model search, J.
Amer. Statist. Assoc., 93 (1998), pp. 935–948.
[9] H. A. Chipman, E. I. George, and R. E. McCulloch, Bayesian treed models, Machine
Learning, 48 (2002), pp. 299–320.
[10] E. K. P. Chong, An Introduction to Optimization, Wiley, 2008.
[11] W. S. Cleveland, Robust locally weighted regression and smoothing scatterplots, J. Amer.
Statist. Assoc., 74 (1979), pp. 829–836.
[12] W. S. Cleveland, LOWESS: A program for smoothing scatterplots by robust locally weighted
regression, Amer. Statistician, 35 (1981), p. 54.
[13] W. S. Cleveland and S. J. Devlin, LOWESS: A program for smoothing scatterplots by robust
locally weighted regression, J. Amer. Statist. Assoc., 83 (1988), pp. 596–610.
[14] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, Active learning with statistical models, J.
Artiﬁcial Intelligence, 4 (1996), pp. 129–145.
[15] R. L. Eubank, Spline Smoothing and Nonparametric Regression, Marcel Dekker, New York,
1988.
[16] J. Fan, Design-adaptive nonparametric regression, J. Amer. Statist. Assoc., 87 (1992), pp. 998–
1004.
[17] J. Fan and I. Gijbels, Local Polynomial Modelling and Its Applications, Chapman & Hall/
CRC Monographs on Statistics & Applied Probability, London, 1996.
[18] C. F. Gauss, Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Gottingae,
1825.
[19] S. Geisser, Predictive Inference: An Introduction, Chapman & Hall/CRC, New York, 1993.
[20] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, 2nd ed.,
Chapman & Hall/CRC, Boca Raton, FL, 2003.
[21] A. George, W. B. Powell, and S. Kulkarni, Value function approximation using multi-
ple aggregation for multiattribute resource management, J. Mach. Learn. Res., 9 (2008),
pp. 2079–2111.
[22] F. Girosi, M. Jones, and T. Poggio, Regularization theory and neural network architectures,
Neural Comput., 7 (1995), pp. 219–269.
[23] R. B. Gramacy and H. K. H. Lee, Bayesian treed Gaussian process models with an application
to computer modeling, J. Amer. Statist. Assoc., 103 (2008), pp. 1119–1130.
[24] I. D. Guedalia, M. London, and M. Werman, An on-line agglomerative clustering method
for nonstationary data, Neural Comput., 11 (1999), pp. 521–540.
[25] H.-M. Gutmann, A radial basis function method for global optimization, J. Global Optim., 19
(2001), pp. 201–227.
[26] L. A. Hannah, D. M. Blei, and W. B. Powell, Dirichlet process mixtures of generalized
linear models, J. Mach. Learn. Res., 12 (2011), pp. 1923–1953.
[27] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer,
New York, 2009.
[28] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed., Prentice Hall, Upper
Saddle River, NJ, 1999.
[29] C. C. Holmes and B. K. Mallick, Bayesian radial basis functions of variable dimension,
Neural Comput., 10 (1998), pp. 1217–1233.

A RECURSIVE APPROXIMATION METHOD USING DC-RBFs B643

[30] T. Ishikawa and M. Matsunami, An optimization method based on radial basis function, IEEE
Trans. Magnetics, 33 (1997), pp. 1868–1871.
[31] A. A. Jamshidi and M. J. Kirby, Towards a black box algorithm for nonlinear function
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

approximation over high-dimensional domains, SIAM J. Sci. Comput., 29 (2007), pp. 941–
963, doi:10.1137/050646457.
[32] A. A. Jamshidi and M. J. Kirby, Skew-radial basis function expansions for empirical modeling,
SIAM J. Sci. Comput., 31 (2010), pp. 4715–4743, doi:10.1137/08072293X.
[33] A. A. Jamshidi and M. J. Kirby, Modeling multivariate time series on manifolds with skew
radial basis functions, Neural Comput., 23 (2011), pp. 97–123.
[34] R. D. Jones, Y. C. Lee, C. W. Barnes, G. W. Flake, K. Lee, P. S. Lewis, and S. Qian,
Function approximation and time series prediction with neural networks, in Proceedings
of the 1990 International Joint Conference on Neural Networks (IJCNN), Vol. 1, IEEE,
1990, pp. 649–665.
[35] D. E. Knuth, Art of Computer Programming. Vol. 2. Seminumerical Algorithms, 3rd ed.,
Addison-Wesley, Reading, MA, 1998.
[36] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to Linear Regression
Analysis, 3rd ed., Wiley-Interscience, New York, 2001.
[37] J. Moody and C. Darken, Fast learning in networks of locally-tuned processing units, Neural
Comput., 1 (1989), pp. 281–294.
[38] H. G. Müller, Weighted local regression and kernel methods for nonparametric curve fitting,
J. Amer. Statist. Assoc., 82 (1988), pp. 231–238.
[39] J. Munkres, Topology, 2nd ed., Prentice Hall, Upper Saddle River, NJ, 2000.
[40] S. Morales-Enciso and J. Branke, Tracking global optima in dynamic environments with
efficient global optimization, European J. Oper. Res., 242 (2015), pp. 744–755.
[41] E. A. Nadaraya, On estimating regression, Theory Probab. Appl., 9 (1946), pp. 141–142.
[42] J. Park and I. W. Sandberg, Universal approximation using radial-basis-function networks,
Neural Comput., 3 (1991), pp. 246–257.
[43] J. Park and I. W. Sandberg, Approximation and radial-basis-function networks, Neural Com-
put., 5 (1993), pp. 305–316.
[44] T. Poggio and F. Girosi, Regularization algorithm for learning that are equivalent to multi-
layer networks, Science, 247 (1990), pp. 978–982.
[45] M. J. D. Powell, Radial basis functions for multivariable interpolation: A review, in Algo-
rithms for Approximation, J. C. Mason and M. G. Cox, eds., Clarendon Press, Oxford,
1987, pp. 143–167.
[46] M. J. D. Powell, The theory of radial basis function approximation in 1990, in Advances
in Numerical Analysis, Vol. II, W. Light, ed., Oxford University Press, New York, 1992,
pp. 105–210.
[47] W. B. Powell, Approximate Dynamic Programming, Wiley, 2011.
[48] W. B. Powell and I. O. Ryzhov, Optimal Learning, Wiley, Hoboken, NJ, 2012.
[49] C. E. Rasmussen and Z. Ghahramani, Infinite mixtures of Gaussian process experts, in Ad-
vances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA, 2001,
pp. 881–888.
[50] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, MIT
Press, Cambridge, MA, 2006.
[51] R. G. Regis and C. A. Shoemaker, A stochastic radial basis function method for the global
optimization of expensive functions, INFORMS J. Comput., 19 (2007), pp. 497–509.
[52] R. G. Regis and C. A. Shoemaker, Improved strategies for radial basis function methods for
global optimization, J. Global Optim., 37 (2007), pp. 113–135.
[53] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by
error propagation, in Parallel Distributed Processing, D. E. Rumelhart and J. L. McClel-
land, eds., 1986, pp. 318–362.
[54] D. Ruppert, M. P. Wand, and R. J. Carroll, Semiparametric Regression, Cambridge Series
in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK,
2003.
[55] B. Shahbaba and R. M. Neal, Nonlinear models using Dirichlet process mixtures, J. Mach.
Learning Res., 10 (2009), pp. 1829–1850.
[56] J. S. Simonoff, Smoothing Methods in Statistics, Springer, New York, 1996.
[57] Q. Song and N. Kasabov, ECM: A novel on-line, evolving clustering method and its applica-
tions, in Foundations of Cognitive Science, MIT Press, Cambridge, UK, 2001, pp. 631–682.
[58] A. N. Tikhonov and V. Y. Arsenin, Solutions of Ill-Posed Problems, John Wiley & Sons,
New York, 1977.
[59] G. Wahba, Spline bases, regularization, and generalized cross validation for solving approxi-

B644 ARTA A. JAMSHIDI AND WARREN B. POWELL

mation problems with large quantities of data, in Approximation Theory III, W. Cheney,
ed., Academic Press, 1980, pp. 905–912.
[60] G. S. Watson, Smooth regression analysis, Sankhya Ser. A, 26 (1946), pp. 359–372.
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

[61] P. J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences, Ph.D. dissertation, Division of Applied Mathematics, Harvard University, Boston,
MA, 1974.
[62] I. C. Yeh, Modeling of strength of high-performance concrete using artiﬁcial neural networks,
Cement Concrete Res., 28 (1998), pp. 1797–1808.

Bods Functions
83% (6)
Bods Functions
17 pages
Navigation Bar: Today You Will Learn To Navigate A WEBSITE!
80% (5)
Navigation Bar: Today You Will Learn To Navigate A WEBSITE!
5 pages
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
No ratings yet
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
6 pages
1705FA Certification Regarding Sex Offender Registry
No ratings yet
1705FA Certification Regarding Sex Offender Registry
3 pages
Optimized Fuzzy Model in Piecewise Interval for Function Approximation
No ratings yet
Optimized Fuzzy Model in Piecewise Interval for Function Approximation
10 pages
Function Approximation: A Gradient Boosting Machine.
No ratings yet
Function Approximation: A Gradient Boosting Machine.
45 pages
Fourier Feature Approximations For Periodic Kernels
No ratings yet
Fourier Feature Approximations For Periodic Kernels
8 pages
Radial Basis Functions Theory and Implementations
No ratings yet
Radial Basis Functions Theory and Implementations
14 pages
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
No ratings yet
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
23 pages
Sketching As A Tool For Numerical Linear Algebra
No ratings yet
Sketching As A Tool For Numerical Linear Algebra
139 pages
Amath/Math 516 Second Homework Set Linear Least Squares
No ratings yet
Amath/Math 516 Second Homework Set Linear Least Squares
6 pages
Jerome H. Friedman
No ratings yet
Jerome H. Friedman
44 pages
Regression Interpolation
No ratings yet
Regression Interpolation
34 pages
discussion_VF
No ratings yet
discussion_VF
4 pages
Lim 05429427
No ratings yet
Lim 05429427
10 pages
Final Slides HL
No ratings yet
Final Slides HL
22 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Online Bootstrap Confidence Intervals For The Stochastic Gradient Descent Estimator
No ratings yet
Online Bootstrap Confidence Intervals For The Stochastic Gradient Descent Estimator
21 pages
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
No ratings yet
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
64 pages
Stochastic Search Optimization
No ratings yet
Stochastic Search Optimization
317 pages
Test Functions PDF
No ratings yet
Test Functions PDF
18 pages
Smooth Interpolation of Scattered Data by Local Thin Plate Splinew
No ratings yet
Smooth Interpolation of Scattered Data by Local Thin Plate Splinew
9 pages
January 19, 2010 16:20 WSPC/244-AADA 00035: A Sparse Greedy Self-Adaptive Algorithm For Classification of Data
No ratings yet
January 19, 2010 16:20 WSPC/244-AADA 00035: A Sparse Greedy Self-Adaptive Algorithm For Classification of Data
18 pages
Passivity-Based Sample Selection and Adaptive Vector Fitting Algorithm For Pole-Residue Modeling of Sparse Frequency-Domain Data
No ratings yet
Passivity-Based Sample Selection and Adaptive Vector Fitting Algorithm For Pole-Residue Modeling of Sparse Frequency-Domain Data
6 pages
Batlle et al. - 2023 - Kernel Methods are Competitive for Operator Learning
No ratings yet
Batlle et al. - 2023 - Kernel Methods are Competitive for Operator Learning
36 pages
Function Approximation
No ratings yet
Function Approximation
2 pages
RKAN Rational Kolmogorov-Arnold Networks
No ratings yet
RKAN Rational Kolmogorov-Arnold Networks
15 pages
A Mapping Strategy For The Identification of Structural Systems
No ratings yet
A Mapping Strategy For The Identification of Structural Systems
8 pages
2304.03641
No ratings yet
2304.03641
45 pages
Module 3 Problem Discretization Using Approximation Theory PDF
No ratings yet
Module 3 Problem Discretization Using Approximation Theory PDF
98 pages
EC744 Lecture Notes: Projection Method: Jianjun Miao
No ratings yet
EC744 Lecture Notes: Projection Method: Jianjun Miao
31 pages
CHP 1curve Fitting
No ratings yet
CHP 1curve Fitting
21 pages
Curve Fitting and Interpolation Techniques
No ratings yet
Curve Fitting and Interpolation Techniques
41 pages
Mat 202
No ratings yet
Mat 202
51 pages
Modeling Systems With Machine Learning Based Differential Equations
No ratings yet
Modeling Systems With Machine Learning Based Differential Equations
12 pages
Algorithms 17 00120 v2
No ratings yet
Algorithms 17 00120 v2
16 pages
Continuous Optimization - Vaithilingam Jeyakumar, Alexander Rubinov
100% (1)
Continuous Optimization - Vaithilingam Jeyakumar, Alexander Rubinov
453 pages
EMT 3200 Group #4 Curve Fitting
No ratings yet
EMT 3200 Group #4 Curve Fitting
50 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Additive Model
No ratings yet
Additive Model
8 pages
Book PCH
No ratings yet
Book PCH
321 pages
MedTerm Machine Learning
No ratings yet
MedTerm Machine Learning
14 pages
LectureNotes New PDF
No ratings yet
LectureNotes New PDF
112 pages
R.A. Thisted - Elements of Statistical Computing - Numerical Computation-Routledge (1988)
100% (4)
R.A. Thisted - Elements of Statistical Computing - Numerical Computation-Routledge (1988)
456 pages
Algorithms 17 00111
No ratings yet
Algorithms 17 00111
12 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
A General Stochastic Algorithmic Framework For Minimizing Expensive Black Box Objective Functions Based On Surrogate Models and Sensitivity Analysis
No ratings yet
A General Stochastic Algorithmic Framework For Minimizing Expensive Black Box Objective Functions Based On Surrogate Models and Sensitivity Analysis
33 pages
A Limited T,: Memory Algorithm For Bound Constrained T, T
No ratings yet
A Limited T,: Memory Algorithm For Bound Constrained T, T
19 pages
1 s2.0 089812219090270T Main
No ratings yet
1 s2.0 089812219090270T Main
19 pages
Elements of Statistical Learning II - Ch.6 Kernel Smoothing Methods - Notes
No ratings yet
Elements of Statistical Learning II - Ch.6 Kernel Smoothing Methods - Notes
5 pages
Estimation of Time-Varying Par in STAT Models - Bertsimas Et - Al. (1999) - PUB
No ratings yet
Estimation of Time-Varying Par in STAT Models - Bertsimas Et - Al. (1999) - PUB
21 pages
Supervised Learning: Instance Based Learning
No ratings yet
Supervised Learning: Instance Based Learning
16 pages
Adaptive Control Signal - 2022 - Wigren - Recursive Identification of A Nonlinear State Space Model
No ratings yet
Adaptive Control Signal - 2022 - Wigren - Recursive Identification of A Nonlinear State Space Model
27 pages
Powell - A View of Algorithms For Optimization Without Derivatives
No ratings yet
Powell - A View of Algorithms For Optimization Without Derivatives
12 pages
V3i403 PDF
No ratings yet
V3i403 PDF
3 pages
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
No ratings yet
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
26 pages
Chelyshkov Least Squares Support Vector Regression For Nonl 2022 Chaos Soli
No ratings yet
Chelyshkov Least Squares Support Vector Regression For Nonl 2022 Chaos Soli
12 pages
2015 Electric Power Systems Research
No ratings yet
2015 Electric Power Systems Research
8 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Approximate Bayesian Computation Using Asymptotically Normal Point Estimates
No ratings yet
Approximate Bayesian Computation Using Asymptotically Normal Point Estimates
38 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Singapore’s policies and strategies _Low carbon transport and electric mobility
No ratings yet
Singapore’s policies and strategies _Low carbon transport and electric mobility
16 pages
First-Order Optimization Algorithms Via Inertial Systems With Hessian Driven Damping
No ratings yet
First-Order Optimization Algorithms Via Inertial Systems With Hessian Driven Damping
43 pages
© by SIAM. Unauthorized Reproduction of This Article Is Prohibited
No ratings yet
© by SIAM. Unauthorized Reproduction of This Article Is Prohibited
36 pages
Banana VCin Ethiopia Asmiro
No ratings yet
Banana VCin Ethiopia Asmiro
11 pages
Spatial Analysis of Constraints...
No ratings yet
Spatial Analysis of Constraints...
156 pages
1 Agricultural Value Chain Development in Selected Asian Countries - Analysis of Fruit and Vegetable Value Chains in Indonesia
No ratings yet
1 Agricultural Value Chain Development in Selected Asian Countries - Analysis of Fruit and Vegetable Value Chains in Indonesia
72 pages
Share Capital - Solution
No ratings yet
Share Capital - Solution
7 pages
Rajen-Resume
No ratings yet
Rajen-Resume
3 pages
Ahmed Naim CV
No ratings yet
Ahmed Naim CV
3 pages
Seo vs Aeo vs Geo
No ratings yet
Seo vs Aeo vs Geo
10 pages
Topic 2 - Event Planning Part 2
No ratings yet
Topic 2 - Event Planning Part 2
15 pages
Mithilesh PPT On Gear Trains
No ratings yet
Mithilesh PPT On Gear Trains
20 pages
MOP BFD Configuration APPROVED MOP - BSS - 1906241330
No ratings yet
MOP BFD Configuration APPROVED MOP - BSS - 1906241330
7 pages
Project Report On Manures and Chemical Fertilizers
No ratings yet
Project Report On Manures and Chemical Fertilizers
5 pages
Matthew Brown Unit 4: Pre-Production Portfolio
No ratings yet
Matthew Brown Unit 4: Pre-Production Portfolio
12 pages
Autocad Autocad: Autocade Hardware Software Data People
No ratings yet
Autocad Autocad: Autocade Hardware Software Data People
35 pages
CH 6
No ratings yet
CH 6
12 pages
JIT at Arnold Palmer Hospital
100% (1)
JIT at Arnold Palmer Hospital
2 pages
American Connector Company: Case Analysis Report
No ratings yet
American Connector Company: Case Analysis Report
5 pages
Statement 20211031
No ratings yet
Statement 20211031
18 pages
tvs 2013-2014 2
No ratings yet
tvs 2013-2014 2
965 pages
Practical Writeup
No ratings yet
Practical Writeup
2 pages
HW 1
No ratings yet
HW 1
2 pages
Cisco Catalyst 3560 Series Switches Datasheet
No ratings yet
Cisco Catalyst 3560 Series Switches Datasheet
21 pages
201438758
No ratings yet
201438758
88 pages
Karma Resort Communications
No ratings yet
Karma Resort Communications
15 pages
Castillo v. de Leon-Castillo
No ratings yet
Castillo v. de Leon-Castillo
2 pages
Exit _FAQ
No ratings yet
Exit _FAQ
10 pages
Night Vision Technology: by Mudit Deval Mudit Relan
No ratings yet
Night Vision Technology: by Mudit Deval Mudit Relan
28 pages
Self Management
No ratings yet
Self Management
4 pages
Samsung Color Laser Printer CLP 510 510N Parts and Service Manual
No ratings yet
Samsung Color Laser Printer CLP 510 510N Parts and Service Manual
227 pages
Test Bank For Cornerstones of Managerial Accounting, 6th Edition by Maryanne Mowen, Don Hansen, Dan Heitger
No ratings yet
Test Bank For Cornerstones of Managerial Accounting, 6th Edition by Maryanne Mowen, Don Hansen, Dan Heitger
40 pages
Obligations Contracts Review Notes Quizzer
100% (3)
Obligations Contracts Review Notes Quizzer
75 pages

A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions

Uploaded by

A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions

Uploaded by

SIAM J. SCI. COMPUT.

A RECURSIVE LOCAL POLYNOMIAL APPROXIMATION METHOD

ARTA A. JAMSHIDI† AND WARREN B. POWELL‡

AMS subject classifications. 65D10, 65D15, 62M10, 94A12

1. Introduction. There are three major classes of function approximation meth-

[email protected]). Previously at Department of Operations Research and Financial Engineering,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Our work is motivated by the need to approximate functions within stochastic

resentation of the problem is given by

min E F (x, ν),

where E denotes the expectation operator, x is a deterministic parameter, and ν is

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

subproblems. An automatic function approximation technique using RBF that does

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

where ci is the centroid of cloud i ∈ {1, . . . , Nc }.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

(6) θIkI +1 = θIkI + PIkI +1 a y k − aT θIkI ,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

to this cloud, use the recurrence formulas

(8) S kI = S kI −1 + (xk − ckI I −1 )T (xk − ckI I ).

The kI th estimate of the variance is the diagonal elements of

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Algorithm 1 DC-RBF algorithm.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Lemma 2. For a given DT , if the arrival data points sample an n-dimensional

diam(X ) = max ||xk1 − xk2 ||.

(A−1 u)(v T A−1 )

Hence, at each iteration the estimate is unbiased.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

(a) First order model. (b) Second order model.

We denote the residuals of the least squares ﬁt line at iteration k, f k (x) = ak x

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

discrete samplings of the mapping

We would like to show that there exists x ∈ [ωl , ωr ], such that

for x = 0 and x ∈ BI . By assumption we have a∗ x = κa+ x, where 0 < κ < 1, so

0 < e(x) ≤ a∗ x < a+ x

for all x ∈ (ωl , ωr ). Hence a∗ x − e(x) < a+ x − e(x) and

which is a contradiction, as we assumed a+ was the optimal solution. Thus, the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

where pi (x) represents a linear model and Δ is an index set. As DT → 0, |Δ| → ∞

Since DT → 0 leads to Wi → 0, limW →0 φ(x  − ci W ) = δ(x − ci ), where δ(.) is the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

(a) D ∼ (50, 60). (b) D ∼ (30, 70).

−200 training data −200

(a) Diﬀerent input ordering. (b) Diﬀerent value of DT .

Fig. 3. Performance of DC-RBF algorithm on a diﬀerent random ordering of input data

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

(a) Actual data. (b) Model output.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

(a) Training and testing data. (b) Final model.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

(a) Arrival of ﬁrst cloud. (b) Arrival of second cloud.

(c) Arrival of third cloud. (d) Arrival of fourth cloud.

available in the MATLAB function classregtree.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

100 200 300 400 500 600 700 800

GP by capturing low-dimensional interactions between a small subset of variables (3

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Training set size 30 50 100 250 500 30 50 100 250 500

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

same as described for CMB data set.

Training set size 30 50 100 250 500 30 50 100 250 500

Algorithm Speed Complexity Storage Recursivity Noise TP

More speciﬁcally, in contrast to DP-GLM, DC-RBF oﬀers orders of magnitude

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

which complicates function computations in stochastic optimization settings (our mo-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

networks, Neural Comput., 13 (2001), pp. 2359–2407.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

You might also like

Since DT → 0 leads to Wi → 0, limW →0 φ(x − ci W ) = δ(x − ci ), where δ(.) is the