A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions
A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions
c 2016 Society for Industrial and Applied Mathematics
Vol. 38, No. 4, pp. B619–B644
Abstract. We present a recursive function approximation technique that does not require the
storage of the arrival data stream. Our work is motivated by algorithms in stochastic optimization
which require approximating functions in a recursive setting such as a stochastic approximation
algorithm. The unique collection of these features in this technique is essential for nonlinear modeling
of large data sets where the storage of the data becomes prohibitively expensive and in circumstances
where our knowledge about a given query point increases as new information arrives. The algorithm
presented here employs radial basis functions (RBFs) to provide locally adaptive parametric models
(such as linear models). The local models are updated using recursive least squares and only store the
statistical representative of the local approximations. The resulting scheme is very fast and memory
efficient without compromising accuracy in comparison to methods well accepted as the standard
and some advanced techniques used for functional data analysis in the literature. We motivate the
algorithm using synthetic data and illustrate the algorithm on several real data sets.
Key words. radial basis functions, function approximation, local polynomials, data fitting
DOI. 10.1137/15M1008592
∗ Submitted to the journal’s Computational Methods in Science and Engineering section Febru-
ary 17, 2015; accepted for publication (in revised form) April 28, 2016; published electronically
August 2, 2016. This work was partially supported by grant FA9550-08-1-0195 from the Air Force
Office of Scientific Research. Any opinions, findings, and conclusions or recommendations expressed
in this material are those of the authors and do not necessarily reflect the views of the Air Force
Office of Scientific Research.
http://www.siam.org/journals/sisc/38-4/M100859.html
† School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran (arta.
NJ 08544 ([email protected]).
B619
and as a result are quite slow. One of the main attractions of RBFs is that the
resulting optimization problem can be broken efficiently into linear and nonlinear
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
2. Normalized radial basis functions. RBFs are powerful tools for function
approximation [5, 7, 46]. Over the years RBFs have been used successfully for solving
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
a wide range of function approximation problems (see, e.g., [3]). RBFs have also been
used for global optimization (see, e.g., [51, 52, 25, 30]). An RBF expansion is a linear
summation of special nonlinear basis functions. In general, an RBF is a mapping
f : Rn −→ R that is represented by
Nc
(1) f (x) = αi φ(x − ci Wi ),
i=1
where x is an input pattern, φ is the RBF centered at location ci , and αi denotes the
weight for the ith RBF. Nc denotes the total number of RBFs. The term Wi , which
is a symmetric positive definite matrix, denotes the parameters in the weighted inner
product, √
xW = xT W x.
Note that throughout this work Wi is diagonal, as explained in section 3.1. The
dimensions of the input n and output m are specified by the dimensions of the input-
output pairs. Universal approximation properties of RBFs are established in [42, 43].
Normalized RBFs of the following form were proposed by Moody and Darken [37]:
Nc
αi φ(x − ci Wi )
(2) f (x) = i=1
Nc
.
i=1 φ(x − ci Wi )
Normalized RBFs appear to have advantages over regular RBFs, especially in the
domain of pattern classification; see, e.g., [6]. In this work it is reported that using
normalized RBFs reduced the order of the model and the robustness of generalization.
It has also been reported that normalized RBFs require less data when training models
of dynamical systems [34].
The polynomial modulation of the normalized RBF expansion leads to the ex-
pansion
Nc
pi (x)φ(x − ci Wi )
(3) f (x) = i=1
Nc ,
i=1 φ(x − ci Wi )
where pi (x) is assumed to be a low order polynomial such as a linear function (pi (x) =
αi x+βi ). Note that the denominator does not become zero for Nc > 1 and distinct ci .
This is a property of the RBF φ. Let ψ contain all the parameters, αi , βi , ci , Wi , Nc ,
that are used to define f (x) as shown in (3). Enter these unknown parameters to the
argument of f (x) as f (x, ψ). Later we use this notation for optimizing the parameters
contained in ψ.
The above formulation leads us to the idea of normalized RBFs that have poly-
nomial response terms. In what follows we provide a fast algorithm that recursively
updates the response terms and the weights associated to them as new data points
arrive. During the training phase the unknown parameters in the model need to be
calculated, including the number of RBFs, Nc , in the function expansion.
3. Online learning scheme. The goal of our proposed algorithm is to find a
mapping f from x ∈ Rn , a vector in n-dimensional space, to R such that y k = f (xk )
as the input-output pair {(xk , y k )} becomes available. We expect to have a total of
K arrival data points where k ∈ {1, . . . , K}, with X = {xk }K k K
k=1 and Y = {y }k=1
representing the domain and range observable values. We assume that data arrive as
a stream. Throughout the process we would like to find the parameters associated
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
with the model provided in (3) to find an accurate model for the data. The observed
data might be noisy, and we would like to have a model that has good generalization
ability. In this regression framework, the problem is to minimize the cost function
K
1
E(ψ) = f (xk , ψ) − y k 2 ,
2
k=1
given the available data points. In this work, the Euclidean inner product is used for
the metric · , and we use the Gaussian kernel,
φ(r) = exp(−r2 ).
Note that other local kernels could be used. For a comprehensive review of RBF
kernels and recently developed skew and compactly supported expansions, see [32]. In
what follows the description of an algorithm in this framework is presented that does
not require the storage of the data stream and locally approximates the underlying
functional behavior of the data. The response model parameters are updated quickly
using recursive least squares, and the weights associated to each linear response are
updated via statistical techniques. Note that the model order is also determined in
the algorithm.
3.1. A data driven space cover. The algorithm proposed here works by form-
ing a cover for the domain of the underlying function f upon arrival of new data points,
xk . We define a cover
C = {Ui : i ∈ Δ},
where Ui denotes an indexed family of sets, indexed with the elements of set Δ. C is a
cover of X if X ⊆ i∈Δ Ui . On the arrival of the first data point, x1 , a cover element
or cloud U1 is formed with the center on this point and with a distance threshold of
DT to form a ball around x1 . When the system receives a new data point x2 with
associated y 2 value, if the data point is within distance DT of x1 , x2 is assigned to
the same cloud as x1 , and the cloud’s centroid and variance are updated; otherwise
a new cloud is created. New data points xk are assigned to clouds using the distance
metric
(4) Di = xk − ci ,
the surface landscape). The data that is stored in the model are the number of points
in each cloud, Ki , the centroid of each cloud, ci , and the variance in each dimension
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
of the data in each cloud, Wi (the diagonal of the sample covariance matrix). In
addition the response coefficients for each cloud are also stored. Note that the model
can be evaluated at any point of its domain after adapting the model using the new
information. If the geometry of the data is more complex, there will be a need for
more clouds in the cover to capture this behavior. The proposed space cover is a part
of our function approximation technique. We have developed this independently. In
this procedure, the clouds form, move, and could dissipate (if there are not enough
data, a possibility studied in future work) to model the topology of the underlying
data. Note that the data is not a priori sorted or organized in any way, and the cloud
centers must be learned adaptively from the data. To the best of our understanding
there are some similarities to clustering procedures such as those found in [57, 24].
3.2. Recursive update of the model. There are two parts in the model that
are updated during the training: the response to the local model associated to cloud
I, and its weights.
Recursive update for response. We solve a local least squares problem of the
form minθI XIkI θI − YIkI 22 , where XIkI is a matrix of the form
⎡ ⎤
1 x1
XIkI = ⎣ : : ⎦,
kI
1 x
where the vectors x1 , . . . , xkI are domain values. The vector YIkI contains all the
associated range values YIkI = [y 1 , . . . , y kI ]. The vector θIT = [β I , αI1 , . . . , αIn ] contains
all the parameters for the linear response. The direct solution to this system is
T T
θI = (XIkI XIkI )−1 XIkI YIkI . After receiving n + 1 affinely independent data points
in a local neighborhood (we refer to this state of the model as the no-knowledge state),
the matrix XIkI is formed with kI = n + 1. The recursive least squares update [10] is
used to compute the linear response model parameters. For kI = n + 1, we initialize
T
the recursion with PIkI = (XIkI XIkI )−1 . When a new data point (xk , y k ) is assigned
to this local neighborhood, with kI > n + 1, let aT = [1, xk ].
The recursion formula is then
PIkI aaT PIkI
(5) PIkI +1 = PIkI − ,
1 + aT PIkI a
where kI is the recursion index for the cloud indexed I. These equations are easily
modified to include a discount factor which puts more weight on recent observations.
Recursive update for the weights. When a new data point (xk , y k ) is assigned
to the local neighborhood indexed I, the weights WI associated with cloud UI is also
updated. For this purpose the number of data points that have been assigned to this
local ball is updated as kI = kI + 1. The center of the kernels are updated via the
recursion
xk − ckI I −1
(7) ckI I = ckI I −1 + .
kI
The width or scale of the model in each dimension of the cloud I is updated using
the Welford formula [35]. Initialize S1 = 0. For each additional data point xk assigned
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
The weighted inner product in the argument of the RBF function (as shown in
(3)) is calculated from this recursive update of the standard deviation of data in each
dimension. Note that in this study the weight matrices are diagonal. For cases where
the standard deviation of the data along a specific dimension is zero, we introduce a
penalty term to avoid singularity by replacing Wi with Wi + P , where P is a specified
constant. The required storage is proportional to Nc (3n + 1).
Ideally, we assume that the arrival data are scattered. At initial stages of model
construction it is best to have data that are sampled randomly from the input space.
The order of the data plays a role in the initial stage of the placement of the centroids;
however, the centers of the clouds stabilize after there are enough data points in a
kI −1
xk −c
given cloud from (7), ckI I − cIkI −1 = I
kI < DkIT . As the number of data points
DT
in a cloud kI increases, kI becomes smaller. In a case where the data points arrive
in one direction after a finite number of data points, a new cloud is created.
3.3. Model evaluation. This section describes how to compute the approxi-
k
mation F (x) at a query point x. As shown in (3), the output of the proposed model
is the weighted average of the linear responses over the local clouds Ui associated to
the cover C. In this study a Gaussian kernel, φ(r) = exp(−r2 ), determines the weight
of each response function. The widths, Wi , and the centroids, ci , of the kernels,
i ∈ {1, . . . , Nc }, were computed recursively as mentioned in section 3.2. The model
output is formulated as
Nc
i=1φ(x − ci Wi )([1, x]θi )
(10) f (x) = Nc .
i=1 φ(x − ci Wi )
Nc is the total number of clouds. Note that θi is the least squares solution of the
response over the ith local cloud. Note that if a cloud does not accumulate n + 1
affinely independent data points, the models built in other clouds are evaluated at a
query in this region.
A summary of the above procedure is presented in Algorithm 1.
3.4. Goodness of fit and stopping criteria. In data modeling one of the
major goals is to model data in such a way that the model does not over- or underfit
the data that is not used for training but is generated from the same process as the
training data. One general approach to finding such a smooth model to observed
data is known as regularization. Regularization describes the process of fitting a
smooth function through the data set using a modified optimization problem that
penalizes variation [58]. A standard technique to achieve regularization is via cross-
validation [22, 59]. Such methods involve partitioning the data into subsets of training,
validation, and testing data; for details see, e.g., [28]. To determine the tunable
while xk do
if k = 1 then
c1 = x1 , Nc = 1, k1 = 1, Δ = {1}, form cloud U1
else
compute Di = xk − ci according to (4) for all i ∈ Δ
compute I ∗ = arg min Di and let I = min I ∗
i
if DI < DT then
update cloud UI , kI = kI + 1
update the cloud centroid, cI , using (7)
update the scale of cloud, WI , using (8) and (9) if the standard deviation of
data is zero along a dimension use WI + P , where P is a specified constant
if kI ≥ n + 1 and the points are affinely independent then
update the response parameter, θI , using (5) and (6)
else
store xk assigned to cloud I
end if
else
spawn new cloud UNc +1 , Nc = Nc + 1, kN c = 1, Δ = {1, . . . , Nc }
end if
end if
k =k+1
end while
evaluate the model as needed using (10)
parameter of the model proposed in this work, i.e., the distance threshold, DT , we
use t-fold cross-validation techniques [19].
In this scheme, the original data set is randomly partitioned into t subsample sets.
One of the t subsample sets is chosen as the validation data for testing the model,
and the remaining subsample sets are used as training data. This process is repeated
t times with each of the t subsample sets used only once as the validation data. All
the t results are averaged to provide an overall estimate of the error.
The advantage of this method over repeated random subsampling is that all ob-
servations are used for both training and validation, and each observation is used for
validation exactly once [27]. We have chosen to use 5-fold cross-validation. We record
the accuracy of the model on the testing data set. This procedure is repeated for all
five folds, and the mean squared error (MSE) on the test data sets is computed at
each time using
L
1 2
(11) M SE = f (xl ) − y l ,
L
l=1
where L is the size of the testing data set. Overfitting is reduced by tuning the
parameter, DT , to minimize M SE calculated through cross-validation, as opposed to
minimizing the MSE of points within the training data set [28].
To increase the number of estimates, one could run the above t-fold cross-valida-
tion multiple times. For this purpose the data needs to be repartitioned each time.
To determine the predictive ability of the models, we use several measures to show
the accuracy of the predictions. The R2 value is calculated through the following
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
formula: L l l 2
l=1 (y − f (x ))
R2 = 1 − L
,
l
l=1 (y − ȳ)
2
where ȳ = L1 Tl L y l is the mean of y l , L is the size of the testing data set, and TL
is the size of training set. The variance of the data with regards to the true values
is calculated in the numerator. The denominator presents the variance of the model
output with respect to the mean of the outcomes. This measure shows how well the
model is performing in comparison to the case where the mean value is used as the
predictor. An R2 value of 1 indicates an exactly fitted model, while a value of 0
indicates a model that adds no predictive power. We have also used the notion of
mean absolute error (MAE) to record the performance of the model. The MAE is
defined as follows:
L
1
(12) MAE = |f (xl ) − y l |.
L
l=1
If the distance threshold is set too high, the proposed method will fit a single
hyperplane to the entire data set. In practice one could determine the DT by observing
a part of data and use this for the rest of the input data.
4. The convergence properties of the DC-RBF algorithm. This section
describes facts about the DC-RBF algorithm. Theorems describe the finite data and
asymptotic behavior of this algorithm. We assume scattered data arrive and the
underlying structure is continuously differentiable. Note that the proofs are carried
out for the algorithm in steady state where there are enough clouds to cover the desired
domain of the function, the clouds have been stabilized (the movement of the centers
are negligible), and there are enough points in each cloud to form a response surface,
as explained in the previous section. Theorems are provided for both homoscedastic
and heteroscedastic types of noise.
Assume we have
(13) y k = θxk + k
for k ∈ {1, . . . , K}. Vector θ is the parameter that needs to be estimated. To simplify
our analysis we assume the origin of the coordinate system is adjusted such that the
intercept is zero. The set DK = {(xk , y k )}, with k ∈ {1, . . . , K}, denotes the collection
of sequentially observed data points. The error variables, k , are random.
Remark 1 (Gauss–Markov theorem [18]). The Gauss–Markov theorem states that
in a linear regression model in which the errors have expectation zero and are uncor-
related and have equal variances, the lowest possible MSE (variance) of the linear un-
biased estimator (BLUE) of the coefficients is given by the ordinary least squares esti-
mator. Actual errors need not be normal, nor independent and identically distributed
(only uncorrelated and homoscedastic). That is, E(k ) = 0, var(k ) = σ 2 < ∞, and
cov(k1 , k2 ) = 0 for k1 = k2 and k1 , k2 ∈ {1, . . . , K}.
The proof of this follows from a contradiction on the mean and variance of another
linear estimator for coefficient θ. An example could be adding another term to the
least squares solution given by (X T X)−1 X T . It can be shown that the mean of this
new estimator is not unbiased and produces a variance that is greater than what is
produced by the MSE solution, i.e., σ 2 (X T X)−1 .
Note that in our work we use the recursive least squares solution on the local
clouds as new data points arrive at that local region.
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Proof. Under the hypothesis HIk (null hypothesis at cloud I and iteration k) that
cloud I has a linear structure, the rank one update of the recursive least squares
solution at each iteration given by (5) and (6) provides the best linear unbiased
estimator for the coefficient θ. Note that in this argument the local iteration counter
kI (the kth data point that arrives to cloud I) is the same as the global counter,
kI = k. We offer a proof based on induction. At k = n − 1 the argument is true since
the algorithm saves n − 1 data points to initialize matrix P , and the least squares
error is computed directly. Letting the argument be true for k = m ∈ N, we show
that at iteration k = m + 1, the rank one update solution via the recursive least
squares solution is also unbiased. This follows from the derivation of the recursive
least squares via the Sherman–Morrison formula,
Proof. The proof follows that of Proposition 3 and incorporates the notion of gen-
eralized least squares. Generalized least squares assumes the conditional variance of Y
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
given X is a known or estimated matrix Ω. This matrix is used as the weight in com-
puting the minimum squared Mahalanobis length; hence β̂ = (X Ω−1 X)−1 X Ω−1 Y.
Ω rescales the input data to make it uncorrelated, and then the Gauss–Markov the-
orem is applied. For the case where the variance of the noise changes, the weight
matrix Ω is diagonal, and this is a weighted least squares problem.
We denote the bias generated by the DC-RBF algorithm at each point of the
domain of function f (x) as B(x) = (f (x) − E[y|x])2 . The variance at each point
1
of the domain is V (x) = K (f (x) − f¯(x))2 , where f¯(x) is the mean of f (x) over its
domain values x.
Proposition 5. If the underlying nonlinear functional structure g(x) is Lipschitz
continuous with Lipschitz constant Lc , φ(.) is uniform, and the observed data points
are drawn from
(14) y k = g(xk ) + k ,
(ω1 −ω2 )2
then x∈X B(x) < Q(Lc , ω1 , ω2 ) = Lc 2 + O. Here DT = diam(X ) is the
chosen distance threshold, ω1 and ω2 determine the intersecting points of the linear
estimation and the underlying true structure g(x), and O denotes a residual term.
Proof. We assume that the underlying function g(x) is Lipschitz continuous; i.e.,
there are restrictions on the variations in the underlying structure. This implies
that for every pair of points on the graph of this function, the absolute value of the
slope of the line connecting them is no greater than a definite real number (Lipschitz
constant, Lc (g)). The slope of the secant line passing through xk1 , xk2 ∈ X is equal to
k1 k2
the difference quotient g(x|xk1)−g(x
−xk2 |
)
< Lc when xk1 = xk2 . Without loss of generality
we only consider the portion of the data that has an ascent followed by a descent slope
in its structure. This is achieved by the appropriate choice of DT . By assumption,
φ(.) is uniform weights. First we show that the least squares solution to this structure
intersects at two points, ω1 and ω2 (see Figure 1(a) for a visual presentation). The
proof follows from a contradiction argument.
Fig. 1. Visual demonstration of the regression lines intersecting the underlying nonlinear func-
tion g(x). In this figure we provide the geometry for the case where the model consists of a single
line, followed by the case where the model consists of two lines.
e : x ∈ [ωl , ωr ] → ek (x) ∈ R.
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
We assume that the data used for training is given by XI ∈ UI (index I denotes
a specific cloud, in this case the whole data diameter with I = 1). Note that the
global iteration k is the same as the local index kI . The following analysis is carried
out for a given iteration k. These data are contained in the closed interval x ∈ BI =
[ωl , ωr ] ∈ UI . In addition, we assume that both g and f intersect at the origin of the
coordinate system, i.e., e(ω1 ) = 0, and that e(x) = 0 at x ∈ / BI .
Let a+ denote the optimized parameters for the linear estimator f , i.e.,
a+ = argmin (axl − e(xl ))2 .
a
xl ∈BI
a+ x − e(x) = 0.
This indicates that the line a+ x intersects g(x) at another point. If a+ x − e(x) = 0
for all x ∈ [ωr , ωl ], then, without loss of generality, one could assume a+ x − e(x) > 0
for all x ∈ [ωl , ωr ].
So, there must be a smallest distance between the optimal hyperplane a+ x and
the residual curve e(x). This distance is defined as
α+ = min a+ x − e(x).
x∈[ωl ,ωr ]
Say this minimum occurs at the point x+ . Note that this point x+ may not actually
correspond to a sampled data point xk .
Now consider another hyperplane a∗ x which is obtained by scaling the optimal
hyperplane so that it produces another point of intersection with e(x), i.e.,
a∗ x − e(x) = 0
l
2
x ∈BI
Note that BI is an interval that contains the region of interest, diam(BI ) > |ω1 − ω2 |,
where ωr and ωl are the right and left ends of the interval BI , respectively. f k is the
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
line structure that is the least squares solution. We present this proof for a single
dimension; however, the higher dimensional argument follows directly by considering a
ball BI with a diameter that covers the intersecting hyperplane with codimension 2 at
{Ω1 , . . . , Ωn } with g(x). The argument follows the volume that is contained between
the hyperplane of codimension 1 and the polyhedron (tetrahedra in dimension 2) that
contains the nonlinear structure and has slope equal to Lc for all its faces.
Note that in one dimension we have the notion of a line and a tangent line, and
the model output is the weighted sum of piecewise linear approximation to the data.
In higher dimensions we have the notion of a hyperplane with the tangent plane to
make a weighted piecewise linear approximation to the underlying manifold.
Definition 6 (partition of unity [39]). A partition of unity of a topological space
X is a set of continuous functions, {ρi }i∈Δ (Δ is an index set), from X to the unit
interval [0, 1], such that for every point x ∈ X , there is a neighborhood of x where all
but a finitenumber of the functions are 0, and the sum of all the function values at x
is 1, i.e., i∈Δ ρi (x) = 1.
The notion of partition of unity allows the extension of a local construction to the
whole space. We use the following method to identify our partition of unity. Given
any open cover Ui , i ∈ Δ (Δ is an index set), of a space, there exists a partition ρi ,
i ∈ Δ, such that supp ρi ∈ Ui (supp ρi indicates the support of function ρi ). Such a
partition is said to be subordinate to the open cover Ui . Thus we choose to have the
supports indexed by the open cover.
If functions ρi are compactly supported, given any open cover Ui , i ∈ Δ, of a
space, there exists a partition ρj , j ∈ Λ, indexed over a possibly distinct index set Λ
such that each ρj has compact support, and for each j ∈ Λ, supp ρj ∈ Ui for some
i ∈ Δ.
Theorem 7. Assuming the underlying data structure, g(x), is nonlinear, Lip-
schitz continuous, and the distance threshold parameter, DT1 = diamX , the DC-RBF
algorithm with uniform RBF produces an upper bound on the bias denoted by QDT1 (Q
is defined in Proposition 5); however, if DT2 = diamX
N (N ∈ N, is a natural number)
with N > 1, and upper bound on the bias QDT2 , then QDT2 < QDT1 .
Proof. As new data points arrive, more of the true underlying geometry reveals
itself. We begin by posing as a null hypothesis that the underlying structure is
described by a line. As the number of observations increases, we test this hypothesis
against alternative hypotheses that the underlying structure is described by a series of
lines over regions. At this stage the DC-RBF algorithm, given DT = diamX , attempts
to model the secant line passing through the curvature (see Figure 1(a) for a visual
presentation). What we mean by the secant line is the regression line passing through
the data points associated to cloud indexed I, where at this stage I = 1. Note that
the linear solution produces bias, with an upper bound provided in Proposition 5.
Similar to Proposition 5, we assume the regression line intersects the underlying
nonlinear structure at points ω1 and ω2 with upper bound on the bias Q. The domain
of interest is BI = [ωl , ωr ]. As DT shrinks (without loss of generality, we assume
DT = diamX2 ), the space is broken down into smaller sets (see Figure 1(b) for a visual
presentation). For a one-dimensional space, this choice of DT breaks the space down
into two parts, hence Δ = {1, 2}. The local RBFs φ(r), with i φ(x−c i W )
φ(x−ci W ) = 1
i
with i ∈ Δ, form a partition of unity as provided in Definition 6. This would allow the
expansion of each solution to a larger domain to produce a smooth transition from
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
one local model to the neighboring models. With this setup, we assume that the two
regression lines l1 and l2 intersect at point ωM . According to Proposition 5, and with
a uniform choice for kernel φ(.), we get the following bounds for the bias on l1 and l2
denoted by Ql1 and Ql2 , respectively:
(ωl11 − ωl12 )2
Q l1 = L c + (ωl − ωl11 ) + (ωM − ωl12 )
2 2
2
and
(ωl21 − ωl22 )2
Q l2 = L c + (ωM − ωl21 ) + (ωr − ωl12 ) .
2 2
2
One could simply verify that Ql1 + Ql2 < Q. This would result in reduction of
the upper bound on the bias, hence QDT2 < QDT1 , with the cost of having more
clouds in the cover (more parameters in the model) and a corresponding increase in
the variance of the model, VDT1 < VDT2 .
The argument provided here is for one dimension, but, using similar arguments
made in Proposition 5, could be generalized to higher dimensions.
Variable subset selection and shrinkage are methods that introduce bias and try to
reduce the variance of the estimate. These methods trade a little bias for a larger re-
duction in variance. In practice we choose DT via cross-validation over other available
methods that are developed for this purpose.
Theorem 8. Asymptotically, the DC-RBF algorithm is unbiased for continuously
differentiable functions underlying nonlinear structure and heteroscedastic noise.
Proof. This theorem follows from Theorem 7 and takes into account that in the
limit DT → 0 and K → ∞. Let
pi (x)φ(x − ci Wi )
fΔ (x) = i∈Δ ,
i∈Δ φ(x − ci Wi )
where g(x, a) denotes the approximation of function g(x) around point a and g (n) (x)
denotes the nth derivative of the function g at point x. If the underlying function
g(x) ∈ C 1 , we consider a first order linear approximation to function g(x) at each
point, i.e., g(x) ≈ f (x) = g 1 (x − a)1 . In other words, at each point x, the first order
approximation is a hyperplane. Since in the current implementation of the DC-RBF
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
algorithm pi (x) is chosen to be a linear function, pi (x) is the tangent line to each point
x. This is true since the limit of the secant line passing through the points xk1 and
xk2 when xk1 → xk2 is a tangent line at point xk1 . When xk1 → xk2 , ||xk1 − xk2 || → 0
and in the limit this secant line forms the tangent at the meeting point xj . As a result
k1
)−f (xk2 )
the difference quotient f (x||xk1 −xk2 ||
approaches the slope of the tangent line of g(x)
at that xk2 , i.e.,
f (xk1 ) − f (xk2 )
lim = g 1 (xk2 ).
||xk1 −xk2 ||→0 ||xk1 − xk2 ||
This is true as K → ∞. Up to the first order approximation within a ball B(x, ξ) with
ρ > 0, the function is assumed to be linear. According to Remark 1, and considering
the fact that the variation of the noise within B(x, ξ) is negligible, the solution is
BLUE. Since this is true for every point in domain X , the DC-RBF algorithm is
therefore asymptotically unbiased.
5. The empirical results. Here we demonstrate the performance of the algo-
rithm on a variety of synthesized and real data sets. The data sets have distinct
features in terms of input dimension, complexity of the response function, as well as
noise content. Note that the algorithm provides a functional representation of the un-
derlying data. The first three synthetic examples concentrate on the performance and
discussions on DC-RBF algorithm, followed by two real examples with comparisons
to other statistical techniques.
5.1. Synthesized data sets.
One-dimensional newsvendor data set. In the newsvendor problem, a news-
vendor is trying to determine how many units of an item x to stock. The stocking cost
and selling price for the item is c and p, respectively. One could assume an arbitrary
distribution on demand D. The expected profit is given by
F (x) = E [p min(x, D)] − cx.
This problem poses a challenge for online data analysis due to the special behav-
ior in the function around the optimal solution, which is highly dependent on the
characteristics of a data set.
Figure 2 describes the experimental setup and the model output for two different
distributions on D. Figure 2 shows the training, testing, and model output for each
case. The specification of the algorithm for both problems is identical. The number
and position of training and testing data are the same in both experiments. We employ
our function approximation technique to find the maximum number of units to stock
to maximize expected profit. The MSE value for the first experiment is 107.39 with
a model that has three clouds. The second experiment has resulted in a model with
three clouds with MSE value of 426.63. We observe that the maximum stocking point
is well identified in both experiments. Note that the algorithm learns the underlying
functional behavior of the given data set within the scope that is determined by the
radius of the local balls. The model generalizes well within local regions. Figure 3
shows the performance of the proposed algorithm for different random ordering of
input data and a different distance threshold. The order of the data plays a role in
the initial stage of the model construction; however, the centers of the clouds stabilize
after there are enough data points in a given cloud.
600 500
400
0
200
Average profit
Average profit
0 −500
training data
−200 testing data
training data −1000 model output
testing data
−400 model output
−1500
−600
−800
−2000
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Units stocked Units stocked
Fig. 2. Newsvendor data set generated using [p min(x, D)] − cx. The training and testing data
sets consist of 24 and 36 data points, respectively. To generate this data set, c = 50, p = 60, and
D is a random uniform integer between 50 and 60 for panel (a) and between 30 and 70 for panel
(b). Inventory stock levels x were sampled from a random uniform integer distribution from 20 to
80. The MSE and MAE of the model output on the test data set in the original scale of the data
are 33847 and 107.39 for panel (a) and 453470 and 426.63 for panel (b), respectively. The model
distance threshold is 15. There are three clouds in each model. The centroids of the clouds in both
models are 29.80, 51, 69.85 with standard deviations of 6.54, 5.71, and 4.87, respectively.
600 600
400 400
200 200
Average profit
Average profit
0 0
−600 −600
−800 −800
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Units stocked Units stocked
The underlying function forms a shape that is not in the span of a polynomial
of degree p, in particular for problems where D has low variance (making these more
challenging problems). Therefore parametric methods do not perform well, especially
given the fact that the basis functions of a parametric model must be known a priori.
For the noisy data set the interpolation methods such as cubic splines (despite the
regression schemes) do not perform well and often result in overfitting the data. In-
terpolation techniques also require more data points to produce an accurate response;
see, e.g., [59]. We observe that, in this experiment, our method outperforms kernel
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
smoothing regression both in noisy and noise-free data sets. The poor performance
of kernel smoothing in this experiment is due to its local averaging property, which
tends to underfit the function around the local extreme points. Therefore, the pre-
dicted value of the maximum is worse than the true value. In addition, this technique
requires storage of the history of data points and does not provide an analytical form
for the underlying function.
We have tested our algorithm on various other one-dimensional noisy data sets
and have observed a similar conclusion. Therefore we only report on a newsvendor
data set which plays a key role in resource management and produces an interesting
case for data analysis due to the shape around the optimum result. Our method has
the lowest or a very comparable MSE compared to the techniques described in this
section and captures well the local structure of the signal.
An oscillatory data set with varying additive noise. To demonstrate the
ability of the algorithm to discover the underlying nonlinear function while the obser-
vations are highly corrupted with various levels of noise, we have tested the method
on a synthesized data set that is produced by y = sin( πx4 ) + 0.1xN (0, 1). Figure 4(a)
shows the noisy and noise-free data sets. The challenge is to find the underlying func-
tion sin(.) from the noisy input signal with large variation in the variance of noise.
Figure 4(b) shows the testing data set and the model output. One could observe that
the proposed algorithm has performed well in recovering the underlying function with
only five clouds in the function expansion. The final MSE is 0.8 and MAE 0.62. The
distance threshold is 3.
3 2.5
noise−free data training data
noisy data 2 testing data
2 model output
1.5
1
1
0.5
0 0
y
−0.5
−1
−1
−2 −1.5
−2
−3 −2.5
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
x x
Fig. 4. Panel (a) shows a data set of 200 data points in the interval [0, 16] generated with
y = sin( πx4
) + 0.1xN (0, 1). Both noise-free and noisy data are shown in this figure. Panel (b)
shows the training and testing data sets, each with 100 points, as well as the model output. In this
graph the data is rescaled using the standard deviation of the whole data. The centroid of the clouds
are 1.24 and 4.58, 7.93, 11.60, 14.60 with the standard deviation of 0.81 and 1.13, 1.06, 1.04, 0.79,
respectively.
Two-dimensional
√ saddle data set. A data set generated from z = x2 − y 2 +
N (0, 5) which produces a noisy saddle shape is used in this study. Figure 5(a) shows
the data points that are used for training and testing. Figure 6 demonstrates four
10 10
5 5
0 0
z
z
−5 −5
−10 −10
5 5
5 5
0 0
0 0
−5 −5 −5 −5
y x y x
major snapshots of the model-making process. At first there are not enough points
to form a model. Then, as new data points arrive, the first cloud is formed, as shown
in Figure 6(a). Depending on the spatial location of the arrival points, the first cloud
might get updated or the second cloud might get formed. The same process repeats
till the fourth cloud of the model is shaped, as shown in Figure 6(d). Each panel in this
figure shows the incident when a new cloud is formed with three affinely independent
data points. The third data point that activates the cloud, i.e., the first data point
that is assigned to a cloud that results in construction of a plane over its designated
cloud, is also plotted in each panel. Finally, Figure 5(b) shows the final model after
presenting 125 data points in the presence of the 100 testing points.
We observe that the accuracy of the model is similar to LOESS (locally scatter-
plot smoothing). However, unlike LOESS our method does not require the storage
of the data points and has superior speed. In addition, DC-RBF provides an analyt-
ical formulation for the underlying structure. Our test results on other synthesized
surfaces provide the same conclusion. For further elaboration on the LOESS method
and its comparison to DC-RBF, see section 6.
5.2. Real benchmark data sets and comparison results. In this section
we show the performance of the proposed method on two real data sets and compare
results to the related methods in the literature in terms of accuracy, speed, batch
vs. online, and the requirement of storing the previous data points. We select two
data sets with continuous response values. The selected data sets represent regression
challenges on noise heteroscedasticity and moderate dimensionality. We compare our
results to techniques that we find are closest to the spirit of our work.
The list of the benchmark algorithms is as follows:
• Dirichlet process mixtures of generalized linear models (DP-GLM): A Bayesian
nonparametric method that finds a global model of the joint input-response
pair distribution through a mixture of local generalized linear models [26].
• Ordinary least squares (OLS): A parametric method that is widely used for
data fitting problems and often provides a reasonable fit to data. In this
study [1, x]T has been chosen as basis functions (xi , i ∈ {1, . . . , n} denotes
the ith coordinate).
• CART: This is a nonparametric tree regression method [4]. This method is
10 10
5 5
0 0
z
z
−5 −5
−10 −10
5 5
5 5
0 0
0 0
−5 −5 −5 −5
y x y x
10 10
5 5
0 0
z
−5 −5
−10 −10
5 5
5 5
0 0
0 0
−5 −5 −5 −5
y x y x
Fig. 6. The model behavior at the starting point of adding a new cloud to the model. The final
model consists of four clouds.
traning data
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
6
testing data
model output
4
2
y
−2
−4
Fig. 7. The CMB data set that consists of 899 data points. The training and testing data sets
and the model output for the CMB data set are shown in this figure. The training data set consists
of 500 randomly selected points, and the remaining 499 are used for testing. The centroid of the
clouds are 62.22, 195.01, 647.96, 328.01, 816.23, 425.90, 530.63 with the standard deviation of 38.25,
43.43, 43.98, 33.67, 49.94, 31.94, 31.19, respectively.
results are the mean of the performances for various methods for different permutations of data
points selected for training and testing data sets. For DC-RBF, the hyperparameter DT is kept fixed
for a given data set size. The hyperparameter is identified using a 5-fold cross-validation technique.
The hyperparameters of the other techniques are calculated by sweeping over a set of candidate
parameters that varies by 3 to 5 orders of magnitude, and then the parameter that does the best on
the training set is chosen; these number are reported from [26].
DP-GLM requires the use of Markov chain Monte Carlo methods to reoptimize the
clustering as each data point is added, whereas DC-RBF updates instantly.
Concrete compressive strength (CCS) [62]. The CCS data set provides a
mapping from an eight-dimensional input space to a continuous output. This data
set has low noise and variance. The eight continuous input dimensions are as follows:
cement components, blast furnace slag, fly ash, water, superplasticizer, coarse aggre-
gate, and fine aggregate, all measured in kg per m3 , and the age of the mixture in
days. The response is the compressive strength of the resulting concrete. There are
1,030 observations in this data set. The challenge working with this data set is to
find a continuous mapping that represents the input-output relationship well in this
multidimensional data.
Similar to the experiment using the CMB data set, we have summarized the
comparison results for this data set in Table 2. The distance threshold parameter
value, 255, is determined using 5-fold cross-validation. We observe that as the number
of observations increases, the performance of the new method is enhanced due to the
nature of our design, which requires a certain number of data points to activate a
cloud. From Table 2, we see that in terms of error our method remains competitive
with DP-GLM, which is computationally much more demanding.
6. Concluding remarks. We propose a fast, recursive, function approximation
method which avoids the need to store the complete history of data (typical of non-
parametric methods) or the specification of hyperparameters for Bayesian priors. The
method assigns locally linear parametric models for regions of the covariate space that
are created dynamically based on the domain values of the input data without the
need for prespecified classification schemes. A weighting scheme is associated to each
locally linear approximation using normalized RBF. Each local model is updated
recursively with the arrival of each new observation, which is then used to update the
cloud representation before the data are discarded.
The new method is robust in the presence of homoscedastic and heteroscedastic
noise. Through the sole parameter DT , our method automatically determines the
Table 2
The performance table for CCS data. In this table the results are reported in terms of MSE
and MAE for various competing methods, including DP-GLM. The experimental setup here is the
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
model order for a given data set (one of the most challenging tasks in nonlinear
function approximation). Unlike similar algorithms in the literature, our method has
only one tunable parameter and is asymptotically unbiased.
Table 3 provides a comparison of different statistical methods that we have con-
sidered in this paper. This table brings together features such as speed, complexity
of implementation, storage requirement, recursivity, the type of noise that can be
handled, and the number of tunable parameters.
Table 3
The benchmark table to compare various statistical function approximation techniques in terms
of overall speed in terms of model updating and model evaluation (Ultra fast, Fast, Very slow), com-
plexity of implementation (Complex, Intermediate, Simple), storage requirement (All the history,
Statistical Representation, None), recursivity (Yes, No), type of noise they could handle (HEt-
eroscedasticity, HOmoscedasticity), and the number of tunable parameters (None, One, Few).
and speed. Parametric methods will always suffer from the need to tune the choice
of basis functions. Nonparametric methods require the storage of the entire history,
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
REFERENCES
[1] C. Andrieu, N. Freitas, and A. Doucet, Robust full Bayesian learning for radial basis
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
[30] T. Ishikawa and M. Matsunami, An optimization method based on radial basis function, IEEE
Trans. Magnetics, 33 (1997), pp. 1868–1871.
[31] A. A. Jamshidi and M. J. Kirby, Towards a black box algorithm for nonlinear function
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
approximation over high-dimensional domains, SIAM J. Sci. Comput., 29 (2007), pp. 941–
963, doi:10.1137/050646457.
[32] A. A. Jamshidi and M. J. Kirby, Skew-radial basis function expansions for empirical modeling,
SIAM J. Sci. Comput., 31 (2010), pp. 4715–4743, doi:10.1137/08072293X.
[33] A. A. Jamshidi and M. J. Kirby, Modeling multivariate time series on manifolds with skew
radial basis functions, Neural Comput., 23 (2011), pp. 97–123.
[34] R. D. Jones, Y. C. Lee, C. W. Barnes, G. W. Flake, K. Lee, P. S. Lewis, and S. Qian,
Function approximation and time series prediction with neural networks, in Proceedings
of the 1990 International Joint Conference on Neural Networks (IJCNN), Vol. 1, IEEE,
1990, pp. 649–665.
[35] D. E. Knuth, Art of Computer Programming. Vol. 2. Seminumerical Algorithms, 3rd ed.,
Addison-Wesley, Reading, MA, 1998.
[36] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to Linear Regression
Analysis, 3rd ed., Wiley-Interscience, New York, 2001.
[37] J. Moody and C. Darken, Fast learning in networks of locally-tuned processing units, Neural
Comput., 1 (1989), pp. 281–294.
[38] H. G. Müller, Weighted local regression and kernel methods for nonparametric curve fitting,
J. Amer. Statist. Assoc., 82 (1988), pp. 231–238.
[39] J. Munkres, Topology, 2nd ed., Prentice Hall, Upper Saddle River, NJ, 2000.
[40] S. Morales-Enciso and J. Branke, Tracking global optima in dynamic environments with
efficient global optimization, European J. Oper. Res., 242 (2015), pp. 744–755.
[41] E. A. Nadaraya, On estimating regression, Theory Probab. Appl., 9 (1946), pp. 141–142.
[42] J. Park and I. W. Sandberg, Universal approximation using radial-basis-function networks,
Neural Comput., 3 (1991), pp. 246–257.
[43] J. Park and I. W. Sandberg, Approximation and radial-basis-function networks, Neural Com-
put., 5 (1993), pp. 305–316.
[44] T. Poggio and F. Girosi, Regularization algorithm for learning that are equivalent to multi-
layer networks, Science, 247 (1990), pp. 978–982.
[45] M. J. D. Powell, Radial basis functions for multivariable interpolation: A review, in Algo-
rithms for Approximation, J. C. Mason and M. G. Cox, eds., Clarendon Press, Oxford,
1987, pp. 143–167.
[46] M. J. D. Powell, The theory of radial basis function approximation in 1990, in Advances
in Numerical Analysis, Vol. II, W. Light, ed., Oxford University Press, New York, 1992,
pp. 105–210.
[47] W. B. Powell, Approximate Dynamic Programming, Wiley, 2011.
[48] W. B. Powell and I. O. Ryzhov, Optimal Learning, Wiley, Hoboken, NJ, 2012.
[49] C. E. Rasmussen and Z. Ghahramani, Infinite mixtures of Gaussian process experts, in Ad-
vances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA, 2001,
pp. 881–888.
[50] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, MIT
Press, Cambridge, MA, 2006.
[51] R. G. Regis and C. A. Shoemaker, A stochastic radial basis function method for the global
optimization of expensive functions, INFORMS J. Comput., 19 (2007), pp. 497–509.
[52] R. G. Regis and C. A. Shoemaker, Improved strategies for radial basis function methods for
global optimization, J. Global Optim., 37 (2007), pp. 113–135.
[53] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by
error propagation, in Parallel Distributed Processing, D. E. Rumelhart and J. L. McClel-
land, eds., 1986, pp. 318–362.
[54] D. Ruppert, M. P. Wand, and R. J. Carroll, Semiparametric Regression, Cambridge Series
in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK,
2003.
[55] B. Shahbaba and R. M. Neal, Nonlinear models using Dirichlet process mixtures, J. Mach.
Learning Res., 10 (2009), pp. 1829–1850.
[56] J. S. Simonoff, Smoothing Methods in Statistics, Springer, New York, 1996.
[57] Q. Song and N. Kasabov, ECM: A novel on-line, evolving clustering method and its applica-
tions, in Foundations of Cognitive Science, MIT Press, Cambridge, UK, 2001, pp. 631–682.
[58] A. N. Tikhonov and V. Y. Arsenin, Solutions of Ill-Posed Problems, John Wiley & Sons,
New York, 1977.
[59] G. Wahba, Spline bases, regularization, and generalized cross validation for solving approxi-
mation problems with large quantities of data, in Approximation Theory III, W. Cheney,
ed., Academic Press, 1980, pp. 905–912.
[60] G. S. Watson, Smooth regression analysis, Sankhya Ser. A, 26 (1946), pp. 359–372.
Downloaded 01/30/23 to 137.132.123.69 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
[61] P. J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences, Ph.D. dissertation, Division of Applied Mathematics, Harvard University, Boston,
MA, 1974.
[62] I. C. Yeh, Modeling of strength of high-performance concrete using artificial neural networks,
Cement Concrete Res., 28 (1998), pp. 1797–1808.