2011 GPTP FFX Paper
2011 GPTP FFX Paper
1
FFX: Fast, Scalable, Deterministic Symbolic Regression Technology 1
Trent McConaghy
Chapter 1
Trent McConaghy1
1 Solido Design Automation Inc., Canada
Abstract
Symbolic regression is a common application for genetic programming (GP).
This paper presents a new non-evolutionary technique for symbolic regression
that, compared to competent GP approaches on real-world problems, is orders of
magnitude faster (taking just seconds), returns simpler models, has comparable
or better prediction on unseen data, and converges reliably and deterministically.
We dub the approach FFX, for Fast Function Extraction. FFX uses a recently-
developed machine learning technique, pathwise regularized learning, to rapidly
prune a huge set of candidate basis functions down to compact models. FFX is
verified on a broad set of real-world problems having 13 to 1468 input variables,
outperforming GP as well as several state-of-the-art regression techniques.
1. Introduction
Consider when we type “A/B” into a math package. This is a least-squares
(LS) linear regression problem. The software simply returns an answer. We do
not need to consider the intricacies of the theory, algorithms, and implemen-
tations of LS regression because others have already done it. LS regression is
fast, scalable, and deterministic. It just works.
This gets to the concept of “technology” as used by Boyd: “We can say that
solving least-squares problems is a (mature) technology, that can be reliably
used by many people who do not know, and do not need to know, the details”
(Boyd and Vandenberghe, 2004). Boyd cites LS and linear programming as
representative examples, and convex optimization getting close. Other exam-
ples might include linear algebra, classical statistics, Monte Carlo methods,
software compilers, SAT solvers1 , and CLP solvers2 .
(McConaghy et al., 2010) asked: “What does it take to make genetic pro-
gramming (GP) a technology? . . . to be adopted into broader use beyond that
of expert practitioners? . . . so that it becomes another standard, off-the-shelf
method in the ’toolboxes’ of scientists and engineers around the world?”
This paper asks what it takes to make symbolic regression (SR) a technology.
SR is the automated extraction of whitebox models that map input variables
to output variables. GP (Koza, 1992) is a popular approach to do SR, with
successful applications to real-world problems such as industrial processing
(Smits et al., 2010; Castillo et al., 2010), finance (Korns, 2010; Kim et al.,
2008), robotics (Schmidt and Lipson, 2006), and integrated circuit design
(McConaghy and Gielen, 2009).
Outside the GP literature, SR is rare; there are only scattered references such
as (Langley et al., 1987). In contrast, the GP literature has dozens of papers on
SR every year; even the previous GPTP had seven papers involving SR (Riolo
et al., 2010). In a sense, the home field of SR is GP. This means, of course,
that when authors aim at SR, they start with GP, and look to modify GP to
improve speed, scalability, reliability, interpretability, etc. The improvements
are typically 2x to 10x, but fall short of performance that would makes SR a
“technology” the way LS or linear programming is.
We are aiming for SR as a technology. What if we did not constrain ourselves
to using GP? To GP researchers, this may seem heretical at first glance. But if
the aim is truly to improve SR, then this should pose no issue. And in fact, we
argue that the GP literature is still an appropriate home for such work, because
(a) GP authors doing SR deeply care about SR problems, and (b) as already
2. SR Problem Definition
Given: X and y, a set of {xj , yj }, j = 1..N data samples where xj is an
n-dimensional point j and yj is a corresponding output value. Determine: a set
of symbolic models M = m1 , m2 , . . . that provide the Pareto-optimal tradeoff
between minimizing model complexity f1 (m) and minimizing expected future
model prediction error f2 = Ex,y L(m) where L(m) is the squared-error loss
function y − m(x))2 . Each model m maps an n-dimensional input x to a scalar
1 To be precise: attacked directly, without pre-filtering input variables or transforming to a smaller dimen-
sionality
4 GENETIC PROGRAMMING THEORY AND PRACTICE VI
output value ŷ, i.e. ŷ = m(x). Complexity is some measure that differentiates
the degrees of freedom between different models; we use the number of basis
functions.
We restrict ourselves to the class of generalized linear models (GLMs)
(Nelder and Wedderburn, 1972). A GLM is a linear combination of NB basis
functions Bi ; i = 1, 2, ..., NB :
NB
X
ŷ = m(x) = a0 + ai ∗ Bi (x) (1.1)
i=1
1 The middle term (quadratic term, like ridge regression), encourages correlated variables to group together
rather than letting a single variable dominate, and makes convergence more stable. The last term (l1 term,
like lasso), drives towards
P a sparse model with few coefficients, but discourages any coefficient from being
too large. ||a||1 = i |ai |.
FFX: Fast, Scalable, Deterministic Symbolic Regression Technology 5
Figure 1-1. A path of regularized regression solutions: each vertical slice of the plot gives
a vector of coefficient values a for each of the respective basis functions. Going left to right
(decreasing λ), each coefficient ai follows its own path, starting at zero then increasing in
magnitude (and sometimes decreasing).
each ai one at a time, updating the ai through a trivial formula while holding
the rest of the parameters fixed, and repeating until a stabilizes. For speed, it
uses “hot starts”: at each new point on the path, coordinate descent starts with
the previous point’s a.
Some highly useful properties of pathwise regularized learning are:
• Learning speed is comparable or better than LS.
• Unlike LS, can learn when there are fewer samples than coefficients N < n.
• Can learn thousands or more coefficients.
• It returns a whole family of coefficient vectors, with different tradeoffs
between number of nonzero coefficients and training accuracy.
For further details, we refer the reader to (Zou and Hastie, 2005; Friedman
et al., 2010).
4. FFX Algorithm
The FFX algorithm has three steps, which we now elaborate.
FFX Step One. Here, FFX generates a massive set of basis functions, where
each basis function combines one or more interacting nonlinear subfunctions.
6 GENETIC PROGRAMMING THEORY AND PRACTICE VI
Table 1-1 gives the pseudocode. Steps 1-10 generate univariate bases, and
steps 11-20 generate bivariate bases (and higher orders of univariate bases). The
algorithm simply has nested loops to generate all the bases. The eval function
(line 5, 9, and 18) evaluates a base b given input data X; that is, it runs the
function defined by b with input vectors in X. The ok() function returns F alse
if any evaluated value is inf, - inf, or N aN , e.g. as caused by divide-by-zero,
log on negative values, or negative exponents on negative values. Therefore, ok
filters away all poorly-behaving expressions. Line 16 means that expressions
of the form op() ∗ op() are not allowed; these are deemed too complex.
FFX Step Two. Here, FFX uses pathwise regularized learning (Zou and Hastie,
2005) to identify the best coefficients and bases when there are 0 bases, 1 base,
2 bases, and so on.
Table 1-2 gives the pseudocode. Steps 1-2 create a large matrix XB which
has evaluated input matrix X on each of the basis functions in B. Steps
3-4 determine a log-spaced set of Nλ values; see (Zou and Hastie, 2005) for
motivations here. Steps 5-16 are the main work, doing pathwise learning.
FFX: Fast, Scalable, Deterministic Symbolic Regression Technology 7
At each iteration of the loop it performs an elastic-net linear fit (line 11)
from XB 7→ y to find the linear coefficients a. As the loop iterates, Nbases
tends to increase, because with smaller λ there is more pressure to explain the
training data better, therefore requiring the usage of more nonzero coefficients.
Once a coefficient value ai is nonzero, its magnitude tends to increase, though
sometimes it will decrease as another coefficient proves to be more useful in
explaining the data.
FFX step two is like standard pathwise regularized learning, except that
whereas the standard approach covers a whole range of λ such that all co-
efficients eventually get included, FFX stops as soon as there are more than
Nmax−bases (e.g. 5) nonzero coefficients (line 9). Naturally, this is because
in the SR application, expressions with more than Nmax−bases are no longer
interpretable. In practice, this makes an enormous difference to runtime; for
example, if there are 7000 possible bases but the maximum number of bases
is 5, and assuming that coefficients get added approximately uniformly with
decreasing λ, then only 5/7000 = 1/1400 = 0.07% of the path must be covered.
# Compute XB
1. for i = 1 to length(B)
2. XB [i] = eval(B[i], X)
# Nondominated filtering
9. f1 = numBases(m), for each m in Mcand
10. f2 = testError(m) or trainError(m), for each m in Mcand
11. J = nondominatedIndices(f1 , f2 )
12. M = Mcand [j] for each j in J
13. return M
NB′
X N
X BN
NB′
X N
X BN
5. Medium-Dimensional Experiments
This section presents experiments on medium-dimensional problems. (Sec-
tion 7 will give higher-dimensional results.)
Experimental Setup
Problem Setup. We use a test problem used originally in (Daems et al.,
2003) for posynomial fitting, but also studied extensively using GP-based SR
(McConaghy and Gielen, 2009). The aim is to model performances of a well-
known analog circuit, a CMOS operational transconductance amplifier (OTA).
10 GENETIC PROGRAMMING THEORY AND PRACTICE VI
The goal is to find expressions for the OTA’s performance measures: low-
frequency gain (ALF ), phase margin (P M ), positive and negative slew rate
(SRp , SRn ), input-referred offset voltage (Vof f set ), and unity-gain frequency
(fu ). 1
Each problem has 13 input variables. Input variable space was sampled with
orthogonal-hypercube Design-Of-Experiments (DOE) (Montgomery, 2009),
with scaled dx=0.1 (where dx is % change in variable value from center value),
to get 243 samples. Each point was simulated with SPICE. These points were
used as training data inputs. Testing data points were also sampled with DOE
and 243 samples, but with dx=0.03. Thus, this experiment leads to somewhat
localized models; we could just as readily model a broader design space, but
this allows us to compare the results to (Daems et al., 2003). We calculate
normalized mean-squared
pP error on the training data and on the separate testing
data: nmse = i (( y
b i − yi )/(max(y) − min(y))2 )
FFX Setup. Up to Nmax−bases =5 bases are allowed. Operators allowed
are: abs(x),
p log10 (x), min(0, x), max(0, x); and exponents on variables are
x1/2 (= (x)), x1 (=x), and x2 . By default, denominators are allowed; p but
if turned off, then negative exponents are also allowed: x −1/2 (=1/ (x)),
x−1 (=1/x), and x−2 (=1/x2 ). The elastic net settings were ρ = 0.5, λmax =
max|X T y|/(N ∗ ρ), eps = 10−70 , and Nλ =1000.
Because the algorithm is not GP, there are no settings for population size,
number of generations, mutation/crossover rate, selection, etc. We emphasize
that the settings in the previous paragraph are very simple, with no tuning
needed by users.
Each FFX run took ≈5 s on a 1-GHz single-core CPU.
Reference GP-SR Algorithm Setup. CAFFEINE is a state-of-the-art GP-
based SR approach that uses a thoughtfully-designed grammar to constrain SR
functional forms such that they are interpretable by construction. Key settings
are: up to 15 bases, population size 200, and 5000 generations. Details are in
(McConaghy and Gielen, 2009). Each CAFFEINE run took ≈10 minutes on a
1-GHz CPU.
Experimental Results
This section experimentally investigates FFX behavior, and validates its
prediction abilities on the set of six benchmark functions.
FFX Data Flow. To start with, we examine FFX behavior in detail on a test
problem. Recall that FFX has three steps: generating the bases, pathwise
learning on the bases, and pruning the results via nondominated filtering. We
examine the data flow of these steps on the ALF problem.
The first step in FFX generated 176 candidate one-variable bases, as shown
in Table 1-4. These bases combined to make 3374 two-variable bases, some
of which are shown in Table 1-5. This made a total of 3550 bases for the
numerator; and another 3550 for the denominator1 .
Table 1-4. For FFX step 1: The 176 candidate one-variable bases.
0.5 , abs(v 0.5 ), max(0, v 0.5 ), min(0, v 0.5 ), log (v 0.5 ), v
vsg1 sg1 sg1 sg1 10 sg1 sg1 , abs(vsg1 ), max(0, vsg1 ), min(0, vsg1 ),
2 , max(0, v 2 ), min(0, v 2 ), log (v 2 ), v 0.5 , abs(v 0.5 ), max(0, v 0.5 ), min(0, v 0.5 ),
log10 (vsg1 ), vsg1 sg1 sg1 10 sg1 gs2 gs2 gs2 gs2
0.5 ), v
log10 (vgs2 2 2 2
gs2 , abs(vgs2 ), max(0, vgs2 ), min(0, vgs2 ), log10 (vgs2 ), vgs2 , max(0, vgs2 ), min(0, vgs2 ),
2 0.5 0.5 0.5 0.5 0.5
log10 (vgs2 ), vds2 , abs(vds2 ), max(0, vds2 ), min(0, vds2 ), log10 (vds2 ), vds2 , abs(vds2 ), max(0, vds2 ),
min(0, vds2 ), log10 (vds2 ), vds2 2 , max(0, v 2 ), min(0, v 2 ), log (v 2 ), v 0.5 , abs(v 0.5 ), max(0, v 0.5 ),
ds2 ds2 10 ds2 sg3 sg3 sg3
0.5 ), log (v 0.5 ), v 2 2
min(0, vsg3 10 sg3 sg3 , abs(vsg3 ), max(0, vsg3 ), min(0, vsg3 ), log10 (vsg3 ), vsg3 , max(0, vsg3 ),
2 2 0.5 0.5 0.5 0.5 0.5
min(0, vsg3 ), log10 (vsg3 ), vsg4 , abs(vsg4 ), max(0, vsg4 ), min(0, vsg4 ), log10 (vsg4 ), vsg4 , abs(vsg4 ),
max(0, vsg4 ), min(0, vsg4 ), log10 (vsg4 ), vsg4 2 , max(0, v 2 ), min(0, v 2 ), log (v 2 ), v 0.5 , abs(v 0.5 ),
sg4 sg4 10 sg4 sg5 sg5
0.5 ), min(0, v 0.5 ), log (v 0.5 ), v 2
max(0, vsg5 sg5 10 sg5 sg5 , abs(vsg5 ), max(0, vsg5 ), min(0, vsg5 ), log10 (vsg5 ), vsg5 ,
2 2 2 0.5 0.5 0.5 0.5
max(0, vsg5 ), min(0, vsg5 ), log10 (vsg5 ), vsd5 , abs(vsd5 ), max(0, vsd5 ), min(0, vsd5 ), log10 (vsd5 ), vsd5 ,0.5
abs(vsd5 ), max(0, vsd5 ), min(0, vsd5 ), log10 (vsd5 ), vsd5 2 , max(0, v 2 ), min(0, v 2 ), log (v 2 ),
sd5 sd5 10 sd5
0.5 , abs(v 0.5 ), max(0, v 0.5 ), min(0, v 0.5 ), log (v 0.5 ), v
vsd6 sd6 sd6 sd6 10 sd6 sd6 , abs(vsd6 ), max(0, vsd6 ), min(0, vsd6 ),
log10 (vsd6 ), vsd6 , max(0, vsd6 ), min(0, vsd6 ), log10 (vsd6 ), id1 , abs(id1 ), max(0, id1 ), min(0, id1 ), i2
2 2 2 2
d1 ,
max(0, i2 2 2 0.5 0.5 0.5 0.5 0.5
d1 ), min(0, id1 ), log10 (id1 ), id2 , abs(id2 ), max(0, id2 ), min(0, id2 ), log10 (id2 ), id2 , abs(id2 ),
2 2 2 2 0.5 0.5 0.5
max(0, id2 ), min(0, id2 ), log10 (id2 ), id2 , max(0, id2 ), min(0, id2 ), log10 (id2 ), ib1 , abs(ib1 ), max(0, ib1 ),
min(0, i0.5 0.5 2 2 2
b1 ), log10 (ib1 ), ib1 , abs(ib1 ), max(0, ib1 ), min(0, ib1 ), log10 (ib1 ), ib1 , max(0, ib1 ), min(0, ib1 ),
log10 (i2 0.5 0.5 0.5 0.5 0.5
b1 ), ib2 , abs(ib2 ), max(0, ib2 ), min(0, ib2 ), log10 (ib2 ), ib2 , abs(ib2 ), max(0, ib2 ), min(0, ib2 ),
log10 (ib2 ), i2 2 2 2 0.5 0.5 0.5 0.5 0.5
b2 , max(0, ib2 ), min(0, ib2 ), log10 (ib2 ), ib3 , abs(ib3 ), max(0, ib3 ), min(0, ib3 ), log10 (ib3 ),
ib3 , abs(ib3 ), max(0, ib3 ), min(0, ib3 ), log10 (ib3 ), i2 2 2 2
b3 , max(0, ib3 ), min(0, ib3 ), log10 (ib3 )
Table 1-5. For FFX step 1: Some candidate two-variable bases (there are 3374 total).
The second FFX step applied pathwise regularized learning on the 7100 bases
(3550 numerator + 3550 denominator), as illustrated in Figure 1-1 (previously
shown to introduce pathwise learning). It started with maximum lambda (λ),
where all coefficient values were 0.0, and therefore there are 0 (far left of figure).
Then, it iteratively decreased λ and updated the coefficient estimates. The first
base to get a nonzero coefficient was min(0, vds22 ) ∗ v 2 (in the denominator).
ds2
At a slightly smaller λ, the second base to get a nonzero coefficient was
2 ) ∗ v 2 (also in the denominator). These remain the only two bases
min(0, vsd5 sd5
for several iterations, until finally when λ shrinks below 1e4, a third base is
added. A fourth base is added shortly after. Pathwise learning continued until
the maximum number of bases (nonzero coefficients) was hit.
Figure 1-2. For FFX step 2: Pathwise regularized learning following on ALF .
The third and final FFX step applies nondominated filtering to the candidate
models, to generate the Pareto Optimal sets that trade off error versus number of
bases (complexity). Figure 1-3 shows the outcome of nondominated filtering,
for the case when error is training error, and for the case when error is testing
error. Training error for this data is higher than testing error because the training
data covers a broader input variable range (dx = 0.1) than the testing data (dx
= 0.03), as section 5 discussed.
Extracted Whitebox Models. Table 1-6 shows the lowest test-error functions
extracted by FFX, for each test problem. First, we see that the test errors are
all very low, <5% in all cases. Second, we see that the functions themselves
are fairly simple and interpretable, at most having two basis functions. For
ALF , P M , and SRn , FFX determined that using a denominator was better.
We continue to find it remarkable that functions like this can be extracted in
such a computationally lightweight fashion. For SRp , FFX determined that
the most predictive function was simply a constant (2.35e7). Interestingly, it
combined univariate bases of the same variable to get higher-order bases, for
example min(0, vds2 2 ) ∗ v 2 in A
ds2 LF .
Recall that FFX does is designed to not just return the function with the
lowest error, but a whole set of functions that trade off error and complexity. It
FFX: Fast, Scalable, Deterministic Symbolic Regression Technology 13
Figure 1-3. For FFX step 3: results of nondominated filtering to get the Pareto optimal tradeoff
of error versus number of bases, in modeling ALF . Two cases are shown: when error is on the
training data, and when error is on testing data.
Table 1-6. Functions with lowest test error as extracted by FFX, for each test problem. Ex-
traction time per problem was ≈5 s on a 1-GHz machine.
Problem Test error (%) Extracted Function
37.020
ALF 3.45 2 )∗v 2 −4.72e-5∗min(0,v 2 )∗v 2
1.0−1.22e-4∗min(0,vds2 ds2 sd5 sd5
90.148
PM 1.51 2 )∗v 2 +2.28e-6∗min(0,v 2 )∗v 2
1.0−8.79e-6∗min(0,vsg1 sg1 ds2 ds2
−5.21e7
SRn 2.10 2 )∗v 2
1.0−8.22e-5∗min(0,vgs2 gs2
does this efficiently by exploiting pathwise learning. Table 1-7 illustrates the
Pareto optimal set extracted by FFX for the ALF problem.
Prediction Abilities. Figure 1-4 compares FFX to GP-SR, linear models, and
quadratic models in terms of average test error and build time. The linear
and quadratic models took <1 s to build, using LS learning. GP-SR and FFX
predict very well, and linear and quadratic models predict poorly. GP-SR has
much longer model-building time than the rest. In sum, FFX has the speed of
linear/quadratic models with the prediction abilities of GP-SR.
14 GENETIC PROGRAMMING THEORY AND PRACTICE VI
Table 1-7. Pareto optimal set (complexity vs. test error) for ALF extracted by FFX.
3.72 37.619
37.379
3.55 2 )∗v 2
1.0−6.78e-5∗min(0,vds2 ds2
37.020
3.45 2 )∗v 2 −4.72e-5∗min(0,v 2 )∗v 2
1.0−1.22e-4∗min(0,vds2 ds2 sd5 sd5
Figure 1-4. Average test error (across six test problems) versus build time, comparing linear,
quadratic, FFX, and GP-SR
Table 1-8 compares the test error for linear, quadratic, FFX, and GP-SR
models; plus the approaches originally compared in (McConaghy and Gielen,
2005): posynomial (Daems et al., 2003), a modern feedforward neural network
(FFNN) (Ampazis and Perantonis, 2002), boosting the FFNNs, multivariate
adaptive regression splines (MARS) (Friedman, 1991), least-squares support
vector machines (SVM) (Suykens et al., 2002), and kriging (gaussian process
models) (Sacks et al., 1989). Lowest-error values are in bold.
From Table 1-8, we see that of all the modeling approaches, FFX has the
best average test error; and best test error in four of the six problems, coming
close in the remaining two.
FFX: Fast, Scalable, Deterministic Symbolic Regression Technology 15
Table 1-8. Test error (%) on the six medium-dimensional test problems.
Approach ALF PM SRn SRp Vof f set fu Avg.
6. FFX Scaling
Experimental Setup
So far, we have tested FFX on several problems with 13 input variables.
What about larger real-world problems? We consider the real-world integrated
circuits listed in Table 1-9. The aim is to map process variables to circuit
performance outputs. Therefore, these problems have hundreds or thousands
of input variables.
The data was generated by performing Monte Carlo sampling: drawing
process points from the process variables’ pdf, and simulating each process
point using HspiceT M , to get output values. The opamp and voltage reference
had 800 Monte Carlo sample points, the comparator and GMC filter 2000,
and bitcell and sense amp 5000. The data is chosen as follows: sort the data
according to the y-values; every 4th point is used for testing; and the rest are
used for training1 .
Table 1-10. FFX runs on each of these settings, and merges the results.
Inter- Denom- Expon- Log/Abs Hinge Notes
actions inator entials Operators Functions
linear
Y quadratic
Y Y Y
Y Y Y Y
Y Y
Y Y
Y Y Y
Y Y Y
Y Y
Y Y Y
FFX settings were like in section 5, except up to 250 bases were allowed.
The overall runtime per problem was ≈30 s on a single-core 1-GHz CPU.
7. High-Dimensional Experiments
This section presents results using the scaled-up FFX, on the high-dimensional
modeling problems described in section 6.
Table 1-11 shows the lowest test error found by FFX, compared to other
approaches. FFX always gets the lowest test error, and many other approaches
failed badly. FFX did find it easier to capture some mappings than others.
Table 1-11. Test error (%) on the twelve high-dimensional test problems. The quadratic model
failed because it had too samples for the number of coefficients. GP-SR and FFNN failed, either
because test error was ≫100% or model build time took unreasonably long (several hours).
Approach opamp opamp opamp opamp bitcell sense amp
AV BW PM SR celli delay
Lin (LS) 1.7 1.3 1.3 3.2 12.7 3.4
Quad (LS) FAIL FAIL FAIL FAIL 12.5 3.5
FFX 1.0 0.9 1.0 2.0 12.4 3.0
GP-SR FAIL FAIL FAIL FAIL FAIL FAIL
FFNN FAIL FAIL FAIL FAIL FAIL FAIL
Figure 1-5 shows the tradeoff of equations, for each modeling approach.
Each dot represents a different model, having its own complexity and test error.
For a given subplot, the simplest model is a constant, at the far left. It also has
the highest error. As new bases are added (higher complexity), error drops.
The curves have different signatures. For example, we see that when the opamp
BW model (top center) gets 2 bases, its error drops from 6.8% to 1.9%. After
that, additional bases steadily improve error, until the most complex model
having 31 bases has 1% error. Or, for opamp P M (top right), there is little
reduction in error after 15 bases.
In many modeling problems, FFX determined that just linear and quadratic
terms were appropriate for the best equations. These include the the simpler
opamp P M functions, GMC filter IL, GMC filter AT T EN , opamp SR (for
errors > 2.5%), and bitcell celli . But in some problems, FFX used more
strongly nonlinear functions. These include voltage reference DV REF , sense
amp delay, and sense amp pwr. Let us explore some models in more detail.
FFX: Fast, Scalable, Deterministic Symbolic Regression Technology 19
Figure 1-5. Test error vs. Complexity. Top row left-to-right: opamp AV , opamp BW , opamp
P M . Second-from-top row: opamp SR, bitcell celli , sense amp delay. Third-from-top row:
sense amp pwr, voltage reference DV REF , voltage reference P W R. Bottom row: GMC
filter AT T EN , GMC filter IL, comparator BW .
20 GENETIC PROGRAMMING THEORY AND PRACTICE VI
Table 1-12 shows some functions that FFX extracted for opamp P M . At 0
bases is a constant, of course. From 1 to 4 bases, FFX adds one more linear base
at a time, gradually adding resolution to the model. At 5 bases, it adds a base
that has both an abs() operator, and an interaction term: abs(dvthn) ∗ dvthn.
It keeps adding bases up to a maximum of 46 bases. By the time it gets to
46 bases, it has actually started using a rational model, as indicated by the
/(1 + . . .) term.
Table 1-13 shows some functions that FFX extracted for voltage reference
DV REF . It always determines that a rational with a constant numerator is
the best option. It uses the hinge basis functions, including interactions when
3 or more bases are used. It only needs 8 bases (in the denominator) to capture
error of 0.9%. Of the 105 possible variables, FFX determined that variable
dvthn was highly useful, by reusing it in many ways. dvthp and dxw also had
prominence.
8. Related Work
Related Work in GP
Some GP papers use regularized learning. (McConaghy and Gielen, 2009)
runs gradient directed regularization on a large set of enumerated basis func-
tions, and uses those to bias the choice of function building blocks during GP
search. FFX is similar, except it does not perform GP after regularized learn-
ing, and does not exploit pathwise learning to get a tradeoff. (Nikolaev and
Iba, 2001) and (McConaghy et al., 2005) use ridge regression and the PRESS
statistic, respectively, as part of the individual’s fitness function.
FFX: Fast, Scalable, Deterministic Symbolic Regression Technology 21
1 The authors claim CMA-ES is a “completely” derandomized algorithm, but that is not quite accurate,
because CMA-ES still relies on drawing samples from a pdf. To be completely derandomized, an algorithm
has to be deterministic.
22 GENETIC PROGRAMMING THEORY AND PRACTICE VI
9. Conclusion
This paper presented FFX, a new SR technique that approaches “technology”
level speed, scalability, and reliability. Rather than evolutionary learning,
it uses a recently-developed technique from the machine learning literature:
pathwise regularized learning (Friedman et al., 2010). FFX applies pathwise
learning to an enormous set of nonlinear basis functions, and exploits the path
structure to generate a set of models that trade off error versus complexity. FFX
was verified on six real-world medium-sized SR problems: average training
time is ≈5 s (compared to 10 min with GP-SR), prediction error is comparable
or better than GP-SR, and the models are at least as compact. FFX was scaled
up to perform well on real-world problems with >1000 input variables. Due to
its simplicity and deterministic nature, FFX’s computational complexity could
readily be determined: O(N ∗ n2 ); where N is number of samples and n is
number of input dimensions.
A python implementation of FFX, along with the real-world benchmark
datasets used in this paper, are available at trent.st/ffx.
FFX’s success on a problem traditionally approached by GP raises several
points. First, stochasticity is not necessarily a virtue: FFX’s deterministic
nature means no wondering whether a new run on the same problem would
work. Second, this paper showed how doing SR does not have to mean doing
GP. What about other problems traditionally associated with GP? GP’s greatest
virtue is perhaps its convenience. But GP is not necessarily the only way;
there is the possibility of dramatically different approaches. The problem may
be reframed to be deterministic or even convex. As in the case of FFX for
SR, there could be benefits like speed, scalability, simplicity, and adoptability;
plus a deeper understanding of the problem itself. Such research can help
crystallize insight into what problems GP has most benefit, and where research
on GP might be the most fruitful; for example, answering specific questions
about the nature of evolution, of emergence and complexity, and of computer
science.
10. Acknowledgment
Funding for the reported research results is acknowledged from Solido De-
sign Automation Inc.
References
Ampazis, N. and Perantonis, S. J. (2002). Two highly efficient second-order
algorithms for training feedforward networks. IEEE-EC, 13:1064–1074.
Boyd, Stephen and Vandenberghe, Lieven (2004). Convex Optimization. Cam-
bridge University Press, New York, NY, USA.
24 GENETIC PROGRAMMING THEORY AND PRACTICE VI
Castillo, Flor, Kordon, Arthur, and Villa, Carlos (2010). Genetic program-
ming transforms in linear regression situations. In Riolo, Rick, McConaghy,
Trent, and Vladislavleva, Ekaterina, editors, Genetic Programming Theory
and Practice VIII, volume 8 of Genetic and Evolutionary Computation,
chapter 11, pages 175–194. Springer, Ann Arbor, USA.
Daems, Walter, Gielen, Georges G. E., and Sansen, Willy M. C. (2003).
Simulation-based generation of posynomial performance models for the siz-
ing of analog integrated circuits. IEEE Trans. on CAD of Integrated Circuits
and Systems, 22(5):517–534.
Deb, Kalyanmoy, Pratap, Amrit, Agarwal, Sameer, and Meyarivan, T. (2002). A
fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions
on Evolutionary Computation, 6:182–197.
Fonlupt, Cyril and Robilliard, Denis (2011). A continuous approach to genetic
programming. In Silva, Sara, Foster, James A., Nicolau, Miguel, Giacobini,
Mario, and Machado, Penousal, editors, Proceedings of the 14th European
Conference on Genetic Programming, EuroGP 2011, volume 6621 of LNCS,
pages 335–346, Turin, Italy. Springer Verlag.
Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of
Statistics, 19(1):1–141.
Friedman, Jerome H., Hastie, Trevor, and Tibshirani, Rob (2010). Regulariza-
tion paths for generalized linear models via coordinate descent. Journal of
Statistical Software, 33(1):1–22.
Hansen, N. and Ostermeier, A. (2001). Completely derandomized self-adaptation
in evolution strategies. Evolutionary Computation, 9(2):159–195.
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome (2008). The elements
of statistical learning: data mining, inference and prediction. Springer, 2
edition.
Kim, Minkyu, Becker, Ying L., Fei, Peng, and O’Reilly, Una-May (2008).
Constrained genetic programming to minimize overfitting in stock selec-
tion. In Riolo, Rick L., Soule, Terence, and Worzel, Bill, editors, Genetic
Programming Theory and Practice VI, Genetic and Evolutionary Computa-
tion, chapter 12, pages 179–195. Springer, Ann Arbor.
Korns, Michael F. (2010). Abstract expression grammar symbolic regression.
In Riolo, Rick, McConaghy, Trent, and Vladislavleva, Ekaterina, editors,
Genetic Programming Theory and Practice VIII, volume 8 of Genetic and
Evolutionary Computation, chapter 7, pages 109–128. Springer, Ann Arbor,
USA.
Koza, John R. (1992). Genetic Programming: On the Programming of Com-
puters by Means of Natural Selection. MIT Press, Cambridge, MA, USA.
Langley, Pat, Simon, Herbert A., Bradshaw, Gary L., and Zytkow, Jan M.
(1987). Scientific discovery: computational explorations of the creative pro-
cess. MIT Press, Cambridge, MA, USA.
FFX: Fast, Scalable, Deterministic Symbolic Regression Technology 25
Leung, Henry and Haykin, Simon (1993). Rational function neural network.
Neural Comput., 5:928–938.
Looks, Moshe (2006). Competent Program Evolution. Doctor of science, Wash-
ington University, St. Louis, USA.
McConaghy, Trent, Eeckelaert, Tom, and Gielen, Georges (2005). CAFFEINE:
Template-free symbolic model generation of analog circuits via canonical
form functions and genetic programming. In Proceedings of the Design
Automation and Test Europe (DATE) Conference, volume 2, pages 1082–
1087, Munich.
McConaghy, Trent and Gielen, Georges (2005). Analysis of simulation-driven
numerical performance modeling techniques for application to analog cir-
cuit optimization. In Proceedings of the IEEE International Symposium on
Circuits and Systems (ISCAS). IEEE Press.
McConaghy, Trent and Gielen, Georges (2006). Double-strength caffeine: fast
template-free symbolic modeling of analog circuits via implicit canonical
form functions and explicit introns. In Proceedings of the conference on
Design, automation and test in Europe: Proceedings, DATE ’06, pages 269–
274, 3001 Leuven, Belgium, Belgium. European Design and Automation
Association.
McConaghy, Trent and Gielen, Georges G. E. (2009). Template-free symbolic
performance modeling of analog circuits via canonical-form functions and
genetic programming. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 28(8):1162–1175.
McConaghy, Trent, Vladislavleva, Ekaterina, and Riolo, Rick (2010). Genetic
programming theory and practice 2010: An introduction. In Riolo, Rick,
McConaghy, Trent, and Vladislavleva, Ekaterina, editors, Genetic Program-
ming Theory and Practice VIII, volume 8 of Genetic and Evolutionary
Computation, pages xvii–xxviii. Springer, Ann Arbor, USA.
Montgomery, Douglas C. (2009). Design and analysis of experiments. Wiley,
Hoboken, NJ, 7. ed., international student version edition.
Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models.
Journal of the Royal Statistical Society, Series A, General, 135:370–384.
Nikolaev, Nikolay Y. and Iba, Hitoshi (2001). Regularization approach to induc-
tive genetic programming. IEEE Transactions on Evolutionary Computing,
54(4):359–375.
O’Neill, Michael and Brabazon, Anthony (2006). Grammatical differential evo-
lution. In Arabnia, Hamid R., editor, Proceedings of the 2006 International
Conference on Artificial Intelligence, ICAI 2006, volume 1, pages 231–236,
Las Vegas, Nevada, USA. CSREA Press.
O’Neill, Michael and Ryan, Conor (2003). Grammatical Evolution: Evolution-
ary Automatic Programming in a Arbitrary Language, volume 4 of Genetic
programming. Kluwer Academic Publishers.
26 GENETIC PROGRAMMING THEORY AND PRACTICE VI