Nonlinear Regression Using EXCEL Solver
Nonlinear Regression Using EXCEL Solver
%xy
%x
%y
%(x
2
)
%x
2
(3)
and
c=
%y
%(x
2
)
%x
%xy
%(x
2
)
%x
2
(4)
respectively. The r
2
value, also known as the
correlation index or coefcient of determination,
is a value between 0 and 1. It expresses the
proportion of variance in the dependent variable
explained by the independent variable. An r
2
value of 0 means that knowing x does not help to
predict y. As the r
2
value increases towards 1 the
more accurately the function ts the data. (N.B.
By convention in linear regression the r
2
value is
expressed in lower case and in non-linear regres-
sion the R
2
value is expressed in upper case).
r
2
=1
%(yy
mean
)
2
%(y
2
)
%(y)
2
n
(5)
where y is the data point, and y
mean
is the average
value of the y data. This method of least squares
tting can be used only with data in which the
dependent variable is a linear function of the
independent variable.
2.2. Non-linear regression
Prior to the advent of personal computers and
specialist curve tting programmes non-linear
data would be transformed into a linear form and
subsequently analyzed by linear regression (e.g.
Lineweaver Burke method or Scatchard plots).
These transformations could yield inaccurate
analysis as the linear regression was carried out
on transformed data, which may distort the ex-
perimental error or alter the relationship between
the x and y values. This method is outdated and
inaccurate and should not be used. Instead for
data that are not described by a linear function, it
is necessary to implement a protocol that will t a
non-linear function to the data. A method that is
suitable for this procedure is called iterative non-
linear least squares tting. This process uses the
same goal as described for linear regression, i.e.
Fig. 1. Linear regression. A: An X-Y Scatter plot illustrating
the difference between the data points and the linear t. B: A
residual plot illustrating the difference between data points
and the t. C: The residual is squared to eliminate the effect of
positive or negative deviations from the t. This value is used
to calculate the sum of the squares.
A.M. Brown / Computer Methods and Programs in Biomedicine 65 (2001) 191200 194
minimize the value of the squared sum of the
difference between data and t. However it differs
from linear regression in that it is an iterative, or
cyclical process. This involves making an initial
estimate of the parameter values. The initial
parameter estimates should be based on prior
experience of the data or a sensible guess based on
knowledge of the function used to t the data.
The rst iteration involves computing the SS
based on the initial parameter values. The second
iteration involves changing the parameter values
by a small amount and recalculating the SS. This
process is repeated many times to ensure that
changes in the parameter values result in the
smallest possible value of SS. For linear regres-
sion only a single calculation is required to
provide the lowest value of the SS, because the
second and higher derivatives of the function are
zero. Therefore, the algorithm requires only a
single iteration. However, for non-linear regres-
sion the second and higher derivatives are not
zero, and thus an iterative process is required to
calculate the optimal parameter values. Several
different algorithms can be used in non-linear
regression including the GaussNewton, the Mar-
quardt Levenberg, the NelderMead and the
steepest descent methods [2]. SOLVER, however,
uses another iteration protocol, which is based on
the robust and reliable generalized reduced gradi-
ent (GRG) method. A detailed description of the
evolution and implementation of this code can be
found elsewhere [3,4]. All of these algorithms have
similar properties. They all require the user to
input initial parameter values and use these values
to provide a better estimate of the parameters
employing an iterative process. With the same set
of data all of these methods should yield the same
parameter values.
The following example illustrates how to use
the SOLVER function in Excel to t data with
user-input non-linear functions. The process by
which the curve t proceeds is called iterative
non-linear least squares regression. The example
used is a sigmoidal function (the Boltzmann equa-
tion), which describes the probability that an ion
channel will be open relative to voltage. This
example is used purely for illustrative purposes
and it is not necessary that the reader understand
anything about ion channels.
The Boltzmann function is described by the
following function:
y=
1
1+exp
(VE)
Slope
n
(6)
where y is the dependent variable, E is the
independent variable (Voltage), and V and Slope
are the parameter values. V is the half activation
voltage, which describes the voltage at which half
of the ion channels are open (i.e. where y=0.5).
Slope describes the slope at the point V and
indicates the steepness of the curve, or sensitivity
to voltage of the ion channel.
2.3. Conguring the spreadsheet for non-linear
regression
In order to perform non-linear regression anal-
ysis using the Boltzmann function, the following
procedure must be carried out:
1. Input onto a spreadsheet the raw data in two
columns, the X column containing the indepen-
dent variable (Voltage), and the Y column con-
taining the dependent variable (Data). This is
illustrated as Columns A and B (Voltage and
Data, respectively) of Fig. 2A, where Voltage is
the independent variable and Data is the depen-
dent variable.
2. Graph the data contained in cells A2 to B20
in a Scatter plot. The data points are displayed as
lled squares.
3. Enter labels in cells G1 to G8 to describe the
contents of the adjacent cells. In cell G1 enter V,
which will describe the parameter in cell H1. For
cell H1 select the Insert menu choose Name then
Dene for cell H1. Name the cell V. Similarly, for
cells G2 to G8, enter Slope, Mean of y, df, S.E. of
y, R2. Critical t and CI, respectively. Name cells
H2 to H8, Slope, Mean
of
y, df, S.E.
of
y,
RSQ, Critical
2
0
0
1
9
5
Fig. 2. Spreadsheet template for non-linear regression. A: The data are entered into Column A and B with Column C used to generate the t based on the parameters
in Cells H1 and H2. Columns D and E calculate the 95% condence interval around the t. Cell H6 is used to calculate R
2
. B: The solution of the t calculated by
SOLVER.
A.M. Brown / Computer Methods and Programs in Biomedicine 65 (2001) 191200 196
where V and Slope refer to the parameter values
in cells H1 and H2.
5. Copy the equation from cell C2 down to and
including C20. Note that A2 is a relative refer-
ence, which species the location of a cell relative
to the cell in which the calculation will be carried
out, in this case cell C2. Thus copying from Rows
2 to 20, changes the value of A2 to reect the
appropriate Row.
6. The mean of the y values is calculated by
entering the following formula in H3.
=AVERAGE(B2:B20)
7. The degrees of freedom is dened as the
number of data point minus the number of
parameters in the function. It is calculated by
entering the following formula in H4.
=COUNT(B2:B20) COUNT(H1:H2)
8. The standard error of the y values is dened
as
S.E. =
D
%(yy
t
)
2
df
(7)
and is calculated by entering the following for-
mula in H5
=SQRT(SUM((B2:B20C2:C20)
2)/df).
However as this formula must be expressed as an
array formula, press Ctrl +Shift +Enter. This en-
closes the whole formula within a pair of curly
brackets ({}), denoting it as an array formula.
9. The R
2
value, the correlation index or coef-
cient of determination, is dened as
R
2
=1
%(yy
t
)
2
%(yy
mean
)
2
(8)
and is calculated by entering the following for-
mula in H6 and expressing it as an array formula
as described above
=1SUM((B2:B20C2:C20)
2)
/SUM((B2:B20Mean
of
y)
2)
10. In order for the condence interval of the t
to be calculated the critical t value at a signi-
cance level of 95% is calculated by entering the
following formula in H7.
=tinv(0.05,df)
The condence interval is dened as
y
t
* Critical
t*S.E.
of
y
Thus in H8 enter
=Critical
t*S.E.
of
y
Enter the following formula in D2
=C2+CI
and copy it down to D20. Similarly enter=C2
CI in E2 and copy down to E20. This calculates
the upper and lower condence limits (95%) of the
t.
11. The S.E. of the y values, R
2
and CI are
automatically calculated: 0.134, 0.872 and 0.283,
respectively.
12. Insert initial estimates of the parameters V
and Slope into cells G1 and G2, respectively.
Approximate estimates are 20 and 10, respec-
tively. Fig. 2A illustrates the spreadsheet template
with the formulas used in the tting protocol
displayed.
13. Graph Columns C, D and E versus Column
A such that they are displayed as continuous lines
on the graph as illustrated in Fig. 3A. It can be
seen that the initial estimate (thick line) is not a
good t of the data with large condence limits
(thin lines). The following section describes ma-
nipulations that allow SOLVER to improve the
t.
2.4. Implementation of SOLVER
The above protocol sets up the spreadsheet
template that SOLVER requires in order to t a
curve to the data. This method can be used to t
data with any user input non-linear function.
Simply enter the appropriate parameter values in
Column H and the function in a form that Excel
recognizes in Column C. Carry out the following:
14. Open the SOLVER function, which can be
found under the Tools menu. The Dialogue box
illustrated in Fig. 4A appears. If SOLVER is not
in this menu it should be installed. See Excel
documentation for installation procedure.
A.M. Brown / Computer Methods and Programs in Biomedicine 65 (2001) 191200 197
Fig. 3. Boltzmann t of electrophysiological data. A: This
graph displays the experimental data points (lled squares),
the t based on the initial parameter estimates (thick line), and
the 95% condence intervals (thin lines) around the t. B: The
t as calculated by SOLVER. Note how the t more accu-
rately overlies the data than the initial estimates, and the CI
are closer to the t.
These changes will be displayed on the spread-
sheet template, as illustrated in Fig. 2B. The
optimal values of V and Slope are 10.317 and
12.194, respectively, and the maximal value of R
2
is 0.997. The continuous thick line in Fig. 3B
illustrates the best t and it is clear that it is an
improvement over the t provided by the initial
parameter values. Additionally the condence in-
tervals around the t have been reduced.
2.5. Controlling ad6anced SOLVER features
The default SOLVER settings can be changed
by opening the Solver Options Dialogue box (Fig.
4B). Each option has a default setting that is
appropriate for most situations but that can be
changed. The most relevant to the protocol de-
scribed in this paper are described below.
Fig. 4. The built-in SOLVER function. A: The SOLVER
Dialogue box used as an interface between the SOLVER
function and data on the spreadsheet. B: The t can be
ne-tuned using the Options Dialogue box.
15. In Set Target Cell box enter RSQ
16. Set the Equal To option to Max. SOLVER
tries to maximize the value of R
2
.
17. In By Changing Cells box enter V, Slope.
18. In the Subject to the Constraints box enter
VB=0
V\=20
This determines the range over which SOLVER
will nd the best tting value of V. It can be seen
from Fig. 2 that the value of V at y=0.5 lies
between 0 and 20. Constraints are used to
impose limits over the range of values used to
dene the parameters. Although it is intuitive that
the Slope is positive at y=0.5, it is difcult to
estimate the value so no constraints are applied to
Slope.
19. Choose Solve to perform the t. The pro-
gramme will iteratively cycle through the tting
routine, changing the parameter values of V and
Slope until the largest value of R
2
is calculated.
A.M. Brown / Computer Methods and Programs in Biomedicine 65 (2001) 191200 198
Max time: Species the amount of time in
seconds that SOLVER will be allowed to run
before stopping. The default value is 100 s. Itera-
tions: Species the number of iterations that
SOLVER will carry out before stopping. The
default value is 100. If SOLVER nds the optimal
solution before either of these limits is reached it
will present the results. Precision: Controls the
precision of solutions by using the number en-
tered to determine whether the value of a con-
straint meets a target or satises a lower or upper
bound. The default value is 110
6
. The higher
the precision, the more time taken to reach a
solution. Tolerance: The percentage by which the
target cell (RSQ in the example described here) of
a solution satisfying the integer constraints can
differ from the true optimal value and still be
considered acceptable. This option applies only to
problems with integer constraints. A higher toler-
ance tends to speed up the solution process. The
default value is 5. Convergence: This value tells
SOLVER when to stop the iterative process.
When the relative change in the target cell value is
less than the number in the Convergence box for
the last ve iterations, SOLVER stops. The
smaller the convergence value, the more time
SOLVER takes to reach a solution. The default
value is 0.001. Assume Linear Model: This box
should be checked only if the model to be solved
in linear; otherwise, as in the case of the non-lin-
ear regression described here, leave the box
unchecked. Use Automatic Scaling: Select to use
automatic scaling when inputs and outputs have
large differences in magnitude. For example, if
values such as 110
14
are entered rounding off
errors can be large. It is advised to keep this box
checked for all SOLVER models. Assume Non-
Negative: Causes SOLVER to assume a lower
limit of 0 for all adjustable cells for which no
constraints have been set. Show Iteration Results:
Select to have SOLVER pause to show the results
of each iteration. Estimates: Determines the ap-
proach used to obtain subsequent estimates of the
basic variable values at the outset of each one-di-
mensional search. Tangent: Uses linear extrapola-
tion from a tangent vector. Quadratic: The
Quadratic choice extrapolates the minimum (or
maximum) of a quadratic tted to the function at
its current point. The Tangent choice is slower but
more accurate. Derivatives: Species the differenc-
ing used to estimate partial derivatives of the
objective and constraint functions. Forward: The
point from the previous iteration is used in con-
junction with the current point. This reduces the
recalculation time required for nite differencing,
which can account for up to half of the total
solution time. Central: Central differencing relies
only on the current point and perturbs the deci-
sion variables in opposite directions from that
point. Although this involves more recalculation
time, it may result in a better choice of search
direction when the derivatives are rapidly chang-
ing, and hence fewer total iterations. Search: Spe-
cies the algorithm used at each iteration to
determine the direction to search. Newton: The
default choice Newton requires more memory but
fewer iterations than does the Conjugate gradient
method. Conjugate: Requires less memory than
the Newton method but typically needs more
iterations to reach a particular level of accuracy.
Load Model: Loads a previously saved tting
routine. Save Model: Allows the user to save the
current tting routine for future use.
3. Conclusion
Non-linear regression is a powerful technique
for standardizing data analysis. The advent of
personal computers has rendered linear transfor-
mation of data obsolete, allowing non-linear re-
gression to be carried out quickly and reliably by
non-specialist users. While the method described
in this paper requires that the user have a basic
knowledge of spreadsheets, it is not required that
the user has an intimate understanding of the
mathematics behind the processes involved in
curve tting. This subject is beyond the knowl-
edge of most biologists, involving calculus, ma-
trices and statistics. What is important, however,
is that the user understands enough about the
data to be t to use the correct type of analysis,
and to judge goodness of t from calculated
estimates.
This paper does not address the issue of which
functions are suitable to describe individual data,
A.M. Brown / Computer Methods and Programs in Biomedicine 65 (2001) 191200 199
but this topic is discussed in detail elsewhere
where excellent guides to determining goodness of
t of a function using residual plots are described
[2,5,6].
3.1. Assessment of goodness of t
The R
2
value calculated in this paper is de-
signed to give the user an estimate of goodness of
t of the function to the data, i.e. we assume that
we are using an appropriate function to describe
the data, but we want to know how accurately the
function describes or ts the data. The R
2
value is
called the coefcient of determination and its
value represents the fraction of the overall vari-
ance of the dependent variable that is explained
by the independent variable. It is calculated from
the sum of the squares of the residuals and the
sum of the squares of regression. The sum of the
squares of the residuals captures the error be-
tween the estimate and the actual data and is
analogous to the sum of the squares (within) in
ANOVA (see the numerator of Eq. (8)). The sum
of the squares of the residuals is used in linear
regression to calculate the best t (see earlier).
The sum of the squares of regression calculates
how far the predicted values differ from the over-
all mean, and is analogous to the sum of the
squares (between) in ANOVA (see the denomina-
tor of Eq. (8)). In the example in this paper the R
2
value was 0.997 which means that 99.7% of the
variation of the independent variable can be
explained by the variation of the dependent
variable.
After using SOLVER to calculate the con-
verged values of the parameters one would like to
know the reliability of those values. Some curve
tting programmes display the standard error of
the parameters. However care should be taken in
interpreting these values. As stated by Motulsky
and Ransnas [6] Non-linear regression programs
generally print out estimates of the standard error
of (the) parameters, but these values should not
be taken too seriously. In non-linear functions,
errors are neither additive nor symmetrical, and
exact condence limits cannot be calculated. The
reported standard error values are based on lin-
earizing assumptions and will always underesti-
mate the true uncertainty of any non-linear
equationit is not appropriate to use the stan-
dard error values printed by a non-linear regres-
sion program in further formal statistical
calculations. A method for calculating the
asymptotic standard errors of the parameters has
been devised, but it involves evaluating a Hessian
matrix, a method that is signicantly more com-
plex and requires signicantly more computer
time to evaluate. They also require a considerably
more complex computer program [2]. Thus the
approach taken in this paper is to calculate the
standard error of the data around the regression
line, also known as the standard error of the
residuals. This is calculated by dividing the sum of
the squares of the residuals by the degrees of
freedom to get the variance data about the regres-
sion line. Taking the square root of this value
gives the standard error of the residuals. The
standard error of the residuals can be used to
calculate the condence interval. The condence
interval is an indicator of the probability that the
true value lies within the range specied by the
probability formula. It is common to use 95%
condence interval, which means that there is a
95% probability that the true value lies within the
interval. In order to calculate the condence inter-
val the Critical t-value must be calculated. This
value depends on the condence interval and the
degrees of freedom. Fortunately Excel has a built-
in function (tinv) which allows calculation of the
Critical t-value, thus bypassing the need to look
up tables of t values. The formula in cell H7
(1-condence interval, degrees of freedom) calcu-
lates this value for our desired condence interval
and degrees of freedom. Once this value has been
calculated the condence interval is simply the
best t at all data points9the Critical t-
value*S.E. of residuals.
3.2. Ad6antages and limitations
While this protocol is regarded as robust and
reliable a few points should be borne in mind.
A.M. Brown / Computer Methods and Programs in Biomedicine 65 (2001) 191200 200
First, the greater the number of parameters in
the function the longer SOLVER will take. Ad-
ditionally the more the user customizes the
tting protocol with additional constraints or in-
creasing tolerance or precision, the longer
SOLVER will take. Second, if initial parameter
estimates are inappropriate, the iteration process
can proceed in the wrong direction and a solu-
tion is not found. Thus it is important that sen-
sible initial parameter estimates are input. Poor
estimates may also lead to the wrong solution
being found. This paper demonstrates an easily
understood method for rapid tting of data with
non-linear functions.
References
[1] W.P. Bowen, J.C. Jerman, Nonlinear regression using
spreadsheets, TiPS 16 (1995) 413417.
[2] M.L. Johnson, Why, when, and how biochemists should use
least squares, Anal Biochem. 206 (1992) 215225.
[3] L.S. Lasdon, A.D. Waren, A. Jain, M. Ratner, Design and
testing of a generalized reduced gradient code for nonlinear
programming, ACMTrans. Mathematical Software 4 (1978)
3450.
[4] S. Smith, L. Lasdon, Solving large sparse nonlinear pro-
grams using GRG, ORSA J. Comput. 4 (1992) 215.
[5] J. Dempster, Computer Analysis of Electrophysiological
Signals, Academic Press, London, 1993.
[6] H.J. Motulsky, L.A. Ransnas, Fitting curves to data using
nonlinear regression: a practical and nonmathematical re-
view, FASEB J. 1 (1987) 365374.
.