0% found this document useful (0 votes)
16 views

Nonlinear Least Squares Data Fitting: Appendix D

This document discusses nonlinear least squares problems and their application to data fitting. It describes how nonlinear least squares problems arise from modeling data with a nonlinear function and residual errors. It also outlines the distinctive properties of nonlinear least squares problems, including their objective function formulation and the special structure of the gradient and Hessian matrix. Solution interpretation and specialized solution methods like the Gauss-Newton method are also covered.

Uploaded by

Tâm Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Nonlinear Least Squares Data Fitting: Appendix D

This document discusses nonlinear least squares problems and their application to data fitting. It describes how nonlinear least squares problems arise from modeling data with a nonlinear function and residual errors. It also outlines the distinctive properties of nonlinear least squares problems, including their objective function formulation and the special structure of the gradient and Hessian matrix. Solution interpretation and specialized solution methods like the Gauss-Newton method are also covered.

Uploaded by

Tâm Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Appendix D

Nonlinear Least Squares


Data Fitting
D.1 Introduction
A nonlinear least squares problem is an unconstrained minimization problem of the
form
minimize
x
f(x) =
m

i=1
f
i
(x)
2
,
where the objective function is dened in terms of auxiliary functions { f
i
}. It
is called least squares because we are minimizing the sum of squares of these
functions. Looked at in this way, it is just another example of unconstrained min-
imization, leading one to ask why it should be studied as a separate topic. There
are several reasons.
In the context of data tting, the auxiliary functions { f
i
} are not arbitrary
nonlinear functions. They correspond to the residuals in a data tting problem (see
Chapter 1). For example, suppose that we had collected data { (t
i
, y
i
) }
m
i=1
consist-
ing of the size of a population of antelope at various times. Here t
i
corresponds to
the time at which the population y
i
was counted. Suppose we had the data
t
i
: 1 2 4 5 8
y
i
: 3 4 6 11 20
where the times are measured in years and the populations are measured in hun-
dreds. It is common to model populations using exponential models, and so we
might hope that
y
i
x
1
e
x2ti
for appropriate choices of the parameters x
1
and x
2
. A model of this type is illus-
trated in Figure D.1.
If least squares were used to select the parameters (see Section 1.5) then we
would solve
minimize
x
1
,x
2
f(x
1
, x
2
) =
5

i=1
(x
1
e
x
2
t
i
y
i
)
2
743
744 Appendix D. Nonlinear Least Squares Data Fitting
0 1 2 3 4 5 6 7 8 9 10
0
2
4
6
8
10
12
14
16
18
20
t
y
Figure D.1. Exponential Model of Antelope Population
so that the i-th function would be
f
i
(x
1
, x
2
) = x
1
e
x2ti
y
i
,
that is, it would be the residual for the i-th data point. Most least squares problems
are of this form, where the functions f
i
(x) are residuals and where the index i
indicates the particular data point. This is one way in which least squares problems
are distinctive.
Least-squares problems are also distinctive in the way that the solution is
interpreted. Least squares problems usually incorporate some assumptions about
the errors in the model. For example, we might have
y
i
= x
1
e
x2ti
+
i
,
where the errors {
i
} are assumed to arise from a single probability distribution,
often the normal distribution. Associated with our model are the true parameters
x
1
and x
2
, but each time we collect data and solve the least-squares problem we
only obtain estimates x
1
and x
2
of these true parameters. After computing these
estimates, it is common to ask questions about the model such as: What bounds
can we place on the values of the true parameters? Does the model adequately t
the data? How sensitive are the parameters to changes in the data? And so on.
Algorithms for least-squares problems are also distinctive. This is a conse-
quence of the special structure of the Hessian matrix for the least-squares objective
function. The Hessian in this case is the sum of two terms. The rst only involves
the gradients of the functions { f
i
} and so is easier to compute. The second in-
volves the second derivatives, but is zero if the errors {
i
} are all zero (that is, if
the model ts the data perfectly). It is tempting to approximate the second term in
the Hessian, and many algorithms for least squares do this. Additional techniques
are used to deal with the rst term in a computationally sensible manner.
If least-squares problems were uncommon then even these justications would
not be enough to justify our discussion here. But they are not uncommon. They
are one of the most widely encountered unconstrained optimization problems, and
amply justify the attention given them.
D.2. Nonlinear Least-Squares Data Fitting 745
D.2 Nonlinear Least-Squares Data Fitting
Let us rst examine the special form of the derivatives in a least-squares problem.
We will write the problem as
minimize
x
f(x) =
1
2
m

i=1
f
i
(x)
2

1
2
F(x)
T
F(x)
where F is the vector-valued function
F(x) = ( f
1
(x) f
2
(x) f
m
(x) )
T
.
We have scaled the problem by
1
2
to make the derivatives less cluttered. The
components of f(x) can be derived using the chain rule:
f(x) = F(x) F(x).

2
f(x) can be derived by dierentiating this formula with respect to x
j
:

2
f(x) = F(x) F(x)
T
+
m

i=1
f
i
(x)
2
f
i
(x).
These are the formulas for the gradient and Hessian of f.
Let x

be the solution of the least-squares problem, and suppose that at the


solution, f(x

) = 0. Then f
i
(x

) = 0 for all i, indicating that all the residuals are


zero and that the model ts the data with no error. As a result, F(x

) = 0 and
hence f(x

) = 0, conrming that the rst-order necessary condition is satised.


It also follows that

2
f(x

) = F(x

) F(x

)
T
,
so that the Hessian at the solution is positive semi-denite, as expected. If F(x

)
is a matrix of full rank then
2
f(x

) is positive denite.
Example D.1 Gradient and Hessian. For the antelope data and model in Sec-
tion D.1,
F(x) =

x
1
e
x2t1
y
1
x
1
e
x2t2
y
2
x
1
e
x
2
t
3
y
3
x
1
e
x2t4
y
4
x
1
e
x
2
t
5
y
5

x
1
e
1x2
3
x
1
e
2x2
4
x
1
e
4x
2
6
x
1
e
5x2
11
x
1
e
8x
2
20

.
The formula for the least-squares objective function is
f(x
1
, x
2
) =
1
2
5

i=1
(x
1
e
x2ti
y
i
)
2
=
1
2
F(x)
T
F(x).
The gradient of f is
f(x
1
, x
2
) =

i=1
(x
1
e
x2ti
y
i
)e
x2ti
5

i=1
(x
1
e
x2ti
y
i
)x
1
t
i
e
x2ti

.
746 Appendix D. Nonlinear Least Squares Data Fitting
This can be rewritten as f(x
1
, x
2
) =

e
x2t1
e
x2t2
e
x2t3
e
x2t4
e
x2t5
x
1
t
1
e
x2t1
x
1
t
2
e
x2t2
x
1
t
3
e
x2t3
x
1
t
4
e
x2t4
x
1
t
5
e
x2t5

x
1
e
x2t1
y
1
x
1
e
x2t2
y
2
x
1
e
x
2
t
3
y
3
x
1
e
x2t4
y
4
x
1
e
x2t5
y
5

so that f(x
1
, x
2
) = F(x) F(x). The Hessian matrix is
2
f(x) = F(x) F(x)
T
+

m
i=1
f
i
(x)
2
f
i
(x) =

e
x2t1
e
x2t2
e
x2t3
e
x2t4
e
x2t5
x
1
t
1
e
x2t1
x
1
t
2
e
x2t2
x
1
t
3
e
x2t3
x
1
t
4
e
x2t4
x
1
t
5
e
x2t5

e
x2t1
x
1
t
1
e
x2t1
e
x2t2
x
1
t
2
e
x2t2
e
x2t3
x
1
t
3
e
x2t3
e
x
2
t
4
x
1
t
4
e
x
2
t
4
e
x
2
t
5
x
1
t
5
e
x
2
t
5

+
5

i=1
(x
1
e
x
2
t
i
y
i
)

0 t
i
e
x2ti
t
i
e
x
2
t
i
x
1
t
2
i
e
x
2
t
i

.
Note that { t
i
} and { y
i
} are the data values for the model, while x
1
and x
2
are
the variables in the model.
If F(x

) = 0 then it is reasonable to expect that F(x) 0 for x x

, implying
that

2
f(x) = F(x) F(x)
T
+
m

i=1
f
i
(x)
2
f
i
(x) F(x) F(x)
T
.
This nal formula only involves the rst derivatives of the functions { f
i
} and
suggests that an approximation to the Hessian matrix can be found using only rst
derivatives, at least in cases where the model is a good t to the data. This idea
is the basis for a number of specialized methods for nonlinear least squares data
tting.
The simplest of these methods, called the Gauss-Newton method uses this ap-
proximation directly. It computes a search direction using the formula for Newtons
method

2
f(x)p = f(x)
but replaces the Hessian with this approximation
F(x) F(x)
T
p = F(x) F(x).
In cases where F(x

) = 0 and F(x

) is of full rank, the Gauss-Newton method


behaves like Newtons method near the solution, but without the costs associated
with computing second derivatives.
The Gauss-Newton method can perform poorly when the residuals at the
solution are not small (that is, when the model does not t the data well), or
if the Jacobian of F is not of full rank at the solution. Loosely speaking, in these
cases the Gauss-Newton method is using a poor approximation to the Hessian of f.
D.2. Nonlinear Least-Squares Data Fitting 747
Example D.2 Gauss-Newton Method. We apply the Gauss-Newton method to an
exponential model of the form
y
i
x
1
e
x2ti
with data
t = ( 1 2 4 5 8 )
T
y = ( 3.2939 4.2699 7.1749 9.3008 20.259 )
T
.
For this example, the vector y was chosen so that the model would be a good t to
the data, and hence we would expect the Gauss-Newton method to perform much
like Newtons method. (In general y will not be chosen, but will be part of the given
data for a problem.) We apply the Gauss-Newton method without a line search,
using an initial guess that is close to the solution:
x =

2.50
0.25

.
At this point
F(x) =

0.0838
0.1481
0.3792
0.5749
1.7864

and F(x)
T
=

1.2840 3.2101
1.6487 8.2436
2.7183 27.1828
3.4903 43.6293
7.3891 147.7811

.
Hence
f(x) = F(x) F(x) =

16.5888
300.8722

F(x) F(x)
T
=

78.5367 1335.8479
1335.8479 24559.9419

.
The Gauss-Newton search direction is obtained by solving the linear system
F(x) F(x)
T
= F(x) F(x)
and so
p =

0.0381
0.0102

and the new estimate of the solution is


x x + p =

2.5381
0.2602

.
(For simplicity, we do not use a line search here, although a practical method would
require such a globalization strategy.) The complete iteration is given in Table D.1.
At the solution,
x =

2.5411
0.2595

.
748 Appendix D. Nonlinear Least Squares Data Fitting
Table D.1. Gauss-Newton Iteration (Ideal Data)
k f(x
k
) f(x
k
)
0 2 10
0
3 10
2
1 4 10
3
2 10
1
2 2 10
8
3 10
2
3 3 10
9
4 10
8
4 3 10
9
3 10
13
Since f(x) 0, an approximate global solution has been found to the least-squares
problem. (The least-squares objective function cannot be negative.) In general, the
Gauss-Newton method is only guaranteed to nd a local solution.
For comparison, we now apply Newtons method to the same problem using
the same initial guess
x =

2.50
0.25

.
At this point
f(x) =

16.5888
300.8722

and
2
f(x) =

78.5367 1215.4991
1215.4991 22278.6570

.
(This matrix is similar to the matrix used in the Gauss-Newton method.) The
search direction is the solution of

2
f(x)p = f(x)
so that
p =

0.0142
0.0127

and x x + p =

2.5142
0.2627

.
The complete iteration is in Table D.2. The solution obtained is almost identical
to that obtained by the Gauss-Newton method.
We now consider the same model
y
i
x
1
e
x
2
t
i
but with the data
t = ( 1 2 4 5 8 4.1 )
T
,
y = ( 3 4 6 11 20 46 )
T
.
This corresponds to the antelope data of Section D.1, but with an extraneous data
point added. (This point is called an outlier, since it is inconsistent with the other
data points for this model.) In this case the exponential model will not be a good
D.2. Nonlinear Least-Squares Data Fitting 749
Table D.2. Newton Iteration (Ideal Data)
k f(x
k
) f(x
k
)
0 2 10
0
3 10
2
1 1 10
1
5 10
1
2 2 10
4
9 10
1
3 5 10
9
6 10
3
4 6 10
9
8 10
8
5 3 10
9
1 10
12
t to the data, so we would expect the performance of the Gauss-Newton method
to deteriorate. The runs corresponding to the initial guess
x =

10
0.1

are given in Table D.3. As expected, the Gauss-Newton method converges slowly.
Both methods nd the solution
x =

9.0189
0.1206

.
The initial guess was close to the solution, so that the slow convergence of the
Gauss-Newton method was not due to a poor initial guess. Also, the nal function
value is large, indicating that the model cannot t the data well in this case. This
is to be expected given that an outlier is present.
Many other methods for nonlinear least-squares can be interpreted as using
some approximation to the second term in the formula for the Hessian matrix
m

i=1
f
i
(x)
2
f
i
(x).
The oldest and simplest of these approximations is
m

i=1
f
i
(x)
2
f
i
(x) I,
where 0 is some scalar. Then the search direction is obtained by solving the
linear system

F(x) F(x)
T
+ I

p = F(x) F(x).
This is referred to as the Levenberg-Marquardt method.
750 Appendix D. Nonlinear Least Squares Data Fitting
Table D.3. Gauss-Newton (left) and Newton Iterations (right); Data Set
with Outlier
k f(x
k
) f(x
k
)
0 601.90 2 10
2
1 599.90 8 10
0
2 599.67 3 10
1
3 599.65 6 10
0
4 599.64 2 10
0
5 599.64 6 10
1
.
.
.
.
.
.
.
.
.
16 599.64 1 10
6
17 599.64 4 10
7
18 599.64 1 10
7
19 599.64 4 10
8
20 599.64 1 10
8
21 599.64 3 10
9
k f(x
k
) f(x
k
)
0 601.90 2 10
2
1 599.64 1 10
1
2 599.64 2 10
2
3 599.64 6 10
8
4 599.64 5 10
13
The Levenberg-Marquardt method is often implemented in the context of a
trust-region strategy (see Section 11.6). If this is done then the search direction is
obtained by minimizing a quadratic model of the objective function (based on the
Gauss-Newton approximation to the Hessian)
minimize
p
Q(p) = f(x) + p
T
F(x) F(x) +
1
2
p
T
F(x) F(x)
T
p
subject to the constraint
p
for some scalar > 0. This gives a step p that satises the Levenberg-Marquardt
formula for an appropriate 0. The scalar is determined indirectly by picking
a value of , as is described in Section 11.6. The scalar can be chosen based on
the eectiveness of the Gauss-Newton approximation to the Hessian, and this can
be easier than choosing directly. An example illustrating a trust-region approach
can be found in the same Section.
Both the Gauss-Newton and Levenberg-Marquardt methods use an approx-
imation to the Hessian
2
f(x). If this approximation is not accurate then the
methods will converge more slowly than Newtons method; in fact, they will con-
verge at a linear rate.
Other approximations to the Hessian of f(x) are also possible. For example,
a quasi-Newton approximation to
m

i=1
f
i
(x)
2
f
i
(x)
Exercises 751
could be used.
There is one other computational detail associated with the Gauss-Newton
method that should be mentioned. The formula for the search direction in a Gauss-
Newton method
F(x) F(x)
T
p = F(x) F(x)
is equivalent to the solution of a linear least-squares problem
19
minimize
p

F(x)
T
p + F(x)

2
2
=

F(x)
T
p + F(x)

F(x)
T
p + F(x)

.
If we set the gradient of this function with respect to p equal to zero, we obtain
the Gauss-Newton formula. The Gauss-Newton formula corresponds to a system of
linear equations called the normal equations for the linear least-squares problem. If
the normal equations are solved on a computer then the computed search direction
will have an error bound proportional to
cond(F(x) F(x)
T
) = cond(F(x))
2
if the 2-norm is used to dene the condition number (see Section A.8). However if
the search direction is computed directly as the solution to the linear least-squares
problem without explicitly forming the normal equations, then in many cases a
better error bound can be derived.
Working directly with the linear least-squares problem is especially important
in cases where F(x) is not of full rank, or is close (in the sense of rounding error)
to a matrix that is not of full rank. In this case the matrix
F(x) F(x)
T
will be singular or nearly singular, causing diculties in solving
F(x) F(x)
T
p = F(x) F(x).
The corresponding linear least-squares problem is well dened, however, even when
F(x) is not of full rank (although the solution may not be unique). A similar
approach can be used in the computation of the search direction in the Levenberg-
Marquardt method (see the Exercises).
Exercises
2.1. Let x = (2, 1)
T
. Calculate
F(x), f(x), F(x),
f(x), F(x) F(x)
T
,
2
f(x)
for the antelope model.
19
A least-squares model is linear if the variables x appear linearly. Thus the model y
x
1
+ x
2
t
2
is linear even though it includes a nonlinear function of the independent variable t.
752 Appendix D. Nonlinear Least Squares Data Fitting
2.2. Consider the least-squares model
y x
1
e
x
2
t
+ x
3
+ x
4
t.
Determine the formulas for F(x), f(x), F(x), f(x), F(x) F(x)
T
, and

2
f(x) for a general data set { t
i
, y
i
) }
m
i=1
.
2.3. Verify the results in Example D.2 for the ideal data set.
2.4. Verify the results in Example D.2 for the data set containing the outlier.
2.5. Modify the least-squares model used in Example D.2 for the data set con-
taining the outlier. Use the model
y x
1
e
x2t
+ x
3
(t 4.1)
10
.
Apply the Gauss-Newton method with no line search to this problem. Use
the initial guess (2.5, .25, .1)
T
. Does the method converge rapidly? Does this
model t the data well?
2.6. Apply Newtons method and the Gauss-Newton method to the antelope
model, using the initial guess x = (2.5, .25)
T
. Do not use a line search. Termi-
nate the algorithm when the norm of the gradient is less than 10
6
. Compute
the dierence between the Hessian and the Gauss-Newton approximation at
the initial and nal points.
2.7. Repeat the previous exercise, but use the back-tracking line search described
in Section 11.5 with = 0.1.
2.8. Prove that the Gauss-Newton method is the same as Newtons method for a
linear least-squares problem.
2.9. Consider the formula for the Levenberg-Marquardt direction:

F(x) F(x)
T
+ I

p = F(x) F(x).
Show that p can be computed as the solution to a linear least-squares problem
with coecient matrix

F(x)
T

.
Why might it be preferable to compute p this way?
2.10. Consider a nonlinear least-squares problem
min f(x) = F(x)
T
F(x)
with n variables and m > n nonlinear functions f
i
(x), and assume that
F(x) is a full-rank matrix for all values of x. Let p
GN
be the Gauss-Newton
search direction, let p
LM
() be the Levenberg-Marquardt search direction for
a particular value of , and let p
SD
be the steepest-descent direction. Prove
that
lim
0
p
LM
() = p
GN
and
lim

p
LM
()
p
LM
()
=
p
SD
p
SD

.
D.3. Statistical Tests 753
D.3 Statistical Tests
20
Let us return to the exponential model discussed in Section D.1, and assume that
the functions { f
i
} are residuals of the form
f
i
(x
1
, x
2
) = x
1
e
x2ti
y
i
.
We are assuming that
y
i
= x
1
e
x
2
t
i
+
i
,
or in more general terms that
[Observation] = [Model] + [Error].
If assumptions are made about the behavior of the errors, then it is possible to draw
conclusions about the t. If the model is linear, that is, if the variables x appear lin-
early in the functions f
i
(x), then the statistical conclusions are precise. If the model
is nonlinear, however, standard techniques only produce linear approximations to
exact results.
The assumptions about the errors are not based on the particular set of errors
that correspond to the given data set, but rather on the errors that would be
obtained if a very large, or even innite, data set had been collected. For example,
it is common to assume that the expected value of the errors is zero, that is, that the
data are unbiased. Also, it is common to assume that the errors all have the same
variance (perhaps unknown), and that the errors are independent of one another.
These assumptions can sometimes be guaranteed by careful data collection, or by
transforming the data set in a straightforward manner (see the Exercises).
More is required, however. If we interpret the errors to be random, then
we would like to know their underlying probability distribution. We will assume
that the errors follow a normal distribution with mean 0 and known variance
2
.
The normal distribution is appropriate in many cases where the data come from
measurements (such as measurements made with a ruler or some sort of gauge).
In addition, the central limit theorem indicates that, at least as the sample size
increases, many other probability distributions can be approximated by a normal
distribution.
If the errors are assumed to be normally distributed, then least squares min-
imization is an appropriate technique for choosing the parameters x. Use of least
squares corresponds to maximizing the likelihood or probability that the param-
eters have been chosen correctly, based on the given data. If a dierent probability
distribution were given, then maximizing the likelihood would give rise to a dierent
optimization problem, one involving some other function of the residuals.
If least squares is applied to a linear model, then many properties of the pa-
rameters and the resulting t can be analyzed. For example, one could ask for a
condence interval that contained the true value of the parameter x
1
, with prob-
ability 95%. (Since the errors are random and they are only known in a statistical
sense, all of the conclusions that can be drawn will be probabilistic.) Or one could
20
This Section assumes knowledge of statistics.
754 Appendix D. Nonlinear Least Squares Data Fitting
ask if the true value of the parameter x
2
were nonzero, with 99% probability. This
is referred to as a hypothesis test. (If the true value of this parameter were zero,
then the parameter could be removed from the model.) Or, for some given value of
the independent variable t, one could ask for a condence interval that contained
the true value of the model y, again with some probability.
The answers to all of these questions depend on the variance-covariance ma-
trix. In the case of a linear model it is equal to

2
[
2
f(x)]
1
=
2
[F(x) F(x)
T
]
1
,
where
2
is the variance of the errors {
i
}. In this case, it is a constant matrix. For
a nonlinear model,
2
f(x) is not constant, and as a result the calculations required
to determine condence intervals or to do hypothesis testing are more dicult and
more computationally expensive. It is possible to apply the same formulas that are
used for linear models, using either
[
2
f(x

)]
1
or [F(x

) F(x

)
T
]
1
in place of [
2
f(x)]
1
, but the resulting intervals and tests are then only approx-
imations, and in some cases the approximations can be poor. Using additional
computations, it is possible to analyze a model to determine how nonlinear it is,
and hence detect if these linear approximations are eective. If they are not, then
alternative techniques can be applied, but again at a cost of additional computa-
tions.
Exercises
3.1. Determine the matrices
[
2
f(x

)]
1
and [F(x

) F(x

)
T
]
1
for the two data sets in Example D.2. Are the two matrices close to each
other?
3.2. Prove that the matrices
[
2
f(x

)]
1
and [F(x

) F(x

)
T
]
1
are the same for a linear least-squares problem.
3.3. Consider the nonlinear least-squares problem
minimize
m

i=1
f
i
(x)
2
where each f
i
represents a residual with error
i
. Assume that all of the
errors are independent, with mean zero. Also assume that the i-th error is
normally distributed with known variance
2
i
, but do not assume that all of
these variances are equal. Show how to transform this least-squares problem
to one where all the errors have the same variance.
D.4. Orthogonal Distance Regression 755
D.4 Orthogonal Distance Regression
So far, we have assumed that the data-tting errors were of the form
y = x
1
e
x2t
+ .
That is, either the model is incomplete (there are additional terms or parameters
that are being ignored) or the observations contain errors, perhaps due to inaccurate
measurements. But it is also possible that there are errors in the independent vari-
able t. For example, the measurements might be recorded as having been taken once
an hour at exactly thirty minutes past the hour, but in fact were taken sometime
between twenty-ve and thirty-ve minutes past the hour.
If this is trueif we believe that the independent variable is subject to error
we should use a model of the form
y = x
1
e
x2(t+)
+ .
If we assume that the errors are normally distributed with mean 0 and constant
variance, then it is appropriate to use a least-squares technique to estimate the
parameters x and :
minimize
x,
f(x
1
, x
2
) =
5

i=1
[(x
1
e
x2(ti+i)
y
i
)
2
+
2
i
] =
5

i=1
[
2
i
+
2
i
].
More generally, if our original least-squares problem were of the form
minimize
x
f(x) =
m

i=1
f
i
(x; t
i
)
2
,
where t
i
is a vector of independent variables that is subject to error, then the revised
least-squares problem could be written as
minimize
x,
f(x) =
m

i=1
f
i
(x; t
i
+
i
)
2
+
2
2
.
One might also consider scaling by some constant to reect any dierence in
the variances of the two types of errors.
Problems of this type are sometimes called errors in variables models, for
obvious reasons, but we will use the term orthogonal distance regression. To explain
this, consider again the graph of our antelope data (see Figure D.2). If we assume
that all the error is in the model or the observation, then the residual measures
the vertical distance between the data point and the curve. However if there is
also error in the independent variable then geometrically we are minimizing the
orthogonal distance between the data point and the curve.
If the model is changing rapidlyas in an exponential model when the expo-
nent is large, or near a singularity in a modelthe vertical distance can be large
even though the orthogonal distance is small. A data point in such a region can
easily have a large vertical residual and thus can exert extraordinary inuence in the
ordinary least-squares model, perhaps to the extent that the parameter estimate is
strongly inuenced by that single data point. Using orthogonal distance regression
can alleviate this form of diculty.
756 Appendix D. Nonlinear Least Squares Data Fitting
orthogonal vertical
Figure D.2. Orthogonal and Vertical Distances
Exercises
4.1. Consider the function f(x) = 1/x. Let = 10
2
and x =
3
2
. Determine the
vertical and orthogonal distance from the point (x , 1/x)
T
to the graph of
f.
4.2. Consider the ideal data set in Example D.2. Apply Newtons method (or any
other minimization method) to solve
minimize
x,
f(x) =
m

i=1
f
i
(x; t
i
+
i
)
2
+
2
2
.
Compare your solution to that obtained in Example D.2.
4.3. Repeat the previous exercise for the data set in Example D.2 containing an
outlier.
D.5 Notes
Gauss-Newton MethodWe believe that the Gauss-Newton method was invented
by Gauss; it is described in his 1809 book on planetary motion. A more mod-
ern discussion can be found, for example, in the paper by Dennis (1977). The
Levenberg-Marquardt method was rst described in the papers of Levenberg (1944)
and Marquardt (1963). Its implementation in terms of a trust-region method is de-
scribed in the paper of More (1977). Other approximations to the Hessian of the
least-squares problem are described in the papers by Ruhe (1979), Gill and Murray
(1978), and Dennis, Gay, and Welsch (1981), and in the book by Bates and Watts
(1988). The computational diculties associated with the normal equations are
described in the book by Golub and Van Loan (1996).
In many applications, only a subset of the parameters will occur nonlinearly.
In such cases, ecient algorithms are available that treat the linear parameters in
D.5. Notes 757
a special way. For further information, see the papers by Golub and Pereyra (1973)
and Kaufman (1975). Software implementing these techniques is available from
Netlib.
Statistical TestsA brief introduction to statistical techniques in nonlinear
least-squares data tting can be found in the article by Watts and Bates (1985).
For a more extensive discussion see the book by Bard (1974). A comparison of
various techniques for computing condence intervals can be found in the paper by
Donaldson and Schnabel (1987).
Orthogonal Distance RegressionAn extensive discussion of orthogonal dis-
tance regression can be found in the book by Fuller (1987). Algorithms and software
for orthogonal distance regression are closely related to those used for ordinary least-
squares regression. For a discussion of these techniques, see the paper by Boggs and
Rogers (1990). In the linear case, orthogonal distance regression is often referred
to as total least squares. For information about this case, see the book by Van
Huel and Vandewalle (1991).
D.5.1 References

Y. Bard, Nonlinear Parameter Estimation, Academic Press, New York, 1974.

D.M. Bates and D.G. Watts, Nonlinear Regression Analysis and its Ap-
plications, Wiley, New York, 1988.

Paul T. Boggs and Janet E. Rogers, Orthogonal Distance Regression,


Contemporary Mathematics, 112 (1990), pp. 183194

J.E. Dennis, Jr., Nonlinear least squares, in The State of the Art in Numer-
ical Analysis, D. Jacobs, editor, Academic Press (New York), pp. 269312,
1977.

J.E. Dennis, Jr., D.M. Gay, and R.E. Welsch, An adaptive nonlin-
ear least-squares algorithm, ACM Transactions on Mathematical Software, 7
(1981), pp. 348368

Janet R. Donaldson and Robert B. Schnabel, Computational expe-


rience with condence regions and condence intervals for nonlinear least
squares, Technometrics, 29 (1987), pp. 6782

W.A. Fuller, Measurement Error Models, John Wiley and Sons, New York,
1987.

C.F. Gauss, Theoria Motus Corporum Clestium in Sectionibus Conicus


Solem Ambientum, Dover, New York, 1809.

P.E. Gill and W. Murray, Algorithms for the solution of the nonlin-
ear least-squares problem, SIAM Journal on Numerical Analysis, 15 (1978),
pp. 977992

Gene H. Golub and Victor Pereyra, The dierentiation of pseudo-


inverses and nonlinear least-squares problems whose variables separate, SIAM
Journal on Numerical Analysis, 10 (1973), pp. 413432

Gene H. Golub and C. Van Loan, Matrix Computations (third edition),


The Johns Hopkins University Press, Baltimore, 1996.
758 Appendix D. Nonlinear Least Squares Data Fitting

Linda Kaufman, A variable projection method for solving separable nonlinear


least squares problems, BIT, 15 (1975), pp. 4957

K. Levenberg, A method for the solution of certain problems in least squares,


Quarterly of Applied Mathematics, 2 (1944), pp. 164168

D. Marquardt, An algorithm for least-squares estimation of nonlinear pa-


rameters, SIAM Journal of Applied Mathematics, 11 (1963), pp. 431441

J.J. Mor e, The Levenberg-Marquardt algorithm: implementation and theory,


in Numerical Analysis, G. A. Watson, editor, Lecture Notes in Mathematics
630, Springer-Verlag (Berlin), pp. 105116, 1977.

A. Ruhe, Accelerated Gauss-Newton algorithms for nonlinear least-squares


problems, SIAM Review, 22 (1980), pp. 318337

S. Van Huffel and J. Vandewalle, The Total Least Squares Problem:


Computational Aspects and Analysis, SIAM, Philadelphia, 1991.

D.G. Watts and D.M. Bates, Nonlinear regression, in the Encyclopedia


of Statistical Sciences, Vol. 6, Samuel Kotz and Normal L. Johnson, editors,
John Wiley and Sons (New York), p. 306312, 1985.

You might also like