The Complex Gradient Operator and The: CR-Calculus
The Complex Gradient Operator and The: CR-Calculus
x
y
R
2
.
In particular, when optimizing a real-valued function of a complex variable z = x + j y one can
work with the equivalent real gradient of the function viewed as a mapping from R
2
to R in lieu
of a nonexistent complex derivative [14]. However, because the real gradient perspective arises
within a complex variables framework, a direct reformulation of the problem to the real domain
is awkward. Instead, it greatly simplies derivations if one can represent the real gradient as a
redened, new complex gradient operator. As we shall see below, the complex gradient is an
extension of the standard complex derivative to nonanalytic functions.
Confusing the issue is the fact that there is no one unique way to consistently dene a complex
gradient which applies to nonanalytic real-valued functions of a complex variable, and authors do
not uniformly adhere to the same denition. Thus it is often difcult to resolve questions about the
nature or derivation of the complex gradient by comparing authors. Given the additional fact that
typographical errors seem to be rampant these days, it is therefore reasonable to be skeptical of the
algorithms provided in many textbooksespecially if one is a novice in these matters.
An additional source of confusion arises from the fact that the derivative of a function with
respect to a vector can be alternatively represented as a row vector or as a column vector when
a space is Cartesian,
1
and both representations can be found in the literature. As done for the
development of the real gradient given in [25], in this note we continue to carefully distinguish
between the complex cogradient operator, which is a row vector operator, and the associated com-
plex gradient operator which is a vector operator which gives the direction of steepest ascent of a
real scalar-valued function.
Because of the constant back-and-forth shift between a real function (R-calculus) perspective
and a complex function (C-calculus) perspective which a careful analysis of nonanalytic complex
functions requires [12], we refer to the mathematics framework underlying the derivatives given in
this note as a CR-calculus. In the following, we start by reviewing some of the properties of stan-
dard univariate analytic functions, describe the CR-calculus for univariate nonanalytic functions,
and then develop a multivariate CR-calculus appropriate for optimization scalar real-valued cost
functions of a complex parameter vector. We end the note with some application examples.
1
I.e., is Euclidean with identity metric tensor.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 3
2 The Derivative of a Holomorphic Function
Let z = x + jy, for x, y real, denote a complex number and let
f(z) = u(x, y) + j v(x, y)
be a general complex-valued function of the complex number z.
2
In standard complex variables
courses it is emphasized that for the complex derivative,
f
(z) = lim
z0
f(z + z) f(z)
z
,
to exist in a meaningful way it must be independent of the direction with which z approaches
zero in the complex plane. This is a very strong condition to be placed on the function f(z). As
noted in an introductory comment from the textbook by Flanigan [6]:
You will learn to appreciate the difference between a complex analytic function (roughly
a complex-valued function f(z) having a complex derivative f
(z) exists
forces a great deal of structure on f(z); moreover, this structure mirrors the structure of the
harmonic u(x, y) and v(x, y), functions of two real variables.
3
In particular the following conditions are equivalent statements about a complex function f(z)
on an open set containing z in the complex plane [6]:
2
Later, in Section 3, we will interchangeably alternate between this notation and the more informative notation
f(z, z). Other useful representations are f(u, v) and f(x, y). In this section we look for the (strong) conditions for
which f : z f(z) C is differentiable as a mapping C C (in which case we say that f is C-differentiable),
but in subsequent sections we will admit the weaker condition that f : (x, y) (u, v) be differentiable as a mapping
R
2
R
2
(in which case we say that f is R-differentiable); see Remmert [12] for a discussion of these different types
of differentiability.
3
Quoted from page 2 of reference [6]. Note that in the quote i =
1 whereas in this note we take j =
1
following standard electrical engineering practice.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 4
The derivative f
2
u(x, y)
x
2
+
2
u(x, y)
y
2
= 0 and
2
v(x, y)
x
2
+
2
v(x, y)
y
2
= 0.
Such functions are known as harmonic functions. Thus if either u(x, y) or v(x, y) fail to be har-
monic, the function f(z) is not differentiable.
5
Although many important complex functions are holomorphic, including the functions z
n
, e
z
,
ln(z), sin(z), and cos(z), and hence differentiable in the standard complex variables sense, there
are commonly encountered useful functions which are not:
The function f(z) = z, where z denotes complex conjugation, fails to satisfy the Cauchy-
Riemann conditions.
The functions f(z) = Re(z) =
z+ z
2
= x and g(z) = Im(z) =
z z
2j
= y fail the Cauchy-
Riemann conditions.
The function f(z) = [z[
2
= zz = x
2
+ y
2
is not harmonic.
Any nonconstant purely real-valued function f(z) (for which it must be the case that v(z, y)
0) fails the Cauchy-Riemann condition. In particular the real function f(z) = [z[ =
zz =
x
2
+ y
2
is not differentiable.
6
4
A function is analytic on some domain if it can be expanded in a convergent power series on that domain. Al-
though this condition implies that the function has derivatives of all orders, analyticity is a stronger condition than
innite differentiability as there exist functions which have derivatives of all orders but which cannot be expressed as
a power series. For a complex-valued function of a complex variable, the term analytic has been replaced in modern
mathematics by the entirely synonymous term holomorphic. Thus real-valued power-series-representable functions of
a real-variable are analytic (real analytic), while complex-valued power-series-representable functions of a complex-
variable are holomorphic (complex analytic).
5
Because a harmonic function on R
2
satises the partial differential equation known as Laplaces equation, by
existence and uniqueness of the solution to this partial differential equation its value is completely determined at
a point in the interior of any simply connected region which contains that point once the values on the boundary
(boundary conditions) of that region are specied. This is the reason that contour integration of an analytic complex
function works and that we have the freedom to select that contour to make the integration as easy as possible. On the
other hand, there is, in general, no equivalent to contour integration for an arbitrary function on R
2
. See the excellent
discussion in Flanigan [6].
6
Thus we have the classic result that the only holomorphic real-valued functions are the constant real-valued
functions.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 5
Note in particular, the implication of the above for the problem of minimizing the real-valued
squared-error loss functional
(a) = E
[
k
a
k
[
2
= E
(
k
a
k
)(
k
a
k
)
E e
k
e
k
(2)
for nite second-order moments stationary scalar complex random variables
k
and
k
, and un-
known complex constant a = a
x
+ ja
y
. Using the theory of optimization in Hilbert spaces, the
minimization can be done by invoking the projection theorem (which is equivalent to the orthogo-
nality principle)[24]. Alternatively, the minimization can be performed by completing the square.
Either procedure will result in the Wiener-Hopf equations, which can then be solved for the optimal
complex coefcient variable a.
However, if a gradient procedure for determining the optimum is desired, we are immediately
stymied by the fact that the purely real nonconstant function (a) is not analytic and therefore its
derivative with respect to a does not exist in the conventional sense of a complex derivative [3]-[6],
which applies only to holomorphic functions of a. A way to break this impasse will be discussed
in the next section. Meanwhile note that all of the real-valued nonholomorphic functions shown
above can be viewed as functions of both z and its complex conjugate z, as this fact will be of
signicance in the following discussion.
3 Extensions of the Complex Derivative The CR-Calculus
In this section we continue to focus on functions of a single complex variable z. The primary
references for the material developed here are Nehari [11], Remmert [12], and Brandwood [14].
3.1 A Possible Extension of the Complex Derivative.
As we have seen, in order for the complex derivative of a function of z = x + j y,
f(z) = u(x, y) + j v(x, y),
to exist in the standard holomorphic sense, the real partial derivatives of u(x, y) and v(x, y) must
not only exist, they must also satisfy the Cauchy-Riemann conditions (1). As noted by Flanigan
[6]: This is much stronger than the mere existence of the partial derivatives. However, the
mere existence of the (real) partial derivatives is necessary and sufcient for a stationary point
of a (necessarily nonholomorphic) non-constant real-valued functional f(z) to exist when f(z) is
viewed as a differentiable function of the real and imaginary parts of z, i.e., as a function over R
2
,
f(z) = f(x, y) : R
2
R. (3)
Thus the trick is to exploit the real R
2
vector space structure which underlies C when performing
gradient-based optimization. In essence, the remainder of this note is concerned with a thorough
discussion of this trick.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 6
Towards this end, it is convenient to dene a generalization or extension of the standard partial
derivative to nonholomorphic functions of z = x + j y that are nonetheless differentiable with
respect to x and y and which incorporates the real gradient information directly within the complex
variables framework. After Remmert [12], we will call this the real-derivative, or R-derivative,
of a possibly nonholomorphic function in order to avoid confusion with the standard complex-
derivative, or C-derivative, of a holomorphic function which was presented and discussed in the
previous section. Furthermore, we would like the real-derivative to reduce to the standard complex
derivative when applied to holomorphic functions.
Note that if one rewrites the real-valued loss function (2) in terms of purely real quantities, one
obtains (temporarily suppressing the time dependence, k)
(a) = (a
x
, a
y
) = E
e
2
x
+ e
2
y
= E
(
x
a
x
x
a
y
y
)
2
+ (
y
+ a
y
x
a
x
y
)
2
. (4)
From this we can easily determine that
(a
x
, a
y
)
a
x
= 2 Ee
x
x
+ e
y
y
,
and
(a
x
, a
y
)
a
y
= 2 Ee
x
y
e
y
x
.
Together these can be written as
a
x
+ j
a
y
(a) =
(a
x
, a
y
)
a
x
+ j
(a
x
, a
y
)
a
y
= 2 E
k
e
k
(5)
which looks very similar to the standard result for the real case.
Indeed, equation (5) is the denition of the generalized complex partial derivative often given in
engineering textbooks, including references [7]-[9]. However, this is not the denition used in this
note, which instead follows the formulation presented in [10]-[20]. We do not use the denition
(5) because it does not reduce to the standard C-derivative for the case when a function f(a) is a
holomorphic function of the complex variable a. For example, take the simplest case of f(a) = a,
for which the standard derivative yields
d
da
f(a) = 1. In this case, the denition (5) applied to
f(a) unfortunately results in the value 0. Thus we will not view the denition (5) as an admissible
generalization of the standard complex partial derivative, although it does allow the determination
of the stationary points of (a).
7
3.2 The R-Derivative and Conjugate R-Derivative.
There are a variety of ways to develop the formalism discussed below (see [11]-[14]). Here, we
roughly followthe development given in Remmert [12] with additional material drawn fromBrand-
wood [14] and Nehari [11].
7
In fact, it is a scaled version of the conjugate R-derivative discussed in the next subsection.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 7
Note that the nonholomorphic (nonanalytic in z)
8
functions given as examples in the previous
section can all be written in the form f(z, z), where they are holomorphic in z = x + j y for xed
z and holomorphic in z = x j y for xed z.
9
It can be shown that this fact is true in general for
any complex- or real-valued function
f(z) = f(z, z) = f(x, y) = u(x, y) + j v(x, y) (6)
of a complex variable for which the real-valued functions u and v are differentiable as functions
of the real variables x and y. This fact underlies the development of the so-called Wirtinger cal-
culus [12] (or, as we shall refer to it later, the CR-calculus.) In essence, the so-called conjugate
coordinates,
Conjugate Coordinates: c (z, z)
T
C C, z = x + j y and z = x j y (7)
can serve as a formal substitute for the real r = (x, y)
T
representation of the point z = x+j y C
[12].
10
According to Remmert [12], the calculus of complex variables utilizing this perspective was
initiated by Henri Poincar e (over 100 years ago!) and further developed by Wilhelm Wirtinger in
the 1920s [10]. Although this methodology has been fruitfully exploited by the German-speaking
engineering community (see, e.g., references [13] or [33]), it has not generally been appreciated
by the English speaking engineering community until relatively recently.
11
For a general complex- or real-valued function function f(c) = f(z, z) consider the pair of
partial derivatives of f(c) formally
12
dened by
R-Derivative of f(c)
f(z, z)
z
z=const.
and Conjugate R-Derivative of f(c)
f(z, z)
z
z=const.
(8)
8
Perhaps now we can better appreciate the merit of distinguishing between holomorphic and analytic functions. A
function can be nonholomorphic (i.e. nonanalytic) in the complex variable z = x + j y yet still be analytic in the real
variables x and y.
9
That is, if we make the substitution w = z, they are analytic in w for xed z, and analytic in z for xed w. This
simple insight underlies the development given in Brandwood [14] and Remmert [12].
10
Warning! The interchangeable use of the various notational forms of f implicit in the statement f(z) = f(z, z)
can lead to confusion. To minimize this possibility we dene the termf(z) (z-only) to mean that f(z) is independent
of z (and hence is holomorphic) and the term f( z) ( z only) to mean that f(z) is a function of z only. Otherwise
there are no restrictions on f(z) = f(z, z).
11
An important exception is Brandwood [14] and the work that it has recently inuenced such as [1, 15, 16].
However, these latter references do not seem to fully appreciate the clarity and ease of computation that the Wirtinger
calculus (CR-calculus) can provide to the problem of differentiating nonholomorphic function and optimizing real-
valued functions of complex variables. Perhaps this is do to the fact that [14] did not reference the Wirtinger calculus
as such, nor cite the rich body of work which had already existed in the mathematics community ([11, 18, 12]).
12
These statements are formal because one cannot truly vary z = x + j y while keeping z = x j y constant, and
vice versa.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 8
where the formal partial derivatives are taken to be standard complex partial derivatives (C-derivatives)
taken with respect to z in the rst case and with respect to z in the second.
13
For example, with
f(z, z) = z
2
z we have
f
z
= 2z z and
f
z
= z
2
.
As denoted in (8), we call the rst expression the R-derivative (the real-derivative) and the second
expression the conjugate R-derivative (or R-derivative).
It is proved in [11, 14, 12] that the R-derivative and R-derivative formally dened by (8) can
be equivalently written as
14
f
z
=
1
2
f
x
j
f
y
and
f
z
=
1
2
f
x
+ j
f
y
(9)
where the partial derivatives with respect to x and y are true (i.e., non-formal) partial derivatives of
the function f(z) = f(x, y), which is always assumed in this note to be differentiable with respect
to x and y (i.e., to be R-differentiable). Thus it is the right-hand-sides of the expressions given in
(9) which make rigorous the formal denitions of (8).
Note that from equation (9) that we immediately have the properties
z
z
=
z
z
= 1 and
z
z
=
z
z
= 0 . (10)
Comments:
1. The condition
f
z
= 0 is true for an R-differentiable function f if and only the Cauchy-
Riemann conditions are satised (see [11, 14, 12]). Thus a function f is holomorphic (ana-
lytic in z) if and only if it does not depend on the complex conjugated variable z, f(z) = f(z)
(z only).
15
2. The R-derivative,
f
z
, of an R-differentiable function f is equal to the standard C-derivative,
f
(z
0
)(z z
0
) +
1
2
f
(z
0
)(z z
0
)
2
+ +
1
n!
f
(n)
(z
0
)(z z
0
)
n
+
where the complex coefcient f
(n)
(z
0
) denotes an n-times C-derivative of f(z) evaluated at the
point z
0
, if and only if it is holomorphic. If the function f(z) is not holomorphic over C, so that
the above expansion does not exist, but is nonetheless still R-analytic as a mapping from R
2
to R
2
,
then the real and imaginary parts of f(z) = u(x, y) + j v(x, y), z = x + j y, can be expanded in
terms of the real variables r = (x, y)
T
,
u(r) = u(r
0
) +
u(r
0
)
r
(r r
0
)
T
+ (r r
0
)
T
r
u(r
0
)
r
T
(r r
0
) +
v(r) = v(r
0
) +
v(r
0
)
r
(r r
0
)
T
+ (r r
0
)
T
r
v(r
0
)
r
T
(r r
0
) +
Note that if the R-analytic function is purely real, then f(z) = u(x, y) and we have
f(r) = f(r
0
) +
f(r
0
)
r
(r r
0
)
T
+ (r r
0
)
T
r
f(r
0
)
r
T
(r r
0
) +
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 10
Properties of the R- and R-Derivatives. The R-derivative and R-derivative are both linear oper-
ators which obey the product rule of differentiation. The following important and useful properties
also hold (see references [11, 12]).
16
Complex Derivative Identities:
f
z
=
f
z
(12)
f
z
=
f
z
(13)
df =
f
z
dz +
f
z
d z Differential Rule (14)
h(g)
z
=
h
g
g
z
+
h
g
g
z
Chain Rule (15)
h(g)
z
=
h
g
g
z
+
h
g
g
z
Chain Rule (16)
As a simple consequence of the above, note that if f(z) is real-valued then
f(z) = f(z) so that we
have the additional very important identity that
f(z) R
f
z
=
f
z
(17)
As a simple rst application of the above, note that the R-derivative of (a) can be easily
computed from the denition (2) and the above properties to be
(a)
a
=
a
E e
k
e
k
= E
e
k
a
e
k
+ e
k
e
k
a
= E0 e
k
e
k
k
= E
k
e
k
. (18)
which is the same result obtained from the brute force method based on deriving expanding the
loss function in terms of the real and imaginary parts of a, followed by computing (5) and then
using the result (9). Similarly, it can be easily shown that the R-derivative of (a) is given by
(a)
a
= E
k
e
k
. (19)
Note that the results (18) and (19) are the complex conjugates of each other, which is consistent
with the identity (17).
We view the pair of formal partial derivatives for a possibly nonholomorphic function dened
by (8) as the natural generalization of the single complex derivative (C-derivative) of a holomorphic
16
In the following for z = x + j y we dene dz = dx + j dy and d z = dx j dy, while h(g) = h g denotes the
composition of the two function h and g.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 11
function. The fact that there are two derivatives under general consideration does not need to be
developed in elementary standard complex analysis courses where it is usually assumed that f is
always holomorphic (analytic in z). In the case when f is holomorphic then f is independent of z
and the conjugate partial derivative is zero, while the extended derivative reduces to the standard
complex derivative.
First-Order Optimality Conditions. As mentioned in the introduction, we are often interested
in optimizing a scalar function with respect to the real and imaginary parts r = (x, y)
T
of a
complex number z = x + j y. It is a standard result from elementary calculus that a rst-order
necessary condition for a point r
0
= (x
0
, y
0
)
T
to be an optimum is that this point be a stationary
point of the loss function. Assuming differentiability, stationarity is equivalent to the condition
that the partial derivatives of the loss function with respect the parameters r = (x, y)
T
vanish at
the point r = (x
0
, y
0
)
T
. The following fact is an easy consequence of the denitions (8) and is
discussed in [14]:
A necessary and sufcient condition for a real-valued function, f(z) = f(x, y), z = x +j y,
to have a stationary point with respect to the real parameters r = (x, y)
T
R
2
is that its R-
derivative vanishes. Equivalently, a necessary and sufcient condition for f(z) = f(x, y)
to have a stationary point with respect to r = (x, y)
T
R
2
is that its R-derivative vanishes.
For example, setting either of the derivatives (18) or (19) to zero results in the so-called Wiener-
Hopf equations for the optimal MMSE estimate of a. This result can be readily extended to the
multivariate case, as will be discussed later in this note.
The Univariate CR-Calculus. As noted in [12], the approach we have been describing is known
as the Wirtinger calculus in the German speaking countries, after the pioneering work of Wilhelm
Wirtinger in the 1920s [10]. Because this approach is based on being able to apply the calculus
of real variables to make statements about functions of complex variables, in this note we use the
term CR-calculus interchangeable with Wirtinger calculus.
Despite the important insights and ease of computation that it can provide, it is the case that
the use of conjugate coordinates z and z (which underlies the CR-calculus) is not needed when
developing the classical univariate theory of holomorphic (analytic in z) functions.
17
It is only in
the multivariate and/or nonholomorphic case that the tools of the CR-calculus begin to be indis-
pensible. Therefore it is not developed in the standard courses taught to undergraduate engineering
and science students in this country [3]-[6] which have changed little in mode of presentation from
the earliest textbooks.
18
17
The differential calculus of these operations ... [is] ... largely irrelevant for classical function theory ...
R. Remmert [12], page 66.
18
For instance, the widely used textbook by Churchill [3] adheres closely to the format and topics of its rst edition
which was published in 1948. The latest edition (the 7th at the time of this writing) does appear to have one brief
homework problem on the extended and conjugate derivatives.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 12
Ironically, the elementary textbook by Nehari [11] was an attempt made in 1961 (over 40
years ago!) to integrate at least some aspects of the CR-calculus into the elementary treatment of
functions of a single complex variable.
19
However, because the vast majority of textbooks treat the
univariate case, as long as the mathematics community, and most of the engineering community,
was able to avoid dealing with nonholomorphic functions, there was no real need to bring the ideas
of the CR-calculus into the mainstream univariate textbooks.
Fortunately, an excellent sophisticated and extensive introduction to univariate complex vari-
ables theory and the CR-calculus is available in the textbook by Remmert [12], which is a transla-
tion from the 1989 German edition. This book also details the historical development of complex
analysis. The highly recommended Remmert and Nehari texts have been used as primary refer-
ences for this note (in addition to the papers by Brandwood [14] and Van den Bos [27]).
The Multivariate CR-Calculus. Although one can forgo the tools of the CR-calculus in the case
of univariate holomorphic functions, this is not the situation in the multivariate holomorphic case
where researchers have long utilized these tools [17]-[20].
20
Unfortunately, multivariate complex
analysis is highly specialized and technically abstruse, and therefore virtually all of the standard
textbooks are accessible only to the specialist or to the aspiring specialist. It is commonly assumed
in these textbooks that the reader has great facility with differential geometry, topology, calculus
on manifolds, and differential forms, in addition to a good grasp of advanced univariate complex
variables theory. However, because the focus of the theory of multivariate complex functions is
on holomorphic functions, whereas our concern is the essentially ignored (in this literature) case
of nonholomorphic real-valued functionals, it appears to be true that only a very small part of the
material presented in these references is useful, primarily for creating a rigorous and self-consistent
multivariate CR-calculus framework base on the results given in the papers by Brandwood [14] and
Van den Bos [27].
The clear presentation by Brandwood [14] provides a highly accessible aspect of the multi-
variate CR-calculus as applied to the problem of optimizing real-valued functionals of complex
variables.
21
As this is the primary interest of many engineers, this pithy paper is a very useful
presentation of just those very few theoretical and practical issues which are needed to get a clear
grasp of the problem. Unfortunately, even twenty years after its publication, this paper still is
not as widely known as it should be. However, the recent utilization of the Brandwood results in
[1, 13, 15, 16] seems to indicate a standardization of the Brandwood presentation of the complex
gradient into the mainstream textbooks. The results given in the Brandwood paper [14] are partic-
ulary useful when coupled with the signicant extension of Brandwoods results to the problem of
computing complex Hessians which has been provided by Van den Boss paper [27].
19
This is still an excellent textbook that is highly recommended for an accessible introduction to the use of deriva-
tives based on the conjugate coordinates z and z.
20
[The CR-calculus] is quite indispensable in the function theory of several variables. R. Remmert [12], page
67.
21
Although, as mentioned in an earlier footnote, Brandwood for some reason did not cite or mention any prior work
relating to the use of conjugate coordinates or the Wirtinger calculus.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 13
At this still relatively early stage in the development of a widely accepted framework for dealing
with real-valued (nonholomorphic) functions of several complex variables, presumably even the
increasingly widely used formalism of Brandwood [14] and Van den Bos [27] potentially has some
room for improvement and/or clarication (though this is admittedly a matter of taste). In this
spirit, and mindful of the increasing acceptance of the approach in [14] and [27], in the remainder
of this note we develop a multivariate CR-calculus framework that is only slightly different than that
of [14] and [27], incorporating insights available from the literature on the calculus of multivariate
complex functions and complex differential manifolds [17]-[20].
22
4 Multivariate CR-Calculus
The remaining sections of this note will provide an expanded discussion and generalized presen-
tation of the multivariate CR-calculus as presented in Brandwood [14] and Van den Bos [27], and
it is assumed that the reader has read these papers, as well as reference [25]. The discussion given
below utilizes insights gained from references [17, 18, 19, 20, 21, 22] and adapts notation and
concepts presented for the real case in [25].
4.1 The Space Z = C
n
.
We dene the n-dimensional column vector z by
z =
z
1
z
n
T
Z = C
n
where z
i
= x
i
+ j y
i
, i = 1, , n, or, equivalently,
z = x + j y
with x = (x
1
x
n
)
T
and y = (y
1
y
n
)
T
. The space Z = C
n
is a vector space over the eld
of complex numbers with the standard component-wise denitions of vector addition and scalar
multiplication. Noting the one-to-one correspondence
z C
n
r =
x
y
1 R
2n
= R
n
R
n
it is evident that there exists a natural isomorphism between Z = C
n
and 1 = R
2n
.
The conjugate coordinates of z C
n
are dened by
z =
z
1
z
n
T
Z = C
n
22
Realistically, one must admit that many, and likely most, engineers will be unlikely to make the move from the
perspective and tools provided by [14] and [27] (which already enable the engineer to solve most problems of practical
interest) to that developed in this note, primarily because of the requirement of some familiarity of (or willingness to
learn) concepts of differential geometry at the level presented in [25] (which is at the level of the earlier chapters of
[21] and [22]).
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 14
We denote the pair of conjugate coordinate vectors (z, z) by
c
z
z
C
2n
= C
n
C
n
Noting that c, (z, z), z, (x, y), and r are alternative ways to denote the same point z = x +j y
in Z = C
n
, for a function
f : C
n
C
m
throughout this note we will use the convenient (albeit abusive) notation
f(c) = f(z, z) = f(z) = f(x, y) = f(r) C
m
where z = x + j y Z = C
n
. We will have more to say about the relationships between these
representations later on in Section 6 below.
We further assume that Z = C
n
is a Riemannian manifold with a hermitian, positive-denite
n n metric tensor
z
=
H
z
> 0. This assumption makes every tangent space
23
T
z
Z = C
n
z
a
Hilbert space with inner product
'v
1
, v
2
` = v
H
1
z
v
2
v
1
, v
2
C
n
z
.
4.2 The Cogradient Operator and the Jacobian Matrix
The Cogradient and Conjugate Cogradient. Dene the cogradient and conjugate cogradient
operators respectively as the row operators
24
Cogradient Operator:
z
z
1
zn
(20)
Conjugate cogradient Operator:
z
z
1
zn
(21)
where (z
i
, z
i
), i = 1, , n are conjugate coordinates as discussed earlier and the component
operators are R-derivatives and R-derivatives dened according to equations (8) and (9),
z
i
=
1
2
x
i
j
y
i
and
z
i
=
1
2
x
i
+ j
y
i
, (22)
for i = 1, , n.
25
Equivalently, we have
z
=
1
2
x
j
y
and
z
=
1
2
x
+ j
y
, (23)
23
A tangent space at the point z is the space of all differential displacements, dz, at the point z or, alternatively,
the space of all velocity vectors v =
dz
dt
at the point z. These are equivalent statements because dz and v are scaled
version of each other, dz = vdt. The tangent space T
z
Z = C
n
z
is a linear variety in the space Z = C
n
. Specically it
is a copy of C
n
afnely translated to the point z, C
n
z
= z +C
n
.
24
The rationale behind the terminology cogradient is explained in [25].
25
As before the left-hand-sides of (23) are formal partial derivatives, while the right-hand-sides are actual partial
derivatives.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 15
When applying the cogradient operator
z
, z is formally treated as a constant, and when apply-
ing the conjugate cogradient operator
z
, z is formally treated as a constant. For example, consider
the scalar-valued function
f(c) = f(z, z) = z
1
z
2
+ z
1
z
2
.
For this function we can readily determine by partial differentiation on the z
i
and z
i
components
that
f(c)
z
=
z
2
z
1
and
f(c)
z
=
z
2
z
1
.
The Jacobian Matrix. Let f(c) = f(z, z) C
m
be a mapping
26
f : Z = C
n
C
m
.
The generalization of the identity (14) yields the vector form of the differential rule,
27
df(c) =
f(c)
c
dc =
f(c)
z
dz +
f(c)
z
dz , Differential Rule (24)
where the m n matrix
f
z
is called the Jacobian, or Jacobian matrix, of the mapping f, and the
m n matrix
f
z
the conjugate Jacobian of f. Only The Jacobian of f is often denoted by J
f
is
computed by applying the cogradient operator component-wise to f,
J
f
(c) =
f(c)
z
=
f
1
(c)
z
.
.
.
fm(c)
z
f
1
(c)
z
1
f
1
(c)
zn
.
.
.
.
.
.
.
.
.
fm(c)
z
1
fm(c)
zn
C
mn
, (25)
and similarly the conjugate Jacobian, denoted by J
c
f
is computing by applying the conjugate cogra-
dient operator component-wise to f,
J
c
f
(c) =
f(c)
z
=
f
1
(c)
z
.
.
.
fm(c)
z
f
1
(c)
z
1
f
1
(c)
zn
.
.
.
.
.
.
.
.
.
fm(c)
z
1
fm(c)
zn
C
mn
. (26)
With this notation we can write the differential rule as
df(c) = J
f
(c) dz + J
c
f
(c) dz . Differential Rule (27)
26
It will always be assumed that the components of vector-valued functions are R-differentiable as discussed in
footnotes (2) and (13).
27
At this point in our development, the expression
f(c)
c
dc only has meaning as a shorthand expression for
f(c)
z
dz+
f(c)
z
dz, each term of which must be interpreted formally as z and z cannot be varied independently of each other.
(Later, we will examine the very special sense in which the derivative with respect to c itself can make sense.) Also
note that, unlike the real case discussed in [25], the mapping dz df (c) is not linear in dz. Even when interpreted
formally, the mapping is afne in dz, not linear.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 16
Applying properties (12) and (13) component-wise yields the identities
f (c)
z
=
f(c)
z
=
J
f
(c) and
f (c)
z
=
f(c)
z
=
J
c
f
(c) . (28)
Note from (28) that,
J
f
(c) =
f(c)
z
f (c)
z
= J
c
f
(c) =
f(c)
z
. (29)
However, in the important special case that f(c) is real-valued, in which case
f (c) = f(c), we
have
f(c) R
m
J
f
(c) =
f(c)
z
=
f(c)
z
= J
c
f
(c). (30)
With (27) this yields the following important fact which holds for real-valued functions f(c),
28
f(c) R
m
df(c) = J
f
(c) dz + J
f
(c) dz = 2 Re J
f
(c) dz . (31)
Consider the composition of two mappings h : C
m
C
r
and g : C
n
C
m
,
h g = h(g) : C
n
C
r
.
The vector extensions of the chain rule identities (15) and (16) to h g are
h(g)
z
=
h
g
g
z
+
h
g
g
z
Chain Rule (32)
h(g)
z
=
h
g
g
z
+
h
g
g
z
Chain Rule (33)
which can be written as
J
hg
= J
h
J
g
+ J
c
h
J
c
g
(34)
J
c
hg
= J
h
J
c
g
+ J
c
h
J
g
(35)
Holomorphic Vector-valued Functions. By denition the vector-valued function f(z) is holo-
morphic (analytic in the complex vector z) if and only if each of its components
f
i
(c) = f
i
(z, z) = f
i
(z
1
, , z
n
, z
1
, , z
n
) i = 1, , m
is holomorphic separately with respect to each of the components z
j
, j = 1, , n. In the refer-
ences [17, 18, 19, 20] it is shown that f(z) is holomorphic on a domain if and only if it satises a
matrix Cauchy Riemann condition everywhere on the domain:
Cauchy Riemann Condition: J
c
f
=
f
z
= 0 (36)
This shows that a vector-valued function which is holomorphic on C
n
must be a function of z only,
f(c) = f(z, z) = f(z) (z only).
28
The real part of a vector (or matrix) is the vector (or matrix) of the real parts. Note that the mapping dz df (c)
is not linear.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 17
Stationary Points of Real-Valued Functionals. Suppose that f is a scalar real-valued function
from C
n
to R,
29
f : C
n
R; z f(z) .
As discussed in [14], the rst-order differential condition for a real-valued functional f to be
optimized with respect to the real and imaginary parts of z at the point z
0
is
Condition I for a Stationary Point:
f(z
0
, z
0
)
z
= 0 (37)
That this fact is true is straightforward to ascertain from equations (20) and (23). An equivalent
rst-order condition for a real-valued functional f to be stationary at the point z
0
is given by
Condition II for a Stationary Point:
f(z
0
, z
0
)
z
= 0 (38)
The equivalence of the two conditions (37) and (38) is a direct consequence of (28) and the fact
that f is real-valued.
Differentiation of Conjugate Coordinates? Note that the use of the notation f(c) as shorthand
for f(z, z) appears to suggest that it is permissible to take the complex cogradient of f(c) with
respect to the conjugate coordinates vector c by treating the complex vector c itself as the variable
of differentiation. This is not correct. Only complex differentiation with respect to the complex
vectors z and z is well-dened. Thus, from the denition c col(z, z) C
2n
, for c viewed as a
complex 2n-dimensional vector, the correct interpretation of
c
f(c) is given by
c
f(c) =
z
f(z, z) ,
z
f(z, z)
c
c
H
c = c
H
which would be true if it were permissible to take the complex cogradient with respect to the
complex vector c (which it isnt).
Remarkably, however, below we will show that the 2n-dimensional complex vector c is an
element of an n-dimensional real vector space and that, as a consequence, it is permissible to take
the real cogradient with respect to the real vector c!
Comments. With the machinery developed up to this point, one can solve optimization problems
which have closed-form solutions to the rst-order stationarity conditions. However, to solve
general nonlinear problems one must often resort to gradient-based iterative methods. Furthermore,
to verify that the solutions are optimal, one needs to check second order conditions which require
the construction of the hessian matrix. Therefore, the remainder of this note is primarily concerned
with the development of the machinery required to construct the gradient and hessian of a scalar-
valued functional of complex parameters.
29
The function f is unbolded to indicate its scalar-value status.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 18
4.3 Biholomorphic Mappings and Change of Coordinates.
Holomorphic and Biholomorphic Mappings. A vector-valued function f is holomorphic (ana-
lytic) if its components are holomorphic. In this case the function does not depend on the conjugate
coordinate z, f(c) = f(z), and satises the Cauchy-Riemann Condition,
J
c
f
=
f
z
= 0 .
As a consequence (see (27)),
f(z) holomorphic df(z) = J
f
(z) dz =
f(z)
z
dz . (39)
Note that when f is holomorphic, the mapping dz df(z) is linear, exactly as in the real case.
Consider the composition of two mappings h : C
m
C
r
and g : C
n
C
m
,
h g = h(g) : C
n
C
r
,
which are both holomorphic. In this case, as a consequence of the Cauchy-Riemann condition
(36), the second chain rule condition (35) vanishes, J
c
hg
= 0, and the rst chain rule condition
(34) simplies to
f and g holomorphic J
hg
= J
h
J
g
. (40)
Now consider the holomorphic mapping = f(z),
d = df(z) = J
f
(z) dz (41)
and assume that it is invertible,
z = g() = f
1
() . (42)
If the invertible function f and its inverse g = f
1
are both holomorphic, then f (equivalently, g)
is said to be biholomorphic. In this case, we have that
dz =
g()
d = J
g
() d = J
1
f
(z) d , = f(z) , (43)
showing that
J
g
() = J
1
f
(z) , = f(z) . (44)
Coordinate Transformations. Admissible coordinates on a space dened over a space of com-
plex numbers are related via biholomorphic transformations [17, 18, 19, 20]. Thus if z and are
admissible coordinates on Z = C
n
, there must exist a biholomorphic mapping relating the two
coordinates, = f(z). This relationship is often denoted in the following (potentially confusing)
manner,
= (z) , d =
(z)
z
dz = J
(z) dz ,
(z)
z
= J
(z) = J
1
z
() =
z()
1
(45)
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 19
z = z() , dz =
z()
d = J
z
() d ,
z()
= J
z
() = J
1
(z) =
(z)
z
1
, (46)
These equations tell us how vectors (elements of any particular tangent space C
n
z
) properly trans-
form under a change of coordinates.
In particular under the change of coordinates = (z), a vector v C
n
z
must transform to its
new representation w C
n
(z)
according to the
Vector Transformation Law: w =
z
v = J
v (47)
For the composite coordinate transformation ((z)), the chain rule yields
Transformation Chain Rule:
z
=
z
or J
= J
(48)
Finally, applying the chain rule to the cogradient,
f
z
, of a an arbitrary holomorphic function f
we obtain
f
=
f
z
z
for = (z) .
This shows that the cogradient, as an operator on holomorphic functions, transforms like
Cogradient Transformation Law:
( )
=
( )
z
z
=
( )
z
J
z
=
( )
z
J
1
(49)
Note that generally the cogradient transforms quite differently than does a vector.
Finally the transformation law for the metric tensor under a change of coordinates can be deter-
mined from the requirement that the inner product must be invariant under a change of coordinates.
For arbitrary vectors v
1
, v
2
C
n
z
transformed as
w
i
= J
v
i
C
n
(z)
i = 1, 2 ,
we have
'w
1
, w
2
` = w
H
1
w
2
= v
H
1
J
H
v
2
= v
H
1
J
H
z
J
z
v
2
= v
H
1
z
v
2
= 'v
1
, v
2
` .
This results in the
Metric Tensor Transformation Law:
= J
H
z
J
1
= J
H
z
z
J
z
(50)
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 20
5 The Gradient Operator
z
1
st
-Order Approximation of a Real-Valued Function. Let f(c) be a real-valued
30
functional
to be optimized with respect to the real and imaginary parts of the vector z Z = C
n
,
f : C
n
R.
As a real-valued function, f(c) does not satisfy the Cauchy-Riemann condition (36) and is there-
fore not holomorphic.
From (31) we have (with f(z) = f(z, z) = f(c)) that
df(z) = 2 Re J
f
(z) dz = 2 Re
f(z)
z
dz
. (51)
This yields the rst order relationship
f(z + dz) = f(z) + 2 Re
f(z)
z
dz
(52)
and the corresponding rst-order power series approximation
f(z + z) f(z) + 2 Re
f(z)
z
z
(53)
which will be rederived by other means in Section 6 below.
The Complex Gradient of a Real-Valued Function. The relationship (51) denes a nonlinear
functional, df
c
(), on the tangent space C
n
z
,
31
df
c
(v) = 2 Re
f(c)
z
v
, v C
n
z
, c = (z, z) . (54)
Assuming the existence of a metric tensor
z
we can write
f
z
v =
1
z
f
z
z
v = (
z
f)
H
z
v = '
z
f, v` , (55)
where
z
f is the gradient of f, dened as
Gradient of f:
z
f
1
z
f
z
H
(56)
30
And therefore unbolded.
31
Because this operator is nonlinear in dz, unlike the real vector-space case (see the discussion of the real-vector
space case given in [25]), we will avoid calling it a differential operator..
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 21
Consistent with this denition, the gradient operator is dened as
Gradient Operator:
z
( )
1
z
( )
z
H
(57)
One can show from the coordinate transformation laws for cogradients and metric tensors that the
gradient
z
f transforms like a vector and therefore is a vector,
z
f C
n
z
.
Equations (54) and (55) yield,
df
c
(v) = 2 Re '
z
f, v` .
Keeping |v| = 1 we want to nd the directions v of steepest increase in the value of [df
c
(v)[. We
have as a consequence of the Cauchy-Schwarz inequality that for all unit vectors v C
n
z
,
[df
c
(v)[ = 2 [Re '
z
f, v`[ 2 ['
z
f, v`[ 2 |
z
f| |v| = 2 |
z
f| .
This upper bound is attained if and only if v
z
f, showing that the gradient gives the directions
of steepest increase, with +
z
f giving the direction of steepest ascent and
z
f giving the
direction of steepest descent. The result (57) is derived in [14] for the special case that the metric
is Euclidean
z
= I.
32
Note that the rst-order necessary conditions for a stationary point to exist is given by
z
f = 0,
but that it is much easier to apply the simpler condition
f
z
= 0 which does not require knowledge
of the metric tensor. Of course this distinction vanishes when
z
= I as is the case in [14].
Comments on Applying the Multivariate CR-Calculus. Because the components of the cogra-
dient and conjugate cogradient operators (20) and (21) formally behave like partial derivatives
of functions over real vectors, to use them does not require the development of additional vec-
tor partial-derivative identities over and above those that already exist for the real vector space
case. The real vector space identities and procedures for vector partial-differentiation (as devel-
oped, e.g., in [25]) carry over without change, provided one rst carefully distinguishes between
those variables which are to be treated like constants and those variables which are to be formally
differentiated.
Thus, although a variety of complex derivative identities are given in various references [14,
15, 16], there is actually no need to memorize or look up additional complex derivative identities
if one already knows the real derivative identities. In particular, the derivation of the complex
32
Therefore one must be careful to ascertain when a result derived in [14] holds in the general case. Also note the
notational difference between this note and [14]. We have
z
denoting the gradient operator while [14] denotes the
gradient operator as
z
for
z
= I. This difference is purely notational.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 22
derivative identities given in references [14, 15, 16] is trivial if one already knows the standard
real-vector derivative identities. For example, it is obviously the case the
a
H
z
= a
H
z
z
= 0
as z is to be treated as a constant when taking partial derivatives with respect to z, so the fact that
z
a
H
z = 0 does not have to be memorized as a special complex derivative identity. To reiterate,
if one already knows the standard gradient identities for real-valued functions of real variables,
there is no need to memorize additional complex derivative identities.
33
Instead, one can merely
use the regular real derivative identities while keeping track of which complex variables are to be
treated as constants.
34
This is the approach used to easily derive the complex LMS algorithm in
the applications section at the end of this note.
To implement a true gradient descent algorithm, one needs to know the metric tensor. The cor-
rect gradient, which depends on the metric tensor, is called the natural gradient in [26] where it
is argued that superior performance of gradient descent algorithms in certain statistical parameter
estimation problems occurs when the natural gradient is used in lieu of of the standard naive gra-
dient usually used in such algorithms (see also the discussion in [25]). However, the determination
of the metric tensor for a specic application can be highly nontrivial and the resulting algorithms
signicantly more complex, as discussed in [26], although there are cases where the application of
the natural gradient methodology is surprisingly straightforward.
To close this section, we mention that interesting and useful applications of the CR-calculus as
developed in [14] and [27] can be found in references [13], [28]-[35], and [38], in addition to the
plentiful material to be found in the textbooks [1], [15], [16], and [23].
6 2
nd
-Order Expansions of a Real-Valued Function on C
n
It is common to numerically optimize cost functionals using iterative gradient descent-like tech-
niques [25]. Determination of the gradient of a real-valued loss function via equation (56) allows
the use of elementary gradient descent optimization, while the linear approximation of a biholo-
morphic mapping g() via (43) enables optimization of the nonlinear least-squares problem using
the Gauss-Newton algorithm.
35
Another commonly used iterative algorithm is the Newton method, which is based on the re-
peated computation and optimization of the quadratic approximation to the loss function as given
33
This extra emphasis is made because virtually all of the textbooks (even the exemplary text [15]) provide such
extended derivative identities and use them to derive results. This sends the message that unless such identities are
at hand, one cannot solve problems. Also, it places one at the mercy of typographical errors which may occur when
identities are printed in the textbooks.
34
Thus, in the real case, x is the variable to be differentiated in x
T
x and we have
x
x
T
x = 2x
T
, while in the
complex case, if we take z to be treated as constant and z to be the differentiated variable, we have
z
z
H
z =
z
H
z
z = z
H
. Note that in both cases we use the differentiation rules for vector differentiation which are developed
initially for the purely real case once we have decided which variables are to be treated as constant.
35
Recall that the Gauss-Newton algorithm is based on iterative re-linearization of a nonlinear model z g().
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 23
by a power series expansion to second order [25]. Although the rst order approximation to the
loss function given by (53) was relatively straight-forward to derive, it is is somewhat more work
to determine the second order approximation, which is the focus of this section and which will
be attacked using the elegant approach of Van den Bos [27].
36
Along the way we will rederive
the rst order approximation (53) and the Hessian matrix of second partial derivatives of a real
scalar-valued function which is needed to verify the optimality of a solution solving the rst order
necessary conditions.
6.1 Alternative Coordinate Representations of Z = C
n
.
Conjugate Coordinate Vectors c C Form a Real Vector Space. The complex space, C
n
,
of dimension n naturally has the structure of a real space, R
2n
, of dimension 2n, C
n
R
2n
, as a
consequence of the equivalence
z = x + j y Z = C
n
r =
x
y
1 R
2n
.
Furthermore, as noted earlier, an alternative representation is given by the set of conjugate
coordinate vectors
c =
z
z
( C
2n
R
4n
,
where ( is dened to be the collection of all such vectors c. Note that the set ( is obviously a
subset (and not a vector subspace)
37
of the complex vector space C
2n
. Remarkably, it is also a 2n
dimensional vector space over the eld of real numbers!
This is straightforward to show. First, in the obvious manner, one can dene vector addition
of any two elements of (. To show closure under scalar multiplication by a real number is also
straight forward,
c =
z
z
( c =
z
z
( .
Note that this homogeneity property obviously fails when is complex.
To demonstrate that ( is 2n dimensional, we will construct below the one-to-one transforma-
tion, J, which maps ( onto 1, and vice versa, thereby showing that ( and 1 are isomorphic,
( 1. In this manner ( and 1 are shown to be alternative, but entirely equivalent (including
their dimensions), real coordinate representations for Z = C
n
. The coordinate transformation J is
a linear mapping, and therefore also corresponds to the Jacobian of the transformation between the
coordinate system 1and the coordinate system (.
36
A detailed exposition of the second order case is given by Abatzoglou, Mendel, & Harada in [38]. See also
[34]. The references [38], [27] and [34] all develop the complex Newton algorithm, although with somewhat different
notation.
37
It is, in fact, a 2n dimensional submanifold of the space C
2n
R
4n
.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 24
In summary, we have available three vector space coordinate representations for representing
complex vectors z = x + j y. The rst is the canonical n-dimensional vector space of complex
vectors z Z = C
n
itself. The second is the canonical 2n-dimensional real vector space of vectors
r = col(x, y) 1 = R
2n
, which arises from the natural correspondence C
n
R
2n
. The third is
the 2n-dimensional real vector space of vectors c ( C
2n
, ( R
2n
.
Because ( can be alternatively viewed as a complex subset of C
2n
or as a real vector space iso-
morphic to R
2n
, we actually have a fourth representation; namely the non-vector space complex-
vector perspective of elements of ( as elements of the space C
2n
, c = col(z, z).
38
This perspective
is just the (z, z) perspective used above to analyze general, possibly nonholomorphic, functions
f(z) = f(z, z).
In order to avoid confusion, we will refer to these two alternative interpretations of c (
C
2n
as the c-real case (respectively, the (-real case) for when we consider the vector c ( R
2n
(respectively, the real vector space ( R
2n
), and the c-complex case (respectively, the (-complex
case) when we consider a vector c ( C
2n
(respectively, the complex subset ( C
2n
).
39
These
two different perspectives of ( are used throughout the remainder of this note.
Coordinate Transformations and Jacobians. From the fact that
z = x + j y and z = x j y
it is easily shown that
z
z
I j I
I j I
x
y
I j I
I j I
(58)
then results in the mapping
c = c(r) = J r . (59)
It is easily determined that
J
1
=
1
2
J
H
(60)
38
Since when viewed as a subset of C
2n
the set ( is not a subspace, this viewof ( does not result in a true coordinate
representation.
39
In the latter case c = col(z, z) is understood in terms of the behavior and properties of its components, especially
for differentiation purposes because, as mentioned earlier, in the complex case the derivative
c
is not well-dened in
itself, but is dened in terms of the formal derivatives with respect to z and z. As we shall discover below, in the c-real
case, the derivative
c
is a true real derivative which is well understood in terms of the behavior of the derivative
r
.
40
Except for a trivial reordering of the elements of r = (x
T
y
T
)
T
, this is the transformation proposed and utilized
by Van den Bos [27], who claims in [31] to have been inspired to do so by Remmert. (See, e.g., the discussion on page
87 of [12].)
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 25
so that we have the inverse mapping
r = r(c) = J
1
c =
1
2
J
H
c . (61)
Because the mapping between 1and ( is linear, one-to-one, and onto, both of these spaces are
obviously isomorphic real vector spaces of dimension 2n. The mappings (59) and (61) therefore
correspond to an admissible coordinate transformation between the c and r representations of
z Z. Consistent with this fact, we henceforth assume that the vector calculus (including all of
the vector derivative identities) developed in [25] apply to functions over (.
Note that for the coordinate transformation c = c(r) = Jr we have the Jacobian
J
c
r
c(r) =
r
Jr = J (62)
showing that J is also the Jacobian of the coordinate transformation from 1 to (.
41
The Jacobian
of the inverse transformation r = r(c) is given by
J
r
= J
1
c
= J
1
=
1
2
J
H
. (63)
Of course, then, we have the differential relationships
dc =
c
r
dr = J
c
dr = Jdr and dr =
r
c
dc = J
r
dc =
1
2
J
H
dc (64)
which correspond to the rst-order relationships
42
1st-Order Relationships: c = J
c
r = Jr and r = J
r
c =
1
2
J
H
c (65)
where the Jacobian J is given by (60) and
c =
z
z
and r =
x
y
(66)
The Cogradient with respect to the Real Conjugate Coordinates Vector c. The reader might
well wonder why we didnt just point out that (64) and (65) are merely simple consequences of
the linear nature of the coordinate transformations (59) and (61), and thereby skip the intermediate
steps given above. The point is, as discussed in [25],
43
that once we identied the Jacobian of a
coordinate transformation over a real manifold, we can readily transform between different coordi-
nate representations of all vector-like (contravariant) objects, such as the gradient of a functional,
41
We have just proved, of course, the general property of linear operators that they are their own Jacobians.
42
For a general, nonlinear, coordinate transformation this rst-order relationships would be approximate. However,
because the coordinate transformation considered here happens to be linear, the relationships are exact.
43
See the discussion surrounding equations (8) and (11) of [25].
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 26
and between all covector-like (covariant) objects, such as the cogradient of a functional, over that
manifold. Indeed, as a consequence of this fact we immediately have the important cogradient
operator transformations
Cogradient Transfs:
()
c
=
()
r
J
r
=
1
2
()
r
J
H
and
()
r
=
()
c
J
c
=
()
c
J (67)
with the Jacobian J given by (58) and J
r
= J
1
c
.
Equation (67) is very important as it allows us to easily, yet rigorously, dene the cogradient
taken with respect to c as a true (nonformal) differential operator provided that we view c as an
element of the real coordinate representation space (. The cogradient
()
c
is well-dened in terms
of the cogradient
()
r
and the pullback transformation
()
c
=
1
2
()
r
J
H
.
This shows that
()
c
, which was originally dened in terms of the cogradient and conjugate cogra-
dients taken with respect to z (the c-complex interpretation of
()
c
), can be treated as a real differ-
ential operator with respect to the real vector c (the c-real interpretation of
()
c
).
44
Complex Conjugation. It is easily determined that the operation of complex conjugation, z
z, is a nonlinear mapping on Z = C
n
. Consider an element C
2n
written as
=
top
bottom
C
2n
= C
n
C
n
with
top
C
n
and
bottom
C
n
.
Of course the operation of complex conjugation on C
2n
,
, is, in general, a nonlinear mapping.
Now consider the linear operation of swapping the top and bottom elements of ,
,
dened as
=
top
bottom
bottom
top
0 I
I 0
top
bottom
= S
where
S
0 I
I 0
A
11
A
12
A
21
A
22
.
Then premultiplication by S results in a block swap of the top n rows en masse with the bottom n
rows,
46
SA =
A
21
A
22
A
11
A
12
.
Alternatively, postmultiplication by S results in a block swap of the rst n columns with the last n
columns,
47
AS =
A
12
A
11
A
22
A
21
.
It is also useful to note the result of a sandwiching by S,
SAS = A =
A
22
A
21
A
12
A
11
.
Because S permutes n rows (or columns), it is a product of n elementary permutation matrices,
each of which is known to have a determinant which evaluates to 1. As an easy consequence of
this, we have
det S = (1)
n
.
Other important properties of the swap operator S will be developed as we proceed.
Now note that the subset ( C
2n
contains precisely those elements of C
2n
for which the
operations of swapping and complex conjugation coincide,
( =
C
2n
C
2n
,
and thus it is true by construction that c ( obeys c = c, even though swapping and complex
conjugation are different operations on C
2n
. Now although ( is not a subspace of the complex
vector space C
2n
, it is a real vector space in its own right. We see that the linear operation of
component swapping on the (-space coordinate representation of Z = C
n
is exactly equivalent
45
Permutation is just a fancy term for swapping.
46
Matrix premultiplication of A by any matrix always yields a row operation.
47
Matrix postmultiplication of A by any matrix always yields a column operation. The fact that pre- and post-
multiplication yield different actions on A is an interesting and illuminating way to interpret the fact that matrix
multiplication is noncommutative, MA = AM.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 28
to the nonlinear operation of complex conjugation on Z. It is important to note that complex
conjugation and coordinate swapping represent different operations on a vector c when c is viewed
as an element of C
2n
.
48
We can view the linear swap mapping S : ( ( as a coordinate transformation (a coordinate
reparameterization), c = c = Sc, on (. Because S is linear, the Jacobian of this transformation
is just S itself. Thus from the cogradient transformation property we obtain the useful identity
()
c
S =
()
c
S =
()
c
(68)
It is also straightforward to show that
I =
1
2
J
T
SJ (69)
for J given by (58)
Let us now turn to the alternative coordinate representation given by vectors r in the space 1 =
R
2n
. Specically, consider the 1 coordinate vector r corresponding to the change of coordinates
r =
1
2
J
H
c. Since the vector r is real, it is its own complex conjugate, r = r.
49
Complex conjugation
of z is the nonlinear mapping in C
n
z = x + j y z = x + j (y) ,
and corresponds in the representation space 1to the linear mapping
50
r =
x
y
x
y
I 0
0 I
x
y
= Cr
where C is the conjugation matrix
C
I 0
0 I
. (70)
Note that
C = C
T
= C
1
,
i.e., that C is symmetric, C = C
T
, and its own inverse, C
2
= I. It is straightforward to show that
C =
1
2
J
H
SJ (71)
48
As mentioned earlier, c, in a sense, does double duty as a representation for z; once as a (true coordinate)
representation of z in the real vector space (, and alternatively as a representation of z in the doubled up complex
space C
2n
= C
n
C
n
. In the development given below, we will switch between these two perspectives of c.
49
Note that our theoretical developments are consistent with this requirement, as
r =
1
2
(J
H
c) =
1
2
J
T
c =
1
2
J
T
c =
1
2
J
T
Sc =
1
2
J
T
SJr = Ir = r .
50
We refer to r as r-check.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 29
which can be compared to (69). Finally, it is straightforward to show that
c = Jr c = c = Jr . (72)
To summarize, we can represent the complex vector z by either c or r, where c has two inter-
pretations (as a complex vector, c-complex, in C
2n
, or as an element, c-real, of the real vector
space ( R
2n
), and we can represent the complex conjugate z by c, c, or r. And complex conju-
gation, which is a nonlinear operation in C
n
, corresponds to linear operators in the 2n-dimensional
isomorphic real vector spaces ( and 1.
6.2 Low Order Series Expansions of a Real-Valued Scalar Function.
By noting that a real-valued scalar function of complex variables can be viewed as a function of
either r or c-real or c-complex or z,
f(r) = f(c) = f(z) ,
it is evident that one should be able to represent f as a power series in any of these representations.
Following the line of attack pursued by [27], by exploiting the relationships (65) and (67) we will
readily show the equivalence up to second order in a power series expansion of f.
Up to second order, the multivariate power series expansion of the real-valued function f
viewed as an analytic function of vector r 1is given as [25]
2nd-Order Expansion in r: f(r + r) = f(r) +
f(r)
r
r +
1
2
r
T
H
rr
(r) r + h.o.t. (73)
where
51
H
rr
()
r
f()
r
T
for , r 1 (74)
is the real r-Hessian matrix of second partial derivatives of the real-valued function f(r) with
respect to the components of r. It is well known that a real Hessian is symmetric,
H
rr
= H
T
rr
.
However, there is no general guarantee that the Hessian will be a positive denite or positive
semidenite matrix.
It is assumed that the terms f(r) and f(r +r) be readily expressed in terms of c and c +c
or z and z + z. Our goal is to determine the proper expression of the linear and quadratic terms
of (73) in terms of c and c or z and z.
51
When no confusion can arise, one usually drops the subscripts on the Hessian and uses the simpler notation
H() = H
rr
(). (As is done, for example, in [25].) Note that the Hessian is the matrix of second partial derivatives
of a real-valued scalar function.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 30
Scalar Products and Quadratic Forms on the Real Vector Space C. Consider two vectors
c = col(z, z) ( and s = col(,
= c
H
s = z
H
+z
H
= 2 Re z
H
.
The row vector c
T
S = c
H
is a linear functional which maps the elements of (-real into the real
numbers. The set of all such linear functionals is a vector space itself and is known as the dual
space, (
in (
dened by
52
c
c
T
S = c
H
.
Henceforth it is understood that scalar-product expressions like
a
H
s or c
H
b
where s ( and c ( are known to be elements of ( are only meaningful if a and b are also
elements of (. Thus, it must be the case that both vectors in a scalar product must belong to ( if it
is the case that one of them does, otherwise we will view the resulting scalar as nonsensical.
Thus, for a real-valued function of up to quadratic order in a vector c (,
f(c) = a +b
H
c +
1
2
c
H
Mc = a +b
H
c +
1
2
c
H
s, s = Mc, (75)
to be well-posed, it must be the case that a R, b (,
53
and s = Mc (.
54
Thus, as we
proceed to derive various rst and second order functions of the form (75), we will need to check
for these conditions. If the conditions are met, we will say that the terms, and the quadratic form,
are admissible or meaningful.
To test whether a vector b C
2n
belongs to ( is straightforward:
b (
b = Sb. (76)
It is rather more work to develop a test to determine if a matrix M C
2n2n
has the property
that it is a linear mapping from ( to (,
M L((, () = M[ Mc (, c ( and M is linear L(C
2n
, C
2n
) = C
2n2n
.
52
Warning! Do not confuse the dual vector (linear functional) c
.
54
I.e., because c
H
= c
M
11
M
12
M
21
M
22
z
z
.
The rst block row of this matrix equation yields the conditions
= M
11
z + M
12
z
while the complex conjugate of the second block row yields
=
M
22
z +
M
21
z
and subtracting these two sets of equations results in the following condition on the block elements
of M,
(M
11
M
22
)z + (M
12
M
21
)z = 0 .
With z = x + j y, this splits into the two sets of conditions,
[(M
11
M
22
) + (M
12
M
21
)]x = 0
and
[(M
11
M
22
) (M
12
M
21
)]y = 0.
Since these equations must hold for any x and y, they are equivalent to
(M
11
M
22
) + (M
12
M
21
) = 0
and
(M
11
M
22
) (M
12
M
21
) = 0.
Finally, adding and subtracting these two equations yields the necessary and sufcient conditions
for M to be admissible (i.e., to be a mapping from ( to (),
M =
M
11
M
12
M
21
M
22
C
2n2n
is an element of L((, () iff M
11
=
M
22
and M
12
=
M
21
. (77)
55
I.e., a vector space over the eld of real numbers.
56
I.e., a vector space over the eld of complex numbers.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 32
This necessary and sufcient condition is more conveniently expressed in the following equivalent
form,
M L((, () M = S
MS
M = SMS (78)
which is straightforward to verify.
Given an arbitrary matrix M C
2n2n
, we can dene a natural mapping of M into L((, ()
C
2n2n
by
P(M)
M + S
MS
2
L((, () , (79)
in which case the condition (78) has an equivalent restatement as
M L((, () P(M) = M . (80)
It is straightforward to demonstrate that
P(M) (, M C
2n2n
and P(P(M)) = P(M) (81)
i.e., that P is an idempotent mapping of C
2n2n
onto L((, (), P
2
= P. However, it is important
to note that P is not a linear operator (the action of complex conjugation precludes this) nor a
projection operator in the conventional sense of projecting onto a lower dimensional subspace as
its range space is not a subspace of its domain space. However, it is reasonable to interpret P as a
projector of the manifold C
2n
onto the submanifold ( C
2n
.
57
A nal important fact is that if M C
2n2n
is invertible, then M L((, () if and only if
M
1
L((, (), which we state formally as
Let M be invertible, then P(M) = M iff P(M
1
) = M
1
. (82)
I.e., if an invertible matrix M is admissible, then M
1
is admissible. The proof is straightforward:
M = S
MS and M invertible
M
1
=
S
MS
1
= S(
M)
1
S
= SM
1
S .
57
With C
2n2n
R
4n4n
R
16n
2
and L((, () L(R
2n
, R
2n
) R
2n2n
R
4n
2
, it is reasonable to view P
as a linear projection operator from the vector space R
16n
2
onto the vector subspace R
4n
2
R
16n
2
. This allows us
to interpret P as a projection operator from the manifold C
2n
onto the submanifold ( C
2n
. Once we know that
P is a linear mapping from C
2n
into C
2n
, we can then compute its adjoint operator, P
f(c)
c
H
=
f(c)
c
H
= S
f(c)
c
H
,
which is the necessary and sufcient condition given in (76) that
f(c)
c
H
( .
Thus
f(c)
c
(
f(c)
c
T
(,
which is true if and only if
f(c)
c
T
(.
This shows a simple inspection of
f(c)
c
itself can be performed to test for admissibility of the linear
term.
58
58
In this note, the rst order expansion (84) is doing double duty in that it is simultaneously standing for the c-real
expansion and the c-complex expansion. A more careful development would make this distinction explicit, in which
case one would more carefully explore the distinction between
f(c)
c
T
versus
f(c)
c
H
in the linear term. Because
this note has already become rather notationally tedious, this option for greater precision has been declined. However,
greater care must therefore be made when switching between the (-real and (-complex perspectives.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 34
As discussed above, to be meaningful as a true derivative, the derivative with respect to c
has to be interpreted as a real derivative. This is the c-real interpretation of (84). In addition,
(84) has a c-complex interpretation for which the partial derivative with respect to c is not well-
dened as a complex derivative as it stands, but rather only makes sense as a shorthand notation
for simultaneously taking the complex derivatives with respect to z and z,
c
=
z
,
z
.
Thus, to work in the domain of complex derivatives, we must move to the c-complex perspective
c = col(z, z), and then break c apart so that we can work with expressions explicitly involving z
and z, exploiting the fact that the formal partial derivatives with respect to z and z are well dened.
Noting that
c
=
and c =
z
z
we obtain
f(c)
c
c =
f
z
z +
f
z
z
=
f
z
z +
f
z
z (f is real-valued)
= 2 Re
f
z
z
f
z
z
+ h.o.t. (85)
This is the rederivation of (53) promised earlier. Note that (85) makes explicit the relationship
which is implied in the c-complex interpretation of (84).
We also summarize our intermediate results concerning the linear term in a power series ex-
pansion using the r, c or z representations,
Linear-Term Relationships:
f
r
r =
f
c
c = 2 Re
f
z
z
(86)
The derivative in the rst expression is a real derivative. The derivative in the second expression can
be interpreted as a real derivative (the c-real interpretation). The derivative in the last expression
is a complex derivative; it corresponds to the c-complex interpretation of the second term in (86).
Note that all of the linear terms are real valued.
We now have determined the rst-order expansion of f in terms of r, c, and z. To construct
the second-order expansion it remains to examine the second-order term in (73) and some of the
properties of the real Hessian matrix (74) which completely species that term.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 35
Second Order Expansions. Note from (73) that knowledge of the real Hessian matrix H
rr
com-
pletely species the second order term in the real power series expansion of f with respect to r.
The goal which naturally presents itself to us at this point is now to reexpress this quadratic-order
term in terms of c, which we indeed proceed to do. However, because the canonical coordinates
vector c has two interpretations, one as a shorthand for the pair (z, z (the c-complex perspective)
and the other as an element of a real vector space (the c-real perspective), we will rewrite the sec-
ond order term in two different forms, one (the c-complex form) involving the c-complex Hessian
matrix
H
C
cc
()
c
f()
c
H
for , c ( C
2n
(87)
and the other (the c-real form) involving the c-real Hessian matrix
H
R
cc
()
c
f()
c
T
for , c ( R
2n
. (88)
In (87), the derivative with respect to c only has meaning as a short-hand for
z
,
z
. In (88), the
derivative with respect to c is well-dened via the c-real interpretation.
It is straightforward to show a relationship between the real Hessian H
rr
and the c-complex
Hessian H
C
cc
,
H
rr
r
f
r
T
=
r
f
r
H
=
r
f
c
J
H
(from equation (67))
=
r
J
H
f
c
=
c
J
H
f
c
f
c
H
J (From equation (32) of [25])
= J
H
H
C
cc
J .
The resulting important relationship
H
rr
= J
H
H
C
cc
J (89)
between the real and c-complex Hessians was derived in [27] based on the there unjustied (but
true) assumption that the second order terms of the powers series expansions of f in terms of r
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 36
and c-complex must be equal. Here, we reverse this order of reasoning, and will show below the
equality of the second order terms in the c-complex and r expansions as a consequence of (89).
Note from (60) that
H
C
cc
=
1
4
J H
rr
J
H
. (90)
Recalling that the Hessian H
rr
is a symmetric matrix,
59
it is evident from (90) that H
C
cc
is Hermi-
tian
60
H
C
cc
= (H
C
cc
)
H
(and hence, like H
rr
, has real eigenvalues), and positive denite (semidenite) if and only H
rr
is
positive denite (semidenite).
As noted by Van den Bos [27], one can now readily relate the values of the eigenvalues of H
C
cc
and H
rr
from the fact, which follows from (60) and (90), that
H
C
cc
I =
1
4
J H
rr
J
H
2
JJ
H
=
1
4
J (H
rr
2I) J
H
.
This shows that the eigenvalues of the real Hessian matrix are twice the size of the eigenvalues of
the complex Hessian matrix (and, as a consequence, must share the same condition number).
61
Focussing our attention now on the second order term of (73), we have
1
2
r
T
H
rr
r =
1
2
r
H
H
rr
r
=
1
2
r
H
J
H
H
C
cc
J r (From equation (89))
=
1
2
c
H
H
C
cc
c , (From equation (65))
thereby showing the equality of the second order terms in an expansion of a real-valued function f
either in terms of r or c-complex,
62
1
2
r
T
H
rr
r =
1
2
c
H
H
C
cc
c . (91)
Note that both of these terms are real valued.
With the proof of the equalities 86 and 91, we have (almost) completed a derivation of the
2nd-Order Expansion in c-Complex: f(c + c) = f(c) +
f(c)
c
c +
1
2
c
H
H
C
cc
(c) c + h.o.t. (92)
59
In the real case, this is a general property of the matrix of second partial derivatives of a scalar function.
60
As expected, as this is a general property of the matrix of partial derivatives
z
f(z)
z
H
of any real-valued
function f(z).
61
For a Hermitian matrix, the singular values are the absolute values of the (real) eigenvalues. Therefore the condi-
tion number, which is the ratio of the largest to the smallest eigenvalue (assuming a full rank matrix) is given by the
ratio of the largest to smallest eigenvalue magnitude.
62
And thereby providing a proof of this assumed equality in [27].
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 37
where the c-complex Hessian H
C
cc
is given by equation (87) and is related to the real hessian H
rr
by equations (89) and (90). Note that all of the terms in (92) are real valued. The derivation has not
been fully completed because we have not veried that c
H
H
C
cc
(c) c is admissible in the sense
dened above. The derivation will be fully completed once we have veried that H
C
cc
L((, (),
which we will do below.
The c-complex expansion (92) is not differentiable with respect to c-complex itself, which is
not well dened, but, if differentiation is required, should be instead interpeted has a short-hand,
or implicit, statement involving z and z, for which derivatives are well dened. To explicitly show
the the second order expansion of the real-valued function f in terms of the complex vectors z and
z, it is convenient to dene the quantities
H
zz
z
f
z
H
, H
zz
z
f
z
H
, H
zz
z
f
z
H
, and H
zz
z
f
z
H
. (93)
With
c
= (
z
,
z
), we also have from (87) and the denitions (93) that
H
C
cc
=
H
zz
H
zz
H
zz
H
zz
. (94)
Thus, using the earlier proven property that H
C
cc
is Hermitian, H
C
cc
= (H
C
cc
)
H
, we immediately
have from (94) the Hermitian conjugate conditions
H
zz
= H
H
zz
and H
zz
= H
H
zz
(95)
which also hold for z and z replaced by z and z respectively.
Some additional useful properties can be shown to be true for the block components of (94) de-
ned in (93). First note that as a consequence of f being a real-valued function, it is straightforward
to show the validity of the conjugation conditions
H
C
cc
= H
C
cc
or, equivalently,
H
zz
= H
zz
and H
zz
= H
zz
, (96)
which also hold for z and z replaced by z and z respectively. It is also straightforward to show that
H
C
cc
= SH
C
cc
S = S H
C
cc
S ,
for S = S
T
= S
1
(showing that H
C
cc
and H
C
cc
are related by a similarity transformation and
therefore share the same eigenvalues
63
), which is precisely the necessary and sufcient condition
(78) that the matrix H
C
cc
L((, (). This veries that the term c
H
H
C
cc
c is admissible and
63
Their eigenvectors are complex conjugates of each other, as reected in the similarity transformation being given
by the swap operator S
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 38
provides the completion of the proof of the validity of (92) promised earlier. Finally, note that
properties (96) and (95) yield the conjugate symmetry conditions,
H
zz
= H
T
zz
and H
zz
= H
T
zz
, (97)
which also hold for z and z replaced by z and z respectively.
From equations (66), (91), and (94) we can now expand the second order term in (73) as
follows
1
2
r
T
H
rr
r =
1
2
c
H
H
C
cc
c
=
1
2
z
H
H
zz
z + z
H
H
zz
z + z
H
H
zz
z + z
H
H
zz
z
= Re
z
H
H
zz
z + z
H
H
zz
z
z
H
H
zz
z + z
H
H
zz
z
. (98)
Combining the results given in (73), (86), and (98) yields the desired expression for the second
order expansion of f in terms of z,
2
nd
-Order Exp. in z: f(z + z) = f(z) + 2 Re
f
z
z
+ Re
z
H
H
zz
z + z
H
H
zz
z
+ h.o.t.
(99)
We note in passing that Equation (99) is exactly the same expression given as Equation (A.7)
of reference [38] and Equation (8) of reference [34], which were both derived via an alternative
procedure.
The c-complex expansion shown in Equation (92) is one of two possible alternative second-
order representations in c for f(c) (the other being the c-real expansion), and was used as the
starting point of the theoretical developments leading to the z-expansion (99). We now turn to the
development of the c-real expansion of f(c), which will be accomplished by writing the second
order term of the quadratic expansion in terms of the c-real Hessian H
R
cc
.
From the denitions (88), (87), and (93), and using the fact that
c
= (
z
,
z
), it is straight-
forward to show that
H
R
cc
=
H
zz
H
zz
H
zz
H
zz
= S
H
zz
H
zz
H
zz
H
zz
(100)
or
65
H
R
cc
= H
C
cc
= SH
C
cc
= H
C
cc
S. (101)
64
Alternatively, the last step also follows as a consequence of (95).
65
Alternative derivations are possible. For example, H
C
cc
=
c
f
c
H
=
c
f
c
T
=
c
f
c
S
T
=
c
S
f
c
T
= S
c
f
c
T
= SH
R
cc
H
R
cc
= SH
C
cc
, noting that S = S
T
= S
1
.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 39
Note from the rst equality in (100) and the conjugate symmetry conditions (97) that the c-real
Hessian is symmetric
H
R
cc
= (H
R
cc
)
T
. (102)
Let the SVD of H
C
cc
be
H
C
cc
= UV
H
then from (101) the SVD of H
R
cc
is given by
H
R
cc
= U
V
H
, U
= SU
showing that H
C
cc
and H
R
cc
share the same singular values, and hence the same condition number
(which is given by the ratio of the largest to smallest singular value). The three Hessian matrices
H
rr
, H
R
cc
, and H
C
cc
are essentially equivalent for investigating numerical issues and for testing
whether a proposed minimizer of the second order expansion of f(r) = f(c) is a local (or even
global) minimum. Thus, one can choose to work with the Hessian matrix which is easiest to
compute and analyze. This is usually the c-complex Hessian H
C
cc
, and it is often most convenient to
determine numerical stability and optimality using H
C
cc
even when the algorithmis being developed
from one of the alternative perspectives (i.e., the real r or the c-real second order expansion).
Now note that from (101) we immediately and easily have
1
2
c
T
H
R
cc
c =
1
2
c
T
S H
C
cc
c =
1
2
(Sc)
T
H
C
cc
c =
1
2
(c)
T
H
C
cc
c =
1
2
c
H
H
C
cc
c
showing the equivalence of the c-real and c-complex second order terms in the expansion of f(c).
66
Combining this result with (98), we have shown the following equivalences between the second
order terms in the various expansions of f under consideration in this note:
2nd-Order Terms:
1
2
r
T
H
rr
r =
1
2
c
T
H
R
cc
c =
1
2
c
H
H
C
cc
c = Re
z
H
H
zz
z + z
H
H
zz
z
(103)
where the second order expansion in r is given by (73), the c-complex expansion by (92), the
expansion in terms of z by (99), and the c-real expansion by
2nd-Order Expansion in c-Real: f(c + c) = f(c) +
f(c)
c
c +
1
2
c
T
H
R
cc
(c) c + h.o.t.
(104)
Note that all of the terms in (103) and (104) are real valued.
The expansion in of f(c) in terms of c-complex shown in (92) is not differentiable with respect
to c (this is only true for the c-real expansion). However, (92) is differentiable with respect to z
and z and can be viewed as a short-hand equivalent to the full (z, z) expansion provided by (99).
Therefore, it is Equation (99) which is the natural form for optimization with respect to c-complex
66
One can show that the term c
T
H
R
cc
c is admissible if and only if H
R
cc
= SM for M L((, (), which is the
case here.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 40
via a derivative-based approach, because only differentiation with respect to the components (z, z)
of c-complex is well-posed. On the other hand, differentiation with respect to c-real is well-posed,
so that one can optimize (104) by taking derivatives of (104) with respect to c-real itself.
Note that (73), (92), and (104) are the natural forms to use for optimization via completing the
square(see below). This is because the expansions in terms of r, c-complex, and c-real are less
awkward for completing-the-square purposes than the expansion in z provided by (99).
67
Note that
the expansions (73) and (92) are both differentiable with respect to the expansion variable itself
and both have a form amenable to optimization by completing the square.
The various second order expansions developed above can be found in references [38], [27]
and [34]. In [27], Van den Bos shows the equality of the rst, second, and third second-order terms
shown in equation (98) but does not mention the fourth (which, anyway, naturally follows from the
third term in (98) via a simple further expansion in terms of z and z). The approach used in this
note is a more detailed elaboration of the derivations presented in [27]. In reference [34] Yan and
Fan show the equality of the rst and last terms in (98), but, while they cite the results of Van den
Bos [27] regarding the middle terms in (98), do not appear to have appreciated that the fourth term
in (98) is an immediate consequence of the second or third terms, and derive it from scratch using
an alternative, brute force approach.
Quadratic Minimization and the Newton Algorithm. The Newton algorithm for minimizing a
scalar function f(z) exploits the fact that it is generally straightforward to minimize the quadratic
approximations provided by second order expansions such as (73), (92), (99), and (104). The
Newton method starts with an initial estimate of the optimal solution, say c, then expands f(c)
about the estimate c to second order in c = c c, and then minimizes the resulting second
order approximation of f(c) with respect to c. Having determined an estimated update
c in
this manner, one updates the original estimate c c +
r = (H
rr
)
1
f(r)
r
T
(from the r expansion (73)) (105)
c
C
= (H
C
cc
)
1
f(c)
c
H
(from the c-complex expansion (92)) (106)
c
R
= (H
R
cc
)
1
f(c)
c
T
(from the c-real expansion (104)) . (107)
67
Although (99) can also be optimized by completing the square.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 41
Solutions (105) and (106) can also be found in Van den Bos [27]. Note that
c
C
is an admissible
solution, i.e., that
c
C
(
as required for self-consistency of our theory, as a consequence of the fact that
f(c)
c
H
and
(H
C
cc
)
1
satisfy
f(c)
c
H
( and (H
C
cc
)
1
L((, () ,
with the latter condition a consequence of property (82) and the fact that H
C
cc
L((, (). If this
were not the case, then we generally would have the meaningless answer that
c
C
/ (.
The admissibility of the solution (107) follows from the admissibility of (106). This will be
evident from the fact, as we shall show, that all of the solutions (105)-(107) must all correspond to
the same update,
c
C
=
c
R
= J
r .
Note that
c
C
= (H
C
cc
)
1
f(c)
c
H
=
1
4
JH
rr
J
H
1
2
f(r)
r
J
H
H
(from (67) and (90))
=
=
JH
rr
J
1
1
J
f(r)
r
T
(from (63))
= J(H
rr
)
1
f(r)
r
T
= J
r
as required. On the other hand,
c
R
= (H
R
cc
)
1
f(c)
c
T
= (SH
C
cc
)
1
f(c)
c
T
(from (101))
= (H
C
cc
)
1
f(c)
c
S
T
= (H
C
cc
)
1
f(c)
c
T
= (H
C
cc
)
1
f(c)
c
H
=
c
C
.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 42
Thus, the updates (105)-(107) are indeed equivalent.
The updates (105) and (107), determined via a completing the square argument, can alterna-
tively be obtained by setting the (real) derivatives of their respective quadratically-approximated
loss functions to zero, and solving the necessary condition for an optimum. Note that if we attempt
to (erroneously) take the (complex) derivative of (92) with respect to c-complex and then set this
expression to zero, the resulting solution will be off by a factor of two. In the latter case, we
must instead take the derivatives of (99) with respect to z and z and set the resulting expressions
to zero in order to obtain the optimal solution.
68
At convergence, the Newton algorithm will produce a solution to the necessary rst-order con-
dition
f(c)
c
= 0 ,
and this point will be a local minimum of f() if the Hessians are strictly positive denite at this
point. Typically, one would verify positive deniteness of the c-complex Hessian at the solution
point c,
H
C
cc
(c) =
H
zz
(c) H
zz
(c)
H
zz
(c) H
zz
(c)
> 0 .
As done in [38] and [34], the solution to the quadratic minimization problemprovided by (105)-
(107) can be expressed in a closed form expression which directly produces the solution z C
n
.
To do so, we rewrite the solution (106) for the Newton update
c as
H
C
cc
c =
f(c)
c
H
which we then write in expanded form in terms of z and z
H
zz
H
zz
H
zz
H
zz
f
z
f
z
. (108)
Assuming that H
C
cc
is positive denite, then H
zz
is invertible and the second block row in (108)
results
z = H
1
zz
H
zz
z H
1
zz
f
z
H
.
Plugging this into the rst block row of (108) then yields the Newton algorithm update equation
H
zz
z =
f
z
H
+H
zz
H
1
zz
f
z
H
, (109)
where
H
zz
H
zz
H
zz
H
1
zz
H
zz
68
This is the procedure used in [38] and [34].
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 43
is the Schur complement of H
zz
in H
C
cc
. Equation (109) is equivalent to the solution given as
Equation (A.12) in [38]. Invertibility of the Schur complement
H
zz
follows from our assumption
that H
C
cc
is positive denite, and the Newton update is therefore given by
z =
H
zz
H
zz
H
1
zz
H
zz
H
zz
H
1
zz
f
z
f
z
. (110)
The matrices H
zz
and
H
zz
=
H
zz
H
zz
H
1
zz
H
zz
H
zz
H
zz
H
1
zz
H
zz
z H
1
zz
f
z
H
. (111)
The argument given by Yan and Fan supporting the use of the approximation H
zz
0 is that as the
Newton algorithm converges to the optimal solution z = z
0
, setting H
zz
to zero implies that we
will use a quadratic function to approximate the cost near z
0
[34]. However Yan and Fan do not
give a formal denition of a quadratic function and this statement is not generally true as there
is no a priori reason why the off-diagonal block matrix elements of the Newton Hessian should be
zero, or approach zero, as we demonstrate in Example 2 of the Applications section below.
However, as we shall discuss later below, setting the block off-diagonal elements to zero is
justiable, but not necessarily as an approximation to the Newton algorithm. Setting the block
off-diagonal elements in the Newton Hessian to zero, results in an alternative, quasi-Newton
algorithm which can be studied in its own right as a competitor algorithm to the Newton algorithm,
the Gauss-Newton algorithm, or the gradient descent algorithm.
69
Nonlinear Least-Squares: Gauss vs. Newton. In this section we are interested in nding an
approximate solution, z, to the nonlinear inverse problem
g(z) y
69
That is not to say that there cant be conditions under which the quasi-Newton algorithm does converge to the
Newton algorithm. Just as one can give conditions for which the Gauss-Newton algorithm converges to the Newton
algorithm, one should be able to do the same for the quasi-Newton algorithm.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 44
for known y C
m
and known real-analytic function g : C
n
C
m
. We desire a least-squares
solution, which is a solution that minimizes the weighted least-squares loss function
70
(z) =
1
2
|y g(z)|
2
W
=
1
2
(y g(z))
H
W (y g(z))
where W is a Hermitian positive-denite weighting matrix. Although the nonlinear function g is
assumed to be real-analytic, in general it is assumed to be not holomorphic (i.e., g is not analytic
in z).
In the subsequent development we will analyze the problem using the c-real perspective devel-
oped in the preceding discussions. Thus, the loss function is assumed to be re-expressible in terms
of c,
71
(c) =
1
2
|y g(c)|
2
W
=
1
2
(y g(c))
H
W (y g(c)) . (112)
Quantities produced from this perspective
72
may have a different functional form than those pro-
duced purely within the z Z perspective, but the end results will be the same.
We will consider two iterative algorithms for minimizing the loss function (112): The Newton
algorithm, discussed above, and the Gauss-Newton algorithmwhich is usually a somewhat simpler,
yet related, method for iteratively nding a solution which minimizes a least-squares function of
the form (112).
73
As discussed earlier, the Newton method is based on an iterative quadratic expansion and min-
imization of the loss function (z) about a current solution estimation, z. Specically the Newton
method minimizes an approximation to (c) = (z) based on the second order expansion of (c)
in c about a current solution estimate c = col(z,
z),
(c + c)
(c)
Newton
where
(c)
Newton
= (c) +
(c)
c
c +
1
2
c
H
H
C
cc
(c) c. (113)
Minimizing the Newton loss function
(c)
Newton
then results in a correction
c
Newton
which is then
used to update the estimate c c +
c
Newton
for some stepsize > 0. The algorithm then
70
The factor of
1
2
has been included for notational convenience in the ensuing derivations. If it is removed, some
of the intermediate quantities derived subsequently (such as Hessians, etc.) will differ by a factor of 2, although the
ultimate answer is independent of any overall constant factor of the loss function. If in your own problem solving
ventures, your intermediate quantities appear to be off by a factor of 2 relative to the results given in this note, you
should check whether your loss function does or does not have this factor.
71
Quantities produced from this perspectivesuch as the Gauss-Newton Hessian to be discussed belowmay have a
different functional form than those produced purely within the z Z perspective, but the nal answers are the same.
72
Such as the Gauss-Newton Hessian to be discussed below.
73
Thus the Newton algorithm is a general method that can be used to minimize a variety of different loss functions,
while the Gauss-Newton algorithmis a least-squares estimation method which is specic to the problemof minimizing
the least-squares loss function (112).
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 45
starts all over again. As discussed earlier in this note, a completing-the-square argument can be
invoked to readily show that the correction which minimizes the quadratic Newton loss function is
given by
c
Newton
= H
C
cc
(c)
1
(c)
c
H
(114)
provided that the c-complex Hessian H
C
cc
(c) is invertible. Because it denes the second-order
term in the Newton loss function and directly enters into the Newton correction, we will often
refer to H
C
cc
(c) as the Newton Hessian. If we block partition the Newton Hessian and solve for
the correction
z
Newton
, we obtain the solution (110) which we earlier derived for a more general
(possibly non-quadratic).
We nowdetermine the form of the cogradient
(c)
c
of the least-squares loss function (112). This
is done by utilizing the c-real perspective which allows us to take (real) cogradients with respect
to c-real. First, however, it is convenient to dene the compound Jacobian of g(c) as
G(c)
g(c)
c
g(z)
z
g(z)
z
J
g
(c) J
c
g
(c)
C
m2n
. (115)
Setting e = y g(c), we have
74
c
=
1
2
c
e
H
We
=
1
2
e
H
W
c
e +
1
2
e
T
W
T
c
e
=
1
2
e
H
W
g
c
1
2
e
T
W
T
g
c
=
1
2
e
H
W G
1
2
e
T
W
T
g
c
S
=
1
2
e
H
W G
1
2
e
T
W
T
GS
or
c
=
1
2
e
H
W G
1
2
e
H
W GS. (116)
This expression for
c
is admissible, as required, as it is readily veried that
H
= S
H
as per the requirement given in (76).
74
Remember that
c
is only well-dened as a derivative within the c-real framework.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 46
The linear term in the Newton loss function
Newton
is therefore given by
c
c =
1
2
e
H
W Gc
1
2
e
H
W GS c
=
1
2
e
H
W Gc
1
2
e
H
W Gc
= Re
e
H
W Gc
.
Thus
c
c = Re
e
H
W Gc
= Re
(y g(c))
H
W Gc
. (117)
If the reader has any doubts as to the validity or correctness of this derivation, she/he is invited to
show that the left-hand side of (117) is equal to 2 Re
f
z
z
g(z) +
g(z)
z
z +
g(z)
z
z
where z = z z and z = z = z
z = z
(c)
Gauss
=
1
2
|y Gc|
2
W
=
1
2
(y Gc)
H
W (y Gc)
=
1
2
|y|
2
Re
y
H
W Gc
+
1
2
c
H
G
H
WGc
= (c) +
(c)
c
c +
1
2
c
H
G
H
WGc. (from (117)
Unfortunately, the resulting quadratic form
(c)
Gauss
= (c) +
(c)
c
c +
1
2
c
H
G
H
WGc (119)
is not admissible as it stands.
75
This is because the matrix G
H
WG is not admissible,
G
H
WG =
g
c
H
W
g
c
/ L((, ().
This can be seen by showing that the condition (78) is violated:
S G
H
WGS = S
g
c
H
W
g
c
S
=
g
c
H
W
g
c
g
c
g
c
g
c
H
W
g
c
.
Fortunately, we can rewrite the quadratic form (119) as an equivalent form which is admissible
on (. To do this note that G
H
WG is Hermitian, so that
c
H
G
H
WGc = c
H
G
H
WGc R.
75
And thus the complex Gauss-Newton algorithm is more complicated in form than the real Gauss-Newton algo-
rithm for which the quadratic form (119) is acceptable [25].
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 48
Also recall from Equation (79) that P(G
H
WG) L((, () and c ( Sc = c. We have
76
c
H
G
H
WGc = c
H
P(G
H
WG)c + c
H
G
H
WGP(G
H
WG)
c
= c
H
P(G
H
WG)c +
1
2
c
H
G
H
WGSG
H
WGS
c
= c
H
P(G
H
WG)c +
1
2
c
H
G
H
WGc c
H
G
H
WGc
= c
H
P(G
H
WG)c + 0
= c
H
P(G
H
WG)c .
Thus we have shown that on the space of admissible variations, c (, the inadmissible
quadratic form (119) is equivalent to the admissible quadratic form
(c)
Gauss
= (c) +
(c)
c
c +
1
2
c
H
H
Gauss
cc
(c) c (120)
where
H
Gauss
cc
(c) P
G
H
(c)WG(c)
(121)
denotes the Gauss-Newton Hessian.
Note that the Gauss-Newton Hessian H
Gauss
cc
(c) is Hermitian and always guaranteed to be at least
positive semi-denite, and guaranteed to be positive denite if g is assumed to be one-to-one (and
thereby ensuring that the compound Jacobian matrix G has full column rank). This is in contrast
to the Newton (i.e., the c-complex) Hessian H
C
cc
(c) which, unfortunately, can be indenite or rank
decient even though it is Hermitian and even if g is one-to-one.
Assuming that H
Gauss
cc
(c) is invertible, the correction which minimizes the Gauss-Newton loss
function (120) is given by
c
Gauss
= H
Gauss
cc
(c)
1
(c)
c
H
. (122)
Because of the admissibility of H
Gauss
cc
and
(c)
c
H
, the resulting solution is admissible
c
Gauss
(.
Comparing Equations (114) and (122), it is evident that the difference between the two al-
gorithms resides in the difference between the Newton Hessian, H
C
cc
(c), which is the actual c-
complex Hessian of the least-squares loss function (c), and the Gauss-Newton Hessian H
Gauss
cc
(c)
which has an unclear relationship to (c).
77
For this reason, we now turn to a discussion of the
relationship between H
C
cc
(c) and H
Gauss
cc
(c).
76
Note that the following derivation does not imply that G
H
WG = P(G
H
WG), a fact which would contradict our
claim that G
H
WG is not admissible. This is because in the derivation we are not allowing arbitrary vectors in C
2n
but are only admitting vectors c constrained to lie in (, c ( C
2n
.
77
Note that, by construction, H
Gauss
cc
(c) is the Hessian matrix of the Gauss-Newton loss function. The question is:
what is its relationship to the least-squares loss function or the Newton loss function?
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 49
We can compute the Newton Hessian H
C
cc
from the relationship (see Equation (101))
H
C
cc
= S H
R
cc
= S
c
T
where
c
is taken to be a c-real cogradient operator. Note from (116) that,
H
=
1
2
G
H
We
1
2
SG
H
We =
1
2
B + SB
, (123)
where
B G
H
We (124)
with e = y g(c). This results in
T
=
H
=
1
2
B + SB
,
Also note that
B
c
=
B
c
=
B
c
S
=
B
c
S.
We have
H
R
cc
=
c
T
=
1
2
S
B
c
+
B
c
or
H
R
cc
=
c
T
=
1
2
S
B
c
+
B
c
S
. (125)
This yields
H
C
cc
= S H
R
cc
=
1
2
B
c
+ S
B
c
S
(126)
with B given by (124), which we can write as
H
C
cc
= S H
R
cc
= P
B
c
. (127)
Recall that H
C
cc
must be admissible. The function P() produces admissible matrices which map
from( to (, and thereby ensures that the right-hand side of equation (127) is indeed an admissible
matrix, as required for self-consistency of our development. The presence of the operator P does
not show up in the real case (which is the standard development given in textbooks) as
B
c
is
automatically symmetric as required for admissibility in the real case [25].
Note that B can be written as
B =
g
c
H
W (y g) =
m
i=1
g
i
c
H
[W (y g) ]
i
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 50
where g
i
and [W (y g) ]
i
denote the i-th components of g and We = W(y g) respectively.
We can then compute
B
c
as
B
c
=
g
c
H
W
g
c
i=1
g
i
c
H
[W (y g) ]
i
= G
H
WG
m
i=1
g
i
c
H
[W (y g) ]
i
or
B
c
= G
H
WG
m
i=1
g
i
c
H
[We ]
i
. (128)
Equations (127) and (128) result in
H
C
cc
= H
Gauss
cc
m
i=1
H
(i)
cc
. (129)
where
H
(i)
cc
P
g
i
c
H
[We ]
i
, i = 1, , m. (130)
Note that Equation (129), which is our nal result for the structural form of the Newton Hessian
H
C
cc
, looks very much like the result for the real case [25].
78
The rst term on the right-hand-side
of (129) is the Gauss-Newton Hessian H
Gauss
cc
, which is admissible, Hermitian and at least positive
semidenite (under the standard assumption that W is Hermitian positive denite). Below, we will
show that the matrices H
(i)
cc
, i = 1, , m, are also admissible and Hermitian. While the Gauss-
Newton Hessian is always positive semidenite (and always positive denite if g is one-to-one),
the presence of the second term on the right-hand-side of (129) can cause the Newton Hessian to
become indenite, or even negative denite.
We can now understand the relationship between the Gauss-Newton method and the Newton
method when applied to the problem of minizing the least-squares loss function. The Gauss-
Newton method is an approximation to the Newton method which arises from ignoring the second
term on the right-hand-side of (129). This approximation is not only easier to implement, it will
generally have superior numerical properties as a consequence of the deniteness of the Gauss-
Newton Hessian. Indeed, if the mapping g is onto, via the Gauss-Newton algorithm one can
produce a sequence of estimates c
k
, k = 1, 2, 3, , which drives e(c
k
) = y g(c
k
), and hence
the second term on the right-hand-side of (129), to zero as k . In which case, asymptotically
there will be little difference in the convergence properties between the Newton and Gauss-Newton
methods. This property is well known in the classical optimization literature, which suggests that
by working within the c-real perspective, we may be able to utilize a variety of insights that have
78
The primary difference is due to the presence of the projector P in the complex Newton algorithm. Despite the
similarity, note that it takes much more work to rigorously derive the complex Newton-Algorithm!
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 51
been developed for the Newton and Gauss-Newton methods when optimizing over real vector
spaces.
To complete the proof of the derivation of (129), it remains to demonstrate that H
(i)
cc
, i =
1, , m, are admissible and Hermitian. Note that the raw matrix
[We ]
i
g
i
c
H
is neither Hermitian nor admissible because of the presence of the complex scalar factor [We ]
i
.
Fortunately, the processing of the second matrix of partial derivatives by the operator P to form
the matrix H
(i)
cc
via
H
(i)
cc
= P(A
cc
(g
i
))
creates a matrix which is both admissible and Hermitian. The fact that H
(i)
cc
is admissible is obvious,
as the projector P is idempotent. We will now prove that H
(i)
cc
is Hermitian.
Dene the matrix
A
cc
(g
i
)
c
g
i
c
H
, (131)
and note that
g
i
c
H
=
g
i
c
T
=
g
i
c
=
c
g
i
c
H
,
which shows that A
cc
(g
i
) has the property that
A
cc
(g
i
)
H
= A
cc
( g
i
) . (132)
Now note that
S
c
g
i
c
H
S = S
c
g
i
c
H
=
c
g
i
c
=
c
g
i
c
S
H
=
c
g
i
c
H
,
which establishes the second property that
SA
cc
(g
i
)S = A
cc
(g
i
) . (133)
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 52
Finally note that properties (132) and (133) together yield the property
A
cc
(g
i
)
H
= A
cc
( g
i
) = SA
cc
( g
i
)S = SA
cc
(g
i
)S .
Setting a
i
= [We ]
i
, we have
H
(i)
cc
= P(a
i
A
cc
(g
i
)) =
a
i
A
cc
(g
i
) + S a
i
A
cc
(g
i
) S
2
=
a
i
A
cc
(g
i
) + a
i
S A
cc
(g
i
) S
2
=
a
i
A
cc
(g
i
) + a
i
A
cc
(g
i
)
H
2
which is obviously Hermitian. Note that the action of the projector P on the raw matrix
a
i
A
cc
(g
i
), is equal to the action of Hermitian symmetrizing the matrix a
i
A
cc
(g
i
).
Below, we will examine the least-squares algorithms at the block-component level, and will
show that signicant simplications occur when g(z) is holomorphic.
Generalized Gradient Descent Algorithms. As in the real case [25], the Newton and Gauss-
Newton algorithms can be viewed as special instances of a family of generalized gradient descent
algorithms. Given a general real-valued loss function (c) which we wish to minimize
79
and a
current estimate, c of optimal solution, we can determine an update of our estimate to a new value
c
new
which will decrease the loss function as follows.
For the loss function (c), with c = c + dc, we have
d(c) = (c + dc) (c) =
(c)
c
dc
which is just the differential limit of the rst order expansion
(c; ) = (c + c) (c)
(c)
c
c .
The stepsize > 0 is a control parameter which regulates the accuracy of the rst order approxi-
mation assuming that
0 c dc and (c; ) d(c) .
If we assume that ( is a Cartesian space,
80
then the gradient of (c) is given by
81
c
(c) =
(c)
c
H
.
79
The loss function does not have to be restricted to the least-squares loss considered above.
80
I.e., We assume that ( has identity metric tensor. In [25] we call the resulting gradient a Cartesian gradient (if the
metric tensor assumption is true for the space of intertest) or a naive gradient (if the metric tensor assumption is false,
but made anyway for convenience).
81
Note for future reference that the gradient has been specically computed in Equation (123) for the special case
when (c) is the least-squares loss function (112).
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 53
Take the update to be the generalized gradient descent correction
c = Q(c)
(c)
c
H
= Q(c)
c
(c) (134)
where Q(c) is a Hermitian matrix function of c which is assumed to be positive denite when
evaluated at the value c.
82
This then yields the key stability condition
83
(c; ) |
c
(c)|
2
Q
c
(c)
H
Q
c
(c) 0, (135)
where the right-hand-side is equal to zero if and only if
c
(c) = 0 .
Thus if the stepsize parameter is chosen small enough, making the update
c
new
= c + c = c Q
c
(c)
results in
(c
new
) = (c + c) = (c) + (c; ) (c) |
c
(c)|
2
Q
(c)
showing that we either have a nontrivial update of the value of c which results in a strict decrease
in the value of the loss function, or we have no update of c nor decrease of the loss function
because c is a stationary point. If the loss function (c) is bounded from below, iterating on this
procedure starting from a estimate c
1
will produce a sequence of estimates c
i
, i = 1, 2, 3, ,
which will converge to a local minimum of the loss function. This simple procedure is the basis
for all generalized gradient descent algorithms.
Assuming that we begin with an admissible estimate, c
1
, for this procedure to be valid, we
require that the sequence of estimates c
i
, i = 1, 2, 3, , be admissible, which is true if the corre-
sponding updates c are admissible,
c = Q(c
i
)
c
i
(c
i
) = Q(c
i
)
(c
i
)
c
i
H
( , i = 1, 2, .
We have established the admissibility of
c
(c) =
(c)
c
H
( above. It is evident that in order
for a generalized gradient descent algorithm (GDA) to be admissible it must be the case that Q be
admissible,
Generalized GDA is Admissible Generalized Gradient Q-Matrix is Admissible, Q L((, () .
82
The fact that Q is otherwise arbitrary (except for the admissibility criterion discussed below) is what makes the
resulting algorithm a generalized gradient descent algorithm in the parlance of [25]. When Q = I, we obtain the
standard gradient descent algorithm.
83
We interpret the stability condition to mean that for a small enough stepsize > 0, we will have (c; ) 0.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 54
Furthermore, a sufcient condition that the resulting algorithm be stable
84
is that Q be Hermitian
and positive denite. Note that given a candidate Hermitian positive denite matrix, Q
, which is
not admissible,
Q
/ L((, () ,
we can transform it into an admissible Hermitian positive denite matrix via the projection
Q = P(Q
) L((, () .
It can be much trickier to ensure that Q remain positive denite.
If we set
Q
Newton
(c) = H
Newton
cc
(c)
1
with
H
Newton
cc
H
C
cc
then we obtain the Newton algorithm (114). If we take the loss function to be the least-squares loss
function (112) and set
Q
Gauss
(c) = H
Gauss
cc
(c)
1
we obtain the Gauss-Newton algorithm (122). Whereas the Gauss-Newton algorithm generally
has a positive denite Q-matrix (assuming that g(c) is one-to-one), the Newton algorithm can
have convergence problems due to the Newton Hessian H
Newton
cc
= H
C
cc
becoming indenite. Note
that taking
Q
Simple
= I ,
which we refer to as the simple choice, results in the standard gradient descent algorithm which
is always stable (for a small enough stepsize so that the stability condition (135) holds).
The important issue being raised here is the problem of stability versus speed of convergence.
It is well-known that the Newton algorithm tends to have a very fast rate of convergence, but at the
cost of constructing and inverting the Newton Hessian H
Newton
cc
= H
C
cc
and potentially encountering
more difcult algorithminstability problems. On the other hand, standard gradient descent (Q = I)
tends to be very stable and much cheaper to implement, but can have very long convergence times.
The Gauss-Newton algorithm, which is an option available when the loss function (c) is the
least-squares loss function (112), is considered an excellent trade-off between the Newton algo-
rithmand standard gradient descent. The Gauss-Newton Hessian H
Gauss
cc
is generally simpler in form
and, if g(c) is one-to-one, is always positive denite. Furthermore, if g(c) is also onto, assuming
the algorithm converges, the Gauss-Newton and Newton algorithms are asymptotically equivalent.
We can also begin to gain some insight into the proposal by Yan and Fan [34] to ignore the
block off-diagonal elements of the Newton Hessian,
85
H
Newton
cc
= H
C
cc
=
H
zz
H
zz
H
zz
H
zz
.
84
Assuming a small enough step size to ensure that the stability condition (135) is satised.
85
The values of the block elements of H
Newton
cc
will be computed for the special case of the least-squares loss function
(112) later below.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 55
As mentioned earlier, Yan and Fan make the claim in [34] that the block off-diagonal elements
vanish for a quadratic loss function. As noted above, and shown in an example below, this is
not generally true.
86
However, it is reasonable to ask what harm (if any), or what benet (if any)
can accrue by constructing a new
87
generalized gradient descent algorithm as a modication to
the Newton algorithm created by simply ignoring the block off-diagonal elements in the Newton
Hessian and working instead with the simplied quasi-Newton Hessian,
H
quasi-Newton
cc
H
C
cc
H
zz
0
0 H
zz
.
This results in a new generalized gradient descent algorithm, which we call the quasi-Newton
algorithm, which is somewhere in complexity between the Newton algorithmand standard gradient
descent. Note that the hermitian matrix H
zz
is positive denite if and only if H
zz
is positive
denite. Thus invertibility and positive-deniteness of the quasi-Newton Hessian H
quasi-Newton
cc
=
H
C
cc
is equivalent to invertibility and positive deniteness of the block element H
zz
.
On the other hand, invertibility and positive deniteness of H
zz
is only a necessary condition
for invertibility and positive deniteness of the complete Newton Hessian H
Newton
cc
= H
C
cc
. Assuming
that H
C
cc
is positive denite, we have the well-known factorization
I 0
H
zz
H
1
zz
I
H
C
cc
I H
zz
H
1
zz
0 I
H
zz
0
0
H
zz
(136)
where
H
zz
= H
zz
H
zz
H
1
zz
H
zz
is the Schur complement of H
zz
in H
C
cc
. From the factorization (136) we immediately obtain the
useful condition
rank (H
C
cc
) = rank (H
zz
) + rank
H
zz
. (137)
Note from condition (137) that the Newton Hessian H
Newton
cc
= H
C
cc
is positive denite if and
only if H
zz
and its Schur complement
H
zz
are both positive denite. Thus it is obviously a more
difcult matter to ascertain and ensure the stability of the Newton Hessian than to do the same for
the quasi-Newton Hessian.
The quasi-Newton algorithm is constructed by forming the Q matrix from the quasi-Newton
Hessian H
quasi-Newton
cc
=
H
C
cc
,
Q
Pseudo-Newton
= (H
quasi-Newton
cc
)
1
=
H
C
cc
1
=
H
1
zz
0
0 H
1
zz
f
z
H
(138)
which is just the simplication shown earlier in Equation (111) and proposed by Yan and Fan in
[34]. However, unlike Yan and Fan, we do not present the quasi-Newton algorithm as an approx-
imation to the Newton algorithm, but rather as one more algorithm in the family of generalized
Newton algorithms indexed by the choice of the matrix Q.
Indeed, recognizing that the Gauss-Newton algorithm potentially has better stability properties
than the Newton algorithm, naturally leads us to propose a quasi-Gauss-Newton algorithm for
minimizing the least-squares lose function (112) as follows. Because the hermitian Gauss-Newton
Hessian is admissible, it can be partitioned as
H
Gauss
cc
=
U
zz
U
zz
U
zz
U
zz
with U
zz
= U
T
zz
.
89
The Gauss-Newton Hessian is positive-denite if and only if U
zz
(equivalently
U
zz
) and its Schur complement
U
zz
= U
zz
U
zz
U
zz
1
U
zz
are invertible.
On the other hand the quasi-Gauss-Newton Hessian,
H
quasi-Gauss
cc
U
zz
0
0 U
zz
U
1
zz
0
0 U
zz
1
f
z
H
(139)
which is guaranteed to be stable (for a small enough stepsize so that the stability condition (135)
is satised) if U
zz
is positive denite.
Note that H
zz
can become indenite even while U
zz
remains positive denite. Thus, the quasi-
Gauss-Newton algorithm appears to be generally easier to stabilize than the quasi-Newton algo-
rithm. Furthermore, if g is onto, we expect that asymptotically the quasi-Gauss-Newton and quasi-
Newton algorithm become equivalent. Thus the quasi-Gauss-Newton algorithm is seen to stand in
88
We can ignore the remaining update equation as it is just the complex conjugate of the shown update equation.
89
The values of these block components will be computed below.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 57
the same relationship to the quasi-Newton algorithm as the Gauss-Newton algorithm does to the
Newton algorithm.
Without too much effort, we can construct the block matrix components needed to implement
the Newton and Gauss-Newton algorithms developed above in order to minimize the least-squares
loss function (112).
90
Let us rst look at the elements needed to implement the Gauss-Newton algorithm. From
Equation (121) and the derivations following Equation (119) one obtains
U
zz
=
1
2
g
z
H
W
g
z
g
z
H
W
g
z
(140)
which is positive denite, assuming that W is positive denite and that g is one-to-one. Similarly,
one nds that
U
zz
=
1
2
g
z
H
W
g
z
g
z
H
W
g
z
. (141)
Also U
zz
= U
zz
and U
zz
= U
zz
. We have now completely specied the Gauss-Newton Hessian
H
Gauss
cc
and the quasi-Gauss-Newton Hessian at the block components level,
H
Gauss
cc
=
U
zz
U
zz
U
zz
U
zz
H
quasi-Gauss
cc
U
zz
0
0 U
zz
g
z
H
W
g
z
=
1
2
J
H
g
WJ
g
, (142)
where J
g
is the Jacobian matrix of g.
Now let us turn to the issue of computing the elements need to implement the Newton Algo-
rithm, recalling that the Newton Hessian is block partitioned as
H
Newton
cc
= H
C
cc
=
H
zz
H
zz
H
zz
H
zz
.
One can readily relate the block components H
zz
and H
zz
to the matrices U
zz
and U
zz
used in the
Gauss-Newton and quasi-Gauss-Newton algorithms by use of Equation (129). We nd that
H
zz
= U
zz
i=1
V
(i)
zz
90
This, of course, results in only a special case application of the Newton and quasi-Newton algorithms, both of
which can be applied to more general loss functions.
91
Recall that g(z) is holomorphic (analytic in z) if and only if the Cauchy-Riemann condition
g(z)
z
= 0 is satised.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 58
and
V
(i)
zz
=
1
2
g
i
(z)
z
H
[We ]
i
g
i
(z)
z
H
[We ]
i
(143)
where e = y g(z). Similarly, we nd that
H
zz
= U
zz
i=1
V
(i)
zz
and
V
(i)
zz
=
1
2
g
i
(z)
z
H
[We ]
i
g
i
(z)
z
H
[We ]
i
(144)
Furthermore, V
zz
= V
zz
and V
zz
= V
zz
. Note that neither V
zz
nor V
zz
vanish when g is
holomorphic, but instead simplify to
V
(i)
zz
=
1
2
g
i
(z)
z
H
[We ]
i
and V
(i)
zz
=
1
2
g
i
(z)
z
H
[We ]
i
. (145)
We have shown that the relationship between the Newton Hessian and Gauss-Newton Hessian
is given by
H
zz
H
zz
H
zz
H
zz
. .. .
H
Newton
cc
=
U
zz
U
zz
U
zz
U
zz
. .. .
H
Gauss
cc
i=1
V
(i)
zz
V
(i)
zz
V
(i)
zz
V
(i)
zz
H
zz
H
zz
H
zz
H
zz
. .. .
H
Newton
cc
=
U
zz
0
0 U
zz
. .. .
H
Gauss
cc
1
2
m
i=1
g
i
(z)
z
H
[We ]
i
g
i
(z)
z
H
[We ]
i
g
i
(z)
z
H
[We ]
i
g
i
(z)
z
H
[We ]
i
.
This shows that if g(z) is holomorphic, so that the block off-diagonal elements of the Gauss-
Newton Hessian vanish, and g(z) is also onto, so that asymptotically we expect that e 0, then
the claim of Yan and Fan in [34] that setting the block off-diagonal elements of the Hessian matrix
can proved a a good approximation to the Hessian matrix is reasonable, at least when optimizing
the least-squares loss function. However, when e 0 the Newton least-squares loss function (113)
reduces to the Gauss-Newton loss function (120), so that in the least-squares case one may as
well make the move immediately to the even simpler Gauss-Newton algorithm (which in this case
coincides with the quasi-Gauss-Newton algorithm).
However, the real point to be made is that any generalized gradient descent algorithm is worthy
of consideration,
92
provided that it is admissible, provably stable, and (at least locally) convergent
92
I.e., we dont have to necessarily invoke an approximation argument.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 59
to the desired optimal solution. After all the standard gradient descent algorithm corresponds to
the cheapest approximation of all, namely that
H
Newton
cc
I
and very few will deny the utility of this algorithm, even though as an approximation to the
Newton algorithm it might be far from correct. The resulting algorithm has intrinsic merit as an
algorithm in its own right, namely as the member of the family of gradient descent algorithms
corresponding to the simplest choice of the Q-matrix,
Q = Q
Simple
= I .
In the end, if the algorithm works, its ok. As it is said, the proof is in the pudding.
93
We see then that we have a variety of algorithms at hand which t within the framework of
generalized gradient descent algorithms. These algorithms are characterized by the specic choice
of the Q-matrix in the gradient descent algorithm, and include (roughly in the expected order
of decreasing complexity, decreasing ideal performance, and increasing stability when applied to
the least-squares loss function): 1) the Newton algorithm, 2) the quasi-Newton algorithm, 3) the
Gauss-Newton algorithm, 4) the quasi-Gauss-Newton algorithm, and 5) standard gradient descent.
Note that the Newton, quasi-Newton, and standard gradient descent algorithms are general algo-
rithms, while the Gauss-Newton and quasi-Gauss-Newton algorithms are methods for minimizing
the least-squares loss function (112).
For convenience, we will now summarize the generalized gradient descent algorithms that we
have developed in this note. In all of the algorithms, the update step is given by
c c + c
or, equivalently,
z z + z
for a specic choice of the stepsize > 0. The stability claims made are based on the assumption
that has been chosen small enough to ensure that the stability condition (135) is valid. Further-
more, we use the shorthand notation
G(c) =
g(c)
c
and
e(c) = y g(c) .
93
Of course, we are allowed to ask what the performance of the Q
Simple
algorithm is relative to the Q
Newton
algorithm.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 60
1. Standard (Simple) Gradient Descent.
Applies to any smooth loss function which is bounded from below.
H
simple
cc
(c) = I
Q
simple
(c) = (H
simple
cc
(c))
1
= I
c
simple
=
z
(c) =
(c)
c
H
z
simple
=
z
(z) =
(z)
z
H
Application to Least-Squares Loss Function (112):
H
=
1
2
G
H
We
1
2
SG
H
We =
1
2
B(c) + SB(c)
B(c) + SB(c)
H
=
1
2
g(z)
z
H
We(z) +
g(z)
z
H
We(z)
z
simple
=
1
2
g(z)
z
H
We(z) +
g(z)
z
H
We(z)
g(z) holomorphic:
H
=
1
2
g(z)
z
H
We(z)
z
simple
=
1
2
g(z)
z
H
We(z)
Generally stable but slow.
2. Gauss-Newton Algorithm.
Applies to the least-squares loss function (112).
H
Gauss
cc
(c) =
U
zz
U
zz
U
zz
U
zz
where U
zz
is given by (140), U
zz
= U
zz
, U
zz
is given by (141), and U
zz
= U
zz
.
Q
Gauss
(c) = H
Gauss
cc
(c)
1
c
Gauss
= Q
Gauss
(c)
(c)
c
H
where
H
=
1
2
G
H
We
1
2
SG
H
We =
1
2
B(c) + SB(c)
U
zz
U
zz
U
1
zz
U
zz
U
zz
U
1
zz
where
H
=
1
2
g(z)
z
H
We(z) +
g(z)
z
H
We(z)
H
=
H
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 61
g(z) holomorphic:
U
zz
takes the simpler form (142), U
zz
= U
zz
, and U
zz
= U
zz
= 0.
H
Gauss
cc
(c) =
U
zz
0
0 U
zz
=
1
2
g
z
H
W
g
z
0
0
g
z
H
W
g
z
H
=
1
2
g(z)
z
H
We(z)
z
Gauss
= U
1
zz
H
=
g(z)
z
H
W
g(z)
z
1
g(z)
z
H
We(z)
Stability generally requires positive deniteness of both U
zz
and its Schur complement:
U
zz
= U
zz
U
zz
U
1
zz
U
zz
. The need to step for positive-deniteness of the Schur complement
can signicantly increase the complexity of an on-line adaptive ltering algorithm.
If g(z) is holomorphic, then stability only requires positive deniteness of the matrix U
zz
=
g(z)
z
H
W
g(z)
z
, which will be the case if g(z) is one-to-one. Thus, the algorithm may
be easier to stabilize when g(z) is holomorphic.
Convergence tends to be fast.
3. Pseudo-Gauss-Newton Algorithm.
Applies to the least-squares loss function (112).
H
Gauss
cc
(c) =
U
zz
0
0 U
zz
where U
zz
is given by (140) and U
zz
= U
zz
.
Q
pseudo-Gauss
(c) = [H
pseudo-Gauss
cc
(c)]
1
=
U
1
zz
0
0 U
zz
1
c
pseudo-Gauss
= Q
pseudo-Gauss
(c)
(c)
c
H
where
H
=
1
2
G
H
We
1
2
SG
H
We =
1
2
B(c) + SB(c)
(z)
z
H
=
g
z
H
W
g
z
g
z
H
W
g
z
1
(z)
z
H
where
H
=
1
2
g(z)
z
H
We(z) +
g(z)
z
H
We(z)
g(z) holomorphic:
U
zz
takes the simpler form of (142) , and U
zz
= U
zz
.
H
pseudo-Gauss
cc
(c) =
U
zz
0
0 U
zz
=
1
2
g
z
H
W
g
z
0
0
g
z
H
W
g
z
H
=
1
2
g(z)
z
H
We(z)
z
pseudo-Gauss
=
g(z)
z
H
W
g(z)
z
1
g(z)
z
H
We(z)
Stability requires positive deniteness of U
zz
(z) =
g(z)
z
H
W
g(z)
z
H
zz
(c) H
zz
(c)
H
zz
(c) H
zz
(c)
Q
Newton
(c) = [H
Newton
cc
(c)]
1
c
Newton
= Q
Newton
(c)
(c)
c
H
z
Newton
=
H
zz
H
zz
H
1
zz
H
zz
H
zz
H
1
zz
H
zz
H
zz
H
zz
H
zz
U
zz
U
zz
U
zz
U
zz
m
i=1
V
(i)
zz
V
(i)
zz
V
(i)
zz
V
(i)
zz
= H
Gauss
cc
(c)
m
i=1
V
(i)
zz
V
(i)
zz
V
(i)
zz
V
(i)
zz
U
zz
is given by (140), U
zz
= U
zz
, U
zz
is given by (141), U
zz
= U
zz
V
(i)
zz
is given by (143), V
(i)
zz
= V
(i)
zz
, V
(i)
zz
is given by (144), V
(i)
zz
= V
(i)
zz
.
c
Newton
= Q
Newton
(c)
(c)
c
H
where
H
=
1
2
G
H
We
1
2
SG
H
We =
1
2
B(c) + SB(c)
H
zz
H
zz
H
1
zz
H
zz
H
zz
H
1
zz
where
H
=
1
2
g(z)
z
H
We(z) +
g(z)
z
H
We(z)
H
=
H
g(z) holomorphic:
H
Newton
cc
=
U
zz
0
0 U
zz
m
i=1
V
(i)
zz
V
(i)
zz
V
(i)
zz
V
(i)
zz
= H
pseudo-Gauss
cc
(c)
m
i=1
V
(i)
zz
V
(i)
zz
V
(i)
zz
V
(i)
zz
H
zz
H
zz
H
1
zz
H
zz
H
zz
H
1
zz
where
H
=
1
2
g(z)
z
H
We(z);
H
=
H
Stability generally requires positive deniteness of both H
zz
and its Schur complement
H
zz
=
H
zz
H
zz
H
1
zz
H
zz
H
zz
(c) 0
0 H
zz
(c)
Q
pseudo-Newton
(c) = [H
pseudo-Newton
cc
(c)]
1
c
psedudo-Newton
= Q
pseudo-Newton
(c)
(c)
c
H
z
pseudo-Newton
= [H
zz
(z)]
1
(z)
z
H
Application to the Least-Squares Loss Function (112):
H
pseudo-Newton
cc
=
H
zz
(c) 0
0 H
zz
(c)
U
zz
i=1
V
(i)
zz
0
0 U
zz
i=1
V
(i)
zz
= H
pseudo-Gauss
cc
(c)
i=1
V
(i)
zz
0
0
m
i=1
V
(i)
zz
V
(i)
zz
is given by (143) and V
(i)
zz
= V
(i)
zz
. U
zz
is given by (140) and U
zz
= U
zz
c
pseudo-Newton
= Q
pseudo-Newton
(c)
(c)
c
H
where
H
=
1
2
G
H
We
1
2
SG
H
We =
1
2
B(c) + SB(c)
(z)
z
H
=
U
zz
i=1
V
(i)
zz
1
(z)
z
H
where
H
=
1
2
g(z)
z
H
We(z) +
g(z)
z
H
We(z)
g(z) holomorphic
U
zz
takes the simpler form of (142), U
zz
= U
zz
.
V
(i)
zz
takes the simpler form (145), V
(i)
zz
= V
(i)
zz
(z)
z
H
=
1
2
g(z)
z
H
We(z)
z
pseudo-Newton
=
1
2
U
zz
i=1
V
(i)
zz
1
g(z)
z
H
We(z)
=
g
z
H
W
g
z
i=1
g
i
(z)
z
H
[We ]
i
1
g(z)
z
H
We(z)
Stability generally requires positive deniteness of H
zz
.
The pseudo-Newton is expected to be fast, but have a loss of efciency relative to the Newton
algorithm. When g(z) is holomorphic and onto, we expect good performance as asymptoti-
cally a stabilized pseudo-Newton algorithmwill coincide with the Newton algorithm. If g(z)
is nonholomorphic, the pseudo-Newton and Newton algorithms will not coincide asymptot-
ically, so the speed of the pseudo-Newton algorithm is expected to always lag the Newton
algorithm.
The algorithm suggested by Yan and Fan in [34] corresponds in the above taxonomy to the
pseudo-Newton algorithm. We see that for obtaining a least-squares solution to the nonlinear
inverse problem y = g(z), if g is holomorphic, then the Yan and Fan suggestion can result in a
good approximation to the Newton algorithm. However, for nonholomorphic least-squares inverse
problems and for other types of optimization problems (including the problem considered by Yan
and Fan in [34]), the approximation suggested by Yan and Fan is not guaranteed to provide a good
approximation to the Newton algorithm.
94
However, as we have discussed, it does result in an
admissible generalized gradient descent method in its own right, and, as such, one can judge the
resulting algorithm on its own merits and in comparison with other competitor algorithms.
Equality Constraints. The classical approach to incorporating equality constraints into the prob-
lem of optimizing a scalar cost function is via the method of Lagrange multipliers. The theory of
Lagrange multipliers is well-posed when the objective function and constraints are real-valued
functions of real unknown variables. Note that a vector of p complex equality constraint condi-
tions,
g(z) = 0 C
p
94
Such a a claim might be true. However, it would have to be justied.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 65
is equivalent to 2p real equality constraints corresponding to the conditions
Re g(z) = 0 R
p
and Img(z) = 0 R
p
.
Thus, given the problem of optimizing a real scalar-valued loss function (z) subject to a vector
of p complex equality constraints constraints h(z) = 0, one can construct a well-dened lagrangian
as
L = (z) +
T
R
Re g(z) +
T
I
Img(z) , (146)
for real-valued p-dimensional lagrange multiplier vectors
R
and
I
.
If we dene the complex lagrange multiplier vector by
=
R
+ j
I
C
p
it is straightforward to show that the lagrangian (146) can be equivalently written as
L = (z) + Re
H
g(z) . (147)
One can now apply the multivariate CR-Calculus developed in this note to nd a stationary
solution to the Lagrangian (147). Of course, subtle issues involving the application of the z, c-
complex, and c-real perspectives to the problem will likely arise on a case-by-case basis.
Final Comments on the 2nd Order Analysis. It is evident that the analysis of second-order
properties of a real-valued function on C
n
is much more complicated than in the purely real case
[25], perhaps dauntingly so. Thus, it is perhaps not surprising that very little analysis of these
properties can be found in the literature.
95
By far, the most illuminating is the paper by Van den
Bos [27], which, unfortunately, is very sparse in its explanation.
96
A careful reading of Van den
Bos indicates that he is fully aware that there are two interpretations of c, the real interpretation
and the complex interpretation. This is a key insight. As we have seen above, it provides a very
powerful analysis and algorithm development tool which allows us to switch between the c-real
interpretation (which enables us to use the tools and insights of real analysis) and the c-complex
perspective (which is shorthand for working at the algorithm implementation level of z and z). The
now-classic paper by Brandwood [14] presents a development of the complex vector calculus using
the c-complex perspective which, although adequate for the development of rst-order algorithms,
presents greater difculties when used as a tool for second order algorithm development. In this
note, weve exploited the insight provided by Van den Bos [27] to perform a more careful, yet still
preliminary, analysis of second-order Newton and Gauss-Newton algorithms. Much work remains
to explore the analytical, structural, numerical, and implementation properties of these, and other
second order, algorithms.
95
That I could nd. Please alert me to any relevant references that I am ignorant of!
96
Likely a result of page limitations.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 66
7 Applications
1. A Simple Nonlinear Least Squares Problem - I. This is a simple, but interesting, problem
which is nonlinear in z C yet linear in c ( C
2
.
Let z C be an unknown scalar complex quantity we wish to estimate from multiple iid noisy
measurements,
y
k
= s + n
k
,
k = 1, , n, of a scalar signal s C which is related to z via
s = g(z), g(z) = z + z.
where Cand Care known complex numbers. It is assumed that the measurement noise n
k
is iid and (complex) Gaussian, n
k
N(0,
2
I), with
2
known. Note that the function g(z) is both
nonlinear in z (because complex conjugation is a nonlinear operation on z) and nonholomorphic
(nonanalytic in z). However, because the problem must be linear in the underlying real space
1 = R
2
(a fact which shows up in the obvious fact that the function g is linear in c), we expect
that this problem should be exactly solvable, as will be shown to indeed be the case.
Under the above assumptions the maximumlikelihood estimate (MLE) is found by minimizing
the loss function [15]
97
(z) =
1
2n
n
k=1
|y
k
g(z)|
2
=
1
n
n
k=1
|y
k
z z|
2
=
1
2n
n
k=1
(y
k
z z)(y
k
z z)
=
1
2n
n
k=1
( y
k
z
z)(y
k
z z).
Note that this is a nonlinear least-squares problem as the function g(z) is nonlinear in z.
98
Further-
more, g(z) is nonholomorphic (nonanalytic in z). Note, however, that although g(z) is nonlinear
in z, it is linear in c = (z, z)
T
, and that as a consequence the loss function (z) = (c) has an exact
second order expansion in c of the form (92), which can be veried by a simple expansion of (z)
in terms of z and z (see below). The corresponding c-complex Hessian matrix (to be computed
below) does not have zero off-diagonal entries, which shows that a loss function being quadratic
does not alone ensure that H
zz
= 0, a fact which contradicts the claim made in [34].
97
The additional overall factor of
1
n
has been added for convenience.
98
Recall that complex conjugation is a nonlinear operation.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 67
Dening the sample average of n samples
1
, ,
k
by
'`
1
n
n
k=1
k
the loss function (z) can be expanded and rewritten as
2 (z) =
[y[
2
z
2
' y` +
'y`
z +
[[
2
+[[
2
[y[
2
1
2
' y` +
'y` 'y` + ' y`
z
z
+
1
4
z
z
[[
2
+[[
2
2
2
[[
2
+[[
2
z
z
.
Since this expansion is done using the z-perspective, we expect that it corresponds to a second
order expansion about the value z = 0,
(z) = (0) +
(0)
c
c +
1
2
c
H
H
C
cc
(0)c (149)
with
(0)
c
=
(0)
z
(0)
z
=
1
2
' y` +
'y` 'y` + ' y`
and
H
C
cc
(0) =
1
2
[[
2
+[[
2
2
2
[[
2
+[[
2
.
And indeed this turns out to be the case. Simple differentiation of (148) yields,
(z)
z
=
z +
1
2
[[
2
+[[
2
z
1
2
' y` +
'y`
(z)
z
= z +
1
2
[[
2
+[[
2
z
1
2
( 'y` + ' y`)
which evaluated at zero give the linear term in the quadratic loss function, and further differentia-
tions yield,
H
C
cc
(z) =
H
zz
H
zz
H
z z
H
z z
=
1
2
[[
2
+[[
2
2
2
[[
2
+[[
2
(0)
c
H
An obvious necessary condition for the least-squares solution to exist is that
[[
2
= [[
2
.
The solution will be a global
100
minimum if the Hessian matrix is positive denite. This will be
true if the two leading principal minors are strictly positive, which is true if and only if, again,
[[
2
= [[
2
. Thus, if [[
2
= [[
2
the solution given above is a global minimum to the least squares
problem.
The condition [[
2
= [[
2
corresponds to loss of identiability of the model
g(z) = z + z .
To see this, rst note that to identify a complex number is equivalent to identifying both the real
and imaginary parts of the number. If either of them is unidentiable, then so is the number.
Now note that the condition [[
2
= [[
2
says that and have the same magnitude, but, in
general, a different phase. If we call the phase difference , then the condition [[
2
= [[
2
is
equivalent to the condition
= e
j
,
which yields
g(z) = e
j
z + z = e
j
e
j
2
z + e
j
2
z
= e
j
e
j
2
z + e
j
2
z
= e
j
2
Re
e
j
2
z
.
Thus, it is evident that the imaginary part of e
j
2
z is unidentiable, and thus the complex number
e
j
2
z itself is unidentiable. And, since
z = e
j
e
j
2
z
= e
j
Re
e
j
2
z
+ j Im
e
j
2
z
,
it is obvious that z is unidentiable.
Note for the simplest case of = ( = 0), we have
g(z) = z + z = Re z
in which case Imz, and hence z, is unidentiable.
100
Because the Hessian is independent of z.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 69
2. A Simple Nonlinear Least Squares Problem - II. The nonlinearity encountered in the
previous example, is in a sense bogus and is not a nonlinearity at all, at least when viewed from
the c-real perspective.
101
Not surprisingly, then, we were able to compute an exact solution. Here,
we will briey look at the Newton and Gauss-Newton algorithms applied to the simple problem of
Example 1.
In the previous example, we computed the Newton Hessian of the least-squares loss function
(148). The difference between the Newton and Gauss-Newton algorithm resides in the difference
between the Newton Hessian and the Gauss-Newton Hessian. To compute the Gauss-Newton
Hessian, note that
y = g(c) = ( )
z
z
= Gc
and therefore (since the problem is linear in c) we have the not surprising result that
Gc =
g(c)
c
c
with
G = ( ) .
In this example, the least-squares weighting matrix is W = I and we have
G
H
WG = G
H
G =
( ) =
[[
2
[[
2
G
H
G
[[
2
[[
2
+ S
[[
2
[[
2
S
2
=
1
2
[[
2
+[[
2
2
2
[[
2
+[[
2
= H
C
cc
showing that for this simple example the Newton and Gauss-Newton Hessians are the same, and
therefore the Newton and Gauss-Newton algorithms are identical. As seen from Equations (129)
and (131), this is a consequence of the fact that g(c) is linear in c as then the matrix of second
partial derivatives of g required to compute the difference between the Newton and Gauss-Newton
algorithms vanishes
A
cc
(g)
c
g
c
H
= 0.
From the derivatives computed in the previous example, we can compute
(c)
c
H
as
(c)
c
H
=
(c)
z
(c)
z
(0)
z
(0)
z
+
1
2
[[
2
+[[
2
2
2
[[
2
+[[
2
101
This problem was designed to have the interesting feature that it is both nonlinear and non (complex) analytic in
z C, but both linear and (real) analytic when viewed in terms of the corresponding real parameterization r R
2
or
c R
2
.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 70
or
(c)
c
H
=
(0)
c
H
+H
C
cc
c.
The optimal update in the Newton algorithm is therefore given by
c = (H
C
cc
)
1
(c)
c
H
= (H
C
cc
)
1
(0)
c
H
c = c
opt
c .
The update step in the Newton algorithm is given by
c
new
= c +
c .
If we take the Newton stepsize = 1, we obtain
c
new
= c +
c = c +c
opt
c = c
opt
showing that we can attain the optimal solution in only one update step. For the real case, it
is well-known that the Newton algorithm attains the optimum in one step for a quadratic loss
function. Thus our result is not surprising given that the problem is a linear least-squares problem
in c.
Note that the off-diagonal elements of the constant-valued Hessian H
C
cc
are never zero and
generally are not small relative to the size of the diagonal elements of H
C
cc
. This contradicts the
statement made in [34] that for a quadratic loss function, the diagonal elements must be zero.
102
However, the pseudo-Newton algorithmproposed in [34] will converge to the correct solution when
applied to our problem, but at a slower convergent rate than the full Newton algorithm, which is
seen to be capable of providing one-step convergence. We have a trade off between complexity
(the less complex pseudo-Newton algorithm versus the more complex Newton algorithm) versus
speed of convergence (the slower converging pseudo-Newton algorithm versus the fast Newton
algorithm).
3. The Complex LMS Algorithm. Consider the problem of determining the complex vector
parameter a C
n
which minimizes the following generalization of the loss function (2) to the
vector parameter case,
(a) = E
[e
k
[
2
, e
k
=
k
a
H
k
, (150)
for
k
C and
k
C
n
. We will assume throughout that the parameter space is Euclidean so that
a
= I. The cogradient of (a) with respect to the unknown parameter vector a is given by
a
(a) = E
a
[e[
2
.
102
It is true, as we noted above, that for the quadratic loss function associated with a holomorphic nonlinear inverse
problem the off-diagonal elements of the Hessian are zero. However, the statement is not true in general.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 71
To determine the cogradient of
[e
k
[
2
= e
k
e
k
= e
k
e
k
= (
k
a
H
k
)(
k
a
H
k
)
note that
e
k
= (
k
a
H
k
) = (
k
H
k
a)
and that e
k
= (
k
a
H
k
) is independent of a. Then we have
a
e
k
e
k
= e
k
a
(
k
H
k
a)
= e
k
a
H
k
a
= e
k
H
k
.
The gradient of [e
k
[
2
= e
k
e
k
is given by
a
e
k
e
k
=
a
e
k
e
k
H
=
e
k
H
k
H
=
k
e
k
.
Thus, we readily have that the gradient (direction of steepest ascent) of the loss function (a) =
E
[e
k
[
2
is
a
(a) = E
k
e
k
= E
k
(
k
H
k
a)
.
If we set this (or the cogradient) equal to zero to determine a stationary point of the loss function
we obtain the standard Wiener-Hopf equations for the MMSE estimate of a.
103
Alternatively, if we make the instantaneous stochastic-gradient approximation,
a
(a)
a
(a
k
)
a
[e
k
[
2
=
k
e
k
=
k
H
k
a
k
,
where a
k
is a current estimate of the MMSE value of a and
a
(a) gives the direction of steepest
descent of (a), we obtain the standard LMS on-line stochastic gradient-descent algorithm for
learning an estimate of the complex vector a,
a
k+1
= a
k
a
(a
k
)
= a
k
+
k
k
e
k
= a
k
+
k
H
k
a
k
I
k
H
k
a
k
+
k
k
k
.
Thus, we have easily derived the complex LMS algorithm,
Complex LMS Algorithm: a
k+1
=
I
k
H
k
a
k
+
k
k
k
. (151)
103
Which, as mentioned earlier, can also be obtained from the orthogonality principle or completing the square.
Thus, if the Wiener-Hopf equations are our only goal there is no need to discuss complex derivatives at all. It is only
when a direction of steepest descent is needed in order to implement an on-line adaptive descent-like algorithm that
the need for the extended or conjugate derivative arises.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 72
References
[1] Optimum Array Processing, H.L. Van Trees, 2002, Wiley Interscience.
[2] Elements of Signal Detection & Estimation, C.W. Helstrom, 1995, Prentice Hall.
[3] Complex Variables and Applications, 6nd Edition, J. Brown & R. Churchill, 1996 McGraw-
Hill, New York.
[4] Complex Variables, 2nd Edition, S. Fisher, 1990/1999, Dover Publications, New York.
[5] Complex Variables and the Laplace Transform for Engineers, W. LePage, 1980/1961, Dover
Publications, New York.
[6] Complex Variables: Harmonic and Analytic Functions, F. Flanigan, 1972/1983, Dover Pub-
lications, New York.
[7] Principles of Mobile Communication, 2nd Edition, G.L. Stuber, 2001, Kluwer, Boston.
[8] Digital Communication, E. Lee & D. Messerchmitt, 1988, Kluwer, Boston.
[9] Introduction to Adaptive Arrays, R. Monzingo & T. Miller, 1980, Wiley, New York.
[10] Zur Formalen Theorie der Funktionen von mehr Complexen Ver anderlichen, W. Wirtinger,
Math. Ann., 97: 357-75, 1927
[11] Introduction to Complex Analysis, Z. Nehari, 1961, Allyn & Bacon, Inc.
[12] Theory of Complex Functions, R. Remmert, 1991, Springer-Verlag.
[13] Precoding and Signal Shaping for Digital Transmission, Robert Fischer, 2002, Wiley-
Interscience. Especially Appendix A: Wirtinger Calculus.
[14] A Complex Gradient Operator and its Application in Adaptive Array Theory, D.H. Brand-
wood, IEE Proceedings H (Microwaves, Optics, and Antennas) (British), Vol. 130, No. 1,
pp. 11-16, Feb. 1983.
[15] Fundamentals of Statistical Signal Processing: Estimation Theory, S.M. Kay, 1993, Prentice-
Hall, New Jersey. Chapter 15: Extensions for Complex Data and Parameters.
[16] Adaptive Filter Theory, 3rd Edition, S. Haykin, 1991, Prentice-Hall, New Jersey. Especially
Appendix B: Differentiation with Respect to a Vector.
[17] Theory of Functions on Complex Manifolds, G.M. Henkin & J. Leiterer, 1984, Birk auser.
[18] An Introduction to Complex Analysis in Several Variables, Revised 3rd Edition, Lars
H ormander, 1990, North-Holland. (First Edition was published in 1966.)
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 73
[19] Several Complex Variables, Corrected 2nd Edition, 2001, S.G. Krantz, AMS Chelsea Pub-
lishing.
[20] Introduction to Complex Analysis, Part II: Functions of Several Variables, B.V. Shabat,
American Mathematical Society, 1992.
[21] Geometrical Methods of Mathematical Physics, Bernard Schutz,1980, Cambridge University
Press
[22] The Geometry of Physics, An Introduction, (with corrections and additions), Theodore
Frankel, 2001, Cambridge University Press.
[23] Fundamentals of Adaptive Filtering, A.H. Sayed, 2003, Wiley.
[24] Finite Dimensional Hilbert Spaces and Linear Inverse Problems, ECE275A Lecture Sup-
plement No. ECE275LS1-F05v1.1, K. Kreutz-Delgado, UCSD ECE Department, October
2005.
[25] Real Vector Derivatives and Gradient, ECE275A Lecture Supplement No. ECE275LS2-
F05v1.0, K. Kreutz-Delgado, UCSD ECE Department, October 2005.
[26] Natural Gradient Works Efciently in Learning, Shun-ichi Amari, Neural Computation,
10:251-76, 1998
[27] Complex Gradient and Hessian, A. van den Bos, IEE Proc.-Vis. Image Signal Processing
(British), 141(6):380-82, December 1994.
[28] A Cram er-Rao Lower Bound for Complex Parameters, A. van den Bos, IEEE Transactions
on Signal Processing, 42(10):2859, October 1994.
[29] Estimation of Complex Fourier Coefcients, A. van den Bos, IEE Proc.-Control Theory
Appl. (British), 142(3):253-6, May 1995.
[30] The Multivariate Complex Normal Distribution-A Generalization, A. van den Bos, IEEE
Transactions on Information Theory, 41(2):537-39, March 1995.
[31] Prices Theorem for Complex Variates, A. van den Bos, IEEE Transactions on Information
Theory, 42(1):286-7, January 1996.
[32] The Real-Complex Normal Distribution, A. van den Bos, IEEE Transactions on Informa-
tion Theory, 44(4):537-39, July 1998.
[33] Complex Digital Networks: A Sensitivity Analysis Based on the Wirtinger Calculus,
D. Franken, IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Ap-
plications, 44(9):839-43, November 1997.
[34] A Newton-Like Algorithm for Complex Variables with Applications in Blind Equalization,
G. Yan & H. Fan, IEEE Proc. Signal Processing, 48(2):553-6, February 2000.
K. Kreutz-Delgado Copyright c (2003-2007, All Rights Reserved Version ECE275CG-F05v1.3d 74
[35] Cram er-Rao Lower Bound for Constrained Complex Parameters, A.K. Jagannatham &
B.D. Rao, IEEE Signal Proc. Letters, 11(11):875-8, November 2004.
[36] Optimization by Vector Space Methods, D.G. Luenberger, Wiley, 1969.
[37] Linear Operator Theory in Engineering and Science, A.W. Naylor & G.R. Sell, Springer-
Verlag, 1982.
[38] The Constrained Total Least Squares Technique and its Application to Harmonic Super-
resolution, T.J. Abatzoglou, J.M. Mendel, & G.A. Harada, IEEE Transactions on Signal
Processing, 39(5):1070-86, May 1991.