Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems_ Proceedings of the Third Workshop on Maximum Entropy and Bayesian Methods in Applied Statistics, Wyoming, U.S.a., August 1–4, 1983 ( PDFDrive )
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems_ Proceedings of the Third Workshop on Maximum Entropy and Bayesian Methods in Applied Statistics, Wyoming, U.S.a., August 1–4, 1983 ( PDFDrive )
edited by
c. Ray Smith
u.s. Army Missile Command,
Redstone Arsenal, Alabama, U.S.A.
and
Gary J. Erickson
Department a/Electrical Engineering,
Seattle University, Seattle, Washington. U.S.A.
Preface •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• ix
ON ENTROPY RATE
Athanasios Papoulis •••••••••••••••••••••••••••••••• 39
......................
PRIOR KNOWLEDGE MUST BE USED
John Skilling and Stephen F. Gull 161
ix
BAYESIAN SPECTRUM AND CHIRP ANALYSIS
E. T. Jaynes
1. Introduction
The maximum entropy solution found by Burg [1967, 1975] has been
shown to give the optimal spectrum estimate-by a rather basic, inescapable
criterion of optimality-in one well defined problem [Jaynes, 1982]. In that
problem we estimate the spectrum of a time series {Yl ••• YN}, from in-
complete data consisting of a few autocovariances {R o ••• Rm }, m < N,
measured from the entire time series, and there is no noise.
This is the first example in spectrum analysis of an exact solution, which
follows directly from first principles without ad hoc intuitive assumptions
and devices. In particular, we found that there was no need to assume that
the time series was a realization of a "stationary Gaussian process." The
maximum entropy principle automatically created the Gaussian form for us,
out of the data. This indicated something that could not have been learned
by assuming a distribution, namely, that the Gaussian distribution is the one
that can be realized by Nature in more ways than can any other that agrees
with the given autocovariance data. This classic solution will go down in
history as the "hydrogen atom" of spectrum analysis theory.
But a much more common problem, also considered by Burg, is the one
where our data consist, not of autocovariances, but of the actual values of
{Y1 ••• YN}, a subset of the full time series, contaminated with noise.
Experience has shown Burg's method to be very successful here also, if we
first estimate m autocovariances from the data and then use them in the
maximum entropy calculation. The choice of m represents our judgment
about the noise magnitude, values too large introducing noise artifacts,
values too small losing resolution. For any m, the estimate we get would be
the optimal one if (a) the estimated autocovariances were known to be the
exact values and (b) we had no other data beyond those m autocovariances.
Although the success of the method just described indicates that it is
probably not far from optimal when used with good judgment about m, we
have as yet no analytical theory proving this or indicating any preferred dif-
ferent procedure. One would think that a true optimal solution should
(1) use all the information the data can give; i.e., estimate not just m < N
autocovariances from the data, but find our "best" estimate of all N of them
and their probable errors; (2) then make allowance for the uncertainty of
these estimates by progressively de-emphasizing the unreliable ones. There
should not be any sharp break as in the procedure used now, which amounts
to giving full credence to all autocovariance estimates up to lag m, zero
credence to all beyond m.
In Jaynes [1982] we surveyed these matters very generally and con-
cluded that much more analytical work needs to be done before we can
know how close the present partly ad hoc methods are to optimal in prob-
lems with noisy and/or incomplete data. The following is a sequel, reporting
the first stage of an attempt to understand the theoretical situation better,
by a direct Bayesian analysis of the noisy data problem. In effect, we are
trying to advance from the "hydrogen atom" to the "helium atom" of spec-
trum analysis theory.
SPECTRUM AND CHIRP 3
One might think that this had been done already, in the many papers
that study autoregressive (AR) models for this problem. However, as we
have noted before [Jaynes, 1982], introducing an AR model is not a step
toward solving a spectrum analysis problem, only a detour through an alter-
native way of formulating the problem. An AR connection can always be
made if one wishes to do so, for any power spectrum determines a covari-
ance function, which in turn determines a Wiener prediction filter, whose
coefficients can always be interpreted as the coefficients of an AR model.
But while this is always possible, it may not be appropriate (just as repre-
senting the function f(x) = exp( -x 2 ) by an infinite series of Bessel functions
is always possible but not always appropriate).
Indeed, learning that spectrum analysis problems can be formulated in
AR terms amounts to little more than discovering the Mittag-Leffler theorem
of complex variable theory (under rather general conditions an analytic
funct ion is determined by its poles and residues).
In this field there has been some contention over the relative merits of
AR and other models such as the MA (moving average) one. Mathematicians
never had theological disputes over the relative merits of the Mittag-Leffler
expansion and the Taylor series expansion. We expect that the AR repre-
sentation will be appropriate (i.e., conveniently parsimonious) when all the
poles happen to be close to the unit circle; it may be very inappropriate
otherwise.
Better understanding should come from an approach that emphasizes
logical economy by going directly to the question of interest. Instead of in-
voking an AR model at the beginning (which might bring in a lot of inappro-
priate and unnecessary detail, and also limits the scope of what can be done
thereafter), let us start with a simpler, more flexible model that contains
only the facts of data and noise, the specific quantities we want to esti-
mate, and no other formal apparatus. If AR relations-or any other kind-
are appropriate, then they ought to appear automatically, as a consequence
of our analysis, rather than as initial assumptions.
This is what did happen in Burg's problem; maximum entropy based on
autocovariance data led automatically to a spectrum estimator that could be
expressed most concisely (and beautifully) in AR form, the Lagrange multi-
pliers being convolutions of the AR coefficients: An = Lkakan-k. The first
reaction of some was to dismiss the whole maximum entropy principle as
'nothing but AR,' thereby missing the point of Burg's result. What was
important was not the particular analytical form of the solution, but rather
the logic and generality of his method of finding it.
The reasoning will apply equally well, generating solutions of different
analytical form, in other problems far beyond what any AR model could cope
with. Indeed, Burg's method of extrapolating the autocovariance beyond
the data was identical in rationale, formal relations, and technique, with the
means by which modern statistical mechanics predicts the course of an irre-
versible process from incomplete macroscopic data.
This demonstration of the power and logical unity of a way of thinking,
across the gulf of what appeared to be entirely different fields, was of
4 E. T. Jaynes
vastly greater, and more permanent, scientific value than merely finding the
solution to one particular technical problem.
Quickly, the point was made again just as strongly by applying the same
reasoning to problems of image reconstruction, of which the work of Gull
and Daniell [1978] is an oustandingly concise, readable example.
I think that, 200 years from now, scholars will still be reading these
works, no longer for technical enlightenment-for by then this method of
reasoning will be part of the familiar cultural background of everybody-but
as classics of the History of Science, which opened up a new era in how sci-
entists think.
The actual reasoning had, in fact, been given by Boltzmann and Gibbs
long before; but it required explicit, successful applications outside the field
of thermodynamics before either physicists or statisticians could perceive its
power or its generality.
In the study reported here we have tried to profit by these lessons in
logical economy; at the beginning it was decided not to put in that fancy
entropy stuff until we had done an absolutely conventional, plain vanilla
Bayesian analysis-just to see what was in it, unobscured by all the details
that appear in AR analyses. It turned out that so much surprising new stuff
was in it that we are still exploring the plain vanilla Bayesian solution and
have not yet reached the entropic phase of the theoryl
The new stuff reported here includes what we think is the first deriva-
tion of the Schuster periodogram directly from the principles of probability
theory, and an extension of spectrum analysis to include chirp analysis (rate
of change of frequency). Before we had progressed very far it became evi-
dent that estimation of chirp is theoretically no more difficult than 'sta-
tionary' spectrum estimation. Of course, this takes us beyond the domain of
AR models, and could never have been found within the confines of an AR
analysis.
Our calculations and results are straightforward and elementary; in a
communication between experienced Bayesians it could all be reported in
five pages and the readers would understand perfectly well what we had
done, why we had done it, and what the results mean for data processing.
Such readers will doubtless find our style maddeningly verbose.
However, hoping that this work might also serve a tutorial function, we
have scattered throughout the text and appendices many pages of detailed
explanation of the reasons for what we do and the meaning of each new
equation as it appears.
Also, since consideration of chirp has not figured very much in past
spectrum analysis, and Bayesian analysis has not been very prominent either,
the next three sections survey briefly the history and nature of the chirp
problem and review the Bayesian reasoning format. Our new calculation
begins in Section 5.
SPECTRUM AND CHIRP 5
2. Chirp Analysis
poseful signals that it would be ludicrous to call "random," and that those
signals are being corrupted by additive noise that we are unable to control
or predict. Yet we do have some cogent prior information about both the
signals and the noise, and so our job is not to ask, "What seems to go on
here?" but rather to set up a model that expresses that prior information.
3. Spectral Snapshots
In this section which is possibly the first direct Bayesian analysis of the
problem, attempts at conceptual innovation are out of order, and we wish
8 E. T. Jaynes
reason that it is of interest). We should take into account all the prior in-
formation we have about its structure (functional form), which affords us
our best means of finding it in the noise.
To represent the signal as a ·sample from a Gaussian random process·
would be, in effect, to change the problem into that of estimating the pa-
rameters in an imaginary Gaussian distribution of all possible Signals. That
is not our aim here; we want to estimate a property of the real world, the
spectrum of the specific real signal that generated our data.
Our only seemingly drastic simplification is that for the time being we
suppose it known in advance that the signal contains only a single term, of
the form
In practice we shall have these data only at discrete times, which we sup-
pose for the moment to be equally spaced at integer values of t, and over a
finite interval n. Thus our data consist of N = (n + 1) values
which is our sampling distribution. Conversely, given a and the data 0, the
joint likelihood of the unknown parameters is
2: r
T
L(A,w,a,e) a: ~xp {- 2 [y(t) - A cos(wt + at 2 + e )]2} • (6)
t=-T
Yt = y(t) = 0, It I z T • (7)
Then all our sums of functions K(yt> over time indices can take the form
EK t , understood to run over (-CJ) < t < CJ». In this notation, which we use
henceforth, all those little details are present automatically, but kept out of
sight.
Usually, the absolute phase e is of no interest to us and we have no
prior information about it; that is, it is a "nuisance parameter" that we want
to eliminate. We may integrate it out with respect to a uniform prior prob-
ability density, getting a marginal quasi-likelihood
L(A,w,a) = 2w
1 Jfh L(A,w,a, e) de (8)
o
that represents the contribution from the data to the joint marginal poste-
rior distribution of (A,w,a). This is the course we shall pursue in the present
work.
But an important exception occurs if our ultimate objective is not to
estimate the parameters (A,w,a) of the "regular" signal f(t) but rather to
SPECTRUM AN 0 CHI RP 11
L cos (wt 2 +
2T + 1
at 2 + a) " ' - 2 - = 2 " -
N
(9)
t
A NA2
L(A,w,a) = exp { 01 Yt cost wt + at 2 + a) -
40 2
}, (10)
t
(11 )
The form of Eq. (11) already provides some (at least to the writer)
unexpected insight. Given any sampling distribution, a likelihood or quasi-
likelihood function, in its dependence on the parameters, contains all the
information in the data that is relevant for any inference about those pa-
12 E. T. Jaynes
= w11 r
t
Yt e iwt l2
R(t) = N- 1 r
s
YsYs+t • (14 )
P(W)BT = r
m
t=-m
W(t) R(t) coswt , (15)
were dozens of other ad hoc procedures, which would have corrected the
failure of the B-T method to deal with sharp lines without venturing into the
forbidden realm of Bayesianity.
But history proceeded otherwise; and now finally, the first step of a
Bayesian analysis has told us what a century of intuitive ad hockery did not.
The periodogram was introduced previously only as an intuitive spectrum
estimate; but now that it has been derived from the principles of probability
theory we see it in a very different light. Schuster's periodogram is indeed
fundamental to spectrum analysis; but not because it is itself a satisfactory
spectrum estimator, nor because any linear smoothing can convert it into
one in our problem.
The importance of the periodogram lies rather in its information con-
tent; in the presence of white Gaussian noise, it conveys all the information
the data have to offer about the spectrum of f(t). As noted, the chirpogram
has the same property in our more general problem.
It will follow from Eq. (11) [see Eq. (28) below], that the proper algo-
rithm to convert C( w,O) into a power spectrum estimate is a complicated
nonlinear operation much like exponentiation followed by renormalization, a
crude approximation being
(16)
This will suppress those spurious wiggles at the bottom of the periodogram as
well as did the B-T linear smoothing; but it will do it by attenuation rather
than smearing, and will therefore not lose any resolution. The Bayesian
nonlinear processing of C(w,O) will also yield, when the data give evidence
for them, arbitrarily sharp spectral line peaks from the top of the periodo-
gram that linear smoothing cannot give.
It is clear from Eq. (11) why a nonlinear processing of C is needed.
The likelihood involves not just C, but C in comparison with the noise level
0 2• The B-T procedure, Eq. (15), smears out all parts of C equally, without
considering where they stand relative to any noise. The Bayesian nonlinear
processing takes the noise level into account; wiggles below the noise level
are almost certainly artifacts of the noise and are suppressed, while peaks
that rise above the noise level are believed and emphasized.
It may seem at this point surprising that intuition did not see the need
for this long ago. Note, however, that Blackman and Tukey had in mind a
very different problem than ours. For them the whole data y(t) were a
sample from a "stochastic process" with a multivariate Gaussian distribu-
tion. From the standpoint of our present problem, we might interpret the
B-T work as a preliminary study of the noise spectrum before the purposeful
signal was added. So for them the notion of "C in comparison with the
noise level" did not exist.
To emphasize this, note that B-T considered the periodogram to have a
sampling distribution that was chi-squared with two degrees of freedom, in-
dependently of the sample size. That would not be the case in the problem
we are studying unless the signal f(t) were absent.
SPECTRUM AND CHIRP 15
From this observation there follows a point that has not been suffi-
ciently stressed-or even noticed-in the literature: the B-T efforts were
not directed at all toward the present problem of estimating the power
spectrum of a signal f(t) from data that are contaminated with noise.
Confusion over 'What is the problem?' has been rampant here. We
conceded, after a theoretical study [Jaynes, 1981], that pure maximum
entropy is not optimal for estimating the spectrum of a signal in the pres-
ence of noise; but we failed to see the point just noted. Immediately, Tukey
and Brillinger [1982] proceeded to stress the extreme importance of noise in
real problems and the necessity of taking it into account. But they failed to
note the robustness of maximum entropy with respect to noise (that is, its
practical success in problems where noise is present), or that a given proce-
dure may solve more than one problem, and in fact maximum entropy is also
the optimal solution to a Gaussian problem.
Although we agree with the need to take noise into account (as the
present work demonstrates), we can hardly see that as an argument in favor
of B-T methods in preference to maximum entropy in any problem. For we
must distinguish between the B-T problem (spectrum of Gaussian noise) and
the B-T procedure, Eq. (15), which has not been derived from, or shown to
have any logical connection at all to, that problem.
Indeed, Burg's original derivation of the maximum entropy algorithm
started from just the B-T assumption that the data are Gaussian noise; and
we showed [Jaynes, 1982] that pure maximum entropy from autocovariance
data leads automatically to a Gaussian predictive distribution for future
data. Thus it appears that the' best" solution to the B-T problem is not the
B-T tapering procedure, Eq. (15), but the Burg procedure.
But strangely enough, this enables us to take a more kindly view toward
B-T methods. The procedure in Eq. (15) cannot be claimed as the 'best"
one for any spectrum analysis problem; yet it has a place in our toolbox. As
a procedure it is applicable to any data, and is dependent on no hypotheses
of Gaussianity. Whatever the phenomenon, if nothing is known in advance
about its spectrum, the tapering Eq. (15) is a quick and easy way to wash
out the wiggles enough to get a preliminary view of the broad features of
the spectrum, helpful in deciding whether a more sophisticated data analysis
is called for. The most skilled precision machinist still has frequent use for
a jackknife.
Any tapering clearly does falsify the data, throwing away usable infor-
mation. But data contaminated with noise are themselves in part • false.'
The valid criticism from the standpoint of our present problem is that when
the noise goes away, the falsification in Eq. (15) remains; that is the case
where Burg's pure maximum entropy solution was clearly the optimal one.
But in the different problem envisaged by B-T, which led to Eq. (15), the
noise cannot go away because the noise and the data are identical.
Thus from inspection of Eq. (11), we can see already some clarification
of these muddy waters. The Bayesian method is going to give us, for spec-
trum estimation of a signal in the presence of noise, the same kind of
improvement in resolution and removal of spurious features (relative to the
16 E. T. Jaynes
L b
P(w) dw (17)
is the expectation, over the joint posterior distribution of all the unknown
parameters, of the energy carried by the signal f(t), not the noise, in the
frequency band (a < w < b), in the observation time N = 2T + 1. The true
total energy is NA2/2, and given data D we can write its expectation as
To define a power spectrum only over posltive fre~encies (as would be done
experimentally), one should take instead P+(W) = 2P(w), 0 ~ w < 'If.
In Eqs. (18) and (19), p(A,W,<l1 D,I) is the joint posterior distribution
r
....
P(w) = (21 )
[~ro dA l(A,ro)
18 E. T. Jaynes
or
,.
P(III) = (N/2) E(A21I110I) p(IIIID1) • (23)
,.
In words: P(III) = (conditional expectation of energy given Ill) x (posterior
density of III given the data). Both factors may be of special interest, so we
evaluate them separately. In Appendix C we derive
(24)
where
q(lII) _ C(1II,O)/2a 2 • (25)
I
This has two limiting forms:
2C(III,O),
(26)
a2 + C(III,O),
The second case is unlikely to arise, because it would mean that the signal
and noise have, by chance, nearly canceled each other out. If this happens,
Bayes' theorem tells us to hedge our bets a little. Something strange is
afoot.
From Appendix C, the second factor in Eq. (22) is
exp(q)lo(q)
p(1II101) = (27)
[:xP(q) ',(q) d.
This answers the question: ·Conditional on the data, what is the probability
that the signal frequency lies in the range (Ill, lII+dlll) irrespective of its
SPECTRUM AN 0 CHI RP 19
amplitude ~. In many situations it is this question, and not the power spec-
trum density, that matters, for the signal amplitude may have been cor-
rupted en route by all kinds of irrelevant circumstances and the frequency
alone may be of interest.
Combining Eqs. (24) and (27), we have the explicit Bayesian power
spectrum estimate
f.
-----------
shows it close to the asymptotic form (8/'II'q)1/2 exp(2q) over most of the
range likely to appear; in most cases we would not make a bad approxima-
tion if we replaced Eq. (28) by
f. exp(2q) d.
Whenever the data give evidence for a signal well above the noise level
(that is, C(oo,O) reaches a global maximum Cmax C(v,O) =2 ), then»(
=
qmax Cmax/ a2 »1 and most of the contribution to the integral in the
denominator of Eq. (28) will come from the neighborhood of this greatest
peak. Expanding
the factor of 2 coming from the equal peak at (-v). Near the greatest
peak, the positive frequency estimator reduces to
20 E. T. Jaynes
(33)
where I5w = (q")-lj2 is the accuracy with which the frequency can be esti-
mated. Comparing with the factorization Eq. (23), we see that the Gaus-
sian is the posterior distribution of w, while the first factor, the estimated
total energy in the line, is
Yt = A COS\lt + et , A» CJ • (35)
( 37)
so 2C max = NA2/2 is indeed the correct total energy carried by the signal.
likewise,
9. Extension to Chirp
f(q)
'"P(w,a) = (40)
As in Eq. (28), any peaks of C(w,a) that rise above the noise level will be
strongly emphasized, indicating high probability of a signal.
If we have indeed no prior information about the frequencies and chirp
rates to be expected, but need to be ready for all contingencies, then there
seems to be no way of avoiding the computation in ~etermining C!w,o) over
the entire plane (-,.. < w,a < ,..). Note that, while P( w,O) equals P( -w,O) by
symmetry, when chirp is present we have inversion symmetry, P(w,a) =
P( -w, -a), so half of the (w - a) plane needs to be searched if no prior infor-
mation about the signal location is at hand.
But noting this shows how much reduction in computation can be had
with suitable prior information. That bat, knowing in advance that it has
emitted a signal of parameters (JIo,ao, and knowing also what frequency in-
terval (determined by its flight speed v and the range of important targets)
is of interest, does not need to scan the whole plane. It need scan only a
portion of the line a "'ao extending from about wo(1 - vic) to wo(1 + 4v/c),
where c = velocity of sound, to cover the important contingencies (a target
that is approaching is a potential collision; one dead ahead approaching at
the flight speed is a stationary object, a potential landing site; one moving
away rapidly is uninteresting; one moving away slowly may be a moth, a
potential meal).
In the above we made what seems a strong assumption, that only one
signal f(t) = A cas ( wt + at 2 + e) is present, and all our results were infer-
ences as to where in the parameter space of (A,w,a) that one signal might
be. This is realistic in some problems (that is, oceanographic chirp, or only
one bat is in our cage, etc.), but not in most. In what way has this assump-
tion affected our final result?
22 E. T. Jaynes
Suppose for the moment that the chirp rate ex equals O. Then the power
spectrum estimate 'P(w) dw in Eq. (28) represents, as we noted, the answer
to:
Question A: What is your "best" estimate (expectation) of the
energy carried by the signal f(t) in the frequency band dw, in the
interval of observation?
But had we asked a different question,
Question B: What is your estimate of the product of the energy
transports in the two nonoverlapping frequency bands (a < w < b)
and (c < w < d)?
the answer would be zero; a single fixed frequency w cannot be in two dif-
ferent bands simultaneously. If our prior information I tells us that only one
frequency can be present, then the joint posterior probability p(Eab,EcdIO,I)
of the events
n
f(t) = r A j cos(Wjt + 8j) (43)
j=1
s:
sampled at instants t m, 1 ~ m N, which need not be uniformly spaced (our
previous equally spaced version is recovered if we make the particular
choice tm =-
T + m - 1). Use the notation
fm = f(tm). (44)
The joint likelihood of the parameters is now
l(Aiwi8ia) = a- N exp [- ~ r
N
m=1
(Ym - fm)2]
(46)
(47)
(48)
(49)
r
N
x(w) = N- 1 Ym exp(iwt m) , (50)
m=1
24 E. T. Jaynes
N
M _ N- l , "
jk L 1 ~ j, k ~ n. (52)
m=1
Then
n
yf = ~ djAj (53 )
j=1
n
fT = L MjkAjAk (54)
jk=1
The likelihood Eq. (45) will factor in the desired way if we complete the
square in Q. Define the quantities Ak by
r
n
dj = Mjk~, 1 ~j~n, (56)
k=1
(57)
jk kj
SPECTRUM AND CHIRP 25
Then
Q = r
jk
Mjk(Aj - ~)(Ak - Ak) - r Mjk~Ak
jk
(58)
L = Ll L2 L3 (59)
with
L3 = exp{+
N
iCJ2 r
jk
" "
MjkAjAk)} • (62)
r
rameters becomes
but when N is reasonably large (that is, when we have enough data to
permit a reasonably good estimate of CJ), this is nearly the same as
exp(-NQ/2r) • (64)
In its dependence on {AiWie i} this is just Eq. (45) with CJ2 replaced by
yr. Thus, in effect, if we tell Bayes' theorem: "I'm sorry, I don't know
what CJ2 is'" it replies to us, "That's all right, don't worry. We'll just
replace CJ2 by the best estimate of CJ2 that we can make from the data,
namely yr. " After doing this, we shall have the same quadratic form Q as
before, and its minimization will locate the same "best" estimates of the
other parameters as before. The only difference is that for small N the
peak of Eq. (63) will not be as sharp as that of the Gaussian (45), so we are
not quite so sure of the accuracy of our estimates; but that is the only price
we paid for our ignorance of CJ.
Therefore, if N is reasonably large it hardly matters whether CJ is known
or unknown. Supposing for simplicity, as we did before, that CJ is known,
the joint posterior density of the other parameters {Aiwie i} factors:
(65)
and the joint marginal posterior density of the frequencies and phases:
Equation (66) says that '"Aj is the estimate of the amplitude Aj that we
should make, given the frequencies and phases {Wjej}, and Eq. (67) says that
the most probable values of the frequencies and phases are those for which
the estimated amplitudes are large.
The above relations are, within the context of our model, exact and
quite general. The number n of possible signals and the sampling times tm
may be chosen arbitrarily. Partly for that reason, to explore all the results
that are in Eqs. (65) to (67) would require far more space than we have
here. We shall forego working out the interesting details of what happens
to our conclusions when the sampling times are not equally spaced, what the
answers to those more complicated questions like (B) and (C) are, and what
happens to the matrix M in the limit when two frequencies coincide.
SPECTRUM AND CHIRP 27
tm = -(T + 1) + m, 1 ~ m ~ N, 2T + 1 =N , (68)
then Mj k reduces to
T
M
jk = N -l 1::: cos(Wjt + 6j) cos(wkt + 6k) • (69)
t=-T
1 sinNwj
Mjj = -2 + 2N·
smwj
cos26j, 1 ~ j ~ n , (70)
(71)
where u :: (Wj - wk)/2. This becomes of order unity only when the two fre-
quencies are too close to resolve and that merging phenomenon begins.
Thus as long as the frequencies {Wj} are so well separated that
(72)
(73 )
28 E. T. Jaynes
and the above relations simplify drastically. The amplitude estimates reduce
to, from Eq. (51),
= L
exp[cr- 2 C(Wj,O) cos 2(aj + "'j)] (75)
j
n
= exp {cr-2~ [Aj /NC(wj,O) cos(aj + "'j) - NAj2/4)} , (76)
j=1
n
p({Wj} IDI) II: Tr exp(qj} lo(qj) (77)
j=1
and
n
p({AjWj}IDI) II: Tr exp(-NAj2/4cr 2) lo[Aj INC(wj,O)/cr 2 ] •
j=1
(78)
But these are just products of independent distributions identical with our
previous single-signal results, Eqs. (11) and (27). We leave it as an exercise
for the reader to show from this that our previous power spectrum esti-
mate (28) will follow. Thus as long as we ask only question (A) above, our
single-signal assumption was not a restriction after all.
At this point it is clear also that if our n signals are chirped, we need
only replace C(Wj'O) by C(Wj,Qj) in these results, and we shall get the same
answers as before to any question about frequency and chirp that involves
individual signals, but not correlations between different signals.
Although the theory presented here is only the first step of the devel-
opment that is visualized, we have thought it useful to give an extensive
exposition of the Bayesian part of the theory.
No connection with AR models has yet appeared; but we expect this to
happen when additional prior information is put in by entropy factors. In a
full theory of spectrum estimation in the presence of noise, in the limit as
the noise goes to zero the solution should reduce to something like the origi-
nal Burg pure maximum entropy solution (it will not be exactly the same,
because we are assuming a different kind of data).
For understanding and appreciating Bayesian inference, no theorems
proving its secure theoretical foundations can be quite as effective as seeing
it in operation on a real problem. Every new Bayesian solution like the
present one gives us a new appreciation of the power and sophistication of
Bayes' theorem as the true logic of science. It seeks out every factor in
the model that has any relevance to the question being asked, tells us quan-
titatively how relevant it is-and relentlessly exposes how crude and primi-
tive other methods were.
We could expand the example studied here to a large volume without
exhausting all the interesting and useful detail contained in the general
solution-almost none of which was anticipated by sampling theory or
intuition.
30 E. T. Jaynes
Yt = Acos(vt+6), (A2)
then the periodogram reaches its peak value at or very near the true
frequency,
X(v) (A3)
Yt = A cos(vt + at 2 + 6) , (A4)
then the periodogram (A1) is reduced, broadened, and distorted. Its value
at the center frequency is only about
It is clear from (AS) that the reduction is not severe if aT2 ~ 1, but (A6)
shows that it can essentially wipe out the signal if a T2 » 1.
SPECTRUM AND CHIRP 31
Barber and Ursell observed a signal whose period fell from 17.4 sec to 12.9
sec in 36 hours. These data give a = 5 x 10- 7 sec- 2 , placing the source
about 3000 miles away, which was verified by weather records. In May
1946 a signal appeared, whose period fell from 21.0 sec to 13.9 sec in
4 days, and came from a source 6000 miles away.
Up to 1947 Barber and Ursell had analyzed some 40 instances like this,
without using anybody's recommended methods of spectrum analysis to
measure those periods. Instead they needed only a home-made analog com-
puter, putting a picture of their data on a rotating drum and noting how it
excited a resonant galvanometer as the rotation speed variedl
Munk and Snodgrass surely knew about the phase cancellation effect,
and were not claiming to have discovered the phenomenon, for they make
reference to Barber and Ursell. It appears to us that they did not choose
their record lengths with these chirped signals in mind, simply because they
had intended to study other phenomena. But the signals were so strong that
they saw them anyway-not as a result of using a data analysis method
appropriate to find them, but in spite of an inappropriate method.
Today, it would be interesting to re-analyze their original data by the
method suggested here. If there is a constant amplitude and chirp rate
across the data record, the chirpogram should reach the full maximum (A3),
without amplitude degradation or broadening, at the true center frequency
and chirp rate, and shoud therefore provide a much more sensitive and accu-
rate data analysis procedure.
applying the procedure itself to any set of data whatsoever, whether or not
"their hypotheses hold." And indeed, there are "circumstances where it
does better and others where it does worse."
But we believe also that probability theory incorporating Bayesian max-
imum entropy principles is the proper tool-and a very powerful one-for
(a) determining those circumstances for a given procedure, and (b) deter-
mining the optimal procedure, given what we know about the circumstances.
This belief is supported by decades of theoretical and practical demonstra-
tions of that power.
Clearly, while striving to avoid gratuitous assumption of information
that we do not have, we ought at the same time to use all the relevant in-
formation that we actually do have; and so Tukey has also wisely advised us
to think very hard about the real phenomenon being observed, so that we
can recognize those special circumstances that matter and take them into
account. As a general statement of policy, we could ask for nothing better;
so our question is: How can we implement that policy. in practice?
The original motivation for the principle of maximum entropy [Jaynes,
1957] was precisely the avoidance of gratuitous hypotheses, while taking
account of what is known. It appears to us that in many real problems the
procedure of the maximum entropy principle meets both of these require-
ments, and thus represents the explicit realization of Tukey's goal.
If we are so fortunate as to have additional information about the noise
beyond the mean-square value supposed in the text, we can exploit this to
make the signal more visible, because it reduces the measure W, or
"volume," of the support set of possible noise variations that we have to
allow for. For example, if we learn of some respect in which the noise is
not white, then it becomes in part predictable and some signals that were
previously indistinguishable from the noise can now be separated.
The effectiveness of new information in thus increasing signal visibility
is determined by the reduction it achieves in the entropy of the joint distri-
bution of noise values-essentially, the logarithm of the ratio W'/W by which
that measure is reduced. The maximum entropy formalism is the mathemati-
cal tool that enables us to locate the new contracted support set on which
the likely noise vectors lie.
It appears to us that the evidence for the superior power of Bayesian
maximum entropy methods over both intuition and "orthodox" methods is
now so overwhelming that nobody who is concerned with data analysis-in
any field-can afford to ignore it. In our opinion these methods, far from
conflicting with the goals and prinCiples expounded by Tukey, represent their
explicit quantitative realization, which intuition could only approximate in a
crude way.
SPECTRUM AND CHIRP 35
r t
Yt cos(wt + ae + e) = p cose - 0 sine
where
o- r t
Yt sin(wt + ae) (C3)
(C5)
and substituting Eqs. (C1) and (C4) into Eq. (10), the integral (8) over e is
the standard integral representation of the Bessel function:
f 211"
= (211")-1) eX cose de (C6)
o
Z(a,b) = )
'" -ax2
o
e lo(bx) dx = (n /4a)l/2 exp(b2/Ba) lo(b2/Ba) (C7)
we obtain
15. References
Nikolaus, B., and D. Grischkowsky (1983), "90 fsec tunable optical pulses
obtained by two-stage pulse compression," Appl. Phys. Lett. 43, pp.
228-230.
Gull, S. F., and G. J. Daniell (1978), " Image reconstruction from incomplete
and noisy data," Nature 272, pp. 686-690.
Savage, L. J. (1954), The Foundations of Statistics, Wiley & Sons, New York.
Athanasios Papoulis
39
1. Introduction
of these samples is the Nth order entropy of the process x [n], and the ratio
of H(x 1, ••• , xN)/N is the average uncertainty per sample in a block of N
samples. The limit of this ratio is the entropy rate H(x) of the process
x[n]. Thus,
is the uncertainty about the presence of x [n] under the assumption that its
N most recent past values have been observed. The above conditional
entropy is a nonincreasing sequence of N, and its limit
Indeed, since we know that the limit of the sequence in Eq. (5) can be writ-
ten as the limit of its Cesaro sum
N
Hc<x) = lim ~ ~ H(x[n]lx[n-1], ••• , x[n-m]) (7)
N+a> m=1
ON ENTROPY RATE 41
·
I 1m AN+1
_ _ = lim PN = P. (10)
N+CD AN N+CD
The constant P is the mean square value of the estimate of x [n] in terms of
its entire past. Hence (Kolmogoroff-Sz~go formula) [Papoulis, 1984]:
'Ir
1
P = exp [21r J .
In S(eJW) dw ] (11 )
-'Ir
(12)
mED, (17)
~ -jmw
= L cm e • (18)
ImlED
To complete the determination of S(ejW), it suffices, therefore, to
express the above coefficients c m in terms of the given data am. We shall
do so using a steepest ascent method that utilizes the numerical simplicity
of Levinson's algorithm. As a preparation, we review, briefly, the well
known solution for consecutive data.
1 P
= (19)
M
'L" cme -jrnw
m=-M
M
11 - ~>m e-jrnw 12
m=O
P c m = -am + r
M-m
i=1
aiai+m , (20)
1 ~ m ~ N-1 (21a)
(21b)
N-1
PN-1KN = R[N] - L R[N-k]a~-l (21c)
m=1
(22)
(23)
(24)
(26)
(27)
is the linear mean square estimate of x[n] in terms of its entire past and P
is the resulting mean square error. Hence [see Eqs. (11) and (16)]
cm = 0, mED, (29)
where D is the set of integers in the interval (O,M) that are not in D. To
determine S(ejW), it suffices, therefore, to search only among all admissible
AR spectra, that is, spectra of the form (19) satisfying the constraints (15).
The class of these spectra and of the corresponding autocorrelations R[m]
will be denoted by eM. Clearly, the ME spectrum is a member of this class.
To determine it, we shall use a steepest ascent method based on Eqs. (17)
and (18).
The entropy rate H(x) of the process x[n] is a function of the unspeci-
fied values of Rem], and, in the Cm space, it is maximum at the origin [see
Eq. (29)]. Furthermore,
mED. (30)
This shows that, in the space with coordinates R em], the gradient of the
hyper-surface H = constant is a vector whose coordinates are proportional
to cm. This leads to the following iterative method for determining the ME
spectrum.
Rh[m], mED
Ri[m] =
1 Rh[m] - ISicm,i-1I mED
(31)
M-m
Pi cm , i = -am ,i + ~
L ak I i ak+m I i • (32)
k=1
The choice of the constant lSi is dictated by the following two require-
ments: (a) the sequence Ri[m] obtained from Eq. (31) must be in the class
Cm; (b) the corresponding entropy rate Hi(X) must exceed the entropy rate
Hi-dx).
Both requirements are satisfied if lSi is sufficiently small. In fact, we
have a simple way of checking them without any additional computation.
Indeed, as we noted, Ri [m] is in the class CM iff
(33)
(34)
L cfu,i. (35)
mED
Km = 0, mED. (36)
We have, thus, specified R[m] and Km for every m in the interval (O,M). If
the coefficients Km are such that
46 A. Papoulis
(37)
then the sequence R[m] is admissible and can, therefore, be used to start
the iteration.
Modified data. Suppose, however, that Eq. (37) is not true. In this
case, IKml > 1 for some mE D; hence the sequence R[m] is no longer in the
class eM. To find a member of the class eM, we modify the given data,
adding a constant r to the given value R[O] = Q o of the sequence ~[O]
obtained as above. If r is sufficiently small, then the resulting sequence
R[O] + r , m=O
R[m] = (38)
R[m] ,
Q o+ r, m=O
R[m] = (39)
O~mED.
m= 0
(40)
If the numbers llk are sufficiently small, then each starting sequence is
in the class eM. The corresponding ME spectrum Sk(ejW) is then determined
with the steepest ascent iteration discussed earlier. The above process ter-
minates with Rk[O] = Qo, that is, when the original data are restored.
We have, thus, developed a steepest ascent method for obtaining the ME
solution of the spectral estimation problem with nonconsecutive constraints.
The method converges rapidly, and it utilizes the computational simplicity of
Levinson's algorithm.
To prove Eq. (41), we assume, first, that x[n] is a normal process. In this
case, y[n] is also normal; hence [see Eqs. (11) and (42)],
(43)
Ry[m] = Bm , (46)
lows from Eq. (41) that, to maximize H(x), it suffices to maximize H(y) sub-
ject, again, to the constraints (46).
Using Eq. (41), we have, thus, reduced the above problem to the famil-
iar problem of determining the ME spectrum Sy(ejW) of y[n]. As we see
from Eq. (19), this spectrum is the all-pole function
p
(47)
M
11 - L am e - jmW12
m=1
(48)
r
M
11 - am e- jmw l2
m=1
(49)
then
m=1
1 r bm
M
e- jmw 12
Sx(ejW) = P - - - - - - - - (50)
M
11 - Lam e- jmW l 2
m=1
4. Acknowledgment
This work was supported by the Joint Services Technical Advising Com-
mittee under Contract F49620-80-C-0077.
ON ENTROPY RATE 49
5. References
John E. Shore--
to find q(x) when the sr(x) are known functions and the sr are known
values. In general, Eq. (1) has no unique solution for q(x). The constraints
(1) and
J dx q(x) = 1 (2)
-The full version of this paper has been submitted to the I EEE Transactions
on Information Theory.
--Present address: Entropic Processing, Inc., Washington Research Labora-
tory, 600 Pennsylvania Ave. S. E., Suite 202, Washington, DC 20003.
51
C. R. Smith and G. J. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 51-56.
© 1987 by D. Reidel Publishing Company.
52 John E. Shore
ME and MRE have been applied to the power spectrum analysis of sta-
tionary time series and to the enhancement of two-dimensional images.
These applications proceed by expressing the spectrum estimation and image
enhancement problems in terms that require inversion of the equation
(r = O, ••• ,M) , (5 )
Dr =
image intensity at point r
Ok = object intensity at point k
=
srk point spread function
When M = N, Eq. (5) can be written in matrix form as s-O = D, which can in
principle be solved by
(6)
region M < r ~ N, and then solving for the power spectrum by Eq. (6). Per-
haps the best known extrapolation method is Burg's maximum entropy spec-
tral analysis (MESA) [Burg, 1%7, 1975; Shore, 1981], in which the power
spectrum S(f) is estimated by maximizing
N
~ log Ok (7)
k=1
where the Sr are chosen so that Eq. (8) satisfies Eq. (5). For equally
spaced autocorrelation lags, tr = rlit, Eq. (8) takes on the familiar form
OJ< = (9)
subject to Eq. (5) [Gordon and Herman, 1971; Frieden, 1972; Gull and Dan-
iell, 1978; Skilling and Gull, 1985]. The result has the form
M
Ok = ex p [-1 - ~ ar Srk] • (11 )
r=O
As was the case with Eq. (8), the ar are chosen so that Eq. (11) satisfies
Eq. (5). Spectrum estimates based on the maximization of Eq. (10) have
also been studied for ARMA, meteorological, and speech time series [Nadeu
et al., 1981; Ortigueira et al., 1981; Shore and johnson, 1984].
In previous work [Shore and johnson, 1984], we showed that the results
of maximizing both Eqs. (7) and (10) subject to Eq. (5) are equivalent to
relative-entropy minimization with a uniform initial estimate of 0, the
difference stemming from a different choice for the underlying state space.
In particular, in one case we treated Ok as the expected value of the power
at frequency fk; in the other, we normalized the Ok and treated them as
probabilities. In one sense these results were satisfying, since they showed
that the two estimators arise from the same underlying principle: relative-
entropy minimization. But we didn't answer the question of whether Eq. (7)
or Eq. (10) is the correct expression for entropy; we replaced the question
with another one, namely, which state space is correct?
Here we treat the question from a different viewpoint. We derive both
estimators by treating Ok as an expected value. For the spectrum estima-
tor, Ok is the expected power at fk (as in Shore and johnson [1984]); for
the image estimator, Ok is the expected number of photons received at point
k. We show that the difference between the estimators (8) and (11) corre-
sponds to different choices for the initial estimate-a Gaussian initial esti-
mate, which is appropriate for time series, leads to Eq. (8), while a Poisson
initial estimate, which is appropriate for images, leads to Eq. (11). This
unified derivation of spectrum and image estimators is also shown to extend
to cases involving multiple signals (for example, additive noise), weighted
initial estimates, and uncertain measurements.
These results show that the spectrum and image estimators arise from
MRE analyses in the same state space, but with different initial estimates.
But the results leave unclear the relationship between initial estimates and
state-space choices. To clarify this relationship, we show that, if a nonuni-
form initial estimate for MRE is interpreted as the result of aggregating a
uniform initial estimate from some "larger" state space, then MRE with the
nonuniform initial estimate is equivalent to ME in that larger space. This
result sheds further light on the relationship between the spectrum and
image estimators, namely that both can be viewed as arising from ME in
appropriate spaces. If the amplitudes in a time series are thought of as
arising from the sum of many uniformly distributed increments, then the
Gaussian initial estimate can be thought of as resulting from an aggregation
transformation. Similarly, the Poisson initial estimate can be thought of as
the aggregate probability of x total photons arriving during a standard time
STATE SPACES AND INITIAL ESTIMATES 55
interval, given (in the limit) a uniform probability for the time instant at
which any individual photon might arrive.
The paper closes with a general discussion of choosing initial estimates
in MRE problems.
Acknowledgment
References
Kay, S. M., and S. L. Marple, Jr. (1981), Spectrum analysis-a modern per-
I
Rodney W. Johnson*
57
C. R. Smith and G. J. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 57-73
© 1987 by D. Reidel Publishing Company.
58 Rodney W. Johnson
1. Introduction
for known fr(x) and Tr, r = 0,1, ••• ,M. The principle states that one should
choose the final estimate q that satisfies
and where q' varies over the set of densities that satisfy the constraints.
When these are linear-equality constraints [Eq. (1)], the final estimate has
the form
where the ar and a are Lagrangian multipliers determined by Eq. (1) (with
qt replaced by q) and by the normalization constraint
J q(x) dx = 1. (5)
Properties of REP solutions and conditions for their existence are discussed
by Csiszar [1975] and by Shore and Johnson [1981]. Expressed in terms of
the expected values and the Lagrangian multipliers, the relative entropy at
the minimum is given by
MINIMIZATION WITH UNCERTAIN CONSTRAINTS 59
H( q,p) = -a - ~ 6 r fr • (6)
aCt
- -36 -- fr (8)
r
(9)
However, not all components vr may have equal uncertainty, and different
components may be correlated. We therefore replace Eq. (10) with the
more general constraint
(11 )
(12 )
Next we show how to reduce the more general constraint (11) to this case.
We conclude this section with a remark on the relation between the result
with E > 0 and that for • exact constraints· (E = 0).
Our problem is to minimize the relative entropy H(q,p) subject to the
constraint (13) (with q in place of qt) and the normalization constraint (5).
I f the initial estimate satisfies the constraint (that is, if Eq. (13) holds with
p in place of q), then setting q = p gives the minimum. Otherwise, equality
holds in Eq. (13) and the criterion for a minimLITl is that the variation of
MINIMIZATION WITH UNCERTAIN CONSTRAINTS 61
J
q(x) log ~~:l dx + A ~ [J I,(x) q(x) dx ], + (.-1) Jq(x) dx (14)
with respect to q(x) is zero for some lagrange multipliers A > 0, corre-
sponding to Eq. (13), and a-1, corresponding to Eq. (5). (We write a-1 in-
stead of a for later convenience.) With A > 0, the criterion intuitively im-
plies that a small change t'iq in q that leaves Jq(x) dx fixed and decreases
H(q,p) must increase the error term ErU fr(x) q(x) dx)2.
Equating the variation of expression (14) to zero gives
Therefore q satisfies
where
Conversely, if q has the form (16), and if a, A, and the Sr are chosen so that
Eq. (17), the constraint (13), and the normalization condition (5) hold, then
q is a solution to the minimization problem. But if Eq. (17) holds, the con-
straint with equality is equivalent to
(18)
or to
(19)
e: w =J
Sr fr(x) q(x) dx (20)
62 Rodney W. Johnson
hold, then the constraint (13) will be satisfied, and we can ensure that
Eq. (17) holds by the choice of x.
Next, assume a constraint of the general form (11), (9), with a symmet-
ric, positive-definite matrix M. Then there is a matrix A, not in general
unique, such that AtA = M. Now
(22)
where
ur = LArs Vs • (23)
s
and obtain
ur = j~ s
Ars[fs(x) - fsl q(x) dx • (25)
Defining
gr(x) = rs
Ars[fs(x) - fsl , (26)
we obtain
(27)
from Eq. (22). Thus, constraints of the general form (11) can be trans-
formed to (27), which is of the same form as (13).
MINIMIZATION WITH UNCERTAIN CONSTRAINTS 63
Note that Eq. (16) is identical to Eq. (4): that is, the functional form of
the solution with uncertain constraints is the same as that for exact con-
straints. The difference is that, for uncertain constraints, the conditions
a
that determine the r have the general form (20). These conditions reduce
to the exact-constraint case for E =
O. One way of viewing this identity of
form for the solutions of the two problems is to note that every solution q of
an uncertain-constraint problem is simultaneously a solution of an exact-
constraint problem with the same functions fk and appropriately modified
values for the Tk.
The relative entropy at the minimum may be computed by substituting
Eq. (16) into Eq. (3), which leads to
E lTiIT
ar = J f r(x) q(x) dx - -fr • (29)
aar =
- aa Ifr(x) q(x) dx , (31)
(32)
Note that Eqs. (30) and (32) reduce respectively to Eqs. (6) and (8) when
E = O.
64 Rodney W. Johnson
where
Cr(f) = cos 21rtrf • (34 )
Given initial estimates Pi(f) of the power spectrum of each Signal Si, and
autocorrelation measurements on the sum of the signals, MRESA provides
final estimates for the Si. In particular, if the measurements R}ot satisfy
1
(36)
where the Br are chosen so that the Qi satisfy the autocorrelation con-
straints (35) [Johnson and Shore, 1983]. Since some initial estimates may
be more reliable than others, these results have been extended recently to
include a frequency-dependent weight wi(f) for each initial estimate Pi(f)
[Johnson, Shore, and Burg, 1984]. The larger the value of Wi(f), the more
reliable the initial estimate Pi(f) is considered to be. With the weights in-
cluded, result (36) becomes
1
(37)
1 1 ~
Pi(f) + wi(f) L
r
N
si(t) = (aik cos 21ffkt + bik sin 21ffkt), i = 1, ••• ,L, (38)
k=1
with nonzero frequencies fk, not necessarily uniformly spaced. The aik and
bik were random variables with independent, zero-mean, Gaussian initial
distributions. We defined random variables
(39)
(40)
L N
p(x) = n n Pik(Xik) (41 )
i=1 k=1
where
1 -xik
Pik(Xik) = -p exp - p • (42)
ik ik
The assumed Gaussian form of the initial distribution of aik and bik is equiv-
alent to this exponential form for Pik(xik); the coefficients were chosen to
make the expectation of xik equal to Pik. Using Eq. (40), we wrote a dis-
crete-frequency form of Eq. (35) as linear constraints
R~ot = tfj
i=1 k=1
crk xik qt(x) dx (43)
(44)
subject to the constraints [Eq. (43) with q in place of qt) and the normali-
zation condition
j q(x) dx = 1 ; (46)
L N
q(x) = n n qik(xik), (47)
i=1 k=1
1 -xik
q"lk(X"lk) = -0 exp -0 • (49)
ik ik
1
Oik = (50)
M
[ Sr crk
r=O
l N
~ ~ crk Oik = R~ot (51 )
i=1 k=1
was satisfied.
To handle uncertain constraints, we first replace Eq. (35) with a bound
This has the form (36); by Eq. (16), minimizing relative entropy subject to
these constraints gives
E II~II = tf
i=1 k=1
J crk xik q(x) dx - R}ot (56)
[compare Eq. (20)]. Using Eq. (42), we find that q has the form (47),
where qik(xik) is proportional to
exp [
-xik
Q- -
Ik
L LL
M
r=O
ar
L
i=1 k=1
N
crk xik
]
• (57)
Consequently qik is given by Eq. (49), where Qik is given by Eq. (50).
Rewriting Eq. (56) in terms of Qik and passing from discrete to continuous
frequencies gives
1
(58)
_1_ + ~
Pi(f) L
!J
where the ar are to be determined so that
The functional form (58) of the solution with uncertain constraints is the
same as the form (36) for exact constraints; the difference is in the condi-
tions that determine the a r : Eq. (35) for exact constraints and Eq. (59) for
uncertain constraints. This is a consequence of the analogous result for
probability-density estimation, noted in Section 2.
In the case of the more general constraint form
(60)
with the error vector y as in Eq. (53), it is convenient to carry the matrix
through the derivation rather than transforming the constraint functions as
in Eq. (26). The result is that the final estimates again have the form (36),
while the conditions (59) on the ar are replaced by
MINIMIZATION WITH UNCERTAIN CONSTRAINTS 69
13'r
= t, J C,(I) Qi(/) dl - _lot, (61 )
where
(62)
4. Example
We shall use a numerical example from Johnson and Shore (1983) and
Johnson, Shore, and Burg (1984). We define a pair of spectra, SB and SS,
which we think of as a known "background" component and an unknown
"signal" component of a total spectrum. Both are symmetric and defined in
the frequency band from -0.5 to +0.5, though we plot only their positive-
frequency parts. SB is the sum of white noise with total power 5 and a
peak at frequency 0.215 corresponding to a single sinusoid with total power
2. Ss consists of a peak at frequency 0.165 corresponding to a sinusoid of
total power 2. Figure 1 shows a discrete-frequency approximation to the
sum 'SB+SS, using 100 equi-spaced frequencies. From the sum, six auto-
correlations were computed exactly. As the initial estimate PB of SB, we
used SB itself; that is, PB was Fig. 1 without the left-hand peak. For Ps we
used a uniform (flat) spectrum with the same total power as PB.
Figure 2 shows unweighted MRESA final estimates QB and Qs (Johnson
and Shore, 1983). The signal peak shows up primarily in QS, but some evi-
dence of it is in QB as well. This is reasonable since PB, although exactly
correct, is treated as an initial estimate subject to change by the data. The
signal peak can be suppressed from QB and enhanced in QS by weighting the
background estimate PB heavily [Johnson, Shore, and Burg, 1984).
Figure 3 shows final estimates for uncertain constraints with an error
bound of £ = 1. The Euclidean distance [a constraint of the form (52»),
was used. The estimates were obtained with Newton-Raphson algorithms
similar to those developed by Johnson (1983). Both final estimates in Fig. 3
are closer to the corresponding initial estimates than is the case in Fig. 2,
since the sum of the final estimates is no longer constrained to satisfy the
autocorrelations.
Figure 4 shows results for £ = 3; these final estimates are even closer
to the initial estimates. Because the example was constructed with exactly
known autocorrelations, it is not surprising that the exactly constrained final
estimates are better than those in Figs. 3 and 4, which illustrate the more
conservative deviation from initial estimates that results from incorporating
the uncertain constraints.
70 Rodney W. Johnson
1.2~ r-
IOOr
...
0.7~
II:
~
0
Go
o~o
0.2~ -
~
o.~
".J
.0 0.1 0.2 0.3 0.4 O.~
FREOUENCY
1.2~
1.00
0.7~
ffi~
~
o~o
0.2~
1.2~
1.00
a:: 0.7~
IIJ
~
0
CL
0.50
0.2~
0.00
0.0 0.1 0.3 0.4 o.~
FREOUENCY
1.2~
1.00
0.7~
a::
IIJ
~
0
CL
o.~o
0.2~
0.00
0.0 0.1 0.2 0.3 0.4 o.~
FREOUENCY
5. Discussion
6. References
911-934.
B. S. Choi
Department of Applied Statistics, Yonsei University, Seoul,
Korea
Thomas M: Cover
Departments of Statistics and Electrical Engineering, Stanford
University, Stanford, CA 94305
There are now many proofs that the maximum entropy stationary sto-
chastic process, subject to a finite number of autocorrelation constraints, is
the Gauss Markov process of appropriate order. The associated spectrum is
Burg's maximum entropy spectral density. We pose a somewhat broader
entropy maximization problem, in which stationarity, for example, is not
assumed, and shift the burden of proof from the previous focus on the cal-
culus of variations and time series techniques to a string of information the-
oretic inequalities. This results in a simple proof.
75
C. R. Smith and G. 1. Erickson (eels.).
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems. 75-84
© 1987 by D. Reidel Publishing Company.
76 B. S. Choi and T. M. Cover
1. Preliminaries
= h(f). (1)
h = (2)
n+ ao
if the limit exists. It is known that the limit always exists for stationary
processes.
2. The Proof
Theorem 1: The stochastic process {Xi} 7=1 that maximizes the differ-
ential entropy rate h subject to the autocorrelation constraints
h( X1,X 2, ••• ,X n )
max n = 1,2, ••• , (4)
n
n
= h(ZI,Z2'···'Zp) + ~ h(Zk I Zk-l,oo.,ZI) (5b)
k=p+1
r
n
S h( ZI,Z2,00.,Zp) + h(Zk I Zk-I,Z k-2,. oo,Z k-p) (5c)
k=p+1
k=p+1
h(Zk I Zk-l,oo.,Zk-p) (5d)
Here inequality (b) is the chain rule for entropy, and inequality (c) follows
I I
from h(A B,C) S h(A B). [See standard texts like Ash, 1965, and Gallager,
1968.] Inequality (a) follows from the information inequality, as shown in
Section 3. Thus the pth order Gauss Markov process Z~,Z;, ••• ,Z~ with co-
variances aO,al,.oo,ap has higher entropy h(Z;,Z;, ••• ,Z~) than any other
process satisfying the autocorrelation constraints aO,al,oo.,ap- Conse-
quently,
1 -X t K-l X/2
~(x) = e (7)
(21T)n/ 2 K I II/2
be the n-variate normal probability density with covariance matrix
K = J x xt f(x) dx • ( 8)
78 B. S. Choi and T. M. Cover
!J.
0 s D(fli~) = ) f(x)
f(x) In ~(x) dx
= ) f In f -) f In ~• (10)
But
) f In ~ = ) ~ In ~ (11 )
0 S -h(f) - ) f In ~
= -h(f) - ) ~ In ~
and
h(f) S h(~), (13 )
where aD = 1. And then it can be proved [Choi, 1983] that E~=o aiai is
positive. Thus, we can define
P
0 2 = """
L a"I a"I ' (15 )
i=O
Xn = - r
i=1
p
ai Xn-i + In , (16)
r
p
a R. = I I'
aj a R. -j R, ~p+1. ( 17)
j=1
00
1 -iAR.
S( A) =
2'1f ~ a1 e
1=_00
02 1
= (18)
2'1f
Ir
p
aj eiAW
j=O
h = limOO 1
n+ n ~h(Xl, ••• ,Xp) + ~
L
1=p+1
= 21 2
In (2 1T e a ) , (19)
5. History
where
00
1 _.AR.
S(A) = 2'1f
O(1)e l , (21)
Proof that the pth order Gaussian autoregressive process spectral den-
sity is the maximum entropy spectral density has been established by varia-
tional methods by Smylie, Clarke, and Ulrych [1973, pp. 402-419], using the
Lagrange multiplier method, and independently by Edward and Fitelson
[1973]. Burg [1975], Ulrych and Bishop [1975], Haykin and Kesler [1979,
pp. 16-21], and Robinson [1982] follow Smylie's method. Ulrych and Ooe
[1979] and McDonough [1979] use Edward's method. See also Grandell et
al. [1980].
The calculus of variations necessary to show that
2
S*(X) = .£....-
211"
1 (23)
is the solution to Eq. (20) is tricky. Smylie et al. [1973] show that the
first variation about S*(X) is zero. Further considerations establish S* as a
maximum.
Van den Bos [1971] maximizes the entropy h(X ll X2 , . . . ,X p +1) subject to
the constraints (22) by differential calculus, but further argument is
required to extend his solution to the maximization of h(X1, ... ,X n ), n > p+1.
Feder and Weinstein [1984] have carried this out.
Akaike [1977] maximizes another form of the entropy rate h, that is,
Using prediction theory, one can show that Var(q) has its maximum if
(26)
where a 1,a 2, ••• ,ap are given in Eq. (14). For details, see Priestley [1981,
pp. 604-606].
More details of proofs in this section can be found in Choi [1983].
With hindSight, we see that all of the maximization can be captured in
the information theoretic string of inequalities in Eq. (5) of Theorem 1, and
that the global maximality of S*(X) follows automatically from verifying
that S*(X) is the spectrum of the process specified by the theorem.
82 B. S. Choi and T. M. Cover
6. Conclusions
A bare bones summary of the proof is that the entropy of a finite seg-
ment of a stochastic process is bounded above by the entropy of a segment
of a Gaussian random process with the same covariance structure. This
entropy is in turn bounded above by the entropy of the minimal order Gauss
Markov process satisfying the given covariance constraints. Such a process
exists and has a convenient characterization by means of the Yule-Walker
equations. Thus the maximum entropy stochastic process is obtained.
We mention that the maximum entropy spectrum actually arises as the
answer to a certain "physical" question. Suppose Xll X2 ' ••• are independent
identically distributed uniform random variables. Suppose also that the fol-
lowing empirical covariance constraints are observed:
7. Acknowledgments
8. References
Feder, M., and E. Weinstein (1984), 'On the finite maximum entropy extrap-
olation,' Proc. IEEE 72, pp. 1660-1662.
Grandell, J., M. Hamrud, and P. Toll (1980), 'A remark on the correspond-
ence between the maximum entropy method and the autoregressive
model,' IEEE Trans. Inf. Theory IT-26, pp. 750-751.
Ulrych, T., and T. Bishop (1975), • Maximum entropy spectral analysis and
autoregressive decomposition,' Rev. Geophys. and Space Phys., 13, pp.
183-200; reprinted in D. G. Childers, ed. (1978), Modern Spectrum
Analysis, I EEE Press, pp. 54-71.
Robert M. Haralick
85
C. R. Smith and G. J. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 85-97.
© 1987 by D. Reidel Publishing Company.
86 R. M. Haralick
1. Introduction
The facet model for image processing takes the observed pixel values to
be a noisy discretized sampling of an underlying gray tone intensity surface
that in each neighborhood of the image is simple. To process the image
requires the estimation of this simple underlying gray tone intensity surface
in each neighborhood of the image. Prewitt (1970), Haralick and Watson
(1981), and Haralick (1980, 1982, 1983, 1984) all use a least squares estima-
tion procedure. In this note we discuss a Bayesian approach to this estima-
tion problem. The method makes full use of prior probabilities. In addition,
it is robust in the sense that it is less sensitive to small numbers of pixel
values that might deviate highly from the character of the other pixels in
the neighborhood.
Two probability distributions define the model. The first distribution
specifies the conditional probability density of observing a pixel value, given
the true underlying gray tone intensity surface. The second distribution
specifies the conditional probability density of observing a neighborhood
having a given underlying gray tone intensity surface.
To motivate the representations we choose, and to help make clear
what underlying gray tone intensity surface means, consider the following
thought experiment. Suppose we have a noiseless image that is digitized to
some arbitrary precision. Suppose, for the moment, we take simple underly-
ing gray tone intensity surface to mean a constant surface in each neighbor-
hood. Now begin moving a fixed and reasonable sized neighborhood window
through the image. Most neighborhoods (probably all of them) will not have
constant values. Many would be constant except for illumination shading or
texture effects; those neighborhoods are nearly constant. Some have an
object edge passing through; these are not constant.
The nearly constant neighborhoods can be thought of as having arisen
from small perturbations of a constant neighborhood. The perturbation is
due, not to sensor noise, but to the difference between the idealization of
the model (perfectly constant neighborhoods) and the observed perfect real-
ity. In this case, we take the underlying gray tone intensity surface to be a
constant, the value of which is representative of the values in the observed
nearly constant neighborhood.
What does it mean to determine a value that is representative of the
values in the neighborhood? Does it mean an equally weighted average, for
example? To answer this question, fix attention on the center pixel of the
neighborhood. We expect that the neighbors of the center pixel would have
a value close to the value of the center pixel. The neighbors of these
neighbors, the second neighbors, would have values that could deviate more
from the value of the center pixel than the first neighbors. This expecta-
tion-that the closer a pixel is to the center pixel, the less the deviation is
likely to be from the center pixel-should find a way to be incorporated into
the model explicitly. Under these conditions, the representative gray tone
intensity of the underlying gray tone intensity surface in the neighborhood
can be estimated as an unequally weighted average of the pixel values in the
ROBUST LOCAL FACET ESTIMATION 87
neighborhood, those pixels farther away from the center pixel getting less
weight.
We have neglected the neighborhoods having an edge or a line passing
through them. These neighborhoods do not satisfy the spirit of a model that
is 'constant in each neighborhood.' This suggests that we need to be exam-
ining models in which the spatial distribution of gray tones in the neighbor-
hood is more complex than constant. An appropriate model, for example,
may be one in which the ideal gray tone intensity surface is a low order
polynomial of row and column positions.
Now suppose that our model is that the underlying gray tone intensity
surface in each neighborhood is a bivariate cubic polynomial. Again take
our hypothetical noiseless perfect image and pass a neighborhood window
through it. As before, there probably will be no neighborhoods that fit a
cubic precisely, but this time most neighborhoods will nearly or almost
nearly fit. The cubic model can represent constants, slopes, edges, and
lines.
Fix attention on one of the neighborhoods. Suppose it is mostly con-
stant, especially near its center, with a small portion of a line or edge at its
boundary. Instead of thinking of the polynomial underlying gray tone inten-
sity surface as representative, in the sense of fitting, of the entire neigh-
borhood, think of it as containing the values of all partial derivatives of
order 3 or less evaluated at the center pixel. Since the area around the
center pixel is nearly constant, we should expect all partial derivatives of
order 1 to order 3 to be small or zero, despite some significant disturbance
at the boundary of the neighborhood and despite the fact that a least
squares fit of the pixel values in the neighborhood would certainly not pro-
duce near-zero partial derivatives.
At this point we begin to uncover a few concepts about which deeper
understanding is needed. The first is the difference between estimating the
derivatives at the center pixel of a neighborhood and least squares fitting an
entire neighborhood. The second is the notion of neighborhood size. The
larger the neighborhood, the more different things are likely to happen near
and around its boundary and the more we will want to ignore the values
around the boundary in estimating the partial derivatives at the neighbor-
hood center. At the same time, should the pixel values near and around the
boundary of the neighborhood fit in with the spatial distribution at the
center of the neighborhood, we would definitely want to have the estimation
procedure utilize these neighborhood boundary pixels in a supportive way.
The conclusion we can draw from this perspective is that we can expect
the underlying polynomial gray tone intensity surface to be more represen-
tative of what is happening at the center of the neighborhood than at its
periphery. That is, the observed values at the periphery of the neighbor-
hood are likely to deviate more from the corresponding values of the under-
lying gray tone intensity surface than are the observed values at the center
of the neighborhood. Furthermore, we need to pay careful attention to the
similarity or dissimilarity of pixels at the periphery of the neighborhood so
that their values can be used in a supportive way.
88 R. M. Haralick
2. The Model
a neighborhood, and let J( r,c) represent the observed gray tone values in the
neighborhood. At each pixel (r,c) the squared deviation between the repre-
sentative underlying gray tone intensity surface and the observed image is
given by [ '""
L a··rid
IJ - j(r,c)]2. The expected value of this squared
ij
i+j~3
pea) = h[(a-ao)'
'\1-1 (a-a o)]·
'I- (3)
a
h'[(J-Fa)' t -l(J-Fa)]
h[(J-Fa)' tJ
-l(J-Fa)]
J
(7)
90 R. M. Haralick
Hence,
Ae-Jl/ 2(-1!2)
= -2 ---""7""""- = 1. (8)
Ae-Jl/2
(9)
or
(10)
1 - e _X2/2
s(x) = 21rX
2. (11)
Jl ~ O. (12 )
ROBUST LOCAL FACET ESTIMATION 91
Thus,
a function that is always positive, having largest magnitude for small lJ and a
monotonically decreasing magnitude for larger lJ.
The Cauchy distribution has the form
1
c(x) = (14 )
1
c(x) = 'If(1+lJ) lJ ~ O. (15)
Thus,
~ = 1 (16)
c(lJ) 1 + lJ '
a function that is always positive, having largest magnitude for small lJ and a
monotonically decreasing magnitude for larger lJ.
On the basis of the behavior of h'/h for slash and Cauchy distributions,
we can discuss the meanings of h'/h in Eq. (5). Simply, if the fit Fa to J is
relatively good compared to our prior uncertainty about a, then the esti-
mated a is determined mostly by the least squares fit and hardly at all by
the prior information we have about a. If the fit Fa to J is comparable in
uncertainty to our prior uncertainty about a, then the estimated a is deter-
mined in equal measure by the least squares fit and by the prior information.
If the fit Fa to J has more error than our prior uncertainty about a, then
the estimated a is determined more by the prior information than by the fit.
To see how this works more precisely, let
h'[(J-Fa)' f- -l(j_Fa)]
J
( 17)
h[(j-Fa)' f- -l(j-Fa)]
J
and
92 R. M. Haralick
h'[(a-ao)'
U-l
4- (a-ao)]
a
(18)
= (19)
We can solve this equation iteratively. let an be the value of the estimated
a at the nth iteration. Take the initial 0.(1) to satisfy Eq. (10). Suppose
a{n) has been determined. Substitute a{n) into Eqs. (17) and (18) to obtain
Aj{a{n)) and Aa{a{n)). Then substitute these values for Aj{a n ) and Aa{a n)
into Eq. (19) to determine a{n+1).
N j{r,c)2- ~anfn{r,C))2]
= n [(
h -
a j{ r,c)
n=1
(20)
( r,c)
and
ROBUST LOCAL FACET ESTIMATION 93
N
P(a) = TT Pn(an!ano)
n=1
= TT
n=1
N
h [(ana~:n'H (21)
where a' = (al/ ... /aN) and a~ = (alo/ ... /aN o).
Proceeding as before, we obtain that the maximizing a must satisfy
where ~,
J
t a
I A JI and Aa are diagonal matrices
'i
J
=
(:aJ("~) (23)
'i
a
=
C'aa(n~) (24)
AJ =
(>J("~) (25)
Aa =
(~.a(n~) (26)
94 R. M. Haralick
h (j(r,c) - ~ (lnfn(r,c) ) 2]
[ aJ(r,c)
h' [( ana~:no f]
fJ
(28)
h [( ana~:no
The solution for a can be obtained iteratively. Take the first AJ and Aa
to be the corresponding identity matrices. Solve Eq. (22) for a. Then sub-
stitute into Eqs. (27) and (28) for the next AJ and Aa.
Because the solution for a is iterative, it is not necessary to take the
required NK(1+N+K) + 2N + N 3 operations to solve Eq. (22) exactly. (The
vector J is Kx1 and the vector a is Nx1.) There is a quicker computation
procedure. Suppose the basis that is the columns of F satisfies
(29)
,,-1.
This means that the basis vectors are discretely orthonormal with respect to
the weights that are the diagonal entries of the diagonal matrix If- In
J
this case, Eq. (22) holds if and only if
= F' AJ t- J
1
J + Aa '1--
a
1a (30)
ROBUST LOCAL FACET ESTIMATION 95
This equation suggests the following iterative procedure for the determina-
tion of a:
Take
a{n+ l ) = {I + Aat..-l)-l
a
Each iteration of Eq. (33) requires 3KN +4K+3N operations, and only two to
four iterations are necessary to get a reasonably close answer.
4. Robustness
n
N
and -h'/h is small for large arguments, AJ(r,c) will be small. To understand
the effect of a small AJ(r,c), examine Eq. (33). On the right-hand side of
that equation is the expression AJJ + (I-AJ)FCl, which consists of a general-
ized convex combination of J, a term depending on the observed data, and
FCl, a term depending on the fit to the data. In those components where
AJ(r,c) is small, the generalized convex combination tends to ignore j(r,c)
N
and, in effect, to substitute for it the fit L Clnfn( r ,c). Thus small values
n=1
of AJ( r,c) substitute the fitted values for the observed values. Values of
the weight AJ(r,c) close to 1 tend to make the procedure ignore the fitted
values and use only the observed values.
The technique is inherently robust. Any observed value that deviates
greatly from the fitted value is in a sense ignored and replaced with a fitted
value interpolated on the basis of the other pixel values.
To do the AJ computation, a density function h is required. As we have
seen, a normal distributional assumption leads to each AJ being the same
identical constant. Distributional assumptions such as slash or Cauchy lead
to AJ being some monotonically decreasing function of the squared differ-
ence between the observed and fitted values. The monotonically decreasing
function depends on the distributional assumption being made.
One way to avoid the distributional assumption is to use a form AJ that
has proved to work well over several different kinds of distributions. One
such form is Tukey's bisquare function, used in computing the biweight:
(34)
otherwise
where
ROBUST LOCAL FACET ESTIMATION 97
N
[J(r,c) - ~ anfn(r,c)]2
n=1
(35)
C aj(r,c)
and C is a constant with value between 6 and 9. In this case, the estimated
coefficients al, ••• ,aN are generalizations of the biweight, and the computa-
tional procedure discussed in section 2.1 corresponds to Tukey's iterative
reweighted least squares regression procedure [Mosteller and Tukey, 1977].
5. References
Haralick, R. M. (1980), "Edge and region analysis for digital image data,"
Comput. Graphics and Image Processing 12, pp. 113-129.
Haralick, R. M., and Layne Watson (1981), "A facet model for image data,"
Comput. Graphics and Image Processing 15, pp. 113-129.
Mosteller, Frederick, and John Tukey (1977), Data Analysis and Regression,
Addison-Wesley, Reading, Mass., pp. 356-358.
William I. Newman
The maximum entropy method is reviewed and adapted to treat the problems
posed by noisy and by missing data. Examples of the use of the extended
maximum entropy method are presented.
99
1. Introduction
S(\I) = 0, (1 )
r
CX>
I
tot Pn exp(21Tin\ltot) ,
5(\1) = (3)
0, 1\11> \IN
where we define the time interval t.t according to
(4)
We observe that we can write an expression of the form (3) if and only if
tot ~ (2\1N)-1. Since the coefficients Pn have yet to be determined, we in-
troduce this latter expression into Eq. (2) for the correlation function and
obtain the so-called Whittaker interpolation formula, that is,
CX>
P (t) = f Pn
sin [21T\1N(nt.T - t»)
21r\lN(ntot - t)
(5)
Owing to the presence of the sinc function [the sin(x)/x term) in this
expression, we observe that, when t is an integral multiple of the time inter-
val t. t, we can make the identification
Pn :: P (ntot) • (6)
For example, if P(x) described a Gaussian process with zero mean and vari-
ance a 2 , the entropy for the process would be H ::: In[21Tea 2 )lj2. Suppose
that the correlation function and the spectrum are the outcome of a sta-
tionary, Gaussian process. It is convenient to think of this process as the
outcome of an ensemble of independent sinusoidal oscillators, each at a dif-
ferent frequency. The variance a 2 of the process at a given frequency \I
can be associated with the spectrum S (\I), and in loose terms we can think
of In[21TeS(\I))lj2 as a measure of the entropy associated with the sinusoi-
dal oscillator at that frequency [Papoulis, 1981). The entropy for the proc-
ess can be shown to be [Bartlett, 1955)
= (4. N [.N
)-> In[5(.)] d. +1 ln [2•• 1 (8)
N
(9)
where we have dropped the additive constant term contained in the second
of Eqs. (8), subject to the constraints imposed by the correlation function
lag data
Pn = j
\lN
5(\1) exp(-21rin\l~t) d\l , n = -N, ••• ,O, ••• ,N • (10)
-\IN
L
00
where the P n, n
°=-N, ••• ,O, ••• ,N, are known. We vary the unmeasured P n so
that aH/apn = for Inl > N. In this way, we are implicitly preserving the
measured values of P n. As a consequence, we obtain
This implies that the power spectrum is the reciprocal of a truncated Fourier
series. In other words, the Fourier series representation for 5-1 (v) has
coefficients that vanish if the corresponding correlation function lag van-
ishes. Thus, we can write
5(v) = {t b~
-00
exp(-2nin\)At) } -1 (13)
L =- rN b~
-N
{
jVN
-v N
5(\)) exp(-2ninvAt) dv - pn}/4vN • (14)
THE PROBLEM OF MISSING DATA 105
(15)
cS(H + L) = 0 (16)
and, at the same time, requi re that each term in braces in Eq. (14) vanish.
Therefore, we obtain, as before,
N
{r
-1
~
N WN IIVN S(v) exp(-2ninvllt) dv - Pn 12 ~ 0 2 • (18)
-N -VN
(In this approach, we also may consider cases where the noise between the
different lags is correlated.) In this case, the constraint (18) can be
adapted to a form appropriate to a Lagrangian function, namely,
Again, we perform a variation in the power spectrum S(v) so that Eq. (16)
holds, and we seek an extremum in the entropy, with L vanishing. Newman
[1977] gives a more complete description of this problem. We shall show
that this approach can accommodate unmeasured data as well.
N
S(v) = {~b~ exp(-21Tinv~t) } -1 ~ O. (20)
-N
Po PI P2 PN Yo 1/y:
P-l Po PI PN- 1 Yl 0
P-2 P-l Po Y2 = 0
(23)
YN 0
Van den Bos [1971] observed that the Yn coefficients, when properly nor-
malized, corresponded directly to the N+1 point noise prediction filter or
prediction error filter obtained when autoregressive methods are used to
estimate the power spectrum. This establishes the correspondence between
the spectrum estimated by the M EM and by prediction or autoregressive
methods.
(25)
This expression implies that the noise effects in different correlation func-
tion lags are uncorrelated and that the measured lags are confined within a
hyperellipsoid centered around the true correlation function lags. It can be
shown, in the limit that a-+-O and M = 0, that maximizing the entropy (9)
subject to Eq. (25) reproduces the standard maximum entropy result for the
spectrum.
We can be assured that a solution to the problem of maximizing the
entropy subject to the latter constraint exists if the weights and confidence
thresholds are properly chosen and if the instrumentation that provides the
108 William I. Newman
c5(H + l) = O. (27)
where
The first of Eqs. (29) corresponds to the unmeasured lag problem described
in the implicit constraint approach [Newman, 1979]; the second of Eqs. (29)
corresponds to the noisy lag problem described in the explicit constraint
approach [Newman, 1977]. The proportionality factor that emerges in the
latter is an algebraic combination of C1 and of the weight factors Wp ...
THE PROBLEM OF MISSING DATA 109
~
L 'Yn
{ abn-k
Pn-k + W
[~
L
bmb~
W
]-1/2}
n-k m
n=O m=O
Pn = P
n
+ a bn
Wn
[f
m=-N
bm
Wm
b~ ]-1/ 2
\IN
= [ S(\I) exp(-2'11'in\l~t) d\l, n = 0,1, ••• ,N. (31)
-\IN
N
~ Yn Pn-k = [ro*r1 !Sk,o , k = 0,1, ••• ,N , (32)
n=O
a form that exactly parallels the usual form, that is, Eq. (23). In some
sense, we can regard the derived correlation function Pn as a form of maxi-
mum entropy method "fit" in the presence of contaminating noise [for n =
Pl.' I. = 0,±1 ,±2, ••• ,±(N-M)], as a maximum entropy method interpolant for
missing intermediate lags [for n = qk, k = ±1 ,±2, ••• ,±M], and as a maximum
entropy method extrapolation for the unmeasured lags [n > Nand n < -N].
We wish to develop a practical computational procedure for solving this
problem. To do this, we perform a set of iterations that increase the
entropy consistent with the constraint at each step. Since the maximum
entropy state or solution is unique, we can invoke the uniqueness of the
110 William I. Newman
solution and recognize that with each iteration we are • closer' to the
desired result. Each iteration consists of two sets of steps:
• Fill holes.' We vary the Pqk , for k = ±1,±2, ••• ,±M, employing a multi-
dimensional Newton method. Newman [1979] showed that this step
was absolutely convergent and was asymptotically quadratic.
'Correct for noise.' We vary the Yn'S, employing an iterative relaxa-
tion or invariant imbedding procedure. Newman [1977] showed that
this step generally converges although special care is sometimes needed
in difficult problems.
In each iteration, these two steps are performed in succession. We stop the
iterations once self-consistency is achieved. When the initial estimates of
the correlation function are not Toeplitz, we employ an artificially elevated
value of Po until convergence of the above iterative steps emerges. Then
Po is relaxed and the steps are repeated until conyergence is obtained. In
especially difficult cases, a is also varied in a relaxation scheme until con-
vergence occurs. When variation of Po and a in this manner does not permit
their relaxation to their known values, we can be assured that no solution
exists, owing to instrument failure or faulty weight or variance estimates.
Applications of this extended maximum entropy technique abound, par-
ticularly in problems associated with radar and sonar. This approach avoids
line splitting or broadening and, because it is consistent with all available
information, it provides an unbiased estimate of the solution. As an illustra-
tion, we consider the ionospheric backscatter studies performed at Arecibo
Observatory, which sample data at rates too high and for times too long to
permit the storage of a time series and the application of the Burg algo-
rithm. Instead, researchers there have constructed a digital correlator that
produces single-bit lag products in 1008 correlator registers or channels.
The correlation function estimates that emerge are unbiased but are con-
taminated by the effect of single-bit digitization and often suffer from
improperly functioning channels (usually providing no direct indication of
which channels are providing misleading information). As a simple represen-
tation for the kind of data provided by this device, we construct a time
series xn and an approximate correlation function Pn defined by
K
Pn = K- 1 L xm+n xm* • (33)
m=1
48.29 48.29 .
""
.
"
,"" , b
a :
""
,,"" ,
,,
37.80 3863
i 27.3t l- 1 28.9
~
eu eu
l:.
• ./.-----....
:;t9.32
II) t6.83 I-
6.341- H
J . \.•
...... Ii f\ l..... c/.....
~···~.l l\\ i~ ", . . ..... ~.........
\""-,'v v . . . \""'i .."..
-4t5 \ ,I , I \i
~a50 -025 0.00 025 050
Frequency
4829 r, 4&29
f; "
:' :' C d
38.63 38.63
; 28.97 ;28.97
~ ~
l! l!
u
i
'" t9.32
&
"'t9.32
Frequency Frequency
57.27 57.27
a b
44.94 - 45.82
:32.61 - ~ :34.36
o•
~
0.
1
~
1:... 20.28 ,
-
II>
,
!
i
7.95- r - 11.45
!
£\ j /\ r·
'-' . . . 'r'
-. r\ r·
\j II \ j \.1 v -,- '-'
-4.:38 L -_ _L---'L...L.L~_--::-:-:-_-----::-:
-0.50 -025 0.00 0.25 0.50
Frequency Frequency
features. There we observe that the peaks for the noise-corrected MEM
spectra are not as tall as those of the standard MEM, but not significantly
broader. The error in the noise-corrected MEM estimate of the peak fre-
quencies is only 0.2 % of the bandwidth. The overcorrected spectrun has
very poor resolution. This gives rise to the ad hoc rule that, as the entropy
(a monotonically nondecreasing function of (1) increases, the associated in-
formation content of the data decreases and the resolution of the spectrum
is reduced. The relative amplitudes of the two peaks and the relative areas
under them are not correctly estimated by the standard MEM. (We note
that, if C1 were larger, the correlation function estimates would not be
Toeplitz and the standard MEM could not be used.) The noise-corrected
spectrum rectifies this situation, while the overcorrected spectrun inverts
the relative heights of the peaks obtained by the application of the standard
MEM.
In Fig. 3 we consider the case of noisy measured correlation data as
well as missing intermediate lags, namely M = 3, ql = 3, q2 = 6, and q, = 7,
as in Newman [1981]. The computational procedure described above where
we iteratively "fill holes" and "correct for noise" was employed, including
the variation of trial values for Po' The four curves represent-
114 William I. Newman
~ o.oo a 40 . 00 b
30.00
a:
IoU
:J:
i!:.
:J:
30.00
0 0
"- "-
a!20.00 --'
a:
a: a:
>-
u
IoU
~2 0 . 00
W
"- "-
'" 10.00 '"
I D.00
Figure 3. Comparison of spectral techniques for noisy lags with some un-
measured lags. (a) Full passband. (b) Limited passband.
Corrected MEM spectrum (--). Overcorrected MEM spectrum (- - -).
Truncated correlation-function-derived spectrum with zeroes in missing lags
(-0-0-) and with MEM estimates in missing lags (-.-.-).
THE PROBLEM OF MISSING DATA 115
peak frequencies (they would ultimately merge as (J was increased), and in-
troduced a bias into the relative afl1>litude of the peaks. Added insight into
the influence of contaminating noise and missing data may be inferred from
our previous case studies, where we considered these effects individually.
Finally, in comparing the above extensions of the MEM with other tech-
niques of spectral analysis, it is useful to think of the MEM estimate of 5(\1)
as the Fourier transform of the derived correlation function. The derived
correlation fmction provides no information that is not already included in
the given correlation function estimate. Indeed, the derived correlation
function fits the imprecisely measured lags, provides an interpolatory esti-
mate of the missing data, provides a stable extrapolation for the correlation
function in the time domain, and provides a power spectral density that nei-
ther loses nor assumes any added information. However, in accofl1>lishing
these ends we have inadvertently introduced a firm constraint into the prob-
lem: that the derived correlation function lies on a hyperellipsoid instead of
being characterized by some distribution function. We discuss this problem
next.
Recall that the MEM yields a spectral estimate that is • closest' in some
sense to the greatest concentration of solutions and that this estimate is the
'statistically most probable solution' for the power spectrum. From the
asymptotic equipartition theorem, we have that e- H is the probability of the
process x, where H = H(x). A nonrigorous way of seeing how this result
emerges is to recall [Eq. (7)) that H :: -<lnP>. In an ad hoc way, we can
regard the probability of obtaining the derived correlation function Pn, for
n = O,1,2, ••• ,N, as being approximately exp[-H(p)). We call this probability
f(p) and regard the Pn as the 'true' correlation function.
At the same time, we have measurements of the correlation function,
namely Pn, for n = O,1,2, ••• ,N. We wish to obtain the Bayesian probability
that we have measured the Pn, given the 'true' correlation function Pn,
and we denote this probability as fB(p Ip). The desired result is given by
the Wishart distribution, which we can approximate by
(34)
We now define f(p) as the probability that a correlation function Pn, for n =
0,1, ••• ,N, is measured. Although we have no direct way of calculating this
probability, knowledge of the precise nature of f(p) is not essential to our
purpose. Further application of Sayesian arguments yields, therefore,
f(p,p) I f(p)
fS(p I P) = f(p) = fS(p P) f(p) • (37)
to the cross entropy problem to show that the constrained cross entropy
calculation is a useful approximation to the maximum likelihood estimate of
the power spectrum.
Further, we assume that the noise at time interval i can be calculated from
a linear combination of earlier Signals. In this way, we define a forward
(causal) filter with M points rj(M), for j =0,1, ••• ,M-1, which we refer to as
an M-point noise prediction filter or prediction error filter. From Eq. (39),
it is appropriate to normalize rS M) = 1. We assume, therefore, that our
estimate for the noise obtained using the forward filter, namely ni(f), can be
expressed as
Similarly, Burg [1975] showed that we can estimate the noise by using the
same filter in the backward direction. In analogy to Eq. (40), we write
M-1
n'<b) -
I -
' x .. r.<M)
L I+J J ' i = M, •••,N +1-M. (41)
j=O
118 William I. Newman
(42)
for i 1: j. (43)
In a data set with finite numbers of samples, Burg [1968] proposed that the
expectation values employed by Wiener be replaced by arithmetic averages.
Moreover, as we have a choice of whether to employ ni(b) or ni(f), Burg
proposed that we use both, to minimize the effect of running off the· end·
of our data. As a practical measure following Burg, we replace Eq. (42)
with
N N+1-M ]
[ 'L" n.(f)
I
n.(f)* +
I
' " n.(b) n.(b)*
L I I • (44)
i=M i=1
This measure of the noise power is positive-definite and has no built -in bias
toward the beginning or end of the data set. Minimization of this latter
estimate of the noise power proceeds using the Burg-Levinson recursion
scheme [Burg, 1968]. This scheme produces a filter that is minimum phase
(that is, whose z-transform has all of its zeroes confined to within the unit
circle) and is equivalent to a correlation function estimate that is Toeplitz.
The Burg-Levinson scheme, however, does not necessarily produce a filter of
minimum phase that at the same time minimizes the noise power in Eq. (44).
Although computationally expedient, the Burg-Levinson procedure sometimes
results in spectral features or lines that are artificially split. Fougere et al.
[1976] developed a scheme that provides a minimum phase filter that min-
imizes the noise power PM and overcomes the problem of line splitting,
albeit at the cost of significantly more exhaustive computations. Our first
objective here is to adapt Burg's approach, as epitomized in Eq. (44), to the
problem of gaps.
The simplest approach to the problem can be met by modifying the
range of summation for PM to include only ·meaningful· estimates of ni(f)
and ni(b), that is, including in PM estimates of the noise power coming from
all contiguous segments of data. This approach was first suggested by Burg
[1975]. Although a useful first estimate for the filter coefficients can be
obtained this way, this approach ignores the information present in the con-
tinuity of the data and our knowing the lengths of the gaps. To see how in-
formation can be resident in the gaps, consider the paradigm of measuring
the local gravitational acceleration by obtaining an accurate representation
for the period of a pendulum that oscillates in a vacuum. Suppose we note
THE PROBLEM OF MISSING DATA 119
the time that the pendulum passes over a well defined point (with an abso-
lute error of ±0.1 sec) along its arc and we initiate a clock at 0 sec. We
observe the first five swings to be at 1.1, 2.1, 3.0, 4.0, and 4.9 sec, respec-
tively, and observe later swings at 7.1, 8.0, 10.1, 12.1, 15.0, 20.0, 25.1,
29.9, 38.1, 50.1, 59.9, 70.1, 85.1, 99.9 sec. From the first five swings, we
conclude that the period of the swing is 0.98 ± 0.08 sec. Although we have
obviously missed the sixth swing, we can readily associate the next three
recorded measurements with the seventh, eighth, and tenth swings. Al-
though we are now missing swings more and more frequently, our absolute
error of measurements translates into a lower and lower relative error in the
frequency of the pendulum. With the last recorded swing of the pendulum,
we know we have witnessed the execution of its hundredth oscillation and
that the period of the pendulum is 0.999 ± 0.001 sec. In time series anal-
ysis, we are often concerned with the problem of identifying the frequencies
that characterize a given phenomenon. The example above shows that
"gaps" in the data can be employed to help reduce the uncertainty in these
frequencies when one extrapolates across the gaps. This opens up the possi-
bility of obtaining spectral resolution characteristic of the overall length of
the time series and also estimate the missing or unmeasured datum.
There is, however,. a limitation to this idea. Suppose in the above
"experiment" that the measurements at 7.1 and 8.0 sec were not available.
Then we could not conclude with absolute certainty that the measurement at
10.1 sec corresponded to the tenth swing. Indeed, without the measure-
ments at 7.1 and 8.0 sec, we could equally well conclude that either 9 or 11
swings had taken place. This uncertainty would then propagate through the
rest of the data set, and we would obtain several distinct estimates for the
frequency of the pendulum. From the above example, we see that spectral
peaks must be separated by approximately (NmaxAt)-l, where Nmax is the
length of the longest contiguous set of data, if they are to be resolved.
Assuming that all pairs of spectral features can be resolved, the smallest
uncertainty with which we can identify the frequencies of the spectral
peaks is approximately (NAt)-l, where Nllt is the time elapsed from the
first to last sample (including intervening gaps). Moreover, we see that, by
using an appropriate sampling strategy, we can sample data less and less
frequently. In particular, we can introduce gaps whose durations are pro-
portional to the time elapsed since the first datum was measured. Hence, it
follows that the effective sampling rate, as time passes, can decrease as
quickly as 1jNelapsed (where Nelapsed is the number of time intervals At
that have passed since the first datum was recorded). Therefore, we need
only obtain the order of In(N) data to obtain frequency estimates with
accuracy of order approximately (NAt)-l. Thus, this particular sampling
strategy is an example of time decimation.
These comments on indeterminacy are also appropriate to situations
where the gaps are not organized in some fashion. The variational problem
for the prediction error filter without gaps implicit to Eqs. (42) and (44) has
a unique solution. When gaps are present in practical problems, the order of
the prediction filter should not be taken to be longer than Nmax since the
120 William I. Newman
N+1-M
, _ ..! __1_ {~ ~1.. .(M)" M-1
~
}
~
2
PM - 2 N+1-M i~ j~ x.- J rJ + xi+j rj 2 ,
i=1 j=O
( 45)
where r o(M) = 1 and where we vary the r/ M ), for j = 1, ••• ,M-1, and the
unknown Xi in order to minimize PM. As pointed out earlier, the solution is
not necessarily unique since PM will be a concave functional only if suffi-
cient quantities of data have been sampled. As a computational approach to
solving this variational problem, we propose the following procedure, which
reduces PM at each stage:
Repeat steps (1) and (2) until the missing data and the filter coefficients
converge to some values. If PM has a unique minimum, this result will cor-
respond to that minimum. Otherwise, this method will provide one of the
minima, albeit not necessarily the global minimum. Upon reflection, we can
show that filling the gaps in this variational scheme ties the various data
segments together and can improve spectral resolution significantly (partic-
ularly if PM has a unique minimum) since the poles of the filter will migrate
toward the unit circle.
The concept of employing the prediction error filter to estimate the
unmeasured time series data was first advanced by Wiggins and Miller
[1972]. However, they made no use of the information resident in the gaps
to improve upon their filter estimate. Fahlman and Ulrych [1982], in an
astronomical application to determining the periodicities of the variable star
1\ Delphini, recognized that they could improve upon their filter estimate by
using the information present in the gap and then use the improved filter
estimate to better estimate the missing datum. Their approach was an iter-
ative one, in contrast to a variational principle, although they recognized its
THE PROBLEM OF MISSING DATA 121
1.0 - L
fI I AI
A
1.0
x x
o 0
~ 0.0 W
- 0.0
I-
l- I-
lL "-
-1.0
l,J
-1.0 c-~ I
I V r
0.0 20.0 lJO.O 60.0 0.0 20.0 lJO.O 60.0
a b
I I I I I I
(\ ~
{\
j
1.0
~ 1.0
~J
x
V\
o 0
~ 0.0 w 0.0
IV
-
l-
l- I-
lL "-
-1.0 ~ V -1.0 I-
V
V
I I I ~ I I I
0.0 20.0 lJO.O 60.0 0.0 20.0 lJO.O 60.0
c d
Figure 4. MEM interpolant for (a) time series with one spectral peak, and
(b) time series with two spectral peaks where second peak is at fourth har-
monic of first peak, (c) where second peak is barely resolved from first
peak, and (d) where second peak is not resolved from first peak. In each
case, the solid horizontal line indicates position of missing data.
THE PROBLEM OF MISSINC DATA 123
-5/64 and -6/64, and Fig. 4(c) illustrates the excellent interpolant ob-
tained. Figure 5(a) is the spectrum obtained for this case. The relative
amplitude of the spectral peaks is in rough agreement with our expectations,
and in excellent agreement with the original, ungapped data. Curiously, the
spectral peaks associated with the interpolant are better resolved than the
spectral peaks associated with the original data. Indeed, we must expect
this (sometimes) desirable bias since the variational principle for PM seeks to
minimize the prediction error and therefore can produce an interpolated
estimate for the gapped data that is 'smoother' than the true data, result-
ing in better separation in the associated spectral peaks. In Figs. 4(d) and
5(b) we consider a case immediately below the resolution threshold where
the spectral peaks are at -5/64 and -6/64. The interpolated time series,
Fig. 4(d), is in excellent agreement with the original data, as is the spec-
trum, Fig. 5(b). However, this spectral estimate is incapable of resolving
the two features and, instead, reproduces a relatively broad feature whose
width is an approximate measure of the spectral peak separation.
I I I I I I
80.0 - a - 200.0-
b -
c; C;
w w
- -
~ ~
~ ~
~ ~
a: a:
"-' "-'
::< ::<
c
0
"- 40. 0 r- - "- 100.0- -
-' -'
a: a:
a: a:
I-- I--
U U
W "-'
"- "-
'" '"
o. a I I I o. a I I I
-0.4 0.0 0.4 -0.4 0.0 0.4
FREQUENCY FREQUENCY
7. Acknowledgments
8. References
Paul F. Fougere
127
C. R. Smith and G. 1. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 127-148.
128 Paul F. Fougere
1. Introduction
smoothing will not be done. In fact, it will be pointed out that to yield sta-
tistically consistent spectra, 100 periodograms would be needed to match
the low variance of a single maximum entropy spectrum. However, if only
the spectral index is required, fewer will do. In any case, a rational use of
averaging of periodograms is recommended. Greatly superior results will be
produced with averaging than without.
The above-mentioned authors estimate the power spectrum by a method
based on the finite Fourier transform using the fast Fourier transform (FFT)
algorithm. The finite Fourier transform of a set of real numbers is a com-
plex sequence. When this complex sequence has its magnitude squared, the
resulting sequence is called a periodogram, a term coined by Schuster
[1898]. The periodogram is used as an estimator for the power spectrum; it
has the disturbing property that the variance of the estimates increases as
the number of original time series observations increases. Statistically, the
periodogram provides what is called an • inconsistent' estimate of the power
spectrum. A statistical estimator is called inconsistent if in the limit, as the
sample size goes to infinity, the variance does not approach zero. For the
periodogram of a white Gaussian noise process, the variance approaches the
square of the power spectral density. Averaging is required to produce a
consistent estimate of the power spectrum [Jenkins and Watts, 1968]. In an
attempt to further reduce the variance of the periodogram, Welch [1967]
suggests averaging many distinct periodograms obtained from overlapped
data sections. These procedures reduce the variance of the spectral esti-
mates by a factor proportional to the number of sections.
About 15 years ago, at two international meetings, Burg [1981 a,b] in-
troduced his new technique, which is related to the work of Shannon [Shan-
non and Weaver, 1949] and Jaynes [1982] on information theory. (The two
Burg papers, as well as many others on modern spectrum estimation, are
reprinted in the work of Childers [1981]; see also Burg [1975].) The
method, called the maximlJll entropy method (MEM), has revitalized the sub-
ject of power spectral estimation. Hundreds of papers have appeared on the
subject, but MEM is still not widely accepted and practiced throughout the
geophysical community despite the fact that it has been shown to produce a
smoother spectrum with higher resolution than the FFT-based techniques, in-
cluding the Cooley-Tukey method and the slightly older Blackman-Tukey
method (see, for example, Radoski et al. [1975, 1976]). For an excellent
tutorial review on many of the modern spectral analysis techniques, includ-
ing MEM and periodogram, see Kay and Marple [1981].
The MEM estimate will always be a smooth estimate, but as expected, its
resolution decreases as the signal-to-noise ratio decreases. The MEM, as
applied here, consists of two independent steps. The first step is the so-
called Burg technique, which determines a prediction error filter with m
weights. When run in both time directions over the n data points (n > m),
this technique minimizes the mean square prediction error. The m weights
are the coefficients in an mth-order autoregressive (AR) model of a linear
process that could have operated on a white noise input to produce an
output with the statistical properties of the observations. The second step
130 Paul F. Fougere
For any linear, time-invariant filter, the output power spectrum is equal
to the input power spectrum times the square of the magnitude of the fre-
quency response of the filter (see, for example, Rabiner and Gold [1975,
p.414]). If the input is Gaussian white noise, its power spectrum is a con-
stant, the variance of the noise; the output power spectrum is simply a
constant times the magnitude-squared frequency response of the filter. If a
filter can be designed to have a frequency response in the form f- P, then its
output power spectrum when excited by white noise with unit variance will
be
P(f) = f- 2P • (1 )
numbers, many different realizations of the colored noise (power law) proc-
ess can be produced easily.
3. Computer Experiments
20 100
.... 16
VI
a ;;; 80
~ SLOPE
b
= -2.0000
z ....
o
e;12 Vl
z 60
~
'"~ 40
20
O ~-----~---------~-----------
-80 -40 0 40 60 120 160 0.1
TIME fREQUENCY (Hz,
d
0.46 14
C
0. 32 10
0.16
....
:) 6
.... ....
0..
:) ::l
0.. 0 0
Z
-0. 16 -2
where m is the desired spectral index, using the method of least squares.
The MEM slope is -1.9576, and the periodogram slope is -1.9397. Thus the
smoothed behavior of the periodogram in this case is acceptable.
...
~
Vl
80
a
20
b
<
u. 60 SLOPE = -1.9578 0 SLOPE = -1 . 9397
~
'"
>-
u 40 '"~ -20
w 0
'r~~
"- VI
Vl "- - 40
20
~
C<
~
0
Q.
0 -60
l.l
g
w -20 -80
~ 0 .01 0 .1 0 .01 0 .1
FREQUENCY (Hz) FREQUENCY (Hz)
Figure 2. (a) Maximum entropy spectrum of the signal from Fig. 1(d). The
ordinate is 10 10g1O (PSD). (b) Periodogram of the same signal.
a
1000 100
~ 800 ;0 80
b
z ~
w
SLO PE = -4.0009
0
e;600
w
'"z
0
60
'"w Q..
'"w
~400 40
:::>
Q..
'"
~ 200 20
0 0
-120 -60 0 60 120 180 240 0 .01 0.1
TIME FREQUENCY (Hz)
0. 48 1500
0. 32 C 900
d
>-
:::>
0. 16 ~ 300
>- :J
:::> 0
Q.. 0 -300
~
-0. 16 -900
-0. 32 -1500
160 320 480 640 800 960 1120 250 350 450 550 650 750 650
TIME TIME
:90 50
'"u<
= -
a b
~' 70 SLOPE 4. 1284 30
:J SLOPE = -1 . 9580
'"tiw 50 ~ 10
Q..
o
'"'" 30 '" -10
Q..
w
~
~ 10 - 30
II
o
--J
-50 '-:-:-_ _ _ _ _ _ __ _ _ __
~ 00.'-..0=' - - - - - - =0-.'- - - - - -
0.0' 0 .1
FREQUENCY (Hz ) FREQUENCY (Hz)
Figure 4. (a) Maximum entropy spectrum of the Signal from Fig. 3(d). The
ordinate is 10 10g10 (PSD). (b) Periodogram of the same signal.
134 Paul F. Fougere
In this case the MEM spectrum is still quite acceptable, but the periodo-
gram is not I There is essentially no relationship between the periodogram
and the true spectrum. The true slope is -4, and the periodogram slope is
only -1.9580.
An explanation of the periodogram difficulty derives from the fact that
a finite data set is equivalent to an infinite data set multiplied by a rectan-
gular window that is unity inside the measurement interval and zero outside.
Multiplication of the actual time series by a window function implies that
the overall transform is a convolution of the desired transform with the
transform of the window function. The Fourier transform of the rectangular
window has the form sin( 'lff)/'lff; this has a relatively narrow central lobe
and very high side lobes. If the window were infinitely wide, the side lobes
would disappear, and the central lobe would become a Dirac delta function;
convolution with such a function would simply reproduce the correct spec-
trum. But for a finite-sized window the convolution produces some delete-
rious effects. The first is that the true spectrum is smeared out or defo-
cused; spectral resolution is limited to the width of the main lobe (NH)-l
Hz. The second, and more important for our discussion, is that the high side
lobes increase the apparent power at points away from the central lobe. In
effect, the decrease of power spectral density with increasing frequency,
instead of characterizing the physical process, depends upon the window
alone. Thus spectral indices greater than about 2 cannot be obtained using
a rectangular window. Many other windows have been designed to offset
this high side lobe behavior. The rectangular window gives the narrowest
main lobe; other windows designed to reduce side lobes do so at the expense
of increasing main lobe width. Nevertheless, some such window is essential:
spectral estimation of red noise processes using periodograms requires the
use of some nonrectangular window.
For systematic comparison of the two methods, each filter was used to
produce 100 independent red noise realizations for each power law index.
Four sets of experiments were run, characterized by differing treatments of
the red noise before the spectral estimates. In all cases the order of the
prediction error filter in MEM was set to 6.
5 5
a b
4 4
3 3
i
.; '
...
t;: t;:
2 1 ~l .~~ j 2
00 0
2 3 4 5 0 2 3 4 5
MEM MEM
5 5
C ..'c:i:' d
4 4
3 3
t;: t;:
1.1- .....
2 2
1 1
0
0 2 3 4 5 2 3 4 5
MIM. MEM
Figure 5. Observed FFT (periodogram) index versus the MEM index (index is
the negative of the slope). There are 100 independent realizations of the
power law process for each of the 10 indices 0.5, 1.0, ••• , 5.0. In every
case the same time series was used as input to MEM and to FFT
(periodogram). (a) Raw data. (b) End-matched data. (c) Windowed data.
(d) End-matched and windowed data.
sample. An inverse Fourier analysis, back into the time domain, produces a
periodic function of time, whose period is equal to the duration of the origi-
nal data. If the original data set does not look periodic, that is, if the first
and last points are not equal, then the resulting discontinuity in value of the
periodic function produces a distorted power spectrum estimate. If the dis-
continuity is large, the resulting spectrum distortion can be fatal. For an
illuminating discussion of this problem, which is called 'spectral leakage,'
see the paper by Harris [1978] on the use of windows.
136 Paul F. Fougere
Note that this phenomenon does not have any effect on the MEM spec-
trum, which is specifically designed to operate on the given data and only
the given data. No periodic extension of the data is required for MEM as it
is for the periodogram technique.
It is in the treatment of missing data that the sharpest and most
obvious difference between the periodogram technique and MEM occurs.
Jaynes' [1982] maximum entropy principle says that in inductive reasoning
our general result should make use of all available information and be maxi-
mally noncommittal about missing information by maximizing the entropy of
an underlying probability distribution. The MEM, developed by Burg in 1967,
begins by assuming that the first few lags of an autocorrelation function are
given exactly. Writing the power spectral density as the Fourier transform
of the entire infinite autocorrelation function (ACF) and using an expression
for the entropy of a Gaussian process in terms of its PSD, Burg solved the
constrained maximization problem: Find the PSD whose entropy is maximum
and whose inverse Fourier transform yields the given ACF. Maximization is
with respect to the missing (infinite in number) ACF values. The result of
all this is an extrapolation formula for the ACF. The MEM power spectrum is
the exact Fourier transform of the infinitely extended ACF.
The discontinuity in value of the periodogram can be removed by "end-
matching," in which a straight line is fit to the first and last data points and
then this straight line is subtracted from the data. End-matching may be
used to remove the low-frequency components, whose side lobes cause
spectral distortion.
Experiment B: end-matching. Now the same 10 filters as in experi-
ment A are used, but each red noise realization (time series) is end-matched
before spectral estimation. Once again, the same data set is used as input
to both MEM and the periodograms. Figure 5(b) shows the results, with a
dramatic improvement in the periodogram indices up to a true index of 4.0.
The results at 4.5 and 5.0 are again unacceptable. Note also that the MEM
results are slightly but definitely worse. There is more scatter, and at an
index of 0.5 the MEM results are biased upward.
Clearly then, end-matching is required for spectral estimation using
periodograms: without it most of the spectrum estimates are unreliable.
Except in the low index range, for true indices betweeen 0.5 and 2.0, the
periodograms are distorted. A spectral index of 2.0 could have arisen from
a true index anywhere between 2.0 and 5.0.
Just as clearly, end-matching should be avoided in the MEM calcula-
tions, where it does not help but degrades the spectral estimates.
Experiment C: windowing. The difficulty with the periodogram re-
sults at true indices of 4.5 and 5.0 may perhaps be explained by discontinui-
ties in slope of the original data set at the beginning and end. The sug-
gested cure is "tapering," or looking at the data through a window that
deemphasizes the data at both ends by multiplying the data by a function
that is near zero at the end points and higher in the center. This tapering
does indeed reduce the tendency to discontinuity in slope.
SPECTRUM ANALYSIS OF RED NOISE PROCESSES 137
W· -
1 -
2' - 1
1 - [ _1_
l +1
J2 ' j = 1,2, ••• ,l • (3)
Each entry is based upon a set of 100 realizations of the red noise process. FIT stands for FIT-based
periodogram.
SPECTRUM ANAL YSI S OF RED NOI SE PROCESSES 139
For both MEM and FFT, however, the bias is quite small and should not prove
troublesome.
It is noted that by averaging the spectral index estimates obtained from
four periodograms the confidence bounds can be reduced to less than the
bounds for a single Burg-MEM estimate. Using overlapping spectra as rec-
ommended [N uttall and Carter, 1982], equivalent results could be obtained
from averaged periodograms by using a data set 3 times the length of that
required for the Burg-MEM analysis. The tradeoff between the use of the
two techniques is evident. The nonparametric method employing windowed
and averaged periodograms requires more data to produce the same result as
can be obtained from the Burg-MEM algorithm. The parametric Burg-MEM
algorithm, however, requires the selection of the correct order to produce
unbiased results, but the order is not known a priori. When faced with a
time series from an unknown process, both techniques should be applied, and
the parameters of the models (such as order of the process) should be
adjusted to provide consistent results [Jenkins and Watts, 1968].
3. Slight biases can result from the use of MEM unless care is taken in
determining the order of the process to be analyzed.
°
mean and standard deviation are approximately constant during the three
separate time intervals to 2-1/4 min, 2-1/4 to 3-1/4 min, and 3-1/4 to
SPECTRUM ANALYSIS OF RED NOISE PROCESSES 141
20
co 10
~
....
~
a Cl
::>
!::
0 -.. . . . . .~"~..\ . . .r"·~
-'
c..
:::l!
« -10
-20
0 2 3 4 5
TIM E (min)
•
!\ r.
10 ! \ I ;
c::I
f \)\" ,I'; '~''''''\
b Cl
::>
!:: 0
:::::c:::.:.:.:::::c....
\
.... -... -...-.- ......-...•. ' ......-..~ ..... .......'.... ,.. ./\
..
-'
c..
:::l! -5
« MIN \.
o 10 20 30 40 50 60
BATCH NUMBER
Figure 6. (a) Amplitude scintillation data taken from theMARI SAT satellite
in January 1981. Sampling rate is 36 Hz. There is background noise from 0
to 2-1/4 min, moderate scintillation from 2-1/4 to 3-1/4 min, and fully
saturated scintillation from 3-1/4 to 5 min. (b) Maximum (top curve),
standard deviation (middle), and minimum (lower) of each batch of 361
observations of data from (a). Batches are overlapped by 181 points.
Vi
'"
u
>=
z
w
w
3:
•
,
.....
W
...J m
<I
z m
'"Vi • •
0
0
LL :!:
0
II
W >-
0 .....
J Vi L.
..... Z
..
::; I W L
0
g"
<I
<I
cr
"
...J
.....
~ ,
Of)
cr
W I
3:
&: I
o
TIME I SECONDS
a b
Figure 7. (a) The 60 overlapped data batches. (b) Smooth curves: Burg-
MEM spectra of order 5 of the time series shown in (a). Noisy curves:
periodograms of the same data after end-matching and windowing.
averaging is used. If one had only the periodogram results, one might be
tempted to "see" significant spectral peaks, and these peaks in any case
would tend to obscure real changes in slope in the underlying process. Such
important changes in slope (and indeed changes in the character of the
spectrum) are easy to identify in the MEM spectra and can be seen to cor-
respond to changes in the character of the time series as seen in Fig. 6.
The smoothness of the MEM spectrum is affected by the choice of the order
of the spectrum. Since the process is nearly stationary over a number of
spectra, additional information can be obtained by averaging the spectral
estimates obtained from the periodograms. Notice that the spectra num-
bered 27-37 are all extremely close to pure power law spectra with a single
spectral index over the entire frequency band of interest 0.1 to 18 Hz. This
fact should serve to validate and motivate the simulation results presented
earlier. Real geophysical data sets sometimes can be accurately repre-
sented as realizations of pure power law processes.
Notice that the character of the individual data sets changes quite
sharply from batch 25 to batch 29. The Burg-MEM spectra Similarly show
changes of character from batch 25 to batch 29. Once again the character
of the signal and of the associated spectra changes abruptly from pure
SPECTRUM ANALYSIS OF RED NOISE PROCESSES 143
al
~
>-
f-
VI
10
39 .. 41
....z
~~
0
a a
-'
« -10 ~,
\
~
f-
u
.... -20
.................. , 0' ~
0..
VI
~~
~
....
~
-30
~
0
0.. 10- 1 10- 0. 5 10 0 10 0 • 5 10 1
FREQUENCY (Hz)
'1 _
iD
~
80
>- --------..,~OdB
,\1
f-
VI
60
---------_ __ -::--"V " ri JIr..
....z 40 ~ '40dB / \I
b a
-'
«
20 - 39 + 20 dB Ym
~
f-
....
U
0 ,-../>:--=--VVV'~
38 -. r - " j \ r
0.. -20
VI
....
~ -40
3:
0 10- 0.5
0.. 10- 1 10 0 10°·5 10 1
FREQUENCY (Hz)
s. Conclusions
b
4
16
32
64
gram. Note that very little change in the spectrum is shown in the top four
curves, with 2, 4,8, and 16 weights, but below that, for 32, 64, ••• weights,
more and more meaningless detail is displayed.
A complete FORTRAN package that finds Burg-MEM spectra has been
prepared. Seriously interested scientists are invited to write to the author
for a copy of the program and its documentation. Please indicate preferred
tape density (800 or 1600 BPI) and code (EBCDIC or ASCII).
6. Acknowledgments
7. References
Stephen F. Gull
Mullard Radio Astronomy Observatory, Cavendish Laboratory,
University of Cambridge, Madingley Road,Cambridge CB3 OHE,
England
John Skilling
Department of Applied Mathematics and Theoretical Physics,
University of Cambridge, Silver Street, Cambridge CB3 9EW,
England
149
C. R. Smith and G. J. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 149-160.
© 1987 by D. Reidel Publishing Company.
150 S. F. Gull and J. Skill ing
1. Introduction
S(f) = -~ fi log fi (1 )
ME x-ray tomography
(skull in perspex, EM I Ltd.)
o 100 200
Frfoq~y(GHzl
Millimeter-wave Michelson
interferometer spectrum of
SNR Cas A at 5 GHz. 1024 2 maximum entropy cyclotron emission from DITE
image. (S-km telescope, MRAO, Cambridge) tokamak (Culham laboratory)
'Blind' deconvolution of
unknown blurring.
left: True image and blurring
Middle: Data as given to ME
program
Right: Reconstructions
(T. J. Newton)
where h is normalized to Eh =
1. To perform blind deconvolution, we pro-
pose that the combined entropy be maximized:
Suppose (as has often happened) that a radio astronomer has collected
two data sets related to the same celestial object, but at different frequen-
cies, e.g.:
4. Polarization Images
If the emission is polarized, we have to take into account the fact that
this "next photon" can fall into any of the available eigenstates of polariza-
tion. The generalization of the pattern of proportions of flux density is the
quantum-mechanical density matrix. For pixel i, in the circularly polarized
representation, namely,
we have
where I, V, U, and Q are the usual Stokes parameters. This density matrix
is the quantum generalization of the probability distribution of the next
photon, satisfying Trace( p) = 1, and having entropy S = - Trace( p log p).
The maximum entropy principle is again appropriate for the determination of
this matrix, given incomplete information. It simplifies nicely:
5. Probabilities or Proportionsl
For many years one subject has caused more discussion in our group
than any other: the meaning of N, that is, the infamous N in degeneracy =
exp( NS). A related question we are often asked in image reconstruction is:
'How much more likely is your maximum entropy image than any other
possible one?' We know of (and have at various times given) two distinct
answers:
(1) Use the entropy concentration theorem (Eel): pr({p}) ex exp(NS).
(2) 'It is no more likely than any other.'
For image reconstruction and in most other circumstances, we believe that
this second response is more nearly correct, but despite this, the maximum
entropy image is still the best one to choose [Gull and Skilling, 1984].
-N NI
pr(n) =2 nl(N-n)I' (11 )
Suppose now that we are told there is an underlying similarity between the
repetitions of the event: it's the same coin. There is then a well defined
ratio q for the proportion of Heads, which is the probability of 'Heads' for
anyone toss. This information dramatically changes our state of knowledge,
and we can no longer tolerate the heaping up of probability near n =
N/2
that is predicted by the degeneracy (monkey) factors. Since we don't know
the ratio q (except that it lies between a and 1), we must take something
like
1
pr(n) = N+1' n = O, ••• ,N • (12 )
This implies that the probabilities of sequences of results are no longer all
equal:
The prior is presumably constant (0.125), and the likelihood should, as above,
be taken as exp[ -N S( {q})], where N = 4096 (the number of bytes in the
string) and S({q}) = -1; qi logqi, the "entropy" of the string (qi =ni/N).
In other words, the byte boundary is at the position that puts the most
structure into the string: the minimum entropy. Like many Bayesian solu-
tions, this answer is so staggeringly obvious when you have seen it that
there seems little point in testing it-but we did, and found that for 32k bits
of ASCII the byte boundary is determined to a quite astronomical signifi-
cance level. (The reader may like to contemplate why the position is still
ambiguous to one of two positions.)
Other applications of this procedure abound-for example, tests of pat-
tern length in strings and other tests of randomness.
other hand, cannot be interacting directly with each other, but rather with
something much bigger, which can well afford to lose (or acquire) any
amount of energy that the individual particles possess.
The reason there is a constant spectrum (actually the Lagrange multi-
plier in the maximum-entropy principle) is now clearer: we think we can
identify these 'bigger partners' [Bell, 1978]. They are interstellar shock
fronts. These shock fronts originate usually as the blast waves of super-
novae, and propagate at anything up to several thousand kilometers per
second. They are discontinuities in velocity maintained by plasma processes
in the thermal electron plasma. Relativistic electrons, having greater mo-
mentum, are likely to pass through these shocks undisturbed, though they are
eventually scattered and their directions become randomized. Because they
move so fast compared to the thermal plasma, they must on average make
many crossings of the shock before they are lost downstream. However, the
velocity discontinuity at the shock means that the electrons receive a jolt
every time they make a double crossing, playing "ping-pong' with the
shock. This acceleration is indeed a multiplicative Fermi process, and it
depends only on the compression ratio of the shock. But the probability of
being lost downstream also depends only on the same compression ratio.
This combination of constraints leads to a power-law spectrum [Heavens,
1983]. Further, the compression ratios for most shocks are the same; they
are 'strong' shocks, for which the compression in density is a factor of 4.
The nice thing is that the predicted spectrum agrees with what is
observed.
9. Acknowledgments
10. References
Gull, S. F., and J. Skilling (1984), 'The maximum entropy method,' in Indi-
rect Imaging, J. A. Roberts, ed., Cambridge University Press.
John Skilling
Department of Applied Mathematics and Theoretical Physics,
University of Cambridge, Silver Street, Cambridge CB3 9EW,
England
Stephen F. Gull
Mullard Radio Astronomy Observatory, Cavendish Laboratory,
University of Cambridge, Madingley Road, Cambridge CB3 OHE,
England
161
C(p) ~ 1, (2)
we may be faced with the difficulty that very many proportions obey the
condition. Practical necessity then forces us to make a selection from the
feasible set which we must present as the "result" of the experiment that
gave us the data.
Of course, this selection is not a matter of pure deduction, for we are
logically free to make any selection we like. Nevertheless, we are led to
prefer those particular proportions p that have maximum entropy. Several
arguments lead to this selection. Perhaps the most appealing is the abstract
argument of Shore and Johnson [1980; see also Johnson and Shore, 1983],
based on simple and general consistency requirements.
However, we ought to supply some initial model m before we use MEM.
For many practical cases, it is sufficient to ignore the problem and just let
mi be a constant, independent of i, in which case the model cancels out of
the analysis and can be quietly forgotten. Although this is certainly the
simplest way of dealing with the model, it is equally certainly not the best.
On the other hand, we must be careful when we introduce initial models,
precisely because they influence the final result. Indeed, any feasible pro-
portions p can be obtained by suitable choice of initial model m. This gives
MEM enormous flexibility but also introduces some danger. We are reminded
of the moral that Jaynes [1978] draws from carpentry, that the introduction
of more powerful tools brings with it the obligation to exercise a higher
level of understanding and judgment in using them. If you give a carpenter
a fancy new power tool, he may use it to turn out more precise work in
greater quantity; or he may just cut off his thumb with it. It depends on
the carpenter.
PRIOR KNOWLEDGE MUST BE USED 163
2. Astronomical Photographs
r
M
C(p) = (Pk - Dk)2/Mo k (M = number of data) , (4)
k=1
and we reject all images p for which C(p) significantly exceeds 1. The max-
imum entropy image is obtained by maximizing S(p;m) over C(p) and obeys
the variational equation
Cl S/ClPi = ). ClC/ClPi (5)
Thus most of the misfit statistic C(p) may well become assigned to
those small patches of the photograph that contain stars, so the stars are
allowed to relax relatively far from the data and are reconstructed system-
atically too faint. Conversely, extended objects of lower surface brightness
will be reconstructed too bright (to compensate for the stars' being too
faint) and too noisy (because too much of the misfit, which allows for noise,
has been assigned to the stars). The effect can be very significant, and
although it can sometimes be masked by tinkering with the form of C(p)
[Bryan and Skilling, 1980], the theoretical difficulty remains.
Somehow, the astronomer must tell the entropy not to discriminate
against the stars, even though his state of knowledge is translation
invariant.
A practical astronomer, uninhibited by theoretical dogma, would doubt-
less proceed as follows, or similarly. He would look at his reconstructed
(maximum entropy) image, and see that it contained a sprinkling of bright
points. He would be prepared to accept them as stars. He would then know
how many bright stars there were, and roughly how bright they were, and
where they were. He could remove them from his data, perhaps avoiding
bias by using Bayesian parameter-fitting for the positions and brightnesses.
Lastly, he could use his revised data to give him an uncorrupted maximum
entropy image of the fainter objects such as nebulae and galaxies. (At
Cambridge, we adopt this procedure routinely in radio astronomy.)
The crucial step occurs when the practical astronomer uses his first
reconstruction to learn about the locations of the stars, and revises his data
accordingly. Exactly the same effect can be accomplished entirely within
the maximum entropy formalism by allowing the model m to be adjusted in
the light of the reconstruction p. This change also lets us deal with non-
linear data, from which the stars could not be directly subtracted. We con-
clude that the astronomer's prior knowledge about stars must be encoded
into a result-dependent model m(p).
3. Generalities
and final images, which runs counter to the normal philosophy of MEM. It
is, however, quite clear that we must allow m to depend on p, and the phi-
10sophy will just have to accommodate the demand.
We have argued that to obtain the best reconstructions, which use prior
knowledge to the full, we must let m depend on p, and maximize
r
n
S(p) =- Pi 10g[Pi/mi(p)] • (6)
i=1
Clearly this procedure must be treated with great care. For example, the
uniqueness theorem for linear data no longer holds. We now develop some
preliminary thoughts on how to encode m(p) for certain particularly simple
types of prior knowledge. To expound these ideas, we use the "monkey"
approach, in which p is identified with a probability distribution. Specifi-
cally, we think of Pi as the probability that a sample chosen at random from
the distribution would come from cell i.
The simplest case to consider is that of point stars, for which the prior
knowledge was translation invariant. A single sample from the image could
come from anywhere, but subsequent samples are more likely to come from
exactly the same place. The space S of single samples is too small and
impoverished to contain our prior knowledge, so we move to the larger space
SN of multiple samples, from which we can always recover distributions on S
by marginalization [Jaynes, 1978].
Suppose we have prior knowledge that it is reasonable to assume that a
fraction q of the image comes from a single one of the n cells, leaving the
remaining probability to be distributed equally as r = (1-q)/(n-1) among the
other n-1 cells (Fig. 1). From this prior knowledge we can develop the
Pi
UNKNOWN n
= r k
Pr(1st in i and 2nd in j I star in k) Pr(star in k)
(q -r)215"IJ"
2 (g-r)r + _ _--'-
(11)
n n
This shows that prior knowledge of the desi red type does show up as a non-
uniform initial model
215
r 2 + 2(g-r)r + (q-r) ij
mij = n n (12)
(13)
and the entropy gradient, which defines the reconstruction p through the
variational Eq. (5), is
Comparing this with the ordinary entropy gradient, !=q. (3), we see that this
gradient mimics the effect of an ordinary single-space entropy, Eq. (1), rel-
ative to a fixed initial model defined by the values
ui = exp ( rj
Pj log mij) (19)
This is the property we seek. Although the actual prior knowledge can be
coded only in terms of a nonuniform model on a product space, it behaves
like a result-dependent model in the single space. Furthermore, if the
knowledge is translation-invariant, the evaluation of Lj Pi log mij is merely
an n-cell convolution, which is entirely practical computationally.
It is interesting to compare the behavior of the ordinary entropy,
Eq. (1), with the double-space entropy, Eq. (17), for a three-cell image
(Pl,P2,PS). Taking q = 0.95 (95% of the proportion may come from one of
the three cells), the two entropies are plotted in Figs. 2 and 3 respectively.
The ordinary single-space entropy has the usual single maximum at
(1/3, 1/3, 1/3). The double-space entropy, on the other hand, has four
local maxima, and shows a pronounced tendency to favor images with one of
the proportions larger than the other two.
We can extend the development to higher product spaces. The single-
space model
mi = Pr(i) (20)
P3 =0
P3 = 0
Figure 3. Contours of double-space entropy 5(2) from Eq. (17) for a three-
cell image with q = 0.95.
PRIOR KNOWLEDGE MUST BE USED 169
ui = exp( r
jk
PjPk log mijk) • (24)
Quadruple and higher models can allow for yet more detailed prior knowl-
edge, and an N -space treatment mimics an ordinary entropy with model
u = exp(arbitrary function of p)
= arbitrary positive function of p (26)
(27)
170 J. Skilling and S. F. Gull
will have an associated degeneracy
m1gl m2 gz •••
m gn
n
g = N! (28)
gil gzl ••• gn!
Clearly this is a very natural way of working in the space SN for large N.
Suppose the star is in cell k, so that
m.(k)
I
= {q/h,
r/h,
i =k
i~ k
(30)
(31)
The total degeneracy corresponding to our state of knowledge that the star
could be in any of the cells may be obtained by adding the individual degen-
eracies to reach
(32)
This is precisely what we want since the best estimate of the position of the
star is the position of the brightest cell. Figure 4 shows contours of S(CD)
PRIOR KNOWLEDGE MUST BE USED 171
P3 =0
Figure 4. Contours of 5(00) from Eq. (33) for a three-cell image with
q = 0.95.
for a three-cell image with q = 0.95, for comparison with Figs. 2 and 3.
The uniform image has become a local minimum, so there are now only three
local maxima at positions Pm = q = 0.95, which is entirely reasonable.
The price to be paid for this neat adherence to the given state of prior
knowledge is a loss of uniqueness. There may be several local maxima of
entropy over the given constraint region C(p) < 1. If the brightest cell is
only weakly indicated by the data, then there will be serious danger of error
if we select just the highest overall maximum of entropy. A wiser course
might be to display as many of the local maxima as are reasonably accessible
to our computer programs, writing our preference for particular images in
terms of their numerical values of entropy.
David Hestenes
In spite of the enormous complexity of the human brain, there are good
reasons to believe that only a few basic principles will be needed to under-
stand how it processes sensory input and controls motor output. In fact, the
most important principles may be known already! These principles pro-
vide the basis for a definite mathematical theory of learning, memory, and
behavior.
173
1. Introduction
I am here to tell you that another major scientific revolution is well
under way, though few scientists are aware of it even in fields where the
revolution is taking place. In the past decade, a mathematical theory has
emerged that bridges the gap between neurophysiology and psychology, pro-
viding penetrating insights into brain mechanisms for learning, memory, mo-
tivation, and the organization of behavior. It promises a formulation of
fundamental principles of psychology in terms of mathematical laws as pre-
cise and potent as Newton's laws in physics. If the current theory is on the
right track, then we can expect it to develop at an accelerating pace, and
the revolution may be over by the turn of the century. We will then have a
coherent mathematical theory of brain mechanisms that can explain a great
range of phenomena in psychology, psychophysics, and psychophysiology.
To say that this conceptual revolution in psychology will be over is not
to imply that all problems in psychology will be solved. It is merely to
assert that the fundamental laws and principles of explanation in psychology
will be established. To work out all their implications will be an endless
task. So has it been in physics, where the laws of classical and quantum
mechanics have been well established for some time, but even the classical
theory continues to produce surprises. So has it been with the recent revo-
lution in biology brought about by breaking the genetic code; though some
principles of genetic coding are undoubtedly still unknown, the available
principles are sufficient to provide the field with a unified theoretical per-
spective and determine the modes of acceptable explanation. Biology is
now regarded as a more mature science than psychology, but we shall see
that it may be easier to give psychology a mathematical formulation.
If indeed a conceptual revolution is under way in psychology and the
brain sciences, you may wonder why you haven't heard about it before.
Why hasn't it been bannered by Psychology Today or proclaimed by some
expert on the Johnny Carson show? Why is it announced here for the first
time to an audience of mathematicians, physicists, and engineers? Before
these questions can be answered, we need to consider the status and inter-
relations of the relevant scientific disciplines.
with their 'mental' processing capabilities. At this level the fine physiolog-
ical details of single neuron dynamics become unimportant.
Network modeling is the theoretical bridge between the microscopic
and the macroscopic levels of brain activity, between neurophysiology and
psychology. Consequently it is open to objections by empiricists at both
ends who care little about connecting the antipodes. Little wonder that
network modeling is mostly ignored by the neuroscience establishment I
Little wonder that the establishment is unaware of the theoretical revolution
taking place in its own fieldsl
I am especially pleased to tell this audience about the exciting develop-
ments in network modeling, because the field is wide open for theoretical
exploration, and I doubt that I could find an audience more qualified to con-
tribute. Neural modelers have yet to employ statistical concepts such as
entropy with the skill and sophistication of the scientists gathered here.
Indeed, I believe that maximum-entropy theory will play its greatest role
yet in the neural network theory of the future.
3. Introduction to Grossberg
4. Empirical Background
Although the human brain is the most complex of all known systems, we
need only a few facts about neurons and neuroanatomy to establish an
empirical base for neural network theory. Only a brief summary of those
facts can be given here. For more extensive background, see Kandel and
Schwartz [1981].
The signal processing unit of the nervous system is the nerve cell or
neuron. There are, perhaps, a thousand types of neurons, but most of them
have in common certain general signal processing characteristics, which we
represent as properties of an ideal neuron model. To have a specific exam-
ple in mind, let us consider the characteristics of an important type of long
range signaling neuron called a pyramid cell (schematized in Fig. 1).
(1) The internal state of a neuron is characterized by an electrical po-
tential difference across the cell membrane at the axon hillock (Fig. 1).
This potential difference is called the generating potential. External inputs
produce deviations in this potential from a baseline resting potential (typi-
cally between 70 and 100 mY). When the generating potential exceeds a
certain threshold potential, a spike (or action) potential is generated at the
hillock and propagates away from the hillock along the axon.
HOW THE BRAIN WORKS 179
dendrite
SYNAPSE
Synaptic
gap
~~.
Hi Ilock
generating potential
Axon
Collateral
terminal
arbori zat ion
(3) Synaptic inputs and outputs: The flow of signals in and out of a
neuron is unidirectional. A neuron receives signals from other neurons at
points of contact on its dendrites or cell body known as synapses. A typical
pyramid cell in the cerebral cortex (see below) receives inputs from about
105 different synapses. When an incoming axonal signal reaches the synap-
tic knob it induces the release of a substance called a neurotransmitter from
small storage vesicles. The released transmitter diffuses across the small
synaptic gap to the postsynaptic cell, where it alters the local receptor po-
tential across the cell membrane. The synaptic inputs have several impor-
tant properties:
(a) Quantized transmission: Each spike potential releases (approxi-
mately) the same amount of transmitter when it arrives at the synaptic
knob.
(b) Temporal summation: Changes in receptor potential induced by suc-
cessive spike potentials in a burst are additive. Consequently, deviations of
the receptor potential from the resting potential depend on the pulse fre-
quency of incoming bursts.
(c) Synaptic inputs are either excitatory or inhibitory, depending on the
type of interaction between neurotransmitter and receptor cell membrane.
An input is excitatory if it increases the receptor potential or inhibitory if it
decreases the receptor potential.
(d) Weighted spatial summation: Input-induced changes in the various
receptor potentials of a neuron combine additively to drive a change in the
generating potential.
Now let us identify some general anatomical characteristics of the
brain to guide our constructions of networks composed of ideal neurons. We
can do that by examining a major visual pathway called the geniculostriate
system (Fig. 2). Light detected by pootoreceptors in the retina drives the
production of signals that are transmitted by retinal ganglion cells from the
retina to the lateral geniculate nucleus (LGN), which (after some process-
Cor te x
ing) relays the signal to the visual cortex (also known as the striate cortex
or Area 17). From Area 17, signals are transmitted to Areas 18, 19, and
other parts of the brain for additional processing.
In Fig. 3 the geniculostriate system is represented as a sequence of
three layers of slabs connected by neurons with inputs in one slab and out-
puts in another slab. There are about 10 6 ganglion cells connecting the
retina to the LGN, so we may picture the retina as a screen with 106 pixels.
Let Ik(t) be the light intensity input detected at the kth pixel at time t.
Then the input intensity pattern or image displayed on the retina can be
represented as an image vector with n = 106 components:
(1)
This vector is filtered (transformed) into a new vector input to the LGN by
several mechanisms: (a) the receptive fields (pixel inputs of the ganglion
cells) overlap; (b) nearby ganglion cells interact, and (c) the output from
each ganglion cell is distributed over the LGN slab by the arborization
(branching) of its axon. Later we will see how to model and understand
these mechanisms.
Actually, each of the three main slabs in the geniculostriate system is
itself composed of several identifiable layers containing a matrix of inter-
neurons (neurons with relatively short range interactions). We will take
these complexities into account to some extent by generalizing our models
5. Neural Variables
Directed Synaptic
Node pathway knob
-
Components: v·I eij N··
IJ Vj
Physiological Psychological
Name/variable interpretation interpretation
(2)
where Tij is a 'time delay constant' and bij is a 'path strength constant'
determined by physical properties of the pathway. In general, f(Xi) is a
sigmoid function, but for many purposes it can be approximated by the
'threshold-linear function'
(3)
.
Xi = -Aixi +
n
L SkiZki- L
n
Cki + li(t) (4)
k=1 k=1
(5 )
where the overdot denotes a time derivative and i,j = 1,2, ••• ,n.
Grossberg's equations are generic laws of neural network theory in the
same sense that F = ma is a generic law of Newtonian mechanics. To con-
struct a mathematical model in mechanics from F = ma, one must introduce
a force law that specifies the function form of F. Similarly, to construct a
definite network model from Grossberg's equations, one must introduce laws
of interaction that specify the functional dependence of the quantities Ai,
Ski, Cki, Bij, and Si on the network variables Xi and Zij' To investigate
these laws of interaction is a major research program within the context of
HOW THE BRAIN WORKS 185
(6)
This is an integral equation rather than a solution of Eq. (5) because xj("r)
depends on zij('r) in the activity equation (4). However, it shows that in
the long run the initial STM trace Zij(O) is forgotten and Zij(t) is given by a
time correlation function of the presynaptic signal Sij with the postsynaptic
activity Xj'
Special cases and variants of Grossberg's equations have been formu-
lated and employed independently by many neural modelers. But Grossberg
has gone far beyond anyone else in systematically analyzing the implications
of such equations. In doing so, he has transformed the study of isolated
ad hoc neural models into a systematic theory of neural networks. But it
will be helpful to know more about the empirical status of the learning
equation before we survey the main results of Grossberg's theory.
HOW THE DRAIN WORKS 187
7. Hebb's Law
From the viewpoint of neural network theory, Hebb's law is the funda-
mental psychophysical law of associative learning. It should be regarded,
therefore, as a basic law of psychology. However, most psychologists have
never heard of it because they do not concern themselves with the neural
substrates of learning.
Hebb [1949] first formulated the law as follows: "If neuron A repeat-
edly contributes to the firing of neuron B, then A's efficiency in firing B
increases." The psychological import of this law comes from interpreting it
as the neural basis for Pavlovian (classical) learning. To see how, recall
Pavlov's famous conditioning experiment. When a dog is presented with
food, it salivates. When the dog hears a bell, it does not salivate initially.
But after hearing the bell simultaneously with the presentation of food on
several consecutive occasions, the dog is subsequently found to salivate
when it hears the bell alone. To describe the experiment in more general
terms, when a conditioned stimulus (CS) (such as a bell) is repeatedly paired
with an unconditioned stimulus (UCS) (such as food) that evokes an uncon-
ditioned response (UCR) (such as salivation), the CS gradually acquires the
ability to evoke the UCR.
To interpret this in the simplest possible neural terms, consider Fig. 4.
Suppose the firing of neuron B produces the UCR output, and suppose the
UCS input fires neuron C, which is coupled to B with sufficient strength to
make B fire. Now if a CS stimulates neuron A to fire Simultaneously with
neuron B, then, in accordance with Hebb's law, the coupling strength zAB
between neurons A and B increases to the point where A has the capacity to
fire B without the help of C. In actuality, of course, there must be many
neurons of types A, B, and C involved in the learning and controlling of a
molar behavioral response to a molar stimulus, but our reduction to the in-
teraction of just three neurons assumes that the learning actually takes
place at the synaptic level.
A 8
cs.-. -
I
~UCR
(belli (sa livation)
ucs ... C
(food)
Thus, the molar association strength between stimulus and response that
psychologists infer from their experiments is a crude measure of the synap-
tic coupling strength between neurons in the central nervous system (CNS).
The same can be said about all associations among ideas and actions. Thus,
188 David Hestenes
the full import of Hebb's law is this: All associative (long term) memory
resides in synaptic connections of the CNS, and all learning consists of
changes in synaptic coupling strengths.
This is a strong statement indeed! Although it is far from a proven
fact, it is certainly an exciting working hypothesis, and it provides a central
theme for research in all the neurosciences. It tells us where to look for
explanations of learning and memory. It invites us to do neural modeling.
I ronically, cognitive psychologists frequently dismiss Pavlovian learning
as too trivial to be significant in human learning. But Hebb's law tells us
that Pavlovian learning is simply an amplified form of the basic neural proc-
ess underlying all learning. Neural modeling has already advanced far
enough to give us good reasons for believing that the most complex cogni-
tive processes will ultimately be explained in terms of the simple neural
mechanisms operating in the Pavlovian case.
Direct physiological verification of the synaptic plasticity required by
Hebb's law has been slow in coming because the experimental difficulties are
extremely subtle and complex. Peripheral connections in the CNS are most
easily studied, but they appear to be hardwired as one would expect because
the delicate plastic synapses must be protected from destructive external
fluctuations. Though some limited experimental evidence for synaptic plas-
ticity exists, there are still considerable uncertainties about the underlying
physiological mechanism. There are still doubts as to whether plasticity is
due to a pre- or postsynaptic process, though a postsynaptic process is most
likely [Stent, 1973].
Considering the experimental uncertainties, many neuroscientists are
reluctant to take Hebb's law seriously. They fail to realize that the best
evidence for Hebb's law is indirect and theory dependent. Hebb's law should
be taken seriously because it is the only available theoretical construct that
provides plausible, coherent explanations for psychological facts about
learning and memory. Indeed, that is what led Hebb to the idea in the first
place. Empiricists may regard such inverse arguments from evidence to
theory as unconvincing or even inadmissible, but history shows that inverse
arguments have produced the most profound advances in physics, beginning,
perhaps, with Newton's law of gravitation. As an example with many paral-
lels to the present case, recall that the modern concept of an atom was
developed by a series of inverse arguments to explain observable macro-
scopic properties of matter. From macroscopic evidence alone, remarkably
detailed models of atoms were constructed before they could be tested in
experiments at the atomic level. Similarly, we should be able to infer a lot
about neural networks from the rich and disorderly store of macroscopic
data in psychology. Of course, we should not fail to take into account the
available microscopic data about neural structures.
Hebb's original formulation of the associative learning law is too general
to have detailed macroscopic implications until it is incorporated in a defi-
nite network theory. Grossberg has given Hebb's law a mathematical for-
mulation in his learning equation (5). Hebb himself was not in position to do
that, if only because the necessary information about axonal signals was not
HOW THE BRAIN WORKS 189
There is one obvious property of neurons that we have not yet incorpo-
rated into the network theory, and that is the treelike structure of an axon.
What are its implications for information processing? Grossberg's "outstar
theorem" shows that the implications are as profound as they are simple.
Consider a slab of noninteracting nodes V = {vl'v2 , ••• ,v n } with a time
varying input image I(t) = [ll(t), 12 (t), ••• ,l n(t)] that drives the nodes above
signal threshold. The total intensity of the input is I = E k Ik, so we can
write Ik = 6 k I where E k 6 k = 1. Thus, the input has a reflectance pattern
e(t) = [6 1 (t),6 2 (t), ••• ,6 n(t)].
Now consider a node Vo with pathways to the slab as shown in Fig. 5.
This configuration is called an outstar because it can be redrawn in the sym-
metrical form of Fig. 6. When an "event" lo(t) drives Vo above threshold, a
learning signal Sok is sent to each of the synaptic knobs to trigger a sam-
pling of the activity pattern 'displayed' on the slab by driving changes in
the synaptic strengths Zo k•
.... /~N~.V2
• •
•
Figure 6. Symmetry of the outstar anatomy.
Zok(t) ---
t+ co ek (7)
where
(8)
To see how the outstar theorem follows from Grossberg's equations, let
us consider the simplest case, where the signals S~k = S~ are the same for
all pathways, and all nodes have identical constant self-interaction coeffi-
cients. Then the outstar network equations are
(9a)
(9b)
(9c)
(10)
Thus, the activity pattern across the slab is proportional to the reflectance
pattern.
Suppose now that the same reflectance pattern 0 is repeatedly pre-
sented to the slab, so the 9 k are constant but the intensity I(t) may vary
wildly in time. For the sake of simplicity, suppose also that the signal So
does not significantly perturb the activity pattern across the slab. Then,
the integral of the learning equation (9c) has the form of Eq. (6), and gives
the asymptotic result
r
zok(t) = N(t) 9k (11a)
where
According to Eq. (11a), the outstar learns the reflectance pattern exactly.
The same result is obtained with more mathematical effort even when per-
turbations of the slab activity pattern are taken into account.
Note that, according to Eq. (11b), the magnitude of N(t), and therefore
the rate of learning, is controlled by the magnitudes of the sampling signal
S~(t) and the total input intensity I(t). Stronger signals, faster learningl
If the 9k are not constant, the learning equation will still give an
asymptotic result of the form (11a) if 9k is replaced by a suitably defined
average 9 k. To see this in the simplest way, suppose that two different but
constant patterns 0(1) and 0(2) are sampled by the outstar at different
times. Then the additivity property of the integral in Eq. (6) allows us to
write
192 David Hestenes
(12)
where Nl and N2 have the same form as N in Eq. (11b), except that the in-
tegration is only over the ti~ intervals when e~l) (or e~2)) is displayed on
the slab. Thus, the pattern a stored in the outstar LTM is a weighted aver-
age of sampled patterns. Note that this is a consequence of formulating
Hebb's law specifically as a correlation function.
Outstar learning has a number of familiar characteristics of hl.ll1an
learning. If the outstar is supposed to learn pattern a(1), and a(2) is an
error, then according to Eq. (12), repeated sampling of a(l) will increase
the weight (NdN) of a(1) and If + a(1). Thus outstar learning is 'error
correcting' and 'practice makes perfect.' Or if 0(1) is presented and sam-
pled with sufficient intensity, the outstar can learn 0(1) on a single trial.
Thus, 'memory without practice' is possible. On the other hand, if an
outstar repeatedly samples new patterns, then the weight of the original
pattern decreases. So we have an 'interference theory of forgetting.'
Forgetting is not merely the result of passive decay of LTM traces.
Grossberg has proved the outstar theorem for much more general func-
tional forms of the network equations than we have considered. For pur-
poses of the general theory, it is of great importance to determine the vari-
ations in physical parameters and network design that are compatible with
the pattern learning of the outstar theorem.
The outstar theorem is the most fundamental result of network theory
because it tells us precisely what kind of information is encoded in LTM,
namely, reflectance patterns. It goes well beyond Hebb's original insight in
telling us that a single synaptic strength zok has no definite information
content. All LTM information is in the pattern of relative synaptic strengths
zok defined by Eq. (8).
To sl.ll1 up, we have learned that the outstar network has the following
fundamental properties:
o Coding: The functional unit of LTM is a spatial pattern.
o Learning: A pattern is encoded by a stimulus sampling signal.
o Recall: Probed read-out of LTM into STM by a performance signal.
o Factorization: A pattern is factored into a reflectance pattern, which is
stored, and the total intensity I, which controls the rate of learning.
The outstar is a universal learning device. Grossberg has shown how to
use this one device to model every kind of learning in the CNS. The kinds
of learning are distinguished by the different interpretations given to the
components of the outstar network, which, in turn, depend on how the com-
ponents fit into the CNS as a whole. To appreciate the versatility of the
outstar, let us briefly consider three examples of major importance:
Top-down expectancy learning: Suppose the slab V = {V 1,V 2, ... ,v n }
represents a system of sensory feature detectors in the cerebral cortex. A
visual (or auditory) event is encoded as an activity pattern x = (x lt x 2, ••• ,X n )
across the slab, where xk represents the relative importance of the kth fea-
ture. The outstar node Vo can learn this pattern, and when the LTM pattern
is played back on V, it represents a prior expectancy of a sensory event.
HOW THE BRAIN WORKS 193
where i = 1,2, ••• ,n, and A,B are positive constants while C may be zero or a
positive constant. It is easy to see that the form of Eq. (13) limits solutions
to the finite range B ~ Xi(t) ~ -C, whatever the values of inputs Ik(t).
Two things should be noted about the form of Eq. (13). First, the exci-
tatory input Ii to each node Vi is also fed to the other nodes as an inhibitory
input as indicated in Fig. 7. A slab with this kind of external interaction is
called an on-center off-surround network. Second, both excitatory and in-
hibitory interactions include terms of two types, one of the form Bli and the
other of the form xilk. Grossberg calls the first type an additive
interaction and the second type a shunting interaction. Accordingly, he
calls a network characterized by Eq. (13) a shunting on-center off-surround
network.
• •
shoe crab. The nodes in our model correspond roughly to the retinal gang-
lion cells with outputs in the LGN as we noted in Sec. 4. The distribution of
inhibitory inputs among the nodes is biologically accomplished by a layer of
interneurons in the retina called horizontal cells. Our model lumps_ these
interneurons together with the ganglion cells in the nodes. It describes the
signal processing function of these interneurons without including unneces-
sary details about how this function is biologically realized.
It is well known among neuroscientists that intensity boundaries in the
image input to an on-center off-surround network are contrast-enhanced in
the output. So it is often concluded that the biological function of such a
network is contrast enhancement of images. But Grossberg has identified a
more fundamental biological function, namely, to solve the noise-saturation
dilemma. However, shunting interactions are also an essential ingredient of
the solution. Many neural modelers still do not recognize this crucial role of
shunting interactions and deal exclusively with additive interactions. It
should be mentioned that the combination of shunting and additive interac-
tions appears in cell membrane equations that have considerable empirical
support.
To see how Eq. (13) solves the noise saturation dilemma, we look at the
steady-state solution, which can be put into the form
(B + C)I ( C)
xi = A + lei - B + C • (14)
Here, as before, the e i are the reflectances and I is the total intensity of
the input. This solution shows that the slab has the following important
image processing capabilities:
(1) Factorization of pattern and intensity. Hence, for any variations in
the intensity I, the activities Xi remain proportional to the reflectances e i
[displaced by the small constant C(B + C)-I). Thus, the slab possesses
automatic gain control, and the Xi are never saturated by large inputs.
This factorization matches perfectly with the outstar factorization
property, though they have completely different physical origins. Since an
outstar learns a reflectance pattern only, it needs to sample from an image
slab that displays reflectance patterns with fidelity. Thus, we have here a
minimal solution to the pattern registration problem.
(2) Weber law modulation by the factor I(A + 1)-1. Weber's law is an
empirical law of psychophysics that has been found to hold in a wide variety
of sensory phenomena. It is usually given in the form ~IJI = constant,
where ~ I is the" just noticeable difference" in intensity observed against a
background intensity I. This form of Weber's law can be derived from
Eq. (14) with a reasonable approximation.
(3) Featural noise suppression. The constant C(B + C)-I in Eq. (14)
describes an adaptation level, and B » C in vivo. The adaptation level
C(B + C)-I = n- 1 is especially Significant, for then if ei = n- 1, one gets
Xi = O. In other words, the response to a perfectly uniform input is com-
196 David Hestenes
o _ (B + C)(I + 1) (eo _ _
C_) (15)
XI - A + I + JIB +C '
where I and J are the intensities of the two patterns. The slab output,
quenched or amplified, amounts to a decision as to whether the patterns are
the same or different. This raises questions about the criteria for pattern
equivalence that can be answered by more elaborate network designs. Note
that this competitive pattern matching device compares reflectance patterns
rather than intensity patterns.
From his analysis of the minimal solution to the noise-saturation
dilemma, Grossberg identifies the use of shunting on-center off-surround
interactions as a general design principle solving the problem of accurate
pattern registration. He then employs this principle to construct more gen-
eral solutions with additional pattern processing capabilities. He introduces
recurrent (feedback) interactions to give the slab an STM storage capability.
By introducing distance-dependent interactions he gives the network edge-
enhancement capabilities. Beyond thiS, he develops general theorems about
the form of interactions that produce stable solutions of the network
equations.
With the general network design principle for accurate pattern registra-
tion in hand, we are prepared to appreciate how Grossberg uses it to solve
the STM storage problem. The main idea is to introduce interactions among
the nodes in a slab in a way that is compatible with accurate pattern regis-
tration. Accordingly, we introduce an excitatory self-interaction (on-
center) for each node and inhibitory lateral interactions (off-surround), as
indicated in Fig. 8. Such a network is said to be recurrent because some of
its output is fed back as input.
HOW THE BRAIN WORKS 197
t
Figure 8. A recurrent on-center off-surround anatomy.
The minimal solution to the STM storage problem is given by the net-
work equations
Now that we know how to hardwire a slab for accurate pattern regis-
tration and STM storage, we are ready to connect slabs into larger networks
capable of global pattern processing. Here we shall see how to design a
two-slab network that can learn to recognize similar patterns and distin-
guish between different patterns. This network is a self-organizing system
capable of extracting common elements from the time-varying sequence of
input patterns and developing its own pattern classification code. Thus, the
198 David Hestenes
• •
Xi 2. Contrast Enhance
3. STM
L TM IN PLASTIC SYNAPTIC
STRENGTHS
1. Compute Time-Average of
Presynaptic Si gnal and
Postsynaptic STM Trace
Product.
2. Multiplicatively Gate Signals
n
Tk = f SikZik = Sk·zk, (17)
i=1
where the sum is over all nodes in the image slab, and Sk = (Slk, Szk, ••• , Snk),
zk = (zlk,Z2k, ••• ,znk).
(18)
(19)
ing equation (4). Noting that normalization will keep the feature detector
activity at a fixed value xk = ll-lCk and assuming, for simplicity, that the
learning signal has the linear form Sik = lie i, we can put the instar learning
equations into the vectorial form
(20)
which holds no matter how wildly the input intensity I(t) fluctuates. For
t» a- l , this reduces to zk = a-lck8. Thus, the classification vector zk
aligns itself with the input reflectance vector 8. More generally, it can be
shown that the classification vector zk aligns itself asymptotically with a
weighted average 8 of reflectance vectors that activate the feature detec-
tor vk. This is the instar code development theorem. It is dual to the out-
star learning theorem. Like the outstar theorem, it can be proved rigorously
under more general assumptions than we have considered here. Note that
the instar factorizes pattern and intensity just like the outstar, but the
physical mechanism producing the factorization is quite different in each
case. For the instar, factorization is produced by lateral interactions in the
image slab.
The code development theorem tells us that an adaptive classifier tunes
itself to the patterns it 'experiences' most often. When a single classifying
vector is tuned by experience, it shifts the boundaries of all the categories
Pk defined by Eq. (19). 'Dominant features' will be excited most often, so
they will eventually overwhelm less salient features. Thus, the adaptive
202 David Hestenes
/--+/ / ..
1/--/1/· .
//1-\///·/
\ 1 1 • \ \ -- • I I
"\ ,_. \ \ , - - / I /
...... ' - / . / \ 1 - - \ J - -
/,--/111//\-'--
///1 \ ' - ' 1 1 / · . _ ' \
/ 1 1 \ ,,_. \ 1 • I ' \
...--.·,,--1\·//\
-_. . . ",,,, \ ,-. I /
- - I 1 . / ....... - - /
/ / I \_ . . . _-
/ '---I / /
/ / , -.- 1 I I'"
The striking thing about Fig. 12 is its "swirling" short range order. The
points tend to be organized in curves with gradual changes in the direction
of the "tangent vector' as one moves from one point to the next. This is
striking because it is qualitatively the same as experimental results for
which Hubel and Wiesel received a Nobel prize in 1981. Into the striate
cortexes (Fig. 2) of cats and monkeys, Hubel and Wiesel inserted microelec-
trades that can detect the firing of single neurons called 'simple cells.'
Simple cells are selectively responsive to bars of light with particular orien-
tations that are moved across an animal's retina. Moreover, they found that
the responses of neighboring simple cells are related as in Fig. 12. Mals-
burg's computer experiment suggests an explanation for their results, and so
provides one among many pieces of evidence that brains actually process
patterns in accordance with the principles we have discussed.
HOW THE BRAI N WORKS 203
to tell someday whether the idea is true or false. For example, adaptive
resonances must be accompanied by distinctive electric fields, and the
experimental study of such fields is beginning to show promising results. In
the meantime, the adaptive resonance idea raises plenty of questions to keep
theorists busy.
Adaptive resonances process expected events (that is, recognized pat-
terns). An unexpected event is characterized by a mismatch between tem-
plate and image. In that case, the mismatch feature detectors must be shut
off immediately to avoid inappropriate recoding. Here we have a new
design problem for which Grossberg has developed a brilliant solution involv-
ing two new neural mechanisms that play significant roles in many other
network designs.
The first of these new mechanisms Grossberg calls a gated dipole. The
gated dipole is a rapid on-off switch with some surprising properties. There
is already substantial indirect evidence for the existence of gated dipoles,
but Grossberg's most incisive predictions are yet to be verified. In my
opinion, direct observations of gated dipole properties would be as profound
and significant scientifically as the recent "direct" detection of electroweak
intermediate bosons in physics.
The second new mechanism Grossberg calls nonspecific arousal. It is a
special type of signal that Grossberg uses to modulate the quenching
threshold of a slab. It is nonspecific in the sense that it acts on the slab as
a whole; it carries no specific information about patterns. As Grossberg
develops the theory further, he is led to distinguish several types of arousal
Signals, and their similarities to what we call emotions in people becomes in-
creasingly apparent. Grossberg shows that arousal is essential for pattern
processing. This has the implication that emotions play an essential role in
the rational thinking of human beings. Reason and emotion are not so inde-
pendent as commonly believed.
The processing of unexpected events requires more than refinements in
the design of adaptive classifiers. It requires additional network compo-
nents to decide which events are worth remembering and how they should be
encoded. To see the rich possibilities opened up by Grossberg's attack on
this problem, I refer you to his collected works.
11. References
Richard K. Bryan
207
1. Introduction
2. Fiber Diffraction
where u r is termed the unit rise and Ut the unit twist. The unique line
specified by the z axis is called the helix axis or fiber axis. Using capitals
for coordinates in reciprocal space, the Fourier transform of p is
r 21TJ CD p(r,~,z)
F(R,t,Z) =
JCD e 21Ti [zZ+rR cos(4)-t)] r dr d4> dz. (2)
-CD J 0 0
eiu cos v =
(3)
n=-CD
F(R,t,Z) = (4 )
n=-CD j=-CD
where
r
Gn(R,Z) = JU f21TfCD p(r,4>,z) e i (2 1T zZ-n4» I n (21fRr) r dr d4> dz (5)
o J0 J0
210 Richard K. Bryan
(usually called Bessel function terms). The second sum can be evaluated as
co
~ o(nut - IUr + m) , (6)
m=-co
giving
00 00
F(R,~,Z) = (7)
n=-oo m=-oo
Thus the transform is nonzero only on layer planes perpendicular to the helix
axis, with nth-order Bessel function terms falling at a I position given by
the selection rule
nUt + m
I = ur '
m integer. (8)
(that is, reciprocal space radius), only a few of the possible Bessel function
terms are nonzero on each layer line.
However, real fibers are imperfect in that the individual particles may
not be exactly oriented and in that there is a limited number of coherently
diffracting units in each particle. These effects cause layer line spreading
in the directions of the polar angle and meridian. Examples of diffraction
patterns from magnetically oriented fibers of bacteriophage Pf1 [Nave et
al., 1981] are shown in Fig. 1. These are extremely good fiber patterns, but
they still show some disorientation, particularly visible at large radius. The
form of the point-spread function has been calculated in various approxima-
tions by Deas [1952], Holmes and Barrington Leigh [1974], Stubbs [1974],
and Fraser et al. [1976], and a fast algorithm, accurate over the required
domain, has been programmed by Provencher and Gloeckner [1982]. The
recorded diffraction pattern is also degraded by film background noise and
air scatter of the x-ray beam. For the well oriented patterns used here,
this background is easily estimated by fitting a smoothly varying function of
reciprocal space to the film density in the parts not occupied by layer line
intensity, and interpolating between.
Figure 1. Diffraction patterns of (a) native Pf1 and (b) iodine derivative,
from Nave et al. [1981]. The fiber axis is vertical but tilted slightly from
the film plane so that the pattern is asymmetric. This enables meridional
data to be collected in the 5 " region.
1m
Figure 2. The Harker construction in the complex plane for a single recipro-
cal space point. H is the complex heavy atom amplitude. Circles of radius
N, the native amplitude, and D, the derivative amplitude, are drawn cen-
tered at the origin and at -H respectively. The points of intersection A and
B, where the derivative is the sum of the native and heavy atom vectors,
give the possible phase solutions.
the native data often extend to higher resolution than the derivatives, so
cannot be used until an atomic model has been built and is being refined
against the entire native data set. Despite these criticisms, around 200
protein structures have been solved so far by this or closely related
methods.
The application to fiber diffraction is very similar, but there are signifi-
cant differences [Marvin and Nave, 1982]. Since a continuous distribution
of intensities is measured, instead of the integrated intensities over distinct
diffraction spots, the noise on fiber data is usually considerably higher than
for crystalline diffraction, and the phases are consequently less well deter-
mined. The finite radius of the structure implies continuity in R of the
Bessel function terms. Point-by-point construction of phases and weighting
of amplitudes does not necessarily ensure this continuity of the complex
amplitudes. Even seeking a Bessel function term with minimum curvature in
an Argand diagram plot [Stubbs and Diamond, 1975] does not mean that it is
214 Richard K. Bryan
o = r( p) + € • (11 )
In principle, one could calculate the electron density directly from one
or more of the background-corrected diffraction patterns, using maximum
entropy with the forward transform
F trans I 12 Convolution
p - - - >G - - - > layer lines - - - - - > Diffraction pattern. (13)
The main part of this discussion will therefore be concerned with the
more challenging problem of calculating the electron density from the layer
line intensities.
The data constraint requires a comparison of the data that would be
observed from the estimated structure p with the actual data. The data are
sampled in R, so R will now be used as a subscript to denote a discrete point
in reciprocal space. As described above, the native layer line intensities are
EnIGnR.RI2. The derivative structure is simply the sum of the density of the
heavy atoms at the appropriate coordinates-whose transform will be de-
noted by HnR.R-and the native density, giving predicted derivative data
(14 )
Any data that are already phased, such as the eql,lator, are included in a
separate term. These are compared with the measured data by means of the
traditional X2 test [Ables, 1974; Gull and Daniell, 1978]:
2 2 2 2
X = Xp+ XN+ XD, (15)
2
Xp = L W~.tR (Gn.tR - En.tR)2 ,
2
XN = L W~R (r IG n.tRI2 - I~R)2 , (16)
n
2
XD = L W~R (L IGn.tR + Hn.tRI2 - I~R)2 ,
n
the E's are the phased data amplitudes, the I's are the measured intensities,
the w's are the weights to be given to the respective measurements, taken
as the inverse variances, and the summations are over the data points only.
6
A further term like X is included for each extra heavy atom derivative. A
statistic comparing the predicted and measured amplitudes has also been
suggested for the phase problem [Gull and Daniell, 1978; Wilkins et al.,
1983]. A statistic defined on the actual measured quantities, in this case
the intensities, is obviously to be preferred. The amplitude statistic also has
the disadvantage of being nondifferentiable if any amplitude is zero, so it is
impossible to start the iterative algorithm from a flat map, which is desir-
able if the phases are not to be biased to those of a particular model start-
ing map.
STRUCTURAL MOLECULAR BIOLOGY 217
(17)
S DECREASING
-
1 •
.
:
-~
.-....:. .... .'.
~----'~.---....-."....-~
.
1a x2
I
DECREASING
Figure 3. Bifurcation of the maximum entropy solution. The solid lines are
contours of constant X2, the dashed lines are contours of S, and the dotted
lines are trajectories of local stationary S over constant X2. Branch 1 is the
line of local entropy maxima when X2 is convex. When this constraint
becomes nonconvex, the line of maxima bifurcates into two branches, 1a
and 1b, and a line of local S minima arises, branch 2.
various maps, other than the flat map, is one way of checking whether there
are other possible solutions, and for the Pf1 study this was done to a limited
extent. A program to investigate systematically all the branches, specifi-
cally for the crystallographic problem, has been proposed [Bricogne, 1984].
To attempt to find the directions of negative curvature, the search
direction set was supplemented as follows:
(a) A direction that has a reciprocal space "phase" component. This
can be constructed in various ways, for example, specific reciprocal space
components, random search directions, and asymmetrizing the map both sys-
tematically and randomly. Bryan [1980] describes many attempts at such
constructions. The efficacy varies slightly, but so far no "magic" direction
has been found. Although simple constructions like these usually have only
a small component in a negative curvature direction, they are essential as
"seed" directions for (b).
(b) Contrary to the recommendations made for the convex case [Skilling
and Bryan, 1984], it is here essential to carry forward from iteration to
iteration some information on the constraint curvature, namely the direc-
tions associated with negative constraint curvature. This is done by calcu-
lating the k eigenvectors in the search subspace with the k lowest eigen-
values, and using these as additional search directions at the next iteration.
The direction from (a) provides an initial perturbation, and, in practice, the
negative curvature directions are built up over several iterations. There is,
STRUCTURAL MOLECULAR BIOLOGY 219
With regard to grid size, the aim is for a maximum spacing in the recon-
struction of 0.25 times the resolution length. Since the resolution of the
data is 4 A, the maximum spacing is fl = 1 A. The minimum number of angu-
lar samples N~, can then be calculated from N~ ~ 2Hmax/fl. Using
rmax =
35 A gives N~ ~ 220, conveniently rounded to 256 for the angular
FFT. Bessel function terms up to about n = 40 are required to represent the
structure. However, with the selection rule 71/13, a Bessel function term n
is on the same layer line as one of order 71-n, so separation of terms with n
in the range 30 to 40 is particularly difficult. In the Pf1 data, such layer
lines happened to be very weak and were omitted from the data. This
causes a slight drop in resolution at large radius. Also omitted were meridi-
onal parts of layer lines (for example, .. = 45) that showed particularly
strong crystal sampling. (The patterns in Fig. 1 were not the ones used for
data collections. They were made from drier and hence more closely packed
fibers, which show stronger intensities and are thus more convenient for the
purposes of illustration, but also show more crystal sampling.) Three z-sec-
tions are required to represent the unit rise of 3.049 A. In view of the
polar grid used, the entropy has to be weighted by the pixel size. Instead
of working directly with the density Pi, one can use ni = qPi, proportional
to the integrated density in the pixel, enabling the entropy to be written as
,ni ni
S =- L E log r· E '
. I
(19)
I
where
(20)
This also enables the I r" in the Hankel transform kernel to be absorbed into
the stored density value.
As applied to the Pf1 data, this algorithm was spectacularly unsuccess-
ful, and produced a solution with a single blob of density near the heavy
atom coordinates. Clearly, the phases of the solution map were still lined
up on the heavy atom phases. The peak density exceeded any possible pro-
tein density, and it seemed obvious to apply a maximum density limit to the
reconstruction. Although such a limit could be imposed by individual con-
straints on each pixel, such an approach is computationally intractable, and
it is more in keeping with the philosophy of the method to modify the
entropy expression (which describes our real-space knowledge) to a I Fermi-
Di rac I form [Frieden, 1973]
n· - ~ n'i
n· log-I
~
n'i
1'log qE' , (21)
E qE
where
222 Richard K. Bryan
thus putting the upper bound on the same status as the positivity constraint
implicit in the ordinary entropy expression. The entropy metric in the algo-
rithm is changed to the second derivative of SFD, proportional to
1
(23)
Figure 6. View of a 3-D contour map of the electron density, with the
backbone and B-carbons of a mostly a-helical atomic model of the coat
protein built in. Parts of three adjacent units are shown, looking from the
outside of the virus, with the z axis vertical.
STRUCTURAL MOLECULAR BIOLOGY 225
39
a b
Figure 7. Diffraction amplitudes on layer lines 1, 6, 7, and 39. The solid
curves are the observed amplitudes, the dashed are the transform of the
maximum entropy map, and the dotted are the transform of the atomic
model of Fig. 6. (a) Native amplitudes. (b) Iodine derivative amplitudes.
Note the large differences on layer line 6.
6. Acknowledgments
7. References
Peter Cheeseman
This paper presents a new method for calculating the conditional proba-
bility of any attribute value, given particular information about the individ-
ual case. The calculation is based on the principle of maximum entropy and
yields the most unbiased probability estimate, given the available evidence.
Previous methods for computing maximum entropy values are either very
restrictive in the probabilistic information (constraints) they can use or are
combinatorially explosive. The computational complexity of the new proce-
dure depends on the interconnectedness of the constraints, but in practical
cases it is small.
229
C. R. Smith and G. J. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 229-240.
© 1987 by D. Reidel Publishing Company.
230 Peter Cheeseman
1. Introduction
The main purpose of this paper is to introduce a new method for com-
puting the maximum entropy probability of a predicate or attribute of inter-
est, given specific evidence about related attributes, and subject to any
linear probability constraints. This method avoids the combinatorial explo-
sion inherent in previous methods without imposing strong limitations on the
constraints that can be used, and it is therefore useful for computer-based
expert systems.
The method of maximum entropy was first applied by Jaynes to the sta-
tistical mechanics problem of predicting the most likely state of a system
given the physical constraints (for example, conservation of energy). In
Jaynes [1968] the maximum entropy method was used to provide prior prob-
abilities for a Bayesian analysis. Lewis [1959] applied the method of least
information (an equivalent method) to the problem of finding the best
approximation to a given probability distribution based on knowledge of
some of the joint probabilities (that is, constraints on the possible distribu-
tions). Ireland and Kullback [1968] applied the minimum discrimination in-
formation measure (yet another method equivalent in this case to maximum
entropy) to find the closest approximating probability distribution consistent
with the known marginals in contingency table analysis. Konolige [1979]
applied the least information method to expert systems, and this analysis has
been extended by Lemmar and Barth [1982].
The mathematical framework used in this paper is defined below.
Although the definitions are for a space of four parameterized attributes,
the framework applies to any number of attributes. The attributes are A, B,
(, and D, where
and Pijkl is the probability that A has value Ai, B has value Bj, ( has value
(k, and D has value DI. For example, A might be the attribute (function
name) 'soil-type,' where Al has the value 'clay,' A2 is 'silt,' and so on.
Each value (category) of an attribute is assumed to be mutually exclu-
sive and exhaustive of the other categories. Any attribute that is not cur-
rently exhaustive can be made so by adding an extra category, 'other,' for
anything that does not fit the existing categories. In terms of these attri-
butes, the entropy function, H, is defined as
H = - L Pijkl 10gPijkl • (1 )
ijkl
232 Peter Cheeseman
~ Pijkl = 1 • (2)
ijkl
L
jkl
Pijkl = P:'I (3)
~ Pijkl = CD
p kl (4)
ij
~ Pijkl = pr~D (5 )
pAB
23
rkl
Pukl
p(A 21B,) = = (6)
P~ ~ Piskl
ikl
implying '"
L p~B
2J =1- x (j ~ 3) (8)
PROBABILITY VALUES FOR EXPERT SYSTEMS 233
The next step is to equate the derivative of Eq. (10) (with respect to each
variable) to zero, giving:
aH' .AB
-- - -logPiJ"kl - 1 - A - Ai - ••• - AiJ" - ••• - xzs [1 - P(A2IBs)] - •• ~
aPijkl -
= 0 (11)
implying
where Ao = A + 1, or
Pijkl = exp-Oo + Ai + ••• + Aij + •••
and
aH'
Ho
=o =;)
~ Pijkl = 1 (14 )
ijkl
aH'
Hi
= 0 =;) L
jkl
Pijkl =~ (15 )
aH'
Hu
AB
= 0 =;) L P23 kl = p(A 2IB,) L Pi3kl (16)
kl ikl
etc.
(17)
ao r
ijkl
aiaj ••• aijaik ••• aijk ••• = 1 (18)
(19)
4. Probabilistic Inference
If the prior probability of, say, a particular Aj is required (that is, it is not
one of the given priors), then the value is given by
(21)
PROBABILITY VALUES FOR EXPERT SYSTEMS 237
Here, the full summation has been recursively decomposed into its compo-
nent partial sums, allowing each partial sum to be computed as soon as pos-
sible; the resulting matrix then becomes a term in the next outermost sum.
In the above example, this summation method reduces the cost of evaluating
Ejklm from O(J*K*l*M) (where J, ••• ,M are the ranges of j, ••• ,m respec-
tively) to O(J*l*M)-that is, the cost of evaluating the innermost sum.
Note that a different order of decomposition can produce higher costs; that
is, the cost of the evaluation of sums is dependent on the evaluation order
and is a minimum when the sum of the sizes of the intermediate matrices is
a minimum. When there are a large number of attributes, the total compu-
tational cost of evaluating a sum is usually dominated by the largest inter-
mediate matrix, whose size is partly dependent on the degree of intercon-
nectedness of the attribute being summed over and the order of evaluation.
The above summation procedure is also necessary for updating the previous
a's when given new prior probabilities. In the above, a o is a normalization
constant that can be determined, once all values of P(Ai) have been evalu-
ated, from the requirement that EiP(Ai) = 1. Such a normalization makes
prior evaluation of a o unnecessary.
To find the conditional probability of an attribute (or joint conditional
probability of a set of attributes), all that is needed is to sum over the
values of all attributes that are not given as evidence of the target attri-
bute. For example, to find the conditional probability of Ai given that D2
and E3 are true, the correct formula is
(22)
where 13 is a normalization constant and the summations are over the re-
maining attributes (B and C). Note that the more evidence there is con-
cerning a particular case, the smaller the resulting sum. Also, the condi-
tional probability evaluation procedure is nondirectional because, unlike
other expert systems, this procedure allows the conditional probability of
any attribute to be found for any combination of evidence. That is, it has
no specially designated evidence and hypothesis attributes. Also note that
in this case the initial sum has been factored into two independent sums.
The above probability evaluation method can be extended to include the
case where the evidence in a particular case is in the form of a probability
distribution over the values of an attribute that is different from the prior
distribution, rather than being informed that a particular value is true. In
this case, it is necessary to compute new a's that correspond to the given
distribution and use these new a's in place of the corresponding prior a's in
238 Peter Cheeseman
The revised CI values above are equivalent to the multiplicative factors used
by lemmar and Barth [1982], who showed that the minimum discrimination
information update of a particular probability subspace is given by introduc-
ing multiplicative factors equivalent to the above. The major difference is
that the summation procedure described in this paper will work even when
the space cannot be partitioned. The relationship between minimum dis-
crimination information update and conditional probability update where an
"external" evidence node is given as true, thus inducing a probability distri-
bution on a known node, is being investigated. In the extreme case where
the minimum discrimination update is based on information that a particular
value is true, the two approaches give the same results.
The above conditional probability evaluation procedure (a type of
expert system inference engine) has been implemented in II SP and has been
tested on many well known ME examples. In ME conditional probability cal-
culations when specific evidence is given, it has been found that only short,
strong chains of prior joint or conditional probabilities can significantly
change the probability of an attribute of interest from its prior value.
When a joint probability value is computed by the proposed method, it is
useful to estimate its accuracy as well. There are two sources of uncer-
tainty in a computed ME value. One is the possibility that the known con-
straints used are not the only ones operating in the domain. This type of
uncertainty is hard to quantify and depends on the methods used to find the
known constraints. If a constraint search is systematic (over the known
data), then we can be confident that we know all the dependencies that can
contribute to a specific ME value. If a constraint search is ad hoc, it is
always possible that a major contributing factor has been overlooked. If
any important factors are missing, the calculated ME probability values will
differ significantly from the observed values in many trials. If such system-
atic deviations are found, it indicates that constraints are missing, and an
analysis of the deviations often gives a clue to these missing factors
[Jaynes, 1979].
The other source of uncertainty is in the accuracy with which the con-
straints are known. This accuracy depends on the size of the sample from
which the constraints were extracted or the accuracy of the expert's esti-
mates. This uncertainty is also hard to quantify when expert's estimates are
used, but when the constraints are induced from data, the accuracy is de-
pendent on the sample size and randomness. In the analysis given here, the
constraints were assumed to be known with complete accuracy. When par-
ticular conditional probability values are computed, the uncertainty of the
conditioning information can introduce additional uncertainty.
PROBABILITY VALUES FOR EXPERT SYSTEMS 239
5. Summary
6. References
Yair Censor
Department of Mathematics, University of Haifa, Mt. Carmel,
Haifa 31999, Israel
Tommy Elfving
National Defence Research Institute, Box 1165, S-581 11 lin-
koping, Sweden
Gabor T. Herman
Department of Radiology, Hospital of the University of Pennsyl-
vania, Philadelphia, Pennsylvania 19104
241
n
ent x = - ~ Xj log Xj • (1 )
j=1
By definition, 0 log 0 = O.
The entropy optimization problems that will be addressed here are of
the form
Maximize ent x
subject to x
and ~
n
EQ
Xj = 1 and x ~ 0 •
} (2 )
j=1
Ql = {x ERn I Ax = b} , (3)
O2 = {x ERn I Ax ~ b} , (4 )
03 = {x E Rn I c ~ Ax ~ b} • (5)
determines a point y' ERn and a real number B. It follows from the general
theory presented in Censor and Lent [1981], particularly from Lemma 3.1
there, that these y' and B are uniquely determined, and therefore we call
them the entropy projection of y onto Hand the entropy projection coeffi-
cient associated with entropy projecting y onto H, respectively.
The entropy projection onto a hyperplane is a concrete realization of
Bregman's notion of D-projection [Bregman, 1967a; Censor and Lent, 1981].
Since D-projections include as a special case also common orthogonal pro-
jections, some of the properties of the latter carryover to entropy projec-
tions. For example, if
are two parallel hyperplanes in Rn, and y ERn is a given point, then
(9)
obviously holds, where PH(y) stands for the orthogonal projection of a point
y onto a hyperplane H. In this case
(10)
holds, where di, i = 1,2, is the distance from y to Hi and d z is the distance
from PH 1(y) to Hz.
It turns out that Eq. (9) continues to hold if P is replaced throughout by
P, where PH(y) is the entropy projection of y onto H, and if y E R~. In that
case, Eq. (10) is replaced by
(11 )
3. Row-Action Algorithms
k+1, the i(k)-th row (abbreviated by i wherever possible) is used in the iter-
ation process. Several such control sequences were given by Censor
[1981 a], but for the sake of simplicity we henceforth assume that all the
algorithms used a cyclic control, that is,
Algorithm 1: MART
x~J (<ai,xk>
bi )a], j = 1,2, ••• ,n • (13 )
Here a i = (aJ) is the ith c?lumn of AT (the transpose of A), and the ith equa-
tion of Eq. (3) reads <al,x> = bi. Also, in Eq. (13), i = i(k), which is a
cyclic control. Convergence of MART was proved by Lent [1977] in the
following theorem:
Theorem 1
(i) 0 1 (\ R~ ~ 0
(ii) All entries of A are 0 ~ aj ~ 1, and bi > 0, i = 1,2, ••• ,m.
(iii) Iteration is initialized at xl = exp(-1), j = 1,2, ••• ,n.
246 Y. Censor, T. Elfving, and G. T. Herman
(16)
(17)
then we have the well known algorithm for norm-minimization over linear
equalities due to Kaczmarz [see for example Censor, 1981a]. The similarity
in structure between Eqs. (16) and (17) is more than coincidental. In fact,
both algorithms are special cases of a general scheme of Bregman [1967a,
Theorem 3]. From Bregman's theorem the convergence of Algorithm 2 fol-
lows under the additional assumption that the iterates xk all stay in R~.
It is well known and easy to check that if matrix A in Eq. (3) has only 0
and 1 entries, then Algorithms 1 and 2 produce precisely the same sequence
{xk} of iterates, provided they are initialized at the same point and con-
trolled by the same {i(k)} control sequence.
The relationship between these two algorithms for a general A matrix
eluded Lamond and Stewart [1981, p. 248] but has recently been investi-
gated by Censor, De Pierro, Elfving, Herman, and lusem [1986].
GENERAL PURPOSE ALGORITHMS 247
}
J
(18)
r
n
bi = ai X~+1 •
J J
j=1
!J. bi
Mk = log . k ' (19)
<al,x >
which then goes into the iterative step
For problem (2) with 0 == O2 of Eq. (4), one may use the following algo-
rithm due to Bregman:
Algorithm 3
1+ 1 = x~
}
exp(ck aJ) , j = 1,2, ••• ,n,
(21)
Zk+l = zk - ck e i ,
where
(22)
248 Y. Censor, T. Elfving, and C. T. Herman
If, however, ck..= z~, then Eq. (21) means that xk is entropy-projected onto
a hyperplane Hi that is parallel to Hi [see, for example, Fig. 6 in Censor,
1981a]. Without writing down Hi explicitly, we rewrite Eq. (21) as
,., k
PH'(X ), if Ck = Bk ,
Xk+l = { ' (25)
,., k
Pfi,(x)
I-
,' f ck - z~
" - "
for the primal iteration, where 'P stands, as before, for entropy projection.
In this notation, if we replace 'P by orthogonal projection P in Eq. (25) and
Bk by the distance dk of xk from the hyperplane Hi in Eq. (22), we obtain
the Hildreth algorithm. This is a row-action method for norm-minimization
over linear inequalities. Introduced by Hildreth [1957] and studied further
by lent and Censor [1980], this method also follows from Algorithm 4.1 and
Theorem 4.1 in Censor and lent [1981], if we choose f(x) = t!lx 112 there.
Algorithm 3 again requires at each iterative step an inner-loop calcula-
tion to find the approximate value of Bk. Only with Bk at hand can the
choice of ck in Eq. (22) be made, and the iteration proceeds according to
Eq. (21). Motivated by the resemblance of MART [Eqs. (19) and (20)] and
Algorithm 2 [Eq. (14)] for the entropy optimization problem over linear
equations, we propose a new algorithm for entropy optimization over linear
inequalities. This algorithm preserves the structure of Algorithm 3 but calls
for the replacement of Bk by Mk.
GENERAL PURPOSE ALGORITHMS 249
Algorittvn 4
k+l k .
= = 1,2, ••• ,n,
}
Xj Xj exp (gk aj), j
(26)
Zk+l = zk - gk e i ,
where
(27)
Further details about this approach may be found in Herman [1975], Herman
and lent [1978], or Censor [1981a].
The next algorittvn is capable of solving the entropy maximization prob-
lem (2) ~ith Q = Q! as in Eq. (5). It follows from the development in
Censor and lent [1981], which was inspired by the algorithm for norm-min-
imization over interval inequalities proposed by Herman and lent [1978].
Algorithm 5
k+l k i
= = 1,2, ••• ,n,
}
Xj Xj exp(hk aj) , j
(29)
zk+ 1 = zk - hk e i ,
where
(30)
250 Y. Censor, T. Elfving, and G. T. Herman
and l1k and rk are the entropy projection coefficients associated with the
entropy projection of xk onto the hyperplanes
Algorithm 6
= = 1,2, ••• ,n,
}
X~+l Xjk exp(l.k aj)
i , j
J
(33)
zk+ 1 = zk - I.k e i ,
where
(34)
and
F I b"I + E"I
A
Ll
(35)
k = og <ai,xk> '
11 b" -
Gk = log I
<ai,xk> •
E"
I (36)
4. Concluding Remarks
5. Acknowledgments
Part of this work was done while the first two authors were visiting the
Medical Image Processing Group, Department of Radiology, Hospital of the
University of Pennsylvania, Philadelphia, and was supported by NSF Grant
ECS-8117908 and NIH Grant HL 28438. Further progress was made during a
visit of the first author to the Department of Mathematics at the University
of Linkoping, Sweden; we are grateful to professors 's\ke Bjorck and Kurt
Jornsten and to Dr. Torleiv Orhaug for making this visit possible. We thank
Mrs. Anna Cogan for her excellent work in typing the manuscript.
6. References
Frieden, B. R (1980), "Statistical models for the image restoration problem, "
Computer Graphics and Image Processing 12, pp. 40-59.
Herman, G. T., and A. Lent (1978), "A family of iterative quadratic optimi-
zation algorithms for pairs of inequalities, with application in diagnostic
radiology," Mathemat. Programming Study ~ pp. 15-29.
Lent, A. (1977), "A convergent algorithm for maximum entropy image resto-
ration with a medical x-ray application," in Image Analysis and Evalua-
tion, R. Shaw, ed., Society of Photographic Scientists and Engineers
(SPSE), Washington, D.C., pp. 238-243.
254 Y. Censor, T. Elfving, and G. T. Herman
·Reprinted from Journal of the Optical Society of America 73, pp. 1501-
1509 (November 1983).
255
C. R. Smith and G. J. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 255-272.
256 K. M. Hanson and G. W. Wecksung
1. Introduction
gi = II hi(X,y) f(x,y) dx dy , (1 )
where the hi are the weighting functions and i = 1,2, ••• ,N for N individual
measurements. We refer to the hi as response functions. In the CT problem
the hi typically have large values within a narrow strip and small or zero
values outside the strip. If the hi are unity within a strip and zero outside,
Eq. (1) becomes a strip integral. For zero strip width, it becomes a line in-
tegral. These last two cases are recognized as idealizations of the usual
physical situation. The generality of Eq. (1) allows it to closely represent
actual physical measurements since it can take into account response func-
tions that vary with position. Note that Eq. (1) is applicable to any dis-
cretely sampled, linear-imaging system. Thus the concept of null space and
the Bayesian methods proposed for overcoming its limitations are relevant to
a large variety of image-restoration problems.
The unknown function f(x,y) is usually restricted to a certain class, for
example, the class of all integrable functions with compact support. Con-
sider the Hilbert space of all acceptable functions and assllTle that all the hi
belong to that space. Equation (1) is an inner product of hi with f. Thus
the measurements gi may be thought of as a projection of the unknown
RECONSTRUCTION IN COMPUTED TOMOGRAPHY 257
vector f onto the response vector hi. Only those components of f that lie on
the subspace spanned by the set of all hi contribute to the measurements.
We call this subspace the measurement space. The components of f in the
remaining orthogonal subspace, the null space, do not contribute to the
measurements. Hence the null-space contribution to f cannot be determined
from the measurements alone. Since the deterministic (measurement) sub-
space of f is spanned by the response functions, it is natural to expand the
estimate of f in tenns of them:
t(x,y) = r
N
i=1
ai hi(X,y) • (2)
null space associated with limited data and introduced a method to eliminate
null-space ghosts through the application of known constraints on the re-
constructed image. Further references on the limited-angle CT problem may
be found in Ref. 1.
The restriction of deterministic solutions to the measurement space
should not be viewed as a negative conclusion. Rather, it is simply a state-
ment of what is possible for a given set of measurements in the absence of
further information. It allows one to state formally the goal in an improved
limited-angle CT reconstruction as that of estimating the null-space contri-
bution through the use of further information about the function to be
reconst ructed.
3. Bayesian Solution
(4)
fO =T (5a)
f n +1 = fn + cnrn (5b)
rnT sn
cn = snTsn
(5d)
(5e)
where vector rn is the residual of Eq. (4) (multiplied by Rf) and the scalar
cn is chosen to minimize the norm of rn+l. This iterative scheme is similar
to the one proposed by Hunt [22] for nonlinear MAP-image restoration. We
have found that this technique works well, although convergence typically
requires 10 to 20 iterations.
It is easy to see from the form of this iterative procedure that signifi-
cant null-space contributions to f can arise from the a priori information.
First, the zero-order estimate is T, which can contain null-space contribu-
tions. Second, in Eq. (5c), Rf can generate null-space contributions when it
operates on the results of the backprojectiorr (HT) process, which lies wholly
in the measurement space. Rf, in effect, weights the backprojection of the
measurement residuals. If Rf is chosen as zero in certain regions of the
reconstruction, these regions will not be changed throughout the iteration
scheme. It must be emphasized that the choices for T and Rf are exceed-
ingly important since it is only through them that a nonzero null-space con-
tribution to the reconstruction arises. As was stated earlier, this is the
major advantage of the Bayesian approach over deterministic algorithms.
Trivial choices, such as using for T a constant or a filtered backprojection
reconstruction based on the same projections or assuming Rf to be propor-
tional to the identity matrix [22,23], are not helpful for reducing artifacts.
The iteration scheme given by Eqs. (5) is SIRT-like [11] in that the
reconstruction update [Eq. (5b)] is accomplished only after all the ray sums
(Hfn) have been calculated. It is known that ART-like algorithms converge
much faster than SIRT-like ones [11]. Herman et al. [23] have proposed
an ART -like reconstruction algorithm that converges to the solution of the
MAP equation [Eq. (4)] under the assumption that Rf and Rn are propor-
tional to the identity matrix. This algorithm is worth exploring, as it is
I ikely to converge much more rapidly than the one used here. However, as
was stated above, their algorithm should be extended to include nontrivial
choices for Rf. The iterative scheme used here, although it may be slower
than necessary, does provide a solution to the MAP equation, which is the
important thing. We have found its convergence to be such that the norm
of the residual [Eq. (5c)] behaves as the iteration number raised to the -1.5
to -2.0 power.
260 K. M. Hanson and G. W. Wecksung
past, the ART algorithm [8] was used in the second iterative step of FAIR,
other iterative algorithms, such as the MAP algorithm above, can be used
advantageously (see Section 6). The iterative reconstruction procedure
forces the result to agree with the measurements through alteration of its
measurement-space contribution. The null-space contribution to the FAIR
reconstruction arises solely from the parametric model fitted in the first
step and hence from the a priori information used in specifying the model.
5. Results
Figure 1. (a) Source distribution used for the first example. (b) ART and
(c) MENT reconstructions obtained using 11 views covering 90° in projection
angle. Unconstrained ART was used, whereas MENT has an implicit nonneg-
ativity constraint.
RECONSTRUCTION IN COMPUTED TOMOGRAPHY 263
the annulus and small (0.2) inside and outside [Fig. 2( a)]. Since noiseless
projections were used, the measurement noise was assumed to be uncorre-
lated, constant, and low in value. The resulting MAP reconstruction
[Fig.2(b)] is vastly superior to the ART and MENT results, eliminating
1.3
°O~~~--~~~~~--~
1 870~--~~VU~~--~~3ro
THETA (deo)
Figure 3. Angular dependence of the maximum values along various radii for
the ART, MAP, and FAI R reconstructions in Figs. 1 and 2 compared with that
for the original function [Fig. 1 (a)]. This graph quantitatively demonstrates
the improvement afforded by MAP and FAIR.
RECONSTRUCTION IN COMPUTED TOMOGRAPHY 265
the source function and its actual distribution. This source function is the
same as Fig. 1(a) with a narrow, 0.6-amplitude, 2-D Gaussian added outside
the annulus at 330 0 and a broad, 0.1-amplitude Gaussian added underneath
the annulus at 162 0 • The reconstructions obtained using the same assump-
tions as above are shown in Figs. 4(b) through 4(d). Both MAP and FAIR
handle the inconsistencies similarly. The angular dependence of the maxi-
Figure 4. (a) Source distribution that does not conform to the annular
assumption. (b) ART, (c) MAP, and (d) FAIR reconstructions obtained from
11 views subtending 90 0 • Both MAP and FAIR tend to move the added
source outside the annulus onto the annulus. However, they provide indica-
tions in the reconstructions that there is some exterior activity.
266 K. M. Hanson and G. W. Wecksung
~ 1.0
...J
g
~
:::>
~
x
<t \
~ "
0.5
°0~~---L--90
L-~---L--18~0~L-~---2~ro---L-~~
l~
THETA (deg)
Figure 5. Angular dependence of the maximum values in the MAP and the
FAI R reconstructions of Fig. 4.
The final example is the reconstruction of Fig. 1(a) from noisy data.
The same 11 projections were used as before but with random noise added
with an rms deviation of 10% relative to the maximum projection value. The
reconstructions in Fig. 6 demonstrate that both MAP and FAI R simply yield
noisy versions of those obtained from noiseless projections. There is no
diastrous degradation, as would be expected for algorithms based on analytic
continuation [34,35]. Although the FAI R result appears to be much noisier
than the MAP reconstruction, they both possess nearly identical noise in the
RECONSTRUCTION IN COMPUTED TOMOGRAPHY 267
6. Discussion
7. Acknowledgments
This work was supported by the U.S. Department of Energy under con-
tract W-7405-ENG-36. The authors wish to acknowledge helpful and inter-
esting discussions with James J. Walker, William Rowan, Michael Kemp,
Gerald Minerbo, Michael Buonocore, Barry Medoff, and Jorge Llacer.
RECONSTRUCTION IN COMPUTED TOMOGRAPHY 269
8. References
Paul B. Kantor
273
C. R. Smith and G. I. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 273-281.
© 1987 by D. Reidel Publishing Company.
274 Paul B. Kantor
This problem was suggested by Cooper [1978] and Cooper and Huizinga
[1982]. Thei r work is not very technical and, in elaborating the mathe-
matics [Kantor, 1983a, b, c), I thought, 'Boy, I am really discovering the
wheel" Sitting at this conference for three and a half days, I am now sure
that everything I have to say will be well known to many of you, or wrong,
or both.
The problem: There is a collection of retrievable items-meaning 'all
the books in the library,' 'all the articles in some literature,' or 'all the
facts in some intelligence data base.' The number of items is large. A real
human being, which in the library business we call a 'patron (P),' submits a
query (Q) and we assume that this pair, (P,Q), defines a point function on
all these retrievable items x belonging to a set X, which might be called the
value of the item x, as an answer to that query, for that person.
We assume that the pair (P, Q) defines a point function v (value; utility;
chance of relevance) with values in the closed interval [0,1]:
The items have been examined, and are equipped with 'descriptor vectors'
that indicate whether they have or lack various properties a,b,.... The
subset of elements having properties a and b but not c is denoted by abc'.
The entire 'reservoir' of elements that we denoted by X is the union (we
use u to represent logical union); for example, for three properties,
You can of course argue that this isn't true. For example, maybe the value
does not reside in individual items, but in suitable subsets of items. But
perhaps it's true enough to take as a starting point. The value can also be
called a 'utility,' 'probability or chance of relevance,' and so forth. I am
going to work in a framework where we assume these have been normalized
so that they lie between zero and one (the usual arguments for scaling util-
ity into a compact interval apply here). From now on, I'll just talk about
v(x), the value of an item, and suppress P and Q.
A retrieval system, if I may speak anthropomorphically, 'wants' to pro-
vide most valuable items first because there are so many. In other words, it
wants to find, among all the 2N components of the reservoir, the ones for
which the values are highest.
The items have previously been examined as they were installed in the
system. In principle they might be undergoing continual re-examination by
some computer program. In any case, they are equipped with descriptor
vectors. They might have a property a or b or c, etc. In conventional bib-
liographic data bases, like the on-line retrieval systems, properties are the
presence or absence of a particular keyword. You can do more, though.
You can talk about the number of times a keyword occurs, or you can define
measures of relatedness between documents and use them as descriptors. In
any case, a document will or won't have a property a, b, or c, and so forth.
RETRIEVAL FROM LARGE DATA BASES 275
The general row is called alpha, and by introspection the system knows
exactly how many items there are in each row. What it doesn't know is how
many of them have a particular value. Whatever the complexity, we con-
ceive of a list as shown in the figure. Somewhat dangerously, we use
lower-case n to represent the fraction of the reservoir having each descrip-
tor; n is not an integer:
Ln(a) = 1 • (3)
Constraints K 1 , KI , •••
Apply to rows R1 , RI , •••
The sum of the number of items in the row times the average value of the
items in the row summed over all the rows that the constraint applies to is
the total value that we expect to find associated with that condition. A
typical example would be to say that all the items that have descriptor a
I
These can be given (terribly) as a priori estimates (that is, the patron P
says, "I think that items reported since 1982 have a 90% chance of being
relevant to my search"). More persuasively (I think this point of view has
been pioneered by Salton), this kind of information should be derived by a
feedback process: As you pull things out, you show them to the ultimate
evaluator and take note of whether he or she says they are good or not.
That's conceptually compelling. I don't know whether it's really practical
because it takes most of us a little while to decide on relevance. In any
case, let K(a) be the set of constraints that apply to the row a:
The nice suggestion that Cooper and Huizinga made was: Given these
kinds of clues and the need to have a definite answer, postulate that the
distribution of v, the value (I talk of it as a distribution of v although it's
really a distribution of items into cells with corresponding values), is pre-
cisely the one that maximizes the entropy.
I'll talk a little bit at the end of the talk about why this might plausibly
be t rue, but I'm not a convert, and I really feel, in this particular situa-
I I
that can easily be grabbed by the blade. It is a way of putting some fact in
and pretending you really firmly believe in it, therefore muddling it up with
constraints. Certainly at this point there isn't anything that I believe that
firmly.
The steps of the mathematics will be very clear. The entropy is sensi-
bly broken down into a part over which we have no control, that is, the dis-
tribution of items among the components, and the part that is the expected
value over all the components of the entropy of the individual components:
S = - r
cells c
q(c) In q(c)
S(a)
~
= -L n(a) In n(a) + La n(a) L [-p(c) In p(c)] (6)
rows CEa
1
vE{0,1} p(c) = {er(a)
1 }
1 + er(a)
(7)
er( a)
v(r(a)) = v(a) = 1 +er(a)
(8)
r(a) v( c)
= e
p(a;c) (9)
J dv er(a) v(c)
278 Paul B. Kantor
The "value function" for each component is a function that occurred very
long ago in statistical mechanics of a dipole in a magnetic field:
= -21 r 2
[1 + coth - - - ] •
2 r
(10)
Lee Schick discusses a more general case in his paper in this volume. But
whatever the form of the function v, the well known result is that maximiz-
ing the entropy subject to the constraints leads to the result that the "rich-
ness" of the row a is just the sum of the Lagrange multipliers (with their
signs changed) corresponding to all the constraints that apply to the row
(11 )
I think that any of the methods we've heard described could be applied
to it. I differ from everyone else in not having done a single computer run.
I currently believe that something called the homotopy method will be very
effective [Zangwill and Garcia, 1981]. In this technique we take a problem
to be solved (here the problem is completely controlled by the vector of
constraints, V) and try to continuously deform it into a problem that we can
solve on the back of an envelope. When we have that one solved, we just
re-deform it back in the other direction, keeping careful track of the solu-
RETRIEVAL FROM LARGE DATA BASES 279
tion. The idea is that, step by step, we will have a very good starting point
for whatever technique we apply to solve these rather horribly nonlinear
equations.
In this particular case it is easy to find a starting point. If all the X's
are 0, then it is easy to calculate what all the Yk are. And so we know the
solution for those particular constraints: Yk are
Y~ = ~ n(a) i (13)
a E R(k)
Solution: All X, = o.
We then deform all the Yk(t), varying t from zero to one.
Homotopy method:
The deformation of the left-hand side of the equations is
yt
1
(1-t) +t = (14 )
\lO
n Yn \It
n
I am hopeful that this will provide really quick solutions. There is a real-
time constraint on this because, if it is to work, it has to work while the
patron is at the keyboard expecting the computer to respond instantly. We
can put a little dancing demon on the screen to hold his attention for a
while, but even early results show that although people are tickled pink with
using computers for information retrieval, they are very impatient. To
really be successful it has to work in the order of two or three minutes.
[Note added March 1985: We have a Pascal program running on an 8088 PC
with an 8087 math chip that traces solutions for five constraints in a few
seconds.]
There are some real objections to applying the maximum entropy princi-
ple to an information reservoir, which are immediately raised by any librari-
ans who understand what the maximum entropy principle means. They say:
"We're not dumb, and we've worked really hard to put all these things in
there right. How can you possibly argue that you are going to do better by
using some technique that is used with molecules scattered around the
room?" This is disturbing. In the case of molecules, dynamical theory has
advanced to a point where even those of us who are determinists can be
shown that maximum entropy will hold. Furthermore, I have never inter-
viewed a molecule, and it doesn't exhibit any volition.
But how can we feel this way about activities of human beings, and the
classification of items put into the reservoir? There are three different
arguments that encourage me.
The first goes back to Emile Durkheim who, I think, is regarded as the
founder of sociology. He pointed out that human beings, with all of their
volition and intelligence, behave in horrifyingly predictable ways. The
example he gave was the number of suicides in France. He argued very
convincingly that these are not the same people prone, year after year, to
plunge into the Seine. Instead, there are underlying regularities that govern
the behavior of the French. I believe it is the most elegant argument that
has ever been given in the social sciences.
The second argument is rather technical. The terminology, descriptors,
and what constitutes a sensible query change fairly rapidly. Think of the
descriptor "solid-state," which might now be used to retrieve the catalog of
an expensive audio house but, 15 years ago, would only have been used to
retrieve things from technical journals. Because descriptors and queries are
"moving" all the time, it may well be that the kind of shaking up, the mixing
that goes on in a dynamical system, is actually going on in the very well in-
tentioned efforts to assign exactly the right descriptors to the retrievable
items.
The third argument is the result proved by Van Campenhout and Cover
[1981], which shows that without "believing" anything about minimum
cross-entropy one can show that if the average of independent identically
distributed (i.i.d.) variables is constrained, then the marginal distribution of
anyone of them turns out to be the same as that given by the entropy prin-
ciple. Since I am willing to believe that the values in the cells are i.i.d.
and they seem to have a constraint on their sum, this argument makes it
easier for me to behave like a believer. [Note added in proof: The result
can be derived, at least heuristically, from the following line of argument:
If there are N i.i.d. variables with a constrained sum, then the distribution
of anyone of them, say the fi rst; is the same as the distribution of the con-
straint minus the sum of all the others. But the sum of all the others is
approximately governed by the normal distribution (by the central limit the-
orem), and when you do the arithmetic you find the maximum entropy
result. ]
I always talk fast and apparently discourage questions, so I have 3 min-
utes to speculate. I don't really have 3 minutes worth of speculation, but I
RETRIEVAL FROM LARGE DATA BASES 281
have a lot of hopes. In the course of doing this, and doing it with my back-
ground as a physicist, I couldn't help noticing that after pages of arithmetic
I was rediscovering relations of the form
dV = TdS - dW (15)
Acknowledgment
References
Cooper, William S., and P. Huizinga (1982), "The maximum entropy principle
and its application to the design of probabilistic retrieval systems," Inf.
Technol. Res. Devel. !t pp. 99-112.
Kantor, Paul B. (1983a), "Maximum entropy and the optimal design of auto-
mated information retrieval systems," Inf. Technol. ~2), pp. 88-94.
Van Campenhout, Jan M., and Thomas M. Cover (1981), "Maximum entropy
and conditional probability," IEEE Trans. Inf. Theory 1T-27(4), pp.
483-489.
Lee H. Schick
283
1. Introduction
We have heard in the past few days the details of many elegant theo-
ries. In this section, I shall not bother with such intricacies, but will con-
tent myself with a rather broad outline. In the words of that great philoso-
pher John Madden, • Don't worry about the horse being blind. Just load up
the wagon." Well, just substitute "theory" for ·horse· and ·computer· for
• wagon· and you have the idea.
One may divide all problems into two types, direct and inverse. A
direct problem is ·easy,· is a problem in deduction, and is thus solved first.
To each such problem there corresponds an inverse problem that is "hard," is
a problem of inference, and is thus solved later, if at all. Two examples of
these types of problems are as follows:
Direct Problem #1
Q: What are the first four integers in the sequence 2 + 3n,
n = O,1,2, ••• ?
A: 2,5,8,11
Inverse Problem #1
Q: If· 2, 5, 8, 11· is the answer, what is the question?
A: What are the first four stops on the Market Street subway?
Di rect Problem #2
Q: What did Mr. Stanley say to Dr. livingston?
A: • Dr. livingston, I presume.·
Inverse Problem #2
Q: If • Dr. livingston, I presume" is the answer, what is the
question?
A: What is your full name, Dr. Presume?
TWO RECENT APPLICATIONS 285
As amusing as these examples may, or may not, be, they do illustrate three
important properties of inverse problems: (1) In both inverse problems, the
answer was certainly not unique. This is the case in many such problems.
(2) The first inverse problem might have been easily solved by a Philadel-
phian used to riding that city's subway system. In other words, prior infor-
mation is important in solving the inverse problem. (3) In the second
inverse problem, the correct answer is made even more difficult to infer
than it might otherwise have been because there is misinformation in the in-
formation given to us; that is, the answer for which we were asked to infer
the question should have been written: ·Dr. livingston I. Presume.' In
other words, the presence of noise in the data can influence the answer to
the inverse problem.
I shall now consider a simplified quantum mechanical inverse scattering
problem that embodies considerations (2) and (3).
We are given the prior information-that is, information made available
before the experiment in question is performed-that we are dealing with
the quantum problem of a particle of mass m being scattered by a central
potential V( r) that satisfies V( r) = 0 for all r > 8 fm and V( r) > 0 for all r <
8 fm.
In addition, an experiment has been performed, the results of which are
a set of noisy Born scattering amplitudes for a set of values of momentum
transfer q. Thus the data D(q) are given by
D(q) =- 2m
h2 q
J~ V( r) sin(qr) dr + e(q) (1)
o
M
Di = ~ Fij Vj + ei • (2)
j=1
Finally, since the only values of Vi that contribute to the sum are (from
our prior information) all positive, we may consider the distribution of Vj's
to be a distribution of probabilities Pi. This reduces the problem to the type
of image restoration problem analyzed by Gull and Daniell [1978] using a
MEM applied to the Pi. The prior distribution of the Pi's is assumed to be
uniform for r < 8 fm, zero for r > 8 fm, and of course normalized to 1, con-
sistent with the prior information on the V( r). The entropy of the distribu-
tion of the Pi's is maximized subject to the constraint that the X2 per degree
of freedom, given by X2/M, with
286 Lee H. Schick
(3)
(4 )
could be varied by varying the center and width of the distribution of stand-
ard deviations from which the 0i were chosen.
Figures 1 through 4 are the results of sample calculations. These cal-
culations were carried out for synthetic data derived from an exponential
potential, exp(-r/r o), with ro = 1 fm. The momentum of the particle was
taken to be 200 MeV/c. Ninety values of qi were chosen by varying the
scattering angle from 20 0 to 160 0 • Ninety values of the r variable from 0
to 8 fm were used to evaluate the radial integrals.
Even for the rather poor data of Fig. 3(a), the reconstruction of the
exponential potential is quite good. There is little improvement in the re-
construction in going from Fig. 3 to Fig. 4, even though the data, in the
sense of SNR, are much better. The reason is that the maximum value of
momentlfll transfer used puts a natural limit on the accuracy of the recon-
struction for r < 1 fm.
0·010 . . - - - - - - - - - - - - - - - - - ,
E a
0 ·006
(/)
w
o
::>
I-
0.002
..J
Q.
~
<
(!)
0.002
z
a:
w
l-
I- 0.006
<
u
(/)
o.0 I 0 L......J..................L.L.JU-L..L..L.L..LJ..........l....l...JU-L~L..L.L........1..I..I..I
160 240 320 400
q(MeV/c}
E 0 .0
(/)
w
o
::> -0 .001
I-
..J
Q.
~
<
-0.002
(!)
z
a:
w
l-
I- -0.003
<
u
(/)
q(MeV/c)
0·006
E
a
~~ t
(/)
ILl 0.003
a
::::> '::1 'I
" I II
'1','1
~
80 400
q(MeV/c)
0.09
0.07
(/)
-'
«
~
z 0.05
ILl
~
0
Q.
0.03
0.01
3 5 7
r ( fm )
,
0.001
-
E
en 0.0
a
,I, • ,
I
~
f""
W , a" ". "'II '
_ A I I"
e ,.
::>
I !,
~:
."
" f
l- I
..J In,'
0. -0·001
:E "
<{
C)
z -0.002
a:
w
l-
I-
<{ -0·003
u
en
0.09~------------------------~
b
0.07
Vl
..J
<{
~ 0.05
w
I-
o
0.
0.03
~
,
0·01
"-
--------
" ..... ,'
, I I -.-~- . _1 _ L. -'--'--'--'-....J
3 5 7
r (f m )
E
0.0
CJ)
w
0
:J
I- -0.001
-I
a.
:::E
<X
(!) -0.002
z
0::
W
l-
I-
<X
-0.003
u
CJ)
-0.004
80 160 240 320 400
q ( MeV Ie)
0.09
b
0·07
CJ)
-I
<t
I- 0.05
z
w
I-
0
a.
0.03
0.01
3 5 7
r ( fm )
Here, Pr(i) is the probability that at the ith time-point the trace amplitude
has the value qor. It is important to realize that Pr(i) is not a joint proba-
bility, but rather for each time-point i there is a separate normalization
condition
R
~ Pr(i) = 1 • (6)
r=-R
ti = qi + ni , (8)
292 lee H. Schick
N 2
2 , ( t i - qi) 2
X = L o.2 = Xc' (9)
i=1 I
where X~ is the value of X2 that is assumed known from whatever our confi-
dence level in the data is. More generally, for some other type of noise dis-
tribution, Eq. (9) will be replaced by
(10)
R
qi = qo L (s - di) Psi, (11 )
s=1
where s is an integer, Psi is the probability that qi has the value qo(S-di),
and the di are a known set of constants. (For example, M = 2R+1, di = M-1,
for all i, yields the exact form of Eq. (5) with the r =
0 term included in the
right -hand side.) The result of the M EM with Eq. (10) as an additional con-
straint is
qo h ~ ) _ M qo
T cot h (aiM qo )
qi-qodi = T cot (ai ~ 2 ' (12)
where
(13)
qi = qo (2Pi - 1) (14 )
TWO RECENT APPLICATIONS 293
Except for the constant term, this is just the form of regulator that has been
used many times in the literature [lindley and Smith, 1972; Jaynes, 1974].
But it must be noted, in contradistinction to this earlier work, that the 6i's
in Eq. (16) are not probabilities; that is, they do not have to be positive and
they do not satisfy a normalization condition.
4. References
Gary Erickson
Department of Electrical Engineering, Seattle University, Seat,.
tie, WA 98122
295
C. R. Smith and G. 1. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 295-304.
© 1987 by D. Reidel Publishing Company.
296 Inguva, Smith, Huber, and Erickson
1. Introduction
2. Formulation
N N
1
H = 2m rpf + ~ V(qj}
i=1 i<j=1
= K+V. (1)
The equilibrium phase-space probability distribution p(r1, .. ·,rN; Pl, ••• ,PN)
describes the state of the system and is the main quantity of interest. With
the normalization
(2)
(3)
(4)
VARIATIONAL METHOD FOR CLASSICAL FLUIDS 297
where IC is the Boltzmann constant. If Eqs. (2) and (4) are the only con-
straints on p, then maximization of the entropy in Eq. (3) subject to these
constraints leads to the familiar Gibbs canonical distribution for p. This is
not the problem of interest in this paper.
The form of the Hamiltonian in Eq. (1) suggests the following decompo-
sition of the phase-space distribution function [March, 1968]:
(5 )
where
N
PK = n a(Pi) (6)
i=1
and
N
Py =
Tf f(;i j ) (7)
i<j
J d'N r Py = 1 • (9)
The constant Z in Eq. (7) will be determined later; in turn, the fij = f(rij)
must satisfy Eq. (9).
We wish to express Sand <0 in Eqs. (3) and (4) in terms of a(p) and
f( r); it will turn out that we shall also need the radial distribution function.
We start with the k-particle reduced distribution function, defined [Balescu,
1975] by
J d'Nr Tf
i<j
f ij
298 Inguva, Smith, Huber, and Erickson
Finally, using the above results and definitions, we obtain for the entropy
and average energy
<E) = 2m
N Jd'p pZ a(p) + TNn Jd'r gz( r)V( r) • (15 )
(16)
(17)
VARIATIONAL METHOD FOR CLASSICAL FLUIDS 299
Substitution of Eq. (17) into Eq. (16) breaks the hierarchy and leads to the
following Born-Green equations as a formal solution for g2(r) [March,
1968] :
( 18)
-I:'
where
(20)
where we used
(21 )
Finally, we replace g2(S) in the integral of Eq. (20) by its asymptotic value
in Eq. (21) to obtain
where
gz( r) ]
y(r) = In [ 1TiT . (23)
Using these results in Eq. (18), we obtain the HNC approximation for g(r):
y(r) = n I
d!r'[g2(r') -1] [gz(jr - r'l) -1 - y(jr - r'I)]. (24)
There are three quantities, a(p), f( r), and gz( r), that we wish to deter-
mine in some optimal fashion. As mentioned earlier, we wish to make use of
the principle of maximum entropy. If we maximize the entropy in Eq. (14)
subject to the constraints in Eqs. (8), (9), and (15), we obtain the standard
canonical distribution for p in Eqs. (5) to (7) with
The radial distribution function gz( r) is still given by Eqs. (10) and (12) with
k = 2 where f( r) is obtained from Eq. (25). In the following we wish to
consider the entropy maximization considered above with the additional
constraint provided by the integral equation in Eq. (24). Using Eq. (24) as a
constraint allows us to maximize the entropy by varying independently not
only a(p) but also f(r) and gz(r). Maximization of the entropy in Eq. (14)
subject to the constraints in Eqs. (8), (9), (15), and (24) can be achieved by
considering the unrestricted variation of the functional defined by
where aK, ay, f3 [d. Eq. (25)], and A(r) are lagrange multipliers, and
By setting the variations of t with respect to a(p), U( r), and gz( r), respec-
tively, equal to zero, we obtain Eq. (26) for a(p) and
n n SIC ).(r)
- U(r) + ay - - In Z + - Y(r) + - -
2 2 2 gz( r)
We can solve for the Lagrange multiplier ).( r) by introducing the following
Fourier transforms:
[5(k)-1]2
y(k) = Jd'r y(r) exp(ik·r) = n5(k) • (33)
Using the normalization in Eq. (7) to determine av and with these results,
we can solve Eq. (30) for f( r). The result is
where
U( r) = V(r) - 2L
(211')'n
J d'k exp(-ik·r) W(k) (36)
and
[S(k) - 1] [5(k) + 1] [5(0) - 5(k)]
W(k) = [5(k)]2 • (37)
From the well known properties of the structure factor 5(k) [Balescu,
1975], we observe that Eqs. (35) through (37) imply
f( r) ----. 1, (38)
r+ CD
as they should.
Next, we eliminate f(r) from Eqs. (34) through (36) to obtain the fol-
lowing integral equation for g2( r):
where
and y(k) is given by Eq. (33). We can cast Eq. (39) in the alternative form
where y(k) is given by Eq. (33) and W(k) is given by Eq. (37).
T* = Ie TIe:
n* =
na' (43)
r* = ria.
Figures 1 and 2 show typical solutions of Eq. (39) for g2( r) and the cor-
responding structure factor [see Eq. (31)] for n* =
0.4, T* =1.25. These
solutions were obtained by integration using a method by Krylov and Skoblya
[Davis and Rabinowitz, 1975]. These curves are similar in form to curves
obtained earlier [see Temperly et al., 1968]. For densities n* > 0.4 and
temperatures T* < 1.1, we are unable to obtain convergent solutions to
Eq. (39), even though the HNC equations possess solutions outside these
ranges. In our view, this indicates that the HNC approximation should not
be used for n* > 0.4 and T* < 1.1.
To broaden our understanding of this method and this problem, we have
undertaken similar studies for the Percus-Yevick equation. Detailed calcu-
lations for thermodynamic quantities arising from the optimal correlation
functions of both the HNC equation and the Percus-Yevick equation [see
also Levesque, 1966] are also in progress.
VARIATIONAL METHOD FOR CLASSICAL FLUIDS 303
...,
Z.Z ~,
"
,,'
2.0
,,: .,,,
~
I.B , .,
,,,
..
:\
~
1.6 .
,, ,,,
1.4
: ~
....
~ 1.2
CD
1.0
"~~_##---"--------------- -
.B
..
.6
•2
O~~~~~~~Lw~~~~~~~~~LU
o .5 1.0 1.5 2.0 Z.S 3.0 3.5 1.0 4.5 5.0
r
Z.6
2.4
.6
.4
.2
O~~~LL~~~~~-L~~~LL~~LL~
o 2 1 B B 10 12 14 16 IB 20 22 24 26 2B 30
k
Figure 2. The structure factor as a function of wave number (in A -1) deter-
mined by solving Eqs. (39) and (31) for n* 0.1 and T* 1.25 (solid line). = =
The HNC solution (dashed line) is shown for comparison.
304 Inguva, Smith, Huber, and Erickson
4. References
N. C. Dalkey
305
2. Inductive Inference
~ P(e)S(R,e) ~ ~ P(e)S(P,e) • (1 )
E E
That is, the expectation of the score is a maximum when the correct proba-
bility distribution is asserted. In the context of personalistic, or subjective,
theories of probability, where the notion of correct probability distribution
is not always clear, Eq. (1) is frequently introduced in terms of an honesty
promoting score. That is, if an estimator believes P, then his subjective
expected score is a maximum if he asserts P [Savage, 1971].
Condition (1) has been proposed on a variety of grounds. However, it
has a particularly clear significance for inductive logic. Suppose K is a unit
class (specifying P uniquely). Then clearly the conclusion should be P.
However, if S is not reproducing, a higher expected score would be obtained
by asserting some distribution other than P.
Proper scoring rules, in effect, establish verifiability conditions for
probability statements. In particular, they permit verification of assertions
of the probability of single events. The assigned score S(P,e) is a function
both of the event e that occurs-giving the required tie to reality-and of
the asserted distribution P-giving the required dependence on the content
of the statement.
It is convenient to define some auxiliary notions:
H( P) = L P( e) S( P,e) (2a)
E
H(P) is the expected score given that P is asserted and that P is also the
correct distribution. G(P,R) is the expected score given that P is the cor-
rect distribution but R is asserted. N(P,R) is the net score, and measures
the loss 'if P is correct and R is asserted. (Conversely, N(P,R) can be
viewed as the gain resulting from changing an estimate from the distribution
R to the correct distribution P.)
In the new notation, Eq. (1) can be written
There is an infinite family of score rules that fulfill condition (1). Only
two will be explicitly referred to here: (1) the logarithmic score,
S(P,e) = 10gP(e), and (2) the quadratic score, S(P,e) = 2P(e) - LEP(e)2.
H(P) for the logarithmic score is L EP(e)logP(e), that is, the negentropy.
N(P,R) for the log score is LEP(e)logP(e)/R(e), the cross-entropy. H(P) for
the quadratic score is E EP(e)2. N(P,R) for the quadratic score is
LE[P(e) - R(e)]2, that is, the squared distance between P and R.
Given a set K of potential distributions and a proper score rule S, the
inductive rule is: Select the distribution pO such that H(pO) = minKH(P). In
the general case, the relevant operator may be inf rather than min, and to
achieve the guarantee described below, the analyst may have to adopt a
mixed posit, that is, select a posit according to a probability distribution on
K. These extensions are important for a general theory but are not directly
relevant to the logical structure of the inductive procedure.
The inductive rule can be called the min-score rule. The rule is a
direct generalization of the maximum entropy rule proposed by Jaynes
[1968]. The entropy of a distribution P is the negative of the expected log-
arithmic score. Thus, maximizing the entropy is equivalent to minimizing
the expected logarithmic score. The min-score rule extends the prescription
to any proper score rule.
Although the formalism of the min-score rule is analogous to the maxi-
mum entropy principle, the justification for the rule is much stronger. The
justification stems from two basic properties of the min-score distribution:
(1) It guarantees H(po); that is, for every PEK, G(P,po) ~ H(po), or in par-
ticular, G(P,po) ~ H(po). (2) It is the only rule that fulfills the positive
value of information (PVI) principle; that is, if K C K' (K contains more in-
formation than K') and pO is the min-score posit for K and QO the min-score
posit for K', then H( po) ~ H( QO). These two properties are derived by
Dalkey [1985]. The two properties give a clear content to the notion of
validity for induction.
3. Min-Net-Score Updating
the net score with respect to P; that is, N(Q,P) = minR£ KN(R,P). As men-
tioned above, the rule has been investigated only for the logarithmic score.
However, it clearly is extendable to any proper score.
Figure 1 illustrates the rule for the quadratic score for an E consisting
of three events, where the triangle is the simplex Z of all probability distri-
butions on three events. For the quadratic score, if K is convex and closed,
then Q is the point such that a line joining it and P is perpendicular to the
hyperplane supporting K at Q. The actual distribution P is assumed to be in
K, but the fact that it is unknown is indicated by a fuzzy point.
The most persuasive property of the MNS rule has been demonstrated by
S,hore and Johnson [1980]. By assumption, the actual distribution P is in K.
Shore and Johnson show, for the logarithmic score, that N(P,Q) ~ N(P,P).
In words, the actual distribution is closer to Q than it is to P (in the sense
of cross-entropy). Thus, if it is felt that P embodies relevant information
about P (for example, that P is not a "bad guess"), then Q is a better guess
than P; the analyst suffers a smaller loss believing Q than he would staying
with P.
Call the fact that N(P,Q) ~ N(P,P) the "better guess" property. It can
be shown that the better guess property holds for the MNS rule and any
proper score.
Theorem: If K is convex and closed, and G( P, R) is bounded on K, then,
for any P in Z, there exists a distribution QO in K such that N(P,Qo) ~
N( P,P).
Proof: Let J(R,Q;P) = N(R,P) - N(R,Q) = G(R,Q) - G(R,P)" J can be
interpreted as a game in which "nature," as the minimizing player, selects R,
and the analyst, as the maximizing player, selects Q. From Eqs. (2), J is
linear in R, and therefore convex in R. Hence, there exists a value for the
game, and a pure strategy QO for nature [Blackwell and Girshick, 1954,
310 N. C. Dalkey
SUPPORTINC
HYPERPLANE
Now, consider the situation in Fig. 4. Here the sets K and K' do not in-
tersect, and thus are incompatible: one or the other must be incorrect.
Yet, if only the 'bare' prior P is known, the MNS procedure allows the in-
ference to proceed, and winds up with Q as the conclusion. In short, the
MNS procedure has no provision for identifying incompatibilities between a
prior P and a current knowledge set K.
With regard to the PVI principle, consider a case like that in Fig. 5.
The circle is one of the family of iso-expected score surfaces (in this case
for the quadratic score). The expected score increases outward from the
312 N. C. Dalkey
centroid of the triangle. For the set K (the upper triangle) and the prior P,
the MNS rule selects Q as a posit. Now suppose the knowledge set is
reduced to K' (the shaded quadrangle). Since K' is totally included in K, we
can assert that K' is more informative than K. The MNS rule selects Q' as
the posit given K' and P. From the illustration, it is clear that H(Q) > H(Q');
restricting the knowledge set to K' leads to a smaller expected score. At
the same time, suppose that P is at the indicated location. Again, it is clear
from the construction that N(P,Q) < N(P,Q'). Thus, increasing information
does not guarantee that the resulting posit is closer to the actual
distribution.
4. Min-Score Updating
restrict the discussion to the case that the prior P may be assumed to stem
from an inductive inference based on a knowledge set K' that, for whatever
reason, is no longer available. In this case, if it can be assumed that the
unknown K' is convex, it is feasible to determine the weakest knowledge set
that would support the prior estimate P.
Let K+(P) denote the half space that supports the iso-H surface at P, or
in other words, K+(P) is the half space that is bounded by the hyperplane
I
supporting the convex set H(P)- = {Q H(Q) ~ H(p)} at P, and on the oppo-
site side from H(P)-. Assuming that the unknown K' is convex, it must be
contained in K+(P), since H(P) is a minimum in K', and the hyperplane sup-
porting H(P)- at P is thus a separating hyperplane for K' and H(P)-. K+(P)
is thus the desired 'weakest K' that could generate P as a posit.' The con-
struction of K+ is illustrated in Fig. 6.
5. Discussion
6. Acknowledgment
This work was supported in part by National Science Foundation Grant
1ST 8201556.
7. References
Blackwell, D., and M. A. Girshick (1954), Theory of Games and Statistical
Decisions, Wiley, New York.
Dalkey, N. C. (1985), 'Inductive inference and the maximum entropy prin-
ciple,' in Maximum- Entropy and Bayesian Methods in Inverse Problems,
C. Ray Smith and W. T. Grandy, Jr., eds., D. Reidel, Dordrecht, pp.
351-364.
Good, I. J. (1963), 'Maximum entropy for hypothesis formulation, especially
for multidimensional contingency tables,' Ann. Math. Stat. 34, pp.
911-934.
Jaynes, E. T. (1968), 'Prior probabilities,' IEEE Trans. Syst. Sci. Cybernet.
SSC-4, pp. 227-241.
Kullback, S. (1959), Information Theory and Statistics, Wiley, New York.
Pfaffelhuber, E. (1975), 'Minimax information gain and minimum discrimina-
tion principle,' in Colloquia Mathematica Societatis J~nos Bolyai, vol. 16,
Topics in Information Theory, Keszthely (Hungary).
316 N. C. Dalkey
Stuart Geman
317
C. R. Smith and G. 1. Erickson (eds.),
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems, 317.
© 1987 by D. Reidel Publishing Company.
SUBJECT INDEX
319
320 SUBJECT INDEX