Bayesian Analysis of Binary Sequences
Bayesian Analysis of Binary Sequences
com
Dedicated to Professor Micah Dembo, Boston University, as it traces its origins to his clockwork exposition of Bayesian theory
Abstract
This manuscript details Bayesian methodology for “learning by example”, with binary n-sequences encoding the
objects under consideration. Priors prove influential; conformable priors are described. Laplace approximation of
Bayes integrals yields posterior likelihoods for all n-sequences. This involves the optimization of a definite function
over a convex domain—efficiently effectuated by the sequential application of the quadratic program.
© 2004 Elsevier B.V. All rights reserved.
MSC: primary: 62F15; secondary: 60C05; 65K05
Keywords: Concave; Convex; Cut polytope; Geometric probability; Laplace approximation; Machine learning; Moments;
Nonlinear optimization; Polytope; Posterior likelihoods; Probability monomials; Quadratic program; Semidefinite
1. Introduction
This manuscript self-containedly describes new methodology for the well-rooted desideratum of “learn-
ing by example”. To precise this aim, let binary n-sequences represent the elements of the universe of
possible examples. Then, the desideratum may be paraphrased as “characterizing a distribution on binary
n-sequences, based upon a sample from this distribution” because, for instance, the sine qua non of pre-
diction is making intelligent use of the examples for the classification of supplementary elements. This
aim compasses many applications, e.g., the screening of digital images or of biological sequences—to
find elements resembling those from a data set.
∗ Tel.: +1-505-667-7510; fax: +1-505-665-3493.
E-mail address: [email protected] (D.C. Torney).
0377-0427/$ - see front matter © 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.cam.2004.05.010
232 D.C. Torney / Journal of Computational and Applied Mathematics 175 (2005) 231 – 243
Bayesian methods constitute a peerless framework for data-based inference [4]. They engender a
posterior distribution (what is inferred, based on the examples) from a prior distribution (what is as-
sumed, at the outset, before the examination of the examples). As the desideratum is a distribution,
the respective posterior distribution is a distribution on distributions – accentuating those distributions
in greatest accordance with the examples, as described in Section 2. The reader who seeks an intro-
duction to Bayesian statistics cannot do better than to consult [4], and, alternatively, the uninitiated
may take its present instantiation (cf. (2)) as a postulate. The burgeoning of this application of Bayes’
rule bided the impediment of a decided, unwelcome dependence of the posterior distribution upon the
prior distribution. As will be seen in Section 2, a uniform (unprejudiced) prior is unavailing, and the
new methodology comprises the detailed specification of priors. Insights sufficing for the selection of
nonuniform priors may originate, for instance, from sample estimates of key parameters of probability
distributions.
Parameterizations for distributions on binary n-sequences usually base their parameters on marginal
distributions for subsets of digits [2,5,17,24], and sampling of these distributions is feasible using Markov
chain Monte Carlo methods. The moment parameterization, reviewed in Section 3, is used to engineer the
new methodology. Prior distributions are, herein, taken to be uniform over all distributions having specified
vanishing moments (although fixing moments at nonzero values is also accommodated in the present
framework). Section 4 introduces this prior, gives consideration to distributions exhibiting vanishing
moments and digresses upon dialectics. Applications corroborate the felicitousness of this prior (viz.
Section 11). The linearity of the moment parameterization is its greatest boon; whence, for example, each
fixed-moment prior educes a convex polytope: an intersection of half spaces [11] constitutes the domain of
admissible collections of (nonfixed) moment parameters. In detail, there is a half-space pertaining to each
n-sequence, ensuring the nonnegativity of the respective linear combination of the moments expressing
its probability.
As described in Section 4, according to Bayes’ rule, given a prior uniform over admissible distributions,
the sought posterior distributions result from integrals of a probability monomial over a respective poly-
tope, with this monomial denoting the product of positive powers of sequences’ probabilities, with the
powers being the respective sequence multiplicities in the collection of examples. Probability monomials
are multilinear functions of the (nonfixed) moments.
Explicit restriction to these polytopes, in theory, quells the “moment problem”. However, as illustrated
in Section 5, perplexity may persist through their intricacy, which may retard the implementation of this
restriction. Section 6 is the starting point for the derivation of the main results, which are based upon a
more detailed formulation given therein.
The logarithm of probability monomials is shown to be negative semidefinite in Section 7. Orthogonal
projections may be employed, as necessary, to educe definiteness and, hence, a function whose unique
local optimum is also its global maximum, as described in Section 8. The towering of this maximum, for
samples of respectable size, triggers Laplace approximations for the Bayes integrals, described in Section
9. The integrand’s maximum point parameterizes a distribution comprising the posterior likelihoods: the
expected posterior probabilities (viz. Section 2). As a consequence of definiteness, convergence thither
is efficiently effected by maximizing each of a sequence of quadratic approximations to the logarithm
of a probability monomial, the latter by means of a quadratic program comprising the aforementioned
half-space constraints—whose number is typically exponential in n; as detailed in Section 10 [19,22].
Implementations of the quadratic program exhibit low-degree-polynomial complexity in the number of
optimized parameters.
D.C. Torney / Journal of Computational and Applied Mathematics 175 (2005) 231 – 243 233
The methods developed herein afford an advantageous alternative to neural nets (and the like) and hid-
den Markov models, each approach with its strengths and limitations. Strengths of the present method-
ology include its up-front “analyticity” and its generality. These advantages encourage the embedded
determination of posterior likelihoods within an overarching optimization problem: for instance, as part
of classification, where discrimination could be combinatorially ameliorated by means of the judicious
selection of the nonfixed moments. Although the present methodology is limited to small n by its ver-
nal numerical underpinnings, it surely prefigures approximate methods useful for arbitrarily large n.
This manuscript concerns the solution of analogues of the “Basic Problems” of hidden Markov models
[20, p. 261]: Problem 1, computing a probability for an arbitrary sequence, and Problem 3, adjusting the
parameters to maximize the probability of a data set. Advanced problems, such as the analogue of Problem
2, evoking hidden moment models, await investigation.
2. Bayesian elementals
In this section, the Bayesian framework is presented, requiring which may be viewed as an exercise
in geometric probability. Herein, U denotes the set {1, 2, . . . , n}. The following notation for probability
distributions on (−1, +1)-binary n-sequences is adopted [5].
Definition 1. Index the digits of (−1, +1)-binary n-sequences from 1 to n. Let S, with S ⊆ U , denote
the n-sequence with digit i equal 1 if i ∈ S and equal −1 otherwise (i ∈ U − S).
def
Consider a collection of sequences—derived by sampling an unknown distribution P = (P (∅), P (1),
. . . , P (U )), with P (S) denoting the probability of S, and, therefore, 1 = S⊆U P (S). Such collections
will be referred to as samples. For all S ⊆ U , let mS ∈ 0 + N denote the multiplicity of sequence S in a
given sample. If the sequences of the sample are generated by independent sampling from the distribution,
which is assumed herein, then, by reference to the multinomial distribution, the probability of the sample
is proportional to a probability monomial
P (S)mS , (1)
S⊆U
maximum of zero and) one minus the sum of the integration variables occurring in integrals to the left,
yielding the meaning of the denotatum ∗. Therefore, (2) is a ratio of the probability of the data, assuming
one distribution P, to the probability of the data for all allowed distributions: effecting a reweighting of
n
the latter [4]. Such domains of integration—regular simplices in R2 −1 ; n = 1, 2, . . . , comprising all
admissible P’s [8, p. 122]—are subsequently denoted A. With (P) equal a constant, , the denominator
of (2) equals [1, p. 258].
S⊆U mS !
. (3)
n
T ⊆U mT + 2 − 1 !
That is, L(S) equals the expectation of P (S) over the posterior distribution on distributions. Thus, with
constant (P), L(R); R ⊆ U , may be obtained, in the foregoing formulation, by substituting mR + 1
for mR in (3) and dividing by (3), yielding
mR + 1
L(R) = , R ⊆ U. (4)
n
T ⊆U mT + 2
But what has been wrought? Of particular note, if mR = mS , then L(R) = L(S), regardless of the
congruousness of either R or S to the sequences of the sample and the exceptionableness, in general,
of approximating a distribution based solely upon sample multiplicities: (4) bespeaks “nothing ventured
nothing gained”. And what’s the remedy? Within the Bayesian framework, there is only one recourse:
This manuscript illustrates previously unsuspected, practicable resolution via concinnous restrictions on
the prior distribution—based upon the moment parameterization.
3. Moment parameterization
Definition 3. Index the digits of (−1, +1)-binary n-sequences from 1 to n. The moment (T ) equals the
expectation of the product of the digits indexed by the elements of T ; T ⊆ U .
Thus, (T ) is the sum of the probabilities of (−1, +1)-binary n-sequences for which the respective
product equals +1 minus the sum of the probabilities of the n-sequences for which the respective product
equals −1. Note that (∅) = 1 because it is conventional for an “empty product” to equal unity. As illus-
tration of Definition 3, the probability of “even parity” is implicit in (U ): the sum of the probabilities of
sequences containing an even number of −1’s minus the sum of the probabilities of sequences containing
an odd number of −1’s.
D.C. Torney / Journal of Computational and Applied Mathematics 175 (2005) 231 – 243 235
Here, |T \S| denotes the cardinality of the relative complement: the set of elements in T but not in S, and
T ranges over all subsets of U . Note the linearity of P (S) in the moments. Not all values of the moment
parameters correspond to a distribution: 0 P (S) for all S ⊆ U ; the moment problem seeks insights into
whether or not given values engender a distribution. It is also worthy of note that, with an appropriate
normalization, the ±1 coefficients in (5) constitute self-inverse Hadamard matrices because, for instance,
this rounds directly to Definition 3 [5]. Note that there are 2|U | − 1 variable moment parameters in
(5)—all save (∅)—and the same number of independently specifiable P (S)’s. Eq. (5) is illustrated by
the following example.
4. Fixed-moment-distribution priors
To bypass (4), the prior distribution on distributions, (P), must be nonuniform. Although it may not be
immediately obvious how to effect this, with (5) in mind, it is natural to entertain fixing some moments as
the means of restricting prior distributions. Thus, priors are taken to be uniform on all distributions having
specified moments taking given values, assigning probability zero to all other distributions. Primarily
to simplify computations, fixed moments are taken to vanish (although closely analogous results also
obtain with arbitrarily fixed moments). Such distributions always exist; in these, some probabilities may
be thought of as dependent upon the rest, as there plainly cannot be more independently specifiable
probabilities than the number of nonfixed moment parameters.
Thus, in (5), trivially, instead of summing over all of the subsets of U , one need only sum over the
subsets of U whose moments are not zeroed. This collection of subsets is denoted M ⊆ 2U , the latter
connoting the power set of U . Thus,
P (S) = 2−n (−1)|T \S| (T ), S ⊆ U. (6)
T ∈M
Note that ∅ is always an element of M because (∅) = 1.
For every M ⊆ 2U (with ∅ ∈ M) the domain of admissible nonfixed moments, corresponding to a
probability distribution, is a convex polytope P: the intersection of the closed half-spaces in which the
right-hand sides of (6) are nonnegative for all S ⊆ U [11] (viz. Definition 6). For example, were M = 2U ,
as in (5), then P would be a regular simplex A ∼ A (with n = |U |), entailing (4). Note that P always
contains at least one point: derivable from (T ) = 0, for all T ∈ M − ∅. In Section 6, the distribution P
and the corresponding P are generalized and precised.
236 D.C. Torney / Journal of Computational and Applied Mathematics 175 (2005) 231 – 243
Table 1
Posterior likelihoods
Letting (P) equal a constant inside P and equal zero outside, the multiple integral of the denominator
of (2) is found to be proportional to
d P (S)mS . (7)
P S⊆U
Here, (6) parameterizes the P (S)’s and denotes Euclidean space with Cartesian coordinates (T );
T ∈ M − ∅. In addition, from Definition 2 and (6),
mS
P d P (R) S⊆U P (S)
L(R) = mS , R ⊆ U. (8)
P d S ⊆U P (S )
Example 2. Let n = 3, and let the sample multiplicities m{11−1} = m{−11−1} = 1, m{1−1−1} = 2 and with
the remaining m’s equal zero. Under the assumption of a uniform prior on all possible distributions, from
(4), the likelihoods for the eight binary sequences of length three equal (mS + 1)/12, respectively: viz. the
last row of Table 1. Were, on the other hand, the prior distribution uniform on distributions with vanishing
moments of two and three digit positions, alternative posterior likelihoods would obtain. The integrals
occurring in the calculation of the respective posterior likelihoods are performed via straightforward
recurrences, yielding the penultimate row of Table 1. Note that different sequences have maximum
likelihood for the different priors and that the posterior likelihoods are uncoupled from the sample
multiplicities Table 1’s middle row.
It is a legitimate concern that distributions (6) must often, in practice, depend upon substantially fewer
than 2n nonzeroed moments, and might not this deficit entail some “strangeness in the proportion”? This
may be assuaged by noting that it would be counterproductive to employ “an excess” of parameters
for a given sample, as this would ensure overfitting—whereas extrapolation and generalization are the
desiderata. Measures of over-fitting remain to be identified and implemented, but parsimony is likely to
be uniformly advantageous, i.e. the sought distribution should be reasonably well-approximable using a
relatively small number of parameters. The moment parameterization is especially adapted to distributions
whose marginal distributions may be well approximated using few moments—a frequent attribute of
distributions derived for empirical data sets, e.g., biological sequences. The moments which are fixed at
zero are, ideally, ones whose values are small in the sought distribution, as attested by the sample moments.
Provided one is willing to verify feasibility [14] to furnish interior-point algorithms with an initial point, the
present methodology also readily accommodates fixing moments at values other than zero. Application-
dependent measures of the aptness of a given collection of fixed moment parameters remain, however,
D.C. Torney / Journal of Computational and Applied Mathematics 175 (2005) 231 – 243 237
to be developed. One could, for instance, aim to classify based upon posterior likelihoods derived from
several samples, possibly engendering novel prescriptions for fixing moments, but such optimizations
fall outside the scope of the present manuscript. If moments are unsuited to distributions arising from a
given application, then one might consider developing analogues of the present methodology based upon
an alternative parameterization [5,17,24].
5. Geometric probability
Consider the evaluation of Bayesian integrals (7), with P given by (6). Because of the complexity of
P these integrals are, in general, hopelessly difficult to evaluate exactly [3,15,16]. For instance, were all
moments of more than two digits to vanish, then the respective P would be the dual of a cut polytope;
2 n [9]. The volumes of cut polytopes remain unknown for n 4, however, and determining their duals’
volumes could prove even more daunting because the number of vertices, each corresponding to a facet
of the respective cut polytope, empirically increases faster than exp{exp{0.41n}} [3,6]. Thus, the state of
the art does not lend itself to exact evaluation of (7).
6. Bayesian infrastructure
Excellent approximations for integrals (7) are, achievable, in general, as shall shortly be established by
the following means. The logarithm of the probability monomial (7) is shown to be concave in Section 7,
which, essentially, implies the uniqueness of a local optimum, where the global maximum also, in fact,
occurs, motivating and facilitating Laplace approximations [7, Chapter 5]. The topic of Section 8 is the
discernment of definiteness (versus proper semidefiniteness); there it is established that optimization may
always be effected in a subspace where definiteness obtains. In Section 9, the probabilities parameterized
by the maximum-point moments, are noted as canonical Laplace approximations for the posterior likeli-
hoods. Practicable optimization methods are described in Section 10 and exemplified in Section 11. The
following infrastructure is facilitatory.
A special case of distributions on (−1, +1)-binary n-sequences is as follows. Let some sequences be
known a priori to have probability zero, anticipating, for instance, the binary encoding of letters from an
alphabet of size not equal to a power of two. For convenience, let V ⊆ 2U denote those sequences whose
probabilities might not vanish. It facilitates some applications to explicitly annul P (S) for S ∈ / V , i.e.,
the domain of (6) becomes S ∈ V . Such distributions are denoted PV and its individual probabilities are
denoted PV (S); S ∈ V . From (6), the following definition obtains.
Definition 4.
1
(−1)|T \S| (T ), S ∈ V,
PV (S) = N T ∈M (9)
0, S∈
/V
with S ⊆ U ; with ∅ ∈ M ⊆ 2U ; and with N normalizing the sum of the PV (S)’s over S ∈ V to unity.
238 D.C. Torney / Journal of Computational and Applied Mathematics 175 (2005) 231 – 243
Therefore,
def
N = (−1)|T \S| (T ). (10)
S∈V T ∈M
Note that N = 2n for V = 2U and that P2U is P (cf. (6)). The following comprises (1).
The equality constraints have been included for the sake of generality. Equality constraints could, of
course, be implemented by two, opposing inequalities, but it is best to differentiate these. From (10),
def
b1 = T ∈M aI T (T ), with aI T = S∈V (−1)|T \S| , would fix N (equal to b1 ). Note that N = |V |
whenever a1∅ (=|V |) is the only nonvanishing a1T ; T ∈ M. Thus, the Bayesian integral (7) may be
written
d PV (S)zS . (7 )
P S∈V
7. Semidefiniteness
Definition 7 (Gantmacher [10, Vol. 1, p. 304]). Let f be a twice differentiable function of the (R)’s;
R ∈ M − ∅. f is negative (semi) definite, if and only if, its Hessian matrix
2
* f
(Q, R ∈ M − ∅),
*(Q)*(R)
def
Definition 8. Let = S∈V zS log(NP V (S)).
Theorem 1. With fixed N, the logarithm of any probability monomial (1) is a negative semidefinite
function of the moments in the proper interior of the respective P.
Proof. The theorem is first established for the case h = 0, i.e. with a P which requires no equality
constraints. From Definition 8, the logarithm of a probability monomial (1) ,
zS log PV (S) = − log N zS
S∈V S∈V
(evidently real valued inside P). Thus, with fixed N and fixed zS ’s, contains the functional dependencies
of interest. The Hessian of ,
zS (−1)|Q\S|+|R\S|
H=− 2 (Q, R ∈ M − ∅). (13)
|T \S|
S∈V T ∈M (−1) (T )
Note that H’s entries are finite in the proper interior of P. Thus, let vR denote a vector whose Sth
component
√
zS (−1)|R\S|
vR (S) = |T \S|
, (14)
T ∈M (−1) (T )
for R ∈ M −∅ and S ∈ V . Here, the positive root obtains. Thus, H is the negative of a Gram matrix, whose
(Q, R)th entry equals the negative of the scalar product vQ ·
vR . Negative semidefiniteness of in P follows
immediately (cf. [10,13]). Turning to the case 0 < h, in which P involves equality constraints—such as
one used to explicitly fix N—it is evident that a negative semidefinite function defined on a domain is
negative semidefinite on the subdomain arising from any orthogonal projection [8, pp. 243–262], here,
onto the orthogonal complement of the linear span of the equality constraints.
Fixing N seems a small price to pay to be able to evaluate (7) , for which semidefiniteness is essential.
8. Educing definiteness
For completeness, this section addresses proper semidefiniteness, which may also be circumvented
by other means (viz. Section 11). Orthogonal transformations of Euclidean space may yield subspaces
(spanned by a subset of the transformed variables) (cf. [8, pp. 243–262, 23, Proposition 13.15 and Theorem
13.15]). The following hierarchical decomposition of “moment space” facilitates disquisition.
Definition 9. (i) Let denote E|M|−1 : Euclidean space coordinatized by the (R)’s, with R ∈ M − ∅;
(ii) Let ⊆ denote the linear span of the equality constraints (12); (iii) Let ⊥ denote the orthogonal
complement of in , i.e., ⊥ = ; (iv) Let Υ ⊆ ⊥ denote the intersection of the null space of H
with ⊥ , and (v) Let denote the orthogonal complement of Υ in ⊥ .
Proof. Let uR denote a vector whose Sth component equals (−1)|R\S| ; R ∈ M − ∅, S ∈ V+ = {T ∈ V :
def
0 = zT }. Denote the columns (or rows) of H (the Gram matrix composed from the vR ’s) by the vectors
R ; R ∈ M − ∅. For nonzero real coefficients c1 , c2 , . . . , ck ; 1 k |M| − 1 and {i1 , i2 , . . . , ik } ⊆ M − ∅,
plainly,
c1i1 + c2i2 + · · · ck ik = 0 ⇔ c1 vi1 + c2 vi2 + · · · + ck vik = 0
⇔ c1 ui1 + c2 ui2 + · · · + ck uik = 0. (15)
Therefore, as the right dependencies are independent of the (R)’s, so are the linear dependencies among
the columns of the Gram matrix and, hence, so is their span and its complement, the null space of H. As
⊥ is plainly invariant, the lemma follows.
Proof. Because, by definition, for any linear combination R1 R1 + R2 R2 + · · · + R|M| R|M| in Υ ,
reference to (13) yields
√
zS R∈M (−1)|R\S| R (R)
|T \S|
= 0 for all S ∈ V
T ∈M (−1) (T )
(viz. (14)). Because the Rth component of ∇ equals S∈V zS (−1)|R\S| / T ∈M (−1)
|T \S|
(T ) ;
R ⊆ M −∅, it follows that ∇ is perpendicular to Υ . Therefore, inside P, may be taken to be a function
of those linear combinations of moments constituting any basis of .
is, therefore, negative definite in the proper interior of P if and only if Υ is null. Next, a fundamental
theorem of convex optimization is recalled.
Theorem 2. There is a unique point of a convex domain where a function f —negative definite over the
domain—takes a larger value than at all other -neighboring points of the domain. The global maximum
value of f over the domain occurs at this point (cf. [21, Section 27]).
Proof. For it to be definite, the function f must be continuous. It follows from Weierstrass’ compact-
domain theorem that the function achieves a maximum value, over the polytope. Because it is definite, f
takes a larger value at this point than at all other -neighboring points of the domain.
Assume there are two points of the domain, a and b, at which the function takes a larger value than at all
respective -neighboring points of the domain. Without loss of generality, let f (a) f (b) and a < b, as
can easily be arranged (e.g. by reflecting the coordinate, if necessary). From the second-order, mean-value
theorem [12, Chapter VII],
f (a) = f (b) + (a − b) f (b) + 21 (a − b)2 f ( ),
where a < < b, and f (b) denotes a directional derivative, taken, as necessary, infinitesimally interior
to b. Because f (b) 0, the middle term on the right is nonpositive and because the last term on the right
is by negative by hypothesis, f (a) < f (b), contradicting what was assumed. Therefore f has precisely
one local optimum.
D.C. Torney / Journal of Computational and Applied Mathematics 175 (2005) 231 – 243 241
Therefore, when Υ is nonnull, the maximization of may be facilitated by first “projecting away”
Υ . Fourier–Motzkin elimination may be used to orthogonally project P on (one dimension at a time)
[25, Theorem 1.4], yielding a polytope denoted P . A potential drawback of this approach is that it may
engender a huge number of linear-inequality constraints. These could be culled by testing the superflu-
ousness of the individual inequalities: verifying feasibility with this inequality satisfied as an equality
[14]. A less exacting, rough-and-ready alternative is illustrated in Section 11.
9. Laplace approximation
The substitution, into PV (R) (viz. (9)), of the parameters maximizing , within P, is seen to con-
stitute the simplest Laplace approximation for L(R); R ∈ V [7, Chapter 5]. In detail, recalling (8),
the numerator’s integrand is proportional to PV (R) exp , and the denominator’s integrand is propor-
tional to exp . Thus, although the canonical “large parameter” is absent, Laplace approximation is
congruouswhen is sharply peaked, a measure of congruousness being, for instance, the smallness
ˆ} ˆ
of exp{ P d/ P d exp , with denoting the maximum value of . Thus, Laplace approxima-
tions for such ratios of integrals, each constructed about the maximum of exp , allow cancellation of
coefficients—the
√ square root of the absolute value of det(H), the constants of proportionality and the fac-
tors of —leaving only the numerator’s factor of PV (R), evaluated at the point ŝ maximizing . These
approximations for the posterior likelihoods plainly constitute a distribution. Corrections for situations
with ŝ near the surface of P are, therefore, deferred.
To obtain a unique maximum point ŝ for Laplace approximation may involve restricting to and P ,
as described.A PV (R)’s vector of ±1 coefficients for the (T ); T ∈ M −∅, could, however, have a nonzero
projection on Υ . In this situation, one might seek PV (R): the average of PV (R) over P ∩ (ŝ ⊕ Υ ).
Educing definite is discussed in Section 8. Thus, the optimization at hand is the maximization of
negative definite , under linear constraints (11) and (12), with possibly one equality constraint used to
fix N .
The most efficient numerical methodology available for this maximization involves sequential appli-
cation of the quadratic program [18]. In theory, implementations of the quadratic program efficiently
find the maximum of a linearly (equality and inequality) constrained, definite quadratic function, with
computational work approximately cubic in the number of variables and with memory requirements
linear in the number of constraints [19,22]. An admissible starting point derives from (T ) = 0, for all
T ∈ M − ∅—by projection thereof onto . Then, a given initial point and the maximum, found by the
quadratic program, define a step direction for maximization (provided that the two do not coincide: a
characteristic of the optimum). A sufficient-increase condition and an effective “line-search” yield an
effective step length [18]. The resulting point is taken to be the starting point for the next application
of the quadratic program, iterating until a convergence criterion is met. Advantages of this approach
include (i) guaranteed convergence to the optimum and (ii) a “quadratic convergence” in the vicinity of
the optimum [18]. Implementations of the quadratic program for “strictly convex” functions [22] are,
however, unsuited to nearly singular H’s.
242 D.C. Torney / Journal of Computational and Applied Mathematics 175 (2005) 231 – 243
Table 2
Moments of one digit precede the moments of two digit positions, and within these groups, the optimal ’s are ordered lexico-
graphically (wrapping across rows): e.g., ({1}), ({2}), . . . , ({11}), ({1, 2}), ({1, 3}), . . . , ({10, 11})
11. Example
Example 3. Here, (T ) = 0 if 2 < |T |, with T ⊆ U . Thus, the moments which do not vanish are of
zero, one or two digit positions. The distribution of interest is on (−1, +1)-binary 11- sequences. The
sample contains binary representations of the smallest 100 natural numbers, commencing from zero. The
standard binary representation of integers is used, subsequently mapping 0 → −1. To ensure definiteness
of , zS is taken to equal 1 + 2−12 for the smallest 100 natural numbers and to equal 2−12 otherwise. In
this example, Υ is null and N(=211 ) is automatically constant. Table 2 contains the computed optimal-
moment parameters.
At the global maximum, is larger by approximately 177.3 than at the point where all ’s (except
(∅)) equals zero. It may be noted that, at the global maximum, the eigenvalues of H are between −1581
and −44281.
As described in Section 9, the respective parameters may be substituted in (6) to obtain approximate
posterior likelihoods for all sequences. Then, the likelihoods pertaining to the smallest 128 natural numbers
are found to be comparable. Give that computer credit for gumption! There are also 512 likelihoods about
1
3 as large: the corresponding sequences have either precisely one 1 or all 1’s at the digit positions 8–11
(where the sample had only zeroes), indexing the digit positions from 1 to 11. The likelihoods for the
remaining sequences are much smaller. These results are potentially more advantageous than the sequelae
2
of adopting a uniform, unrestricted prior (4): assigning likelihood 2148 to the 100 sequences of the sample
1
and 2148 to the rest.
The optimum of this example was obtained using several methods, including steepest descent. The
hands-down winner was the sequential application of the QLD algorithm to [22]. This yields a point
very close to the maximum in nine CPU seconds (after 10 steps); at this point, the magnitude of the
gradient of equals 4 × 10−9 .
D.C. Torney / Journal of Computational and Applied Mathematics 175 (2005) 231 – 243 243
Acknowledgements
The author is indebted to Mr. Gary L. Sandine, formerly of Los Alamos National Laboratory, for
implementing the Torczson-Dennis simplex method, for the example of Section 10, and, also, for providing
comments on the manuscript. Dr. James M. (Mac) Hyman, of the Applied Mathematics group of Los
Alamos National Laboratory, was instrumental in the evaluation of suitable numerical methods. Dr.
Timothy L.Wallstrom, also of Los Alamos National Laboratory, contributed the essence of the proof of
Theorem 2. Dr. Winston A. Hide, of SANBI and the University of the Western Cape, helped initiate
this project, which was also benefited by the insights and enthusiasm of the late Dr. George I. Bell. The
introductory course on asymptotic analysis taught by emeritus Professor Harold L. Levine, of Stanford
University, provided essential foundations for these investigations. Thoughtful comments were generously
provided by Dr. Fern Y. Hunt, of NIST. This work was supported by the USDOE, under Contract W-
7405-ENG-36, through its Functional Genomics Initiative.
References
[1] M. Abramowitz, I.A. Stegun, Handbook of Mathematical Functions, US Government Printing Office, Washington, 1964.
[2] R.R. Bahadur, A representation of the joint distribution of responses to n dichotomous items, in: H. Solomon (Ed.), Studies
in Item Analysis and Prediction, Stanford University Press, Stanford, 1961, pp. 158–168.
[3] A.I. Barvinok, Computing the volume, counting integral points, and exponential sums, Discrete Comput. Geom. 10 (1993)
123–141.
[4] J.O. Berger, D.A. Berry, Statistical analysis and the illusion of objectivity, Amer. Sci. 76 (1988) 159–165.
[5] W.J. Bruno, G.-C. Rota, D.C. Torney, Probability set functions, Ann. Combinatorics 3 (1999) 13–26.
[6] T. Christof, http://www.iwr.uni-heidelberg.de/iwr/comopt/SMAPO/cut/cut.html, 1999.
[7] E.T. Copson, Asymptotic Expansions, Cambridge University Press, Cambridge, 1965.
[8] H.S.M. Coxeter, Regular Polytopes, Dover Publications, Inc., New York, 1973.
[9] M. Deza, M. Laurent, The even and odd cut polytopes, Discrete Math. 119 (1993) 49–66.
[10] F.R. Gantmacher, The Theory of Matrices, Chelsea Publishing Co., New York, 1977.
[11] B. Grunbaum, Convex Polytopes, Wiley, London, 1967.
[12] G.H. Hardy, Pure Mathematics, 10th Edition, Cambridge University Press, Cambridge, 1952, p. 285.
[13] M. Hazewinkel, Encyclopaedia of Mathematics, Vol. 4, Kluwer, Dordrecht, 1989, p. 293.
[14] K.L. Hiebert, Solving systems of linear equations and inequalities, SIAM J. Numer. Anal. 17 (1980) 447–464.
[15] J.B. Lasserre, Integration on a convex polytope, Proc. Am. Math. Soc. 126 (1998) 2433–2441.
[16] J. Lawrence, Polytope volume computation, Math. Comp. 57 (1991) 259–271.
[17] P.F. Lazarsfeld, The algebra of dichotomous systems, in: H. Solomon (Ed.), Studies in Item Analysis and Prediction,
Stanford University Press, Stanford, 1961, pp. 111–157.
[18] E. Polyak, Optimization: algorithms and consistent approximations, Applied Mathematical Sciences, Vol. 124, Springer,
Berlin, 1997, p. 30.
[19] M.J.D. Powell, ZQPCVX, A FORTRAN subroutine for convex quadratic programming, University of Cambridge Technical
Report DAMTP/1983/NA17, Cambridge, 1983.
[20] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77 (1989)
257–286.
[21] R.T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, 1970.
[22] K. Schittkowski, FORTRAN computer algorithm QLD, University of Bayreuth, Germany, 1991.
[23] K. Spindler, Abstract Algebra with Applications, Marcel Dekker, Inc., New York, 1994.
[24] D.C. Torney, Binary cumulants, Adv. Appl. Math. 25 (2000) 34–40.
[25] G.M. Ziegler, Lectures on polytopes, Graduate Text in Mathematics, Vol. 152, Springer, Berlin, 1995.