Adams, Freden, Mishna - 2013 - From Indexed Grammars To Generating Functions
Adams, Freden, Mishna - 2013 - From Indexed Grammars To Generating Functions
1. Introduction
1.1. Indexed grammars. Indexed grammars were introduced in the thesis of Aho
in the late 1960s to model a natural subclass of context-sensitive languages, more
expressive than context-free grammars with interesting closure properties [1, 16].
The original reference for basic results on indexed grammars is [1]. The complete
definition of these grammars is equivalent to the following reduced form.
Definition 1. A reduced indexed grammar is a 5-tuple (N, T, I, P, S), such that
(1) N, T and I are three mutually disjoint finite sets of symbols: the set N of
non-terminals (also called variables), T is the set of terminals and I is the
set of indices (also called flags);
(2) S ∈ N is the start symbol;
(3) P is a finite set of productions, each having the form of one of the following:
(a) A → α
(b) A → Bf (push)
(c) Af → β (pop)
where A, B ∈ N, f ∈ I and α, β ∈ (N ∪ T)∗ .
Observe the similarity to context-free grammars which are only defined by produc-
tion rules of type (3a). The language defined by an indexed grammar is the set of all
strings of terminals that can be obtained by successively applying production rules
begining with the rule that involves the start symbol S. A key distinction from
context free grammars is that rather than expand non-terminals, we expand non-
terminal/stack pairs: (A, ι), ι ∈ I∗ , A ∈ N. Here, the start symbol S is shorthand
for the pair (S, ǫ), where ǫ denotes the empty stack.
Production rules in P are interpreted as follows. The stack is implicit, and is
copied when the production is applied. For example, the type (3a) production rule
A → aBC is shorthand for (A, ι) → a(B, ι)(C, ι), for A, B, C ∈ N, a ∈ T and
ι ∈ I∗ .
A production rule of form (3b) encodes a push onto the stack, and a production
rule of the form (3c) encodes a pop off of the stack. For example, the production
rule A → Bf applied to (A, ι) expands to (B, ι′ ) where ι′ is the stack ι with the
character f pushed on. Likewise, Af → β can only be applied to (A, ι) if the top
of the stack string ι is f . The result is β such that any nonterminal B ∈ β is of the
form (B, ι′′ ), where ι′′ is the stack ι with the top character popped off.
To lighten the notation the stack is traditionally written as a subscript. Note
the difference: the presence of a subscript in a production rule is shorthand for an
infinite collection of production rules, whereas in a derivation the stack is viewed
as part of the symbol. Furthermore, it is also useful to introduce an end of stack
symbol, which we write $. This symbol is reserved strictly for the last position in
the stack. This permits us to expand a non-terminal into a terminal only when the
stack is empty. These subtleties are best made clear through an example.
Example 1.1. The class of indexed languages is strictly larger than the class of
context-free languages since it contains the language L = {an bn cn : n > 0}. This
language is generated by the indexed grammar ({S, T, A, B, C}, {a, b, c}, {f }, P, S)
with
P = {S → T$ , T → Tf , T → ABC,
Af → aA, A$ → a, Bf → bB, B$ → b, Cf → cC, C$ → c} .
A typical derivation is as follows. We begin with S and derive aaabbbccc:
S → T$ → Tf $ → Tf f $ → Af f $ Bf f $ Cf f $ → aAf $ Bf f $ Cf f $ →
1.2. The set of indexed languages. The set of all languages generated by in-
dexed grammars forms the set of indexed languages. As alluded to above, this is a
full abstract family of languages which is closed under union, concatenation, Kleene
closure, homomorphism, inverse homomorphism and intersection with regular sets.
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 3
The set of indexed languages, however is not closed under intersection or comple-
ment. The standard machine type that accepts the class of indexed languages is
the nested stack automaton.
This class of languages properly includes all context-free languages. These are
generated by grammars such that I is empty. One way to view indexed grammars is
as an extension of context-free grammars with an infinite number of non-terminals,
however the allowable productions are quite structured. Furthermore, it is a proper
subset of the class of context-sensitive languages. For instance {(abn )n : n ≥ 0} is
context-sensitive but not indexed [12].
Formal language theory in general and indexed languages in particular have ap-
plications to group theory. Two good survey articles are [25] and [13]. Bridson and
Gilman [3] have exhibited indexed grammar combings for fundamental 3-manifold
groups based on Nil and Sol geometries (see Example 4.3 below). More recently [15]
showed that the language of words in the standard generating set of the Grigorchuk
group that do not represent the identity (the so-called co-word problem) forms an
indexed language. The original DSV method (attributed to Delest, Schützenberger,
and Viennot [6]) of computing the growth of a context-free language was success-
fully exploited [11] to compute the algebraic but non-rational growth series of a
family of groups attributed to Higman. One of our goals is to extend this method
to indexed grammars to deduce results on growth series.
We prove that neither of these classes capture indexed grammars. In fact, many
of our examples of indexed grammars have lacunary generating functions, with a
natural boundary at the unit circle, because they are so sparse. This is perhaps
unsatisfying, but it also illustrates a key difference between computational com-
plexity and analytic complexity; a distinction which is not evident after studying
only context-free and and regular languages.
That said, the expressive power of growth series derived from indexed languages
has been broached by previous authors. In [17], the authors consider a limitation
on possible productions, which is close to one of the restrictions we consider below.
They are able to describe the recurrence type satisfied by all sequences u(n) for
which the corresponding language {au(n) } is generated by this restricted set of
indexed grammars.
Furthermore, we mention that other characterizations of indexed languages are
equally amenable to analysis. In particular, indexed languages are equivalent to
sequences of level 2 in the sense of [26], and have several different descriptions.
Notably, they satisfy particular systems of catenative recurrent relations, to which
methods comparable to what we present here may apply. By taking combinations of
such sequences, Fratani and Senizergues [10] can give a characterization of D-finite
functions with rational coefficients.
A second motivation for the present work is to verify that various growth rates
are achievable with an indexed grammar. To that end, we often start with a
language where the enumeration is trivial, but the challenge is to provide an indexed
grammar that generates it. Our techniques verify that the desired growth rate has
been achieved.
1.4. Summary. Ideally, we would like to describe an efficient algorithm to deter-
mine the generating function of an indexed language given only a specification of its
indexed grammar. Towards this goal we first describe a process in the next section
that works under some conditions including only one stack symbol (excluding the
end of stack symbol). Proposition 2.5 summarizes the conditions, and the results.
We have several examples to illustrate the procedure. In Section 3 this is gener-
alized to multiple stack symbols that are pushed in order. In this section we also
illustrate the inherent obstacles in the case of multiple stack symbols. This is fol-
lowed by some further examples from number theory in Section 4 and a discussion
in Section 5 on inherent ambiguity in indexed grammars.
1.5. Notation. Throughout, we use standard terminology with respect to formal
language theory. The expression x|y denotes “x exclusive-or y”. We use epsilon
“ε” to denote the empty word. The Kleene star operator applied to x, written x∗
means make zero or more copies of x. A related notation is x+ which means make
one or more copies of x. The word reversal of w is indicated by wR . The length of
the string x is denoted |x|. We print grammar variables in upper case bold letters.
∗
Grammar terminals are in lower case italic. We use the symbol → to indicate the
composition of two or more grammar productions.
grammar. The enumeration for this example is trivial but the example serves a
pedagogical purpose, as it sets up the process in the case of one index symbol,
and provides an example of an indexed language whose generating function is not
differentiably algebraic.
As usual, we use $ to indicate the bottom-most index symbol. Disregarding this,
there is only one index symbol actually used, and so in the translation process we
note only the size of the stack, not its contents. Furthermore we identify S0 (z) =
S(z).
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 6
S → T$ T → Tf |D Df → DD D$ → a
S0 (z) = T0 (z) Tn (z) = Tn+1 (z) + Dn (z) Dn+1 (z) = Dn (z)2 D0 (z) = z.
Observe that indices are loaded onto T then transferred to D which is then
repeatedly doubled (D is a mnemonic for “duplicator”). After all doubling, each
instance of D$ becomes an a.
n n
Immediately we solve Dn (z) = D0 (z)2 = z 2 , and the system of grammar
equations becomes
X X n
S0 (z) = T0 (z) = T1 (z) + D0 (z) = T2 + D1 + D0 = · · · = Dn (z) = z2 .
n≥0 n≥0
We observe that the sequence of partial sums converge as a power series inside
the unit circle, and that the Tn (z) are incrementally eliminated. We refer to this
process, summarized in Proposition 2.2 below, as pushing the Tn (z) off to infinity.
The function S(z) satisfies the functional equation S(z) = z + S(z 2 ). The series
diverges at z = 1, and hence S(z) is singular at z = 1. However, by the functional
equation it also diverges at z = −1. By repeated application of this argument, we
can show that S(z) is singular at every 2n − th root of unity. Thus it has an infinite
number of singularities, and it cannot be D-finite. In fact, S(z) satisfies no alge-
braic differential equation [19]. Consequently, the class of generating functions for
indexed languages is not contained in the class of differentiably algebraic functions.
2.2. A straightforward case: Balanced indexed grammars. Next, we de-
scribe a condition that allows us to guarantee that this process will result in a
simplified generating function expression for S(z).
Definition 2. An indexed grammar is balanced provided there are constants C, K ≥
0, depending only on the grammar, such that the longest string of indices associated
to any non-terminal in any sentential form W has length at most C|w| + K where
w is any terminal word produced from W. (Note: in all our balanced examples we
can take C = 1 and K ∈ {0, 1}.)
Proposition 2.2. Let G = (N, T, I, P, S) be an unambiguous, balanced, indexed
grammar in strongly reduced form for some non-empty language L with I = {f }.
Furthermore, suppose that V ∈ N is the only non-terminal that loads f and that the
only allowable production in which V appears on the right side is S → V$ . Then in
the generalized DSV equations for G, the sequence of functions Vn (z) ≡ Vf n (z) can
be eliminated (pushed to infinity). Under these hypotheses, the system of equations
defining S(z) reduces to finitely many bounded recurrences with initial conditions
whose solution is the growth function for L.
Proof. By hypothesis, there is only one load production and it has form V → Vf .
Without loss of generality we may assume there is no type (3c) rule Vf → β (in
fact, such a rule can be eliminated by the creation of a new variable U and adding
new productions V → U and Uf → β ). Necessarily there will be at least one
context-free type rule V → β (else V is a useless symbol).
Consider all productions in G that have V on the left side. Converting these pro-
ductions into the usual functional equations and solving for Vn (z) gives an equation
of form
Vn (z) = Vn+1 (z) + Wn±e (z)
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 7
where Wn±e (z) denotes an expression that represents all other grammar produc-
tions having V on the left side and e ∈ {0, 1, −1}.
We make the simplifying assumption that e = 0 for the remainder of this
paragraph. Starting with n = 0 and iterating N ≫ 0 times yields V0 (z) =
VN (z) + W0 (z) + W1 (z) + · · · + WN (z). By the balanced hypothesis, there ex-
ists a constants C, K ≥ 0 such that all terminal words produced from Vf N have
length at least N/C − K ≫ 0. This means that the first N/C − K terms in the ordi-
nary generating function for V0 (z) are unaffected by the contributions from VN (z)
and depend only on the fixed sum W0 (z) + W1 (z) + · · · + WN (z). Therefore the
(N/C − K)th partial sum defining the generating function for V0 (z) is stabilized as
soon as the iteration above reaches VN (z). ThisP is true for all big N , so we may
take the limit as N → ∞ and express V0 (z) = n≥0 Wn (z) .
Allowing e = ±1 in the previous paragraph merely shifts indices in the sum
and does not affect the logic of the argument. Therefore, in all cases the variables
Vn (z) for each n > 0 are eliminated from the system of equations induced by the
grammar G. We assumed that V was the only variable loading indices, so all other
grammar variables either unload/pop indices or are terminals. Consequently, the
remaining functions describe a finite triangular system, with finitely many finite
recurrences of bounded depth and known initial conditions. The solution of this
simplified system is S(z).
We observe that the expression for V0 (z) derived above actually converges as an
analytic function in a neighborhood of zero (see section 1.3 above). It turns out
that the balanced hypothesis used above is already satisfied.
indices, the equality of sentential forms F ′′ = F ′ is not possible because this implies
ambiguity.
We claim that either F ′′ is longer than F ′ or that F ′′ has more terminals than F ′ .
If not, then F ′′ has exactly the same terminals as F ′ , and each has the same quantity
of variables. There are only finitely many arrangements of terminals and variables
for this length and for large N we may loop stepwise through the production cycle
arbitrarily often and thus repeat exactly a sentential form (discounting indices).
This implies our grammar is ambiguous contrary to hypothesis.
Thus after C steps the sentential forms either obtain at least one new terminal
or non-terminal. In the latter case, variables must convert into terminals on $ (via
the reduced, unambiguous, ε-free hypotheses). There will be at least one terminal
per step in the final output word w. We obtain the inequality (N − K)/C ≤ |w|
which establishes the lemma.
2.3. A collection of examples. We illustrate the method, its power and its lim-
itations with three examples. The first two examples exhibit languages with inter-
mediate growth and show that some of the hypotheses of Proposition 2.2 can be
relaxed. We are unable to resolve the third example to our satisfaction.
The first example is originally due to [14] and features the indexed language
LG/M = abi1 abi2 · · · abik : 0 ≤ i1 ≤ i2 ≤ · · · ≤ ik
with intermediate growth (meaning that the number of words of length n ultimately
grows faster than any polynomial in n but more slowly than 2kn for any constant k >
0). The question of whether a context-free language could have this property was
asked in [8] and answered in the negative [18, 4]. Grigorchuk and Machı́ constructed
their language based on the generating function of Euler’s partition function. A
word of length n encodes a partition sum of n. For instance, the partitions of n = 5
are 1+1+1+1+1, 1+1+1+2, 1+2+2, 1+1+3, 2+3, 1+4, 5. The corresponding
words in LG/M are aaaaa, aaaab, aabab, aaabbb, ababb, aabbb, abbbb, respectively.
The derivation below is ours.
Example 2.4. An unambiguous grammar for LG/M is
S → T$ T → Tf |GT|G Gf → Gb G$ → a
∗
The latter two productions imply that Gf m $ → abm or in terms of functions
Gm (z) = z m+1 . A typical parse tree is illustrated in Figure 2.1.
The second grammar production group transforms to
Tm (z) = Tm+1 (z) + Gm (z)Tm (z) + Gm (z) .
Substitution and solving for Tm gives
z m+1 + Tm+1 (z)
Tm (z) = .
1 − z m+1
Iterating this recurrence yields a kind of inverted continued fraction:
3+ . .
.
z 2 +T2 (z) z 2 + z 1−z 3
z + T1 (z) z + 1−z2 z+ 1−z 2
S(z) = T0 (z) = = = .
1−z 1−z 1−z
Equivalently, this recurrence can be represented as
z + T1 (z) z(1 − z 2 ) + z 2 + T2 (z) z(1 − z 2 )(1 − z 3 ) + z 2 (1 − z 3 ) + z 3 + T3 (z)
= =
1−z (1 − z)(1 − z 2 ) (1 − z)(1 − z 2 )(1 − z 3 )
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 9
S
∗
Tf i1 $
Gf i1 $ Tf i1 $
∗ ∗
ab i1 Tf i2 $
Gf i2 $ Tf i2 $
∗ ∗
ab i2 Tf i3 $
∗
abi3 Tf ik $
∗
abik
or
z z2 z3 z k + Tk (z)
S(z) = + + + · · · + k
.
1−z (1 − z)(1 − z 2 ) (1 − z)(1 − z 2 )(1 − z 3 ) n
Q
n=1 (1 − z )
Even though this grammar allows the index loading variable T to appear on the
right side of the production T → GT (contrary to one of the hypotheses in Propo-
sition 2.2) the convergence proof of Proposition 2.2 is applicable to the expression
above allowing us to push Tk (z) off to infinity:
X zj
S(z) = .
(1 − z)(1 − z 2 ) · · · (1 − z j )
j≥1
it is true that S(z) has a dense set of singularities on the unit circle and is not
D-finite.
In general, allowing an index loading variable V to appear on both sides of a
context-free rule implies that the corresponding DSV system of equations will have
an algebraic (but not necessarily linear) equation expressing Vn (Z) in terms of z and
Vn+1 (z). However, multiple occurrences of V on the right side of such a production
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 10
C T$
∗ ∗
∗ Tf n $
b
Wf n $
a2(n−1) W$ a|b
= 1+z + z 2 + z 3 +z 4 +z 5 +z 6 +z 7 +z 8 +z 9 +···
+z + z 2 + z 3 +z 4 +z 5 +z 6 +z 7 +z 8 +z 9 +···
+2z 4 +2z 5 +2z 6+2z 7 +2z 8+2z 9 +···
+4z 9 +...
..
.
and so forth. Sum the columns and observe that the coefficient of each z n is a
power of 2, with new increments occurring when n is a perfect square. Thus
∞ √
2⌊ n⌋ n
X
S(z) = z
n=0
and the coefficient of z n grows faster than any polynomial (as n → ∞) but is
sub-exponential.
The indexed grammars used in applications (combings of groups, combinatorial
descriptions, etc) tend to be reasonably simple and most use one index symbol.
Despite the success of our many examples, Propositions 2.2 / 2.5 do not guarantee
that an explicit closed formula for S(z) can always be found.
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 12
does not appear to be a bargain (it is possible that a multivariate generating func-
tion as per Example 4.3 may be helpful).
Vm,n z 2m+n+1
Vm,n (z) = z 2m+n+1 and Rm,n = = .
1 − Vm,n 1 − z 2m+n+1
The grammar production U → Uf |VR|V implies for fixed n > 0 that
U1,n (z) = U2,n + (V1,n R1,n + V1,n ) = U3,n + (V2,n R2,n + V2,n ) + (V1,n R1,n + V1,n ) .
The hypothesis of Proposition 2.5 are satisfied in that we are dealing with a balanced
grammar where currently only one index symbol is being loaded onto one variable.
Therefore for fixed n we can push Um,n (z) off to infinity and obtain
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 13
S
∗
Uf m gn $
Vf m gn $ Rf m g n $
∗
n n+m
ab c Vf m gn $ Rf m g n $
∗
n n+m
ab c Vf m gn $ Rf m g n $
∗
ab cn n+m Vf m gn $
∗
n n+m
ab c
X X X z 2m+n+1
U1,n (z) = (Vm,n Rm,n + Vm,n ) = Rm,n (z) = .
1 − z 2m+n+1
m≥1 m≥1 m≥1
with the latter two summations realized by expanding geometric series and/or
changing the order of summation in the double sum. In any event, S(z) has infin-
itely many singularities on the unit circle and is not D-finite.
3.1. “Encode first, then copy”. It is worth noting the reason why the copying
schema used above succeeds in indexed grammars but fails for context-free gram-
mars. The word abi cj is first encoded as an index string attached to V and only
then copied to VR (the grammar symbol R is a mnemonic for “replicator”). This
ensures that abi cj is faithfully copied. Slogan: “encode first, then copy”. Context-
free grammars are limited to “copy first, then express” which does not allow for
fidelity in copying.
We would like to generalize the previous example. The key notion was the
manner in which the indices were loaded.
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 14
S → T$ T → Tα |Tβ |N Nα → aN Nβ → bNbNb N$ → ε
When applying the DSV transformations to this grammar we would like to write
N1,1 (z) as the formal power series corresponding to the grammar variable N with
any index string having one α index and one β index, followed by the end of stack
∗ ∗ ∗ ∗
marker $. Note the derivations S → Nαβ$ → abbb and S → Nβα$ → babab. Even
though both intermediate sentential forms have one of each index symbol, followed
by the end of stack marker $, they produce distinct words of differing length. Thus
using subscripts to indicate the quantity of stack indices cannot work in general
without some consideration of index ordering.
We note that the grammar is reduced and balanced. It is also unambiguous,
∗
which can be verified by induction. In fact, if σ ∈ (α|β) $ is an index string
∗ ∗
such that Nσ → w where w is a terminal word of length n, then Nασ → aw and
∗
Nβσ → bwbwb where |aw| = n + 1 and |bwbwb| = 2n + 3. Suppose that all words
w ∈ Lord of length n or less are produced unambiguously. Consider a word v of
length n + 1. Either v = aw or v = bw′ bw′ b for some shorter words w, w′ ∈ Lord
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 15
where we can push the Tn (z) to infinity as per the proof of Proposition 2.5. There-
fore X X 1
S(z) = 2n Rn2 (z) = 2n z 2n = .
1 − 2z 2
n≥0 n≥0
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 16
X X zn
S(z) = T0 = T1 + A1 R1 = T2 + A1 R1 + A2 R2 = · · · = An Rn = .
1 − zn
n≥1 n≥1
Expand each rational summand into a geometric series and collect terms
z z2 z3 z4
S(z) = 1−z + 1−z 2 + 1−z 3 + 1−z 4 + ···
= z+z 2 +z 3 +z 4 +z 5 +z 6 +z 7 +z 8 +z 9 +z 10 + · · ·
+z 2 +z 4 +z 6 +z 8 +z 10 + · · ·
3 6 9
+z +z +z +···
+z 4 +z 8 +...
+z 5 +z 10 + . . .
..
.
where τ (n) is the number of positive divisors of n. Again, S(z) has infinitely many
singularities on the unit circle and is not D-finite.
Example 4.2. Let Lcomp = {ac : c is composite} denote the composite numbers
written in unary. A generative grammar is
S → Tf $ T → Tf |R R → RA|AA
Af → aA A$ → a
with sample derivation
∗ ∗ m m+1 ∗
S → Rf n $ → Rf n $ Af n $ → Rf n $ Af n $ → Af n $ → a(n+1)(m+1) .
This is certainly an ambiguous grammar because there is a separate, distinct
production of ac for each nontrivial factorization of c. (Note: one can tweak the
grammar to allow the trivial factorizations 1 · c and c · 1. The resulting language
becomes the semigroup a+ isomorphic +
P to Z n but the generating function of all
grammar productions is the familiar τ (n)z which we saw in Example 4.1.)
1As written, this grammar is not reduced because the rule T → A R loads two indices
f f
simultaneously. However, by replacing that production by the pair T → Uf , Uf → AR we
obtain an equivalent grammar in reduced form. We use the former rule for brevity.
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 17
σ(n)z n ?
P
Suppose we want the generating function for the sum of positive divisors
Then our table expansion above would look like
S(z) = z+2z 2 +3z 3+4z 4 +5z 5 +6z 6 + · · ·
+2z 2 +4z 4 +6z 6 + · · ·
3
+3z +6z 6 + · · ·
4
+4z +...
..
.
Recovering the original grammar by restoring the U productions allows for the
construction of repeated cutting sequences wt , w being the word associated to a
coprime lattice point (i, j). This serves to add the lattice points (ti, tj). Here we
may assume i and j are relatively prime and t ≥ 2 which makes these additions
unique. (In fact, if (ti, tj) = (sk, sl) for another coprime pair (k, l), then both s
and t are the greatest common divisor of the ordered pair and hence s = t.) The
full grammar is in bijective correspondence with the integer lattice strictly inside
quadrant one. Words represent geodesics in the taxicab metric. Simple observation
shows that the full growth series is represented by the rational function
X z2
(n − 1)z n =
(1 − z)2
n≥2
5. Ambiguity
We begin with the first published example of an inherently ambiguous context-
free language [16, 22]. It has an unambiguous indexed grammar.
Example 5.1. Define Lamb = ai bj ak bl : i, j, k, l ≥ 1; i = k or j = l . The idea
with no restrictions on i, k other than that all exponents are at least one. An
indexed grammar is
S → Tg$ T → Tg |Uf |Z U → Uf |X|Y
Af → aA Ag → ε Bf → B Bg → bB B$ → ε
Our examples beg the question: are there inherently ambiguous indexed lan-
guages? Consider Crestin’s language of palindrome pairs defined by LCrestin =
vw : v, w ∈ (a|b)∗ , v = v R , w = wR . It is a “worst case” example of an inher-
ently ambiguous context-free language (see [8] and its references). We conjecture
that LCrestin remains inherently ambiguous as an indexed language. What about
inherently ambiguous languages that are not context-free?
Consider the composite numbers written in unary as per Example 4.2. What
would an unambiguous grammar for Lcomp look like? We would need a unique
factorization for each composite c. Since the arithmetic that indexed grammars
can simulate on unary output is discrete math (like addition and multiplication,
no division or roots, etc), we need the Fundamental Theorem of Arithmetic. In
fact, suppose there is a different unique factorization scheme for the composites,
that doesn’t involve a certain prime p. Then composite c2 = p2 has only the
factorization 1 · c2 , and similarly c3 = p3 has unique factorization 1 · c3 since p · c2 is
disallowed. But then p6 = c2 · c2 · c2 = c3 · c3 has no unique factorization. Therefore
all primes p are needed for any unique factorization of the set of composites. Adding
any other building blocks to the set of primes ruins unique factorization.
Suppose we have an unambiguous indexed grammar for Lcomp . It would be able
k
to generate ap for any prime p and all k > 1. This requires a copying mechanism (in
the manner of R in Examples 3.1 and 4.2) and an encoding of p into an index string
(recall our slogan “encode first, then copy” from Section 3.1). In other words, our
supposed grammar for Lcomp must be able to first produce its complement Lprime
and encode these primes into index strings. However, [21] show that the set of
index strings associated to a non-terminal in an indexed grammar is necessarily a
regular language. On the other hand [2] shows that the set of primes expressed in
any base m ≥ 1 does not form a regular language. We find it highly unlikely that
an indexed grammar can decode all the primes from a regular set of index strings.
We conjecture that Lcomp = {ac : c is composite} is inherently ambiguous as an
indexed language.
Recall that a word is primitive if it is not a power of another word. In the copious
literature on the subject it is customary to let Q denote the language of primitive
words over a two letter alphabet. It is known that Q is not unambiguously context-
free (see [23, 24], who exploits the original Chomsky-Schützenberger theorem listed
in Section 2 above). It is a widely believed conjecture that Q is not context-free at
all (see [7]).
L′ = wk : w ∈ (a|b)∗ , k > 1 defines the complement of Q with respect to
∗
the free monoid (a|b) . It is not difficult to construct an ambiguous balanced
′
grammar for L (a simple modification of Example 3.5 will suffice). What about
an unambiguous grammar? Recall from [20] that w1n = w2m implies that each wi
is a power of a common word v. Thus to avoid ambiguity, each building block w
used to construct L′ needs to be primitive. This means we must not only be able
to recreate Q in order to generate L′ unambiguously, we must be able to encode
each word w ∈ Q as a string of index symbols, as per the language of composites.
We refer
again to Section 3.1. We find this highly unlikely and we conjecture that
∗
L′ = wk : w ∈ (a|b) , k > 1 is inherently ambiguous as an indexed language.
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 21
6. Open questions
We observed that in many cases the generating function S(z) of an indexed
language is an infinite sum (or multiple sums) of a family of functions related by a
finite depth recursion (or products/sums of the same). As we mentioned earlier, [17]
give explicit sufficient conditions for growth series of indexed languages on a unary
alphabet. They also show that such growth series are defined in terms of the
recursions we mentioned above.
Into what class do the generating functions of indexed languages fit? Can we
characterize the types of productions that lead to an infinite number of singularities
in the generating function? Ultimately, this was a common property of many of the
examples, and possibly one could generalize the conditions that lead to an infinite
number of singularities in Example 2.1 to a general rule on production types in the
grammar. It seems that the foundation laid by Fratani and Senizergues, in their
work on catenative grammars is a very natural starting point for such a study.
Can we characterize the expressive power of the grammars which satisfy the
hypotheses of Proposition 2.5? The alternate characterizations of level 2 sequences
in [26] might be useful. Can we show that {ap(n) : p(n) is the nth prime} is a level
3 (or higher?) language? Perhaps this should precede a search for an indexed
grammar.
Is Crestin’s language inherently ambiguous as an indexed language? What about
the composite numbers in unary or the complement of the primitive words?
In step with much modern automatic combinatorics, we would like to build
automated tools to handle these, and other questions related to indexed grammars.
Towards this goal the first author has written a parser/generator that inputs an
indexed grammar and outputs words in the language. It is licensed under GPLv3
and is available from the authors. Is there a way to automate the process of pushing
a set of terminal variables off to infinity?
Finally, we end where we began with the non-context-free language {an bn cn :
n > 0}. It has a context-sensitive grammar
S → abc|aBSc Ba → aB bB → bb
for which the original DSV method works perfectly. The method fails for sev-
eral other grammars generating this same language. What are the necessary and
sufficient conditions to extend the method to generic context-sensitive grammars?
To what extent can we see these conditions on the system of catenative recurrent
relations?
Acknowledgements
Daniel Jorgensen contributed ideas to the first two authors. The third author
offers thanks to Mike Zabrocki and the participants of the Algebraic Combinatorics
seminar at the Fields Institute for Research in Mathematical Science (Toronto,
Canada) for early discussions on this theme. We are also grateful to the referees
for pointing us to several references.
References
[1] A. V. Aho. Indexed grammars—an extension of context-free grammars. J. Assoc. Comput.
Mach., 15:647–671, 1968.
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 22
[2] D. Allen, Jr. On a Characterization of the Nonregular Set of Primes. J. Comput. System Sci.,
2:464–467, 1968.
[3] M. R. Bridson and R. H. Gilman. Formal language theory and the geometry of 3-manifolds.
Comment. Math. Helv., 71 (4):525–555, 1996.
[4] M. R. Bridson and R. H. Gilman. Context-free languages of sub-exponential growth. J. Com-
put. System Sci., 64 (2):308–310, 2002.
[5] N. Chomsky and M. P. Schützenberger. The algebraic theory of context-free languages. In
Computer programming and formal systems, pages 118–161. North-Holland, Amsterdam,
1963.
[6] M. Delest. Algebraic languages: a bridge between combinatorics and computer science. In
Formal power series and algebraic combinatorics (New Brunswick, NJ, 1994), pages 7187.
DIMACS Ser. Discrete Math. Theoret. Comput. Sci., 24, Amer. Math. Soc., Providence, RI,
1996.
[7] Pál Dömösi, Sándor Horváth, Masami Ito, László Kászonyi, and Masashi Katsura. Some
combinatorial properties of words, and the Chomsky-hierarchy. In Words, languages and
combinatorics, II (Kyoto, 1992), pages 105–123. World Sci. Publ., River Edge, NJ, 1994.
[8] Ph. Flajolet. Analytic models and ambiguity of context-free languages. Theoret. Comput.
Sci., 49 (2-3):283–309, 1987. Twelfth international colloquium on automata, languages and
programming (Nafplion, 1985).
[9] Ph. Flajolet and R. Sedgewick. Analytic combinatorics. Cambridge University Press, Cam-
bridge, 2009.
[10] S. Fratani and G. Sénizergues. Iterated Pushdown Automata and Sequences of Rational
Numbers. Ann. Pure Appl. Logic, 141 363–411, 2005.
[11] E. M. Freden and J. Schofield. The growth series for Higman 3. J. Group Theory, 11 (2):277–
298, 2008.
[12] R. H. Gilman. A shrinking lemma for indexed languages. Theoret. Comput. Sci., 163 (1-
2):277–281, 1996.
[13] R. H. Gilman. Formal languages and their application to combinatorial group theory. In
Groups, languages, algorithms, volume 378 of Contemp. Math., pages 1–36. Amer. Math.
Soc., Providence, RI, 2005.
[14] R. I. Grigorchuk and A. Machı̀. An example of an indexed language of intermediate growth.
Theoret. Comput. Sci., 215 (1-2):325–327, 1999.
[15] D. F. Holt and C. E. Röver. Groups with indexed co-word problem. Internat. J. Algebra
Comput., 16 (5):985–1014, 2006.
[16] John E. Hopcroft and Jeffrey D. Ullman. Introduction to automata theory, languages, and
computation. Addison-Wesley Publishing Co., Reading, Mass., 1979. Addison-Wesley Series
in Computer Science.
[17] L. P. Lisovik and T. A. Karnaukh. A class of functions computable by index grammars.
Cybernetics and Systems Analysis, 39(1):91–96, 2003.
[18] R. Incitti. The growth function of context-free languages. Theoret. Comput. Sci., 255 (1-
2):601–605, 2001.
[19] L. Lipschitz and L. A. Rubel. A gap theorem for power series solutions of algebraic differential
equations Amer. J. Math., 108 (5):1193–1213, 1986
[20] R. C. Lyndon and M. P. Schützenberger. The equation aM = bN cP in a free group. Michigan
Math. J., 9:289–298, 1962.
[21] R. Parchmann and J. Duske. The structure of index sets and reduced indexed grammars.
RAIRO Inform. Théor. Appl., 24 (1):89–104, 1990.
[22] R. J. Parikh. On context-free languages. J. of the ACM, 13:570–581, 1966.
[23] H. Petersen. The ambiguity of primitive words. In STACS 94 (Caen, 1994), volume 775 of
Lecture Notes in Comput. Sci., pages 679–690. Springer, Berlin, 1994.
[24] H. Petersen. On the language of primitive words. Theoret. Comput. Sci., 161 (1-2):141–156,
1996.
[25] S. Rees. A language theoretic analysis of combings. In Groups, languages and geometry
(South Hadley, MA, 1998), volume 250 of Contemp. Math., pages 117–136. Amer. Math.
Soc., Providence, RI, 1999.
[26] G. Sénizergues. Sequences of Level 1, 2, 3, . . . , k, . . . . In Computer Science Theory and
Applications, volume 4649 of Lecture Notes in Comput. Sci., pages 24–32. Springer, Berlin,
2007.
FROM INDEXED GRAMMARS TO GENERATING FUNCTIONS 23
Department of Mathematics, Southern Utah University, Cedar City, UT, USA 84720
Department of Mathematics, Southern Utah University, Cedar City, UT, USA 84720
Department of Mathematics, Simon Fraser University, Burnaby BC, Canada V5A 1S6