Berwick Complexity
Berwick Complexity
ROBERT C. BERWICK
Massachusetts Institute of Technology
A child who is capable of language learning must have … (v) a method for selecting one of
the (presumably infinitely many) hypotheses that are allowed by [the theory of grammar] and
are compatible with the primary linguistic data. Correspondingly, a theory of linguistic
structure that aims for explanatory adequacy must contain … (v) a way of evaluating
alternative proposed grammars… [a] specification of a function m such that m(i) is an integer
associated with the grammar Gi as its value (with, let us say, lower value indicated by higher
number).
Aspects, pp. 30-31
But what is this “way of evaluating alternative proposed grammars”? By way of illustration
Chomsky contrasts two possible “grammars” for a dataset containing the possible sequences of
English auxiliary verbal elements. There are eight such sequences: first an obligatory Tense
element, then a Modal form or not (will, etc.); followed by a Perfective form (have) or not; then a
Progressive (be) or not. Each binary choice allows for 2 possibilities, so there are 23=8 auxiliary
sequences in all, including the sequence where none of the latter 3 elements is present. These 8
statements, the “data,” can be generated by a grammar G1 with eight rewrite rules, as follows:
1
2 Robert C. Berwick
However, as Chomsky then describes, if our notational system can use parentheses to denote
optionality, all eight data patterns can also be generated a second way, as grammar G2 with just a
single rule, (2):
Since the three items in parentheses can either be present or not, it is straightforward to see
that this one-rule grammar captures the same 23=8 examples as the eight rule grammar. The two
grammars are thus both weakly and strongly equivalent in terms of descriptive adequacy. What
about explanatory adequacy? G1 contains roughly the same number of rules that would be
required to describe any random sequence of three binary-valued elements.1 Consequently, this
is the worst that a grammar can ever do in terms of explanation, since it fails to capture any
regularity in the data. It has simply memorized the data as a list. In contrast, G2 is exponentially
more succinct than G1. It is in this sense that G2 has “compressed” the original dataset into a
smaller rule system, while G1 has not. This kind of compression may be taken as the hallmark of
explanatory adequacy as set out in Aspects. In the remainder of this chapter we show that this
kind of exponential reduction in grammar size can indeed serve as a litmus test for
discriminating among otherwise descriptively equivalent grammars. Such gaps apparently arise
whenever one grammar fails to capture generalizations and then must resort to memorizing –
explicitly listing – the data as if it were a random string, while the alternative grammar
“compresses” the data set exponentially. It is in this sense that “exponential gaps” indicate that
overly large grammars have missed generalizations in the linguistic data. They lack explanatory
adequacy.
It is also clear that the ability to compress data depends on the machinery the representational
system provides, an empirical matter, again as Chomsky notes. Rule (2) depends on the
availability of something like the parenthesis notation to denote optionality. If the auxiliary verb
sequences had formed some other set of patterns, for instance, as Aspects notes, the cyclic
permutation of {Modal, Perfective, Progressive}, then the parentheses notation would not yield a
one-rule description of this new dataset. Rather, a system that included different representational
machinery, cyclic permutation, would be able to compress these eight new examples down to
one rule. In this way, explanatory adequacy all depends on the available representational
armamentarium. Consequently, Chomsky rightly emphasizes that what does the heavy lifting in
theory-building here is not the particular ‘calculus’ involved – counting symbols in this
particular case – but rather the substantive, empirical constraints on the machinery of the
mind/brain.
1
Actually ternary valued, since there are three possible elements in each position or 33 or 27 possible random strings
of length 3 rather than just 23. This corresponds to the number of bits required to encode any string over an alphabet
of size 3. While one actually needs bits to more properly account for the coding of the notational system’s alphabet,
this detail does not change the comparative advantage of G2 as compared to G1. See Berwick (1982), chapter 4, for
further discussion.
Mind the Gap 3
2
The restriction to a finite language is crucial here. If we move to infinite regular languages, then the succinctness
gap between finite-state grammars and context-free grammars generating the same language becomes even more
enormous – it need not be bounded by any recursive function (Meyer and Fischer, 1971). And in fact, the PDA and
grammar in Figure 1(b) will recognize exactly the palindromic sentences of any length. To ensure that sentences
longer than n are not recognized, the CFG should really have to use, e.g., nonterminals S1, S2, etc. to ‘count’ the
length of the generated sentences. Note however that this only increases the size of the CFG (or PDA) by a linear
amount. The grammar or PDA would still be exponentially more succinct than the equivalent RG. To simplify the
presentation here, as with the FSA we have not included a special failure state; this would absorb any sentences
longer than four or any non-palindromic sequences of length four or less. We have also glossed over details about
measuring the RG or PDA/CFG size that are not relevant here. In reality, as noted in the main text, we must count
the number of bits encoding the PDA, including its alphabet symbols, transitions, and pushdown stack moves. A
more exact accounting in information theoretic terms would also include the probability of each automaton move;
see Li and Vitányi (1997) for more details.
4 Robert C. Berwick
While both theories are descriptively adequate, it is not hard to show that as we consider
palindrome languages of increasing length, the PDA becomes and exponentially more succinct
than the corresponding FSA; similarly for their corresponding grammars. Further, it is easy to
see by inspection of the grammar in Figure 1(a) that the finite-state representation in effect has
simply listed all the possible strings in LPAL–4, with |LPAL-4| ≈ |RGPAL-4|. In contrast, the
PDA/CFG does capture the palindrome generalization because the PDA or CFG for this
language is smaller than the sentences it describes. The reason why this is so is straightforward.
Each time n increases, say from 4 to 6, the FSA or grammar must be able to distinguish among
all the possible new ‘left-hand’ parts of each palindrome sentence before the middle, in order to
ensure that it can correctly decide whether the following ‘right-hand’ part is the same as the left.
For the change from n=4 to 6, this means being able to distinguish among all the distinct strings
of length 3, the first half of all the new palindromes of length 6, and there are 23=8 of these (aaa,
aab, abb, etc). The FSA must be able to do this in order to recognize that the second half of a
length-6 sentence is properly paired with the first half. But the only way an FSA can do this is to
have distinct states for each one of these 23 possibilities. Therefore, in general, the palindrome-
recognizing FSA will have to distinguish among 2n/2 strings. Thus as n increases, the number of
states in the corresponding FSA or finite-state grammar must increase in proportion to 2n/2 – an
exponential rate of increase in n. Put another way, the FSA must memorize all strings of length
n/2 or less by using its states.
In contrast, for the palindrome recognizing PDA, increasing n does not demand a
corresponding increase in its size, since it can use the same set of fixed general rules that “match
up” corresponding pairs of a’s and b’s and these do not need to be altered at all: the size of the
PDA grows linearly with respect to n. In this sense, the PDA has captured the essence of the
palindrome-type language pattern, matching up paired symbols. In contrast, a similar-sized FSA
could have just as well described any set of random strings up to length n/2, since the number of
states in an FSA is the same for both palindromic strings and any random set of sentences of
length n/2. In linguistic terms, we say that the FSA or corresponding grammar has “missed the
generalization” for the palindrome pattern, because its description – the FSA or grammar size –
n
is almost exactly the same size as the data to be described – that is, proportional to 2 .
Q0 → a Q1 | Q0 → b Q2
Q1 → a Q3 | Q1 → a Q 3 | Q1 → b Q4
Q2 → a Q5 | Q2 → b | Q2 → b Q6 q3
Q3 → a Q7 | Q4 → b Q7 a a
Q5 → a Q8 | Q6 → b Q8 a q1
b
q4
b
q7 a
Q7 →a | Q8 → b q0
b b
q9
b b
q2 q6 q8
a a
q5
FSA uses distinct states to keep track of all possible left-hand parts of the palindrome sentences,
of length up to 2; there are are 22=4 of these.
Start → S
S→aSa | λ
S→bSb| λ
a,b;ab
b,a;ba
a,a;aa a,a:λ
b,b;bb b,b:λ
b, Z; bz
a, Z; az λ,λ:λ λ,Z:λ
q0 q1 q2 q3
Figure 1(b) A context-free grammar and an equivalent push-down automaton for LPAL-4 with 4
states and 10 push/pop transition operations that move the PDA between states. The PDA starts
in state q0 and its final state is q3. The symbol λ denotes the empty string and Z a special bottom
of the stack element. The operations on each arc are in the form input: pop top of stack symbol;
new stack contents. For example, the operation b, Z; bz says that if a b is the current input
symbol, and the result of popping off the top-of-stack is a Z, then the new stack configuration is
bZ, i.e., bZ is pushed onto the stack, and the PDA moves to state q1.3
3
To simplify the presentation, as in the FSA we have not included a special failure state; this would adsorb any
sentences longer than four or any non-palindromic sequences of length four or less. The PDA as presented will of
course accept all and only the palindromes of any length.
4
An earlier version of this material first appeared in Berwick (1982), chapter 4. Details of the results described in
this section may be found in this chapter. The main result that GPSGs grow exponentially in size with the number of
‘filler’ items that may be displaced, is established as theorem 4.1 there.
6 Robert C. Berwick
To demonstrate the exponential blow-up associated with GPSG and filler-gap dependencies,
we first recall how GPSG encodes a single filler-gap dependency through an expanded set of
nonterminals. We begin with a sentence that has all its ‘fillers’ in canonical argument positions,
e.g.,
As usual, an object wh-question version of this sentence has a ‘filler, ’ e.g., who, at the front of
the sentence, linked to the ‘gap’ position as the argument of kissed:
The GPSG analysis of this sentence links the filler who to the (unpronounced) gap via an
analysis where ordinary nonterminals like S, VP, or NP have ‘slashed’ counterparts S/NP,
VP/NP, and NP/NP, where XP/YP is to be interpreted as a constituent tree of type X that
dominates (and is missing) a constituent of type YP expanded as an unpronounced element
somewhere below (Gazdar, 1981). ‘Slash’ rules are introduced by context-free rules of the usual
sort, but using the ‘slashed’ nonterminal names, e.g.:
In this way, one can arrange for a sequence of ‘slash’ rules to serve as a way to ‘store’ the
information that a wh-NP is at the front of the sentence, as in (5), and pass that information down
a (c-commanding) chain of nonterminal expansions, terminating with the rule NP/NP → ε, where
ε is the empty string (the gap). In the terminology of contemporary transformational grammar,
the GPSG rules construct something like a chain, with its head positioned at the filler, and then
the encoding of that filler as intermediate slashed nonterminals, ending in a rule that expands as
the empty string in the ‘gap’ position. The resulting syntactic tree would look something like the
following in Figure (3) where irrelevant syntactic details have been omitted:
CP
NP+wh S/NP
Who
VP/NP
did John
V NP/NP
kiss ✏
Figure 3: Slashed nonterminals for the sentence Who did John kiss link a wh-NP filler to its ‘gap’
position.
The question of succinctness of this representation comes into play when there are multiple
‘displacements’ from lower clauses to higher ones, in examples like the following, where a wh-
NP filler which violins, fills a gap position after on, while a second NP filler these sonatas is the
argument of to play:
Mind the Gap 7
How are we to deal with this kind of example, with two active ‘fillers’ via the slash-nonterminal
representation? We can extend the ‘slash’ notation to explicitly carry these along via the
nonterminal names, e.g.:
(7) CP → wh-NP1 (which violins) S/NP1 ; S/ NP1 → NP2 (these sonatas) S/NP1NP2;
S/NP1NP2 → (pro) VP/NP1NP2; VP/NP1NP2 → V NP2/NP2 PP/NP1; NP2/NP2 → ε;
PP/NP1 → NP1/NP1; NP1/NP1 → ε;
More recently, an exponential succinctness gap has also been demonstrated by Stabler
(2013), when comparing the relative grammar sizes of a formalization of ‘minimalist’ grammars
(MGs) to an extension of context free grammars known as multiple context-free grammars
(MCFGs). Stabler demonstrates that two otherwise strongly equivalent MGs and MCFGs can
differ exponentially in size: the MCFG contains exponentially more rules than the equivalent
MG – and for the same reason as with the GPSG/GB contrast in Section 2. In an MG, the role of
‘movement’ is taken over by an operation called ‘internal merge’ – that is, the operation of
merge between two syntactic (set-formulated) objects where one, the constituent that would be
the target of move alpha in GB theory and the “moved constituent,” is a subset of the other. The
output is a new syntactic object where the moved constituent is copied to its new ‘filler’ position
– there is no ‘gap left behind which would be occupied by a phonologically empty element in
older GB theory, but simply the original constituent. 8(a-b) illustrate the syntactic structure
before and after an internal merge operation (irrelevant details suppressed); note that the second
argument to the merge operator, [DP what], is contained in (is a subset of) the first argument.
(8a) Merge( [CP people [VP eat [DP what] ] ], [DP what ] ) →
(b) [CP [DP what] [people [VP eat [DP what] ] ] ]
Aside from the initial position of the constituent what and its ‘landing site’, this movement
(actually copying) does not need to refer to any intermediate syntax in between (such as VP). If
there is more than one ‘mover’, then this simply adds a single new lexical entry, as described in
more detail in Stabler (2013). In contrast, in MCFGs, such effects are handled by the use of
variables associated with nonterminals that can hold strings such as what. For example, the
nonterminal VP(x1, x2) stands for an ordinary VP that has two variables that can hold the values
of two strings, x1, and x2. The first variable can correspond to the verb eat, while the second can
correspond to what. The following MCFG rule indicates that the value of the first variable – the
verb – can be associated with the V nonterminal, while the second variable holding what can be
associated with the DP:
This wh element variable can then be ‘passed up’ through each higher nonterminal until
what reaches its landing site in the CP, as per the following MCFG rule. Note in particular that in
the MCFG nonterminal CP that the order of the two variables is reversed and their values
concatenated together into one string x2 x1, which places what before the verb:
It should be apparent that the MCFG proceeds just as with GPSG when it encodes movement, in
that all intervening nonterminals are modified along the chain from the initial position of what
all the way to its landing site – as before, affecting intervening nonterminals by introducing
variables even though those nonterminals are not really implicated in the ‘movement.’ As a
result, if there are multiple possible ‘movers,’ each corresponding to a different kind of lexical
item or feature, then the MCFG must multiply out all these possibilities as with GPSG. This
amounts to an exponential number of choices since the possible movements can be any of the
subsets of the movers, and each might move or not, as Stabler notes. Stabler concludes: “MGs
can be exponentially smaller than their strongly equivalent MCFGs because MCFGs explicitly
Mind the Gap 9
code each movement possibility into the category system, while MGs can, in effect, quantify
over all categories with a given feature.” This exponential succinctness gap between MGs and
MCFGs again serves to signal that MGs can capture a generalization that the MCFGs cannot, in
this case, a lexical generalization that is quantified over all nonterminals with a particular feature.
Without working through all the details, this MDL approach will yield the same “exponential
gap” litmus test as described in Sections 2 and 3: exponential gaps in grammar size will show up
in the first |G| factor, while explicit listing of data – if one is not using an appropriate notational
framework – will show up in exponential gaps in the second factor. For one way of using this
MDL framework in a concrete linguistic application, the inference of morphological and
syntactic regularities starting from strings of phonemes and proceeding through to syntax, see de
Marcken (1996). More recently, MDL has been explicitly incorporated in several other models
of language acquisition, e.g., using categorial grammars (Villavicencio, 2002); or slightly
augmented context-free grammars (Hsu and Chater, 2010). A related approach to MDL uses the
7
One might rightly enquire as to whether the number of sentences in D is infinite or not (see Note 2). For our
purposes here this does not matter. For one way of systematically handling this question, using the notion of
“uniform computability” as in the case of a sequence of fixed Boolean circuits, see Berwick (1982).
10 Robert C. Berwick
notion of program size complexity: it attempts to find the length of the shortest program that can
compute or generate a particular language. Chater et al. (2003) and Hsu, et al. (2013) discuss this
approach in the context of language learnability; more on its relationship to linguistic theories
can also be found in Berwick (1982).8
Further, the MDL approach itself can be closely identified with Bayesian evaluation methods
as first pioneered by Horning (1969), who used Bayesian inference to select grammars within an
acquisition-by-enumeration approach. In a Bayesian framework, we fix some set of (observed)
data D and some description of D via a class of (stochastic) context-free grammars, 𝒢, that can
generate D, with some prior probability. A Bayesian inference approach would then attempt to
find the particular grammar G ∈ 𝒢 that maximizes the posterior probability of G given the data
D, i.e., argmax G∈𝒢 p(G |D|. Letting p(G) denote the prior probability assigned to the
(stochastic) context-free grammar g, we can use Bayes’ rule in the usual way to find this
maximum by computing p(G|D) = argmaxG∈𝒢 p(D|G) x p(G)/p(D). Since D is fixed, it can be
ignored to find the G that maximizes the product, and we have the usual computation that
attempts to find the grammar G that maximizes the product of the prior probability of G times
the likelihood of the data given some particular grammar, p(D|G). In other words, Bayesian
inference attempts to find the G that satisfies the following formula:
To pass from this formulation to the MDL version one way to proceed is as follows. The MDL
principle as applied to stochastic context-free grammars says that the ‘best’ grammar G
minimizes the sum of the description length of the grammar and the description length of the data
given G. More precisely, if |G| is the length of the shortest encoding of grammar G and |DG| is the
length of the shortest encoding of the data D given the grammar G, then MDL attempts to find:
Using near-optimal coding schemes, Shannon’s source coding theorem (1948) implies that the
description of the length of D with respect to a particular grammar G can be made to closely
approach the value –log2 p(D|G). We can further assume, as is standard, that one way to define
the prior probability of a stochastic context-free grammar p(G) is as 2–|G|. Larger grammars are
penalized in the same sense that we have used throughout this chapter and have lower
probability. Taking log2 of (8) we get:
Substituting – |DG| for log2 p(D|G) and 2–|G| for p(G) and pulling out the negative sign we get:
8
The notion of program size complexity was developed independently by Solomonoff (1960, 1964) and
Kolmogorov (1965); for a recent comprehensive survey of this field, see Li and Vitányi (1997). In this framework
one can show that there is an ‘optimal’ universal programming system (a representation language or grammar) in the
sense that it is within a constant factor of any other optimal programming language. A detailed analysis of this quite
useful approach lies beyond the scope of this chapter; for a concrete example in the context of language acquisition,
see Hsu and Chater (2010).
Mind the Gap 11
In this particular case then, the MDL formulation (13) and the Bayesian formulation (14)
coincide.9
It should be emphasized that this is not the only notion of “explanatory adequacy” that might
prove valuable in choosing among otherwise descriptively adequate linguistic theories. Other
approaches might stress the computational complexity of acquisition – the sample complexity or
the number of examples required to acquire language (Berwick, 1982, 1985); or the
computational complexity associated with the use of language, that is, parsing or production.
Summarizing, we have found that three different ways to formulate grammar evaluation in
light of explanatory adequacy all amount to the same kind of calculation, ultimately grounded on
the notion of the size of a grammar plus the size of the linguistic data as encoded by that
grammar. Further, this measure can be applied to actual alternative theoretical proposals in
linguistics, distinguishing between proposals that offer generalizations of data as opposed to
those that do not. Finally, this analysis shows that currently popular approaches to learning in
cognitive science, such as Bayesian methods, turn out to be worked-out versions of explanatory
adequacy as discussed in Aspects. In this respect, contrary to what is sometimes thought, the
informal notion of size and simplicity as litmus tests for linguistic theories can be placed within a
coherent framework. We take all this as lending support to the view, first stated in Aspects, that
explanatory adequacy has an important role in evaluating linguistic theories, incorporating a
view that one must attend to how grammars are acquired or inferred from data.
References
Berwick, R.C. 1982. Locality Principles and the Acquisition of Syntactic Knowledge. Ph.D.
thesis, Department of Electrical Engineering and Computer Science, MIT.
Berwick, R.C. 1985. The Acquisition of Syntactic Knowledge. Cambridge, MA: MIT Press.
Chater, N. and Vitányi, P. 2003. Simplicity: a unifying principle in cognitive science? Trends in
Cognitive Sciences 7(1), 19–22.
Chomsky, N. 1951. Morphonemics of Modern Hebrew, S.M. thesis, University of Pennsylvania,
New York: Garland Publishing/Taylor and Francis.
Chomsky, N. 1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
De Marcken, C. 1996. Unsupervised Language Acquisition. Cambridge, MA: Ph.D. thesis, MIT
Department of Electrical Engineering and Computer Science.
Fodor, J.A., 1978. Parsing strategies and constraints on transformations. Linguistic Inquiry 9(3),
427-473.
Gazdar, G. 1981. Unbounded dependencies and coordinate structure. Linguistic Inquiry 12, 155-
184.
Horning, J. 1969. A Study of Grammatical Inference. Ph.D. thesis, Department of Computer
Science, Stanford University, Stanford, CA.
9
This Bayesian estimation is usually called a ‘maximum a posteriori estimate’ or MAP estimate. Note that a
Bayesian account would give us much more than this, not only just the maximum a posteriori estimate, but also the
entire posterior distribution – the whole point of Bayesian analysis, on some accounts.
12 Robert C. Berwick
Hsu, A., Chater, N., Vitányi, P. 2013. Language learning from positive evidence, reconsidered:
A simplicity-based approach. Topics in Cognitive Science 5(1), 35-55.
Hsu, A., Chater, N. 2010. The logical problem of language acquisition: a probabilistic
perspective. Cognitive Science 34, 972-1016.
Kolmogorov, A.N. 1965. Three approaches to the quantitative definition of fnformation.
Problems Information Transmission 1(1), 1–7.
Li, M. and Vitányi, P. 1997. An Introduction to Kolmogorov Complexity Theory and Its
Applications. New York: Springer Verlag.
Meyer A. and Fisher, M. 1971. Economy of description by automata, grammars, and formal
systems. IEEE 12th Annual Symposium on Switching and Automata Theory, 125-129.
Rissanen, J., 1978. Modeling by shortest data description. Automatica 14 (5), 465–658.
Shannon, C. 1948. A mathematical theory of communication. The Bell System Technical Journal
27, 379–423, 623–656.
Solomonoff, R. 1960. A preliminary report on a general theory of inductive inference. Report V-
131. Cambridge, MA: Zator Company.
Solomonoff, R. 1964. A formal theory of inductive inference part I. Information and Control
7(1), 1–22.
Stabler, E. 2013. Two models of minimalist, incremental syntactic analysis. Topics in Cognitive
Science 5(3), 611-633.
Villavicencio, A. 2002. The Acquisition of a Unification-Based Generalised Categorial
Grammar. Cambridge: Cambridge University, TR-533.