0% found this document useful (0 votes)
39 views25 pages

M CFG S For Linguists

This document provides an introduction to Multiple Context Free Grammars (MCFGs) for linguists familiar with context free grammars. It begins with some historical background on Chomsky's hierarchy and the development of mildly context-sensitive formalisms. MCFGs are then presented as the simplest and most natural extension of CFGs that allows a tractable parsing problem while capturing syntactic phenomena typically analyzed using movement in mainstream generative grammar. The document explains how MCFGs define hierarchical structure through derivation trees, similarly to CFGs, but allow for discontinuous and displaced constituents. It also notes that like CFGs, MCFGs lack an appropriate feature system. The remainder of the document reconceives CFGs as having bottom-up derivations

Uploaded by

John Le Tourneux
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views25 pages

M CFG S For Linguists

This document provides an introduction to Multiple Context Free Grammars (MCFGs) for linguists familiar with context free grammars. It begins with some historical background on Chomsky's hierarchy and the development of mildly context-sensitive formalisms. MCFGs are then presented as the simplest and most natural extension of CFGs that allows a tractable parsing problem while capturing syntactic phenomena typically analyzed using movement in mainstream generative grammar. The document explains how MCFGs define hierarchical structure through derivation trees, similarly to CFGs, but allow for discontinuous and displaced constituents. It also notes that like CFGs, MCFGs lack an appropriate feature system. The remainder of the document reconceives CFGs as having bottom-up derivations

Uploaded by

John Le Tourneux
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

An introduction to multiple context free grammars

for linguists
Alexander Clark
September 20, 2014
This is a gentle introduction to Multiple Context Free Grammars
(mcfgs), intended for linguists who are familiar with context free
grammars and movement based analyses of displaced constituents, but
unfamiliar with Minimalist Grammars or other mildly context-sensitive
formalisms.1 1
I originally wrote this for LOT 2012.
This is a lightly edited version with a
few corrections.
Introduction

So to start, a very small bit of historical background not by any


means a proper survey. We are all familiar with the Chomsky hier-
archy 2 . This defined a collection of different formalisms that have 2
N. Chomsky. Three models for
various types of power. This was an incredible intellectual accomp- the description of language. IEEE
Transactions on Information Theory, 2(3):
ishment, but Chomsky (unsurprisingly) didnt get it absolutely right 113124, 1956
first time. Before Chomskys pioneering work, there were three very
natural classes of languages the class of finite languages, the class
of regular languages, and the class of computable languages. We blur here some technical distinc-
Chomsky identified the class of context free grammars (CFGs), and tions between recursive, primitive
recursive and recursively enumerable
he recognised (correctly, but on the basis of incorrect arguments) that languages that are not relevant to the
this was insufficient for describing natural languages. discussion.
In the context of linguistics, it is clear that Chomskys initial at-
tempt to formalise a richer model than context free grammars ended
up with a class that is much too powerful to be useful: the context
sensitive grammars are equivalent just to Turing machines with a
certain bound and have a completely intractable parsing problem.
More formally, determining whether a string is generated by a con-
text sensitive grammar is decidable, in that it can be determined by
a computer in a well defined way, but it would require far too many
computational steps to do in general. If we are interested in cogni-
tively real representations, then given that the human brain has a
limited computational capacity, we can safely assume that the repre-
sentations that are used can be efficiently processed. When it finally
became clear that natural languages were not weakly context free 3 3
S. Shieber. Evidence against the
interest thus shifted to classes that are slightly more powerful than context-freeness of natural language.
Linguistics and Philosophy, 8:333343,
context free grammars while still having a tractable parsing problem 1985
these are the mildly context-sensitive formalisms 4 . There has been a 4
K. Vijay-Shanker and David J. Weir.
great deal of work on this recently, which we wont try to summarise The equivalence of four extensions of
context-free grammars. Mathematical
completely here we describe what we think of as the simplest and Systems Theory, 27(6):511546, 1994
most natural extension of CFGs, the class of Multiple Context Free
Grammars introduced by Seki et al. 5 . These turned out to be equiv- 5
H. Seki, T. Matsumura, M. Fujii, and
T. Kasami. On multiple context-free
grammars. Theoretical Computer Science,
88(2):229, 1991
an introduction to multiple context free grammars for linguists 2

alent to Minimalist Grammars 6 under certain assumptions. In this 6


E. Stabler. Derivational minimalism.
note we will try to give a non technical explanation of MCFGs in a In C. Retor, editor, Logical aspects of
computational linguistics (LACL 1996),
way that is accessible to linguists and in particular to explain how pages 6895. Springer, 1997
these can give a natural treatment of the types of syntactic phenom-
ena that in mainstream generative grammar have been treated using
movement.
One important property of MCFGs that they share with CFGs, is
that they define a natural notion of hierarchical structure through
their derivation trees. Chomsky and many others often emphasize
the importance of considering structural descriptions rather than just
surface strings: of strong generative capacity rather than just the weak
generative capacity considered in formal language theory. MCFGs,
just as with CFGs, have derivational structures that naturally give
a hierarchical decomposition of the surface string into constituents
in the case of MCFGs as we shall see, these constituents can be
discontinuous or contain displaced elements.
However MCFGs also inherit some of the weaknesses of CFGs
in particular, they lack an appropriate notion of a feature. Just as
with CFGs, modeling something as simple as subject verb agree-
ment in English, requires an expansion of the nonterminal symbols:
replacing symbols like NP with sets of more refined symbols like
NP3perssing and NP1pers plur that contain person and number fea-
tures. In MCFGs, the nonterminals are again atomic symbols, and
thus a linguistically adequate system must augment the symbols
using some appropriate feature calculus. We dont consider this inter-
esting and important problem here, but focus on the more primitive
issue of the underlying derivation operations, rather than on how
these operations are controlled by sets of features.

Context free grammars

We will start by looking at context free grammars and reconceiv-


ing them as being a bottom up rather than a top down derivation,
and we will switch to a new notation. The switch from a top-down
derivation to a bottom-up is in line with current thinking in the Min-
imalist Program which hypothesizes a primitive operation called
merge which combines two objects to form another.
Consider a simple CFG that has one nonterminal S, and produc-
tions S ab and S aSb. This generates the language L which con-
sists of ab, aabb, aaabbb, . . . indeed all strings consisting of a nonzero
sequence of as followed by an equal number of bs. Here S is a non- In maths we would say L = { an bn |n >
terminal and a and b are terminal symbols. If we consider the string 0}.

aaabbb, we would say there is a valid derivation which goes some-


thing like this:
an introduction to multiple context free grammars for linguists 3

1. We start with the symbol S.

2. We rewrite this using the production S aSb to get aSb.

3. We rewrite the symbol S in the middle of aSb using the produc-


tion S aSb to get aaSbb.

4. We rewrite this using the production S ab to get aaabbb. We


dont have any nonterminal symbols left so we stop there.

It is much better to think of this as a tree.


S

a S b

a S b

a b
We can view the CF production S aSb as an instruction to build
the tree starting from the top and going down: that is to say if we
have a partial tree like this:
S

a S b
We can view the productions as saying that we can replace the
boxed S node in the tree with a subtree either of the same type or of
the type:
S

a b
An alternative way is to view it as a bottom-up construction. We
can think of the production S ab as saying, not that we can rewrite
S as ab but rather that we can take an a and a b and we can stick
them together to form an S. That is to say the production S aSb The switch to bottom up derivations is
says that we can take the three chunks: also a part of the Minimalist Program.

a and w and b
and combine them to form a new unit like this
S

a S b

w
an introduction to multiple context free grammars for linguists 4

This suggests a change in notation: from S aSb to S aSb. Let


us do something a little less trivial.

We have the standard notation N w which means that N can
be rewritten through zero or more derivation steps into the string w.

Informally this means that w is an N S aabb means that aabb is
an S. We can write this more directly just as a predicate:

S( aabb). (1)

This says directly that aabb satisfies the predicate S. In an English


example, we might say NP(the cat), or VP(died).
A production rule of the form N PQ then means, viewed
bottom up, that if we have some string u which is a P and another
string v that is a Q we can concatenate them and the result will be an
N. The result of the concatenation is the string uv: so we can write
this rule as an implication:

If u is a P and v is a Q, then uv is an N.

Writing this using the predicate notation this just says: P(u) and
Q(v) imples N (uv) We will write this therefore as the following
production:

N (uv) P(u) Q(v) (2)


Here N, P, Q are nonterminals and u and v are variables. We will use This notation is due to Makoto
letters from the end of the alphabet for variables in what follows. We Kanazawa and is based on a logic
programming notation. Originally in
dont have any terminal symbols in this rule. Prolog this would be written with :=
Productions like S ab turn into rules like S( ab) where there instead of . These are sometimes
called Horn clauses.
is nothing on the right hand side, because there are no nontermi-
nals on the righthand side of the rule. So we will just write these as
S( ab), and suppress the arrow. We can view this as a starightforward
asertion that a string ab is in the set of strings derived from S.
Let us go back to the trivial example before ( S aSb). If w is an
S, then awb is also an S: we write this as

S( awb) S(w) (3)

Note that the terminal symbols a and b end up here on the left hand
side of the rule. This is because we know that an a is an a!
We could define predicates like A which is true only of a, and then
have the rule S(uwv) A(u), S(w), B(v), together with the rules
with an empty righthand side A( a) and B(b).
Here A, and B would be like preterminals in a CFG, or lexical
categories in some sense:
an introduction to multiple context free grammars for linguists 5

A S B

a b
Let us consider a point which is kind of trivial here but becomes
nontrivial later. Compare the two following rules:

N (uv) P(u) Q(v) (4)

N (uv) Q(v) P(u) (5)

If you think about this for a moment, they say exactly the same
thing. Two context free rules N PQ and N QP are different
because we express the two different concatenations uv versus vu by
the order of the nonterminals on the right hand side of the rule. In
the case of this new format, we express the concatenation explicitly
using variables. This means we have a little bit of spurious ambiguity
in the rules. We can stipulate that we only want the first one for
example by requiring that the order of the variables on the left hand
side must match the order of the variables in the right hand side.
Think a little more here if we have two string u and v there are
basically two ways to stick them together: uv and vu. We can express
this easily using the standard CFG rule, just by using the two orders
PQ and QP. But are there only two ways to stick them together?
What about uuv? Or uuuu? Or uvuvvuuu? Actually there are, once
we broaden our horizons, a whole load of different ways of combin-
ing u and v beyond the two trivial ones uv and vu. But somehow the
first two ways are more natural than the others because they satisfy
two conditions: first, we dont copy any of them, and secondly we
dont discard any of them. That is to say the total length of uv and vu
is going to be just the sum of the lengths of u and v no matter how
long u and v are. Each variable on the right hand side of the rule oc-
curs exactly once on the left handside of the rule and there arent
any other variables. uuuv is bad because we copy u several times,
and uu is bad because we discard v (and copy u). We will say that
concatenation rules that satisfy these two conditions are linear.
So the rule N (uv) P(u) Q(v) is linear, but N (uvu) P(u) Q(v)
is not.
Summarising, the grammar for an bn above just has two produc-
tions that we write as S( ab) and S( awb) S(w).
When we have a derivation tree for a string like aabb we can write
this with each label of the tree being a nonterminal as we did earlier
this is the standard way or we can write it where each node is
an introduction to multiple context free grammars for linguists 6

labeled with the string that that subtree derives, or we can label them
with both.
S aabb S(aabb)

a S b a ab b a S(ab) b

a b a b a b
The last of these three trees has the most information, but if we
have long strings, then the last two types of tree will get long and
unwieldy.
Finally, when we look at the last two trees, we can see that we
have a string attached to each node of the tree. At the leaves of the
tree we just have individual letters, but at the nonleaf nodes we have
sequences that may be longer. Each of these sequences corresponds
to an identifiable subsequence of the surface string. That is to say,
when we consider the node labeled S( ab) in the final tree, the ab is
exactly the ab in the middle of the string.
S(aabb)

a S(ab) b

a b

a a b b

This diagram illustrates this obvious point: we have drawn the


surface string aabb below the tree. Each leaf node in the tree is linked
by a dotted line to the corresponding symbol in the surface string.
We have boxed in red the nonleaf node labeled S( ab)) and boxed
in blue the corresponding subsequence of the surface string ab. In
a CFG we can always do it, and the subsequence will always be a
contiguous subsequence i.e. one without gaps. This is very easy
to understand if you are familiar with CFGs: the point here is to
introduce the diagram/notation in a familiar context so that when
we reuse it for MCFGs it is easy to understand there too. Thus the
derivation tree of the CFG defines a hierarchical decomposition of
the surface string into contiguous subsequences that do not overlap,
but may include one another. The root of the tree will always corre-
spond to the whole string, and the leaves will always correspond to
individual symbols in the string.
an introduction to multiple context free grammars for linguists 7

Multiple Context Free Grammars

There are many linguistic phenomena where it is impossible or diffi-


cult to find an adequate analysis using just the machinery of context
free grammars, and simple trees of this type, and as a result, almost
immediately after CFGs were invented by Chomsky, he augmented
them with some additional machinery to handle it. The particular
augmentations he considered transformations have gone through
various iterations over the years. Here we will take a different di-
rection that is more computationally tractable and discuss the re-
lationship to the movement based approach. We will discuss some
examples, like Swiss German cross serial dependencies in a bit of
detail below, but even a very simple example like a relative clause in
English causes some problems.

(6) This is the book that I read.

(7) This is the book that I told you to read.

The problem is that it is difficult to find a construction that sup-


ports a compositional semantics for this sentence book must be
simultaneously a head of the NP (or something similar in a DP analy-
sis), and the object in some underlying sense of the verb read. So in
a sense it needs to be in two places at once.
Traditionally we have some idea of movement where the noun
book starts off in some underlying position as an object of read
and then moves to its final position as shown in this figure.
Computationally it turned out that this movement operation, in
its unconstrained form, violated some fundamental properties of
efficient computation 7 . There is an alternative way of modeling this 7
P .S. Peters and R. W. Ritchie. On the
data that turns out to be exactly equivalent to a restricted form of the generative power of transformational
grammars. Information Sciences, 6:4983,
movement analysis, which also is computationally efficient and thus 1973
cognitively plausible in the way that the unconstrained movement
approaches are not.
We will now define the extension from CFGs to MCFGs, and then
we will discuss the linguistic motivation in some depth. The key el-
ement is to extend the predicates that we have in our CFG, so that
instead of applying merely to strings, they apply to pairs of strings. In full generality, MCFGs form a
These pairs of strings have a number of different linguistic interpre- hierarchy where nonterminals can
derive not just single strings or pairs
tations. In a CFG which generates individual strings, we take these of strings, but tuples of arbitrary arity.
strings to be constituents. In an MCFG, one interpretation of the pair Here for presentational purposes we
will restrict ourselves to the simpler
is as a discontinuous constituent that is to say, we have something case of pairs of strings, which may
that we want to be a constituent, but which has some material in- in any event be sufficient for natural
language description.
serted in the middle that we want not to be part of that constituent.
We can consider this as a constituent with a gap in it, and the pair
an introduction to multiple context free grammars for linguists 8

NP Figure 1: Tree diagram with movement


the labels on the nodes and the spe-
cific analysis are not meant to be taken
D N seriously. There are a number of com-
peting analyses for this construction.

the N RP

book R S

that NP VP

I V t

read e

of strings will correspond to the two parts of this: the first element
of the pair will correspond to the segment of the constituent to the
left of the gap (before the gap) and the second element of the pair
will correspond to the segment after the gap. This is not the only
interpretation: we might also have some construction that would be
modeled as movement in that case we might have one part of the
pair representing the moving constituent, and the other part repre-
senting the constituent that it is moving out of.
So in our toy examples above we had a nonterminal N and we
would write N (w) to mean that w is an N. We will now consider a
two-place predicate on strings, that applies not just to one string, but
to two strings. So we will have some predicates, say M which are two
place predicates and we will write M(u, v) to mean that the ordered
pair of strings u, v satisfies the predicate M. Rather than having a

nonterminal generating a single string S aabb, we will have a non-

terminal that might rewrite a pair of strings M ( aa, bb). Since we
want to define a formal language, which is a set of strings, we will
always want to have some normal predicates which just generate
strings which will include S. Terminologically, we will say that all
an introduction to multiple context free grammars for linguists 9

of these nonterminals have a dimension which is one if it generates a


single string and two if it generates a pair of strings. Unsurprisingly we can also define
To avoid confusion in this early stage, we will nonstandardly put MCFGs which have symbols of dimen-
sion greater than 2.
a subscript on all of our nonterminal symbols/predicates to mark
which ones generate strings and which ones generate pairs of strings:
that is to say we will subscript each symbol with the dimension of
that symbol. Thus S, the start symbol, will always have dimension
1, and thus will be written as S1 . If we have a symbol N that derives

pairs of strings, we will write it as N2 . So we might write S1 u and

N2 (u, v). In a CFG of course, all of the symbols have dimension 1. A CFG indeed is just the special case of
Let us consider a grammar that has a nonterminal symbol of di- an MCFG of dimension 1.

mension two, say N2 , in addition to a nonterminal S1 of dimension 1,


and lets consider what the rules that use this symbol might look like.
Here are some examples:

S1 (uv) N2 (u, v) (8)


This rule simply says that if N derives the pair of strings u, v then
S derives the string uv. Note the difference u, v is an ordered pair of
two strings, whereas uv is the single string formed by concatenating
them. Thus N has a bit more structure it knows not just the total
string uv, but also it knows where the gap is in the middle.

N2 ( au, bv) N2 (u, v) (9)


This one is more interesting. Informally this says that if N derives
(u, v) then it also derives the pair ( au, bv). This rule is clearly doing
something more powerful than a CFG it is working on two different
chunks of the string at the same time. At the same time, in the same
derivational step, we are adding an a in one place and a b in another
place.

N2 ( a, b). (10)
Finally we have a trivial rule this just asserts that ( a, b) is gener-
ated by N2 . This is just an analog of the way that S ab becomes
S( ab), but in this case we have an ordered pair of strings rather than
a single string.
An MCFG grammar consists, just like a CFG grammar, of a collec-
tion of nonterminals and some productions. Lets look at the gram-
mar that has just these three rules. What language does it define?
Just as with a CFG, we consider the language defined by an MCFG
to be the set of strings generated by the symbol S1 : since this has
dimension 1 by definition, we know that it generates a set of strings
and not a set of pairs of strings. N2 on the other hand generates a
set of pairs of strings. Let us think what this set is. To start off with
an introduction to multiple context free grammars for linguists 10

we know that it contains a, b, by the third production. If it contains


a, b then it also contains aa, bb by the second rule. More slowly: the
second rule says that if N2 derives u, v then it derives au, bv. Since N2
derives a, b, setting u = a and v = b, this means that au = aa and
bv = bb so N2 derives aa, bb. Repeating this process, we can see that
N2 derives aaa, bbb and aaaa, bbbb and indeed all pairs of the form
an , bn where n is a finite positive number. So what does S1 derive?
Well, it just concatenates the two elements generated by N so this
means it generates the language ab, aabb, aaabbb, . . . which is just the
same as the CFG we defined before.
This might seem a letdown, but lets look a little more closely at
the trees that are generated by this grammar. In particular lets look
at the derivation tree for aaabbb.
When we try to write down a tree we immediately have a prob-
lem: what do the local trees look like? In a CFG it is easy because the
order of the symbols on the right hand side of the productions tell
you what order to write the symbols in. For example we have a rule
S aSb, and so we write a tree like: S where the children of

a S b
the tree occur in exactly the order of the symbols on the right hand
side aSb. But as we saw earlier, we dont have such a rigid notion
of order, as the order is moved onto the specific function that we put
on the left hand side of the rule. So we could write it in a number of
ways:
S1 S1 S1

N2 N2 N2

a b N2 a N2 b N2 a b

a b N2 a N2 b N2 a b

a b a b a b
This difference doesnt matter. What matters crucially is this:
while we have defined exactly the same language as the CFG ear-
lier, the structures we have defined are completely different in each
local tree we have an a and a b just as with the CFG, but think about
which pairs of as and bs are derived simultaneously. To see this easily
an introduction to multiple context free grammars for linguists 11

lets just label the tokens as follows: S1

N2

a1 b1 N2

a2 b2 N2

a3 b3
Working bottom up in this tree we see that the bottommost N2
derives the pair a3 , b3 , the next one up derives a2 a3 , b2 , b3 and the
top one derives a1 a2 a3 , b1 , b2 , b3 , and thus S derives a1 a2 a3 b1 , b2 , b3 .
Thus we have cross serial dependencies in this case as opposed to the
hierarchical dependencies.
We can write this into the tree like this as what is sometimes called
a value tree.
S( aaabbb) a1 a2 a3 b1 b2 , b3

N2 ( aaa, bbb) a1 a2 a3 , b1 b2 b3

a b N2 ( aa, bb) a1 b1 a2 a3 , b2 b3

a b N2 ( a, b) a2 b2 a3 , b3

a b a3 b3
Let us now draw the links from the nodes in the derivation trees to
the symbols in the surface string as before.
S( aaabbb)

N2 ( aaa, bbb)

a b N2 ( aa, bb)

a b N2 ( a, b)

a b

a a a b b b

Note that the red boxed symbol derives a discontinuous con-


stituent, which we have boxed in blue. As before, since we have
an introduction to multiple context free grammars for linguists 12

restricted the rules in some way, we can identify for each node in
the derivation tree a subsequence (which might be discontinuous) of
symbols in the surface string.
Clearly, this gives us some additional descriptive power. Lets look
at a larger class of productions now. Suppose we have three nonter-
minals of dimension 2: N2 , P2 and Q2 , and consider productions that
from N from P and Q.
A simple example looks like this:

N2 (ux, vy) P2 (u, v), Q2 ( x, y) (11)


Here we have four variables: u, v, x, y. u and v correspond to the
the two parts of the pair of strings derived from the nonterminal
symbol of arity 2, P2 . x and y correspond to the two parts derived by
the symbol Q2 . We can see now why we need this switch in notation:
if we have two single strings, then there are two ways of concatenat-
ing them, and so we can comfortably represent this with the order of
the symbols on the right hand side of the production. If we have two
pairs of strings, say u, v and p, q there are many more ways of con-
catenating them together to form a single pair: up, vq, vq, up uq, vp,
upq, v and so on. We also have a load of other ways like uuuu, pvqu
and uuuu, vvv that we might want to rule out for the same reasons
that we ruled out the rules in CFGs above. So we will just consider
rules which are linear in the sense we defined above namely that
each variable on the right hand side occurs exactly once on the left
hand side. We also want to consider another condition which rules
out productions like:

N2 (vu, xy) P2 (u, v), Q2 ( x, y) (12)


Here we have vu on the left hand side, and P2 (u, v) on the right
hand side. This has the property that the order of the variables u, v
in the term vu is different from the order in P2 (u, v). This is called a
permuting rule and some of the time we want to rule it out. If a rule is
not permuting we say it is non-permuting. So for example:

N2 ( xuv, y) P2 (u, v), Q2 ( x, y) (13)

This rule is non-permuting u, v occur in the right order and so too


do x, y. We do have x occurring before u, but that is ok the per-
muting condition only holds for variables that are introduced by the
same nonterminal so u has to occur before v and x has to occur
bfore y. The effect of this constraint on the rules, and the linearity
constraint is that it means that in a derivation, if we have a nontermi-
nal deriving a pair of strings u, v we can work out exactly which bit
an introduction to multiple context free grammars for linguists 13

of the final string corresponds to the u and the v, and moreover we


will know that the string u occurs before the string v.
That is to say if we have an MCFG derivation tree that looks like
this.
S

X N Y

u,v
We know that the final string will look like lumvr that is to say
we know that there will be a substring u and a substring v occuring
in that order, together with some other material before and after it.
This is the crucial point from a linguistic point of view u, v forms
a discontinuous constituent. We can simultaneously construct the u
and the v even if they are far apart from each other. From a semantic
viewpoint this gives us a domain of locality which is not local with
respect to the surface order of the string this means that for seman-
tic intepretation, a word can be local at two different points in the
string at the same time.
Crucially for the computational complexity of this, the formalism
maintains the context free property in a more abstract level. That is
to say, the validity of a step in the derivation does not depend on the
context that the symbol occurs in, only on the symbol that is being
processed. Thinking of it in terms of trees, this means that if we have
a valid derivation with a subtree headed with some symbol, whether
it is of dimension one or two, we can freely swap that subtree for any
other subtree that has the same symbol as its root, and the result will
be also be a valid derivation.

Noncontext-free languages
The language we defined before was a context-free language and
could be defined by a context free grammar, so in spite of the fact
that the structures we got were not those that could be defined by a
context free grammar, it wasnt really exploiting the power of the for-
malism. Lets look at a classic example, indeed the classic example, of 8
S. Shieber. Evidence against the
context-freeness of natural language.
a linguistic construction that requires more than context free power:
Linguistics and Philosophy, 8:333343,
the famous case of cross-serial dependencies in Swiss German 8 . 1985
The figure above shows the dependencies in the standard exam-
ple. The crucial point is that we have a sequence of noun phrases
that are marked for either accusative or dative case, and a sequence
of verb phrases that require arguments that are marked for either
accusative or dative case, and, finally, the dependencies overlap in
the way shown in the diagram. To get this right requires the sort of
(6) ... das mer em Hans es huus halfed aastriiche
... that we Hans-DAT house-ACC helped paint
an introduction to multiple context free grammars for linguists 14

... that we helped Hans paint the house

Figure 2: Swiss German cross serial


(7) ... das mer dchind em Hans es huus lond halfe aastriiche dependencies, taken from Shieber
... that we the children-ACC Hans-DAT house-ACC let help paint (1985). This would be preceded by
something like Jan sit (Jan says).

... that we let the children help Hans paint the house

Swiss German uses case marking and displays cross-serial dependencies.

Proposition 5 The language L of Swiss German is not context-free (Shieber 1985).

Argumentation goes as follows: Assume that L is context-free. Then the intersection of a regular language
MCFG
with the imageapparatus
of L under athat we have developed.
homomorphism must be context-free as well. Find a homomorphism and a
Lets abstract
regular language this
such that the aresult
littleisbit andcontext-free
a non consider language.
a formalContradiction.
language for
this
Further non context
possible sentence:free fragment of Swiss German. We consider that
we have the following words or word types: Na , Nd which are re-
(8) ... das mer dchind em Hans es huus haend wele laa halfe aastriiche
spectively accusative
... that we the and Hans-DAT
children-ACC dative noun phrases,
house-ACC Va , V
have d which
wanted are verb
let help paint
phrases that
... that we haverequire
wanted accusative and dative
to let the children noun
help Hans phrases
paint respectively,
the house
and finally C which is a complementizer which appears at the begin-
Swiss German allows constructions of the form (Jan sait) (Jan says)das mer (dchind)i (em Hans)j es
ning of the clause. Thus the language we are looking at consists of
huus haend wele (laa)i (halfe)j aastriiche. In these constructions the number of accusative NPs dchind
must sequences like of verbs (here laa) selecting for an accusative and the number of dative NPs em
equal the number
Hans must equal the number of verbs (here halfe) selecting for a dative object. Furthermore, the order
be theCN
must (14) a Vin
same a the sense that if all accusative NPs precede all dative NPs, then all verbs selecting
an accusative must precede all verbs selecting a dative.
(15) CNdfV:d
Homomorphism
f (dchind) = a f (Jan sait das mer) = w
f(16) CNa Na N=d VabVa Vd
(em Hans) f (es huus haend wele) = x
f (laa) = c f (aastriiche) = y
butf (halfe)
crucially =
doesd not contain examples where f (s) the
= sequence of ac-
z otherwise
cusative/dative markings on the noun sequence is different from the
sequence of requirements on the verbs. So it does not contain CNd Va ,
because the verb requires an accusative 6 and it only has a dative, nor

does it include CNa Nd Vd Va , because though there are the right num-
ber of accusative and dative arguments (one each) they are in the
wrong order the reverse order. Actually this is incorrect: according to
We can write down a simple grammar with just two nonterminals: Shieber the nested orders are acceptable
as well. We neglect this for ease of
we will have one nonterminal which corresponds to the whole clause exposition.
S1 , amd one nonterminal that sticks together the nouns and the verbs
which we will call T2 . No linguistic interpretation is intended for
these two labels they are just arbitrary symbols. As the subscripts
indicate, the S1 is of dimension 1 and just derives whole strings, and
the T2 has dimension 2, and derives pairs of strings. The first of the
pair will be a sequence of Ns and the second of the pair will be a
matching sequence of Vs.
an introduction to multiple context free grammars for linguists 15

We will have the following rules:

S1 (Cuv) T2 (u, v) (17)


T2 ( Na , Va ) (18)
T2 ( Nd , Vd ) (19)
T2 ( Na u, Va v) T2 (u, v) (20)
T2 ( Nd u, Vd v) T2 (u, v) (21)

The first rule has a right hand side which directly introduces the
terminal symbol C and concatenates the two separate parts of the T2 .
The next two rules just introduce the simplest matching of noun and
verb, and the final two give the recursive rules that allow a poten-
tially unbounded sequence of nouns and verbs to occur. Note that
in the final two rules we add the nouns and verbs to the left of the
strings u and v. This is important because, as you can see by looking
at the original Swiss german example, the topmost nound and verb
occur on the left of the string.
Let us look at how this grammar will generate the right strings
and structures for these examples. We will put some derivation trees
for the examples above, together with the matching value trees on
the right. We will start with the two trivial ones which arent very
S1 CNa Va

C T2 C Na , Va

revealing. First we have the example CNa Va : Na Va Na Va


S1 (CNa Va )

C T2 ( Na , Va )

Na Va
S1 CNd Vd

C T2 C Nd , Vd

Nd Vd Nd Vd
Now we have the example CNd Vd
S1 (CNd Vd )

C T2 ( Nd , Vd )

Nd Vd
an introduction to multiple context free grammars for linguists 16

Now lets consider the longer example with three pairs: CNa Na Nd Va Va Vd .
We have a derivation tree that looks like the following diagram.
S1 (CNa Na Nd Va Va Vd )

C T2 ( Na Na Nd , Va Va Vd )

Na Va T2 ( Na Nd , Va Vd )

Na Va T2 ( Nd , Vd )

Nd Vd
This is a bit cluttered so in the next diagram of the same sentence
we have drawn dotted lines from the terminal symbols in the deriva-
tion tree to the symbols in their positions in the original string so that
the structure of the derivation can be seen more easily.
S1

C T2

Na Va T2

Na Va T2

Nd Vd

C Na Na Nd Va Va Vd
We also do the same thing but where we mark one of the nonter-
minals, here boxed in red, and also mark boxed in blue the discontin-
uous constituent that is derived from that nonterminal.
an introduction to multiple context free grammars for linguists 17

S1

C T2

Na Va T2

Na Va T2

Nd Vd

C Na Na Nd Va Va Vd

Finally, by looking at the diagram we can see that we have the


right dependencies between the Ns and the Vs: if we look at which
ones are derived in the same step, or directly from the same nonter-
minal, we can see that these correspond to the cross serial dependen-
cies. Reusing the same diagram and adding lines between Ns and Vs
that are derived in the same step we get the following situation.
S1

C T2

Na Va T2

Na Va T2

Nd Vd

C Na Na Nd Va Va Vd
an introduction to multiple context free grammars for linguists 18

NP
Relative clauses

Let us now try to use MCFGs to do a direct analysis of the relative D N


clause example we looked at before. We repeat the figure here.
The point of the movement analysis is that the word book occurs the N RP
both in the gap, and then in the position as head of the NP, where it
is pronounced. We need it to be in both places so it can be analysed book R S
in its two semantic roles a fact that is captured even more directly
by the copy theory of movement. that NP VP
The key insight, which is captured by the Minimalist grammar
analysis, is that we dont need to move anything, as long as we derive
I V t
the word in two places. So this analysis has two parts. The first is
that in the position of the gap, we derive the word or phrase that is
read e
going to move, but instead of actually concatenating it together, it is
held in a waiting position where it is kept until the derivation arrives
at the particular point where it is actually pronounced. The second
part is where the moving component actually lands, and then it is
integrated into the final structure. So in a MG analysis the nodes
of the derivation tree have two parts, the part that is actually being
built, and a list of zero or more constituents that are in the process of
moving.
Figure 3: Movement analysis
In the diagram here, we have marked the first part in a blue box,
NP
and the second part in a red box. In the blue box in the tree, we see
that a transitive verb is combined with the noun book, however
instead of being directly combined into a symbol of dimension 1, we D N

keep it to one side, as it is going to move to another location where


it will be pronounced. Higher up in the tree, in the local tree in the the RP,N
red box, we see where the moving constituent lands. Here we have a
rule that takes the moving constituent, the N and combines it into the R S,N
structure. Thus the unary tree in the red box, has a dimension one
symbol at the parent, and a dimension 2 symbol as the single child. that NP VP,N
In between the two boxes we have a succession of rules that resemble
context free rules. I V N
We can put this as a value tree in the following example:

read book
an introduction to multiple context free grammars for linguists 19

the book that I read

the book that I read

the that I read,book

that I read,book

that I read,book

I read book

read book
In fact we will make a minor notational difference and switch the This is because we want to use only
non-permuting rules. This means
orders of the moving and head labels, so that they reflect the final
that when we look at Figure4 the
order in the surface string. diagram on the right hand side has
concatenations that do not permute.

NP the book that I read Figure 4: Figure showing the swapped


order, with the derivation tree on the
left and the value tree on the right.
D N the book that I read

the N,RP the book,that I read

R N,S that book,I read

that NP N,VP that I book,read

I N V I book read

book read book read

We can now write down the sorts of rules we need here: for each
local tree, assuming that the not very plausible labels on the trees
are ok, we write down a rule. The nonterminal symbols here in the
MCFG will just be one for each label that we get in the tree. Note
that these correspond in some sense to tuples of more primitive la-
bels. Thus the node in the tree labeled N, S will turn into an atomic
symbol, NS, in the MCFG we are writing, but has some internal
structure that we flatten out it is an S with an N that is moving out
of it.
an introduction to multiple context free grammars for linguists 20

NP1 (uv) D1 (u), N1 (v) (22)


N1 (uv) NRP2 (u, v) (23)
NRP2 (u, wv) R1 (w), NS2 (u, v) (24)
NS2 (u, wv) NP1 (w), NVP2 (u, v) (25)
NVP2 (u, v) N1 (u), V1 (v) (26)
Lets look at these rules in a little more detail. The first rule is kind
of trivial this is just a regular CFG rule all three of the symbols
have dimension 1. The second one is more interesting: here we have
a symbol with dimension 2 NRP2 . This keeps track of two chunks
the first is the moving chunk which is the N, and the second is the
chunk that it is moving out of, which is the relative clause. The rule
just says this is the point at which the piece can stop moving, so
the concatenation looks trivial. The third (lhs NRP2 ) is also interest-
ing: this is an example of a much larger class of rules. This rule just
transfers the moving constituent up the tree. In a longer example like
(27) This is the book that I told you Bob said I should read.
we will have a long sequence of rules of this type, that pass the
moving constituent up. Generally we will have lots of rules like
MX2 (u, wv) Y1 (w), MZ2 (u, v). These rules mean that we have
a constituent with something moving out of it this is the MZ2 non-
terminal, where the first component u is moving and the second
component v means the bit that it is moving out of. We also have a
normal constituent without any movement Y1 . The result is a con-
stituent MX2 where the u is still moving. So the w is concatenated
onto the v and the u remains as a separate piece. These correspond to
rules without a moving constituent like X1 (wv) Y1 (w), Z1 (v). The symbols MX2 , X1 etc are unrelated
Finally we have the rule labeled 26. This rule has a symbol of di- in the MCFG formalism but are related
at a linguistic level.
mension 2 on the left hand side, and two symbols of dimension 1
on the right hand side. This rule is called when the consituent starts
to move. We introduce the noun that is the object of the verb and
the verb at the same time, but instead of concatenating them as one
might normally, we displace the noun and keep it as a separate com-
ponent. Crucially the semantic relation between the noun and the
verb if local they are introduced at the same step in the derivation
tree, so that when we want to do semantic interpretation they will
be close to each other so that the appropriate meaning can be con-
structed compositionally. In many languages there might also be
an agreement relation between the verb and the object, which again
could be enforced through purely local rules.
You could also spit this rule into two separate rules using some-
thing like a trace or phonetically null element. If we define a null
an introduction to multiple context free grammars for linguists 21

element using the symbol T1 for trace, we can define a rule T1 () that
introduces the empty string . We then define a nonterminal of di-
mension 2, NT2 which generates the moving word and trace. We can
then rewrite the rule using the two rules.

NVP2 (u1 , vu2 ) NT2 (u1 , u2 ), V1 (v)


NT2 (u, v) N1 (u), T1 (v)

Such an approach would also be appropriate for languages which


have resumptive pronouns which are not phonetically null. In that
case we would have a pronounced element instead of T ().
Now when we consider the MCFG that this is a fragment of, it is
clear that it is inadequate in quite a fundamental respect it fails to
capture generalisations between the nonterminal symbols. This is the
same limitation that plain CFGs have but it is even more acute now.
One of the problems with CFGs that was fixed by GPSG 9 is the fact 9
G. Gazdar, E. Klein, G. Pullum, and
that there is no relationship between different symbols either they I. Sag. Generalised Phrase Structure
Grammar. Basil Blackwell, 1985
are identical or they are completely unrelated. In real languages we
have, for example, singular NPs and plural NPs. These need to be
represented by different nonterminals in a CFG, but there is no way
in a plain CFG to represent their relationship. The grammar must
tediously spellout every rule twice. Similarly with an MCFG, we have
to spell out every combination of rules with moving and nonmoving
constituents. The end result will be enormous finite, of course, but
too large to be plausibly learned. Indeed the results that show the
equivalence of MCFGs and MGs 10 tend to reveal an exponential 10
J. Michaelis. Transforming linear
blowup in the size of the MCFG that results from the conversion context-free rewriting systems into
minimalist grammars. In P. de Groote,
from a Minimalist Grammar. Thus it is important to augment MCFGs G. Morrill, and C. Retor, editors, Log-
with an appropriate feature calculus that can compactly represent the ical Aspects of Computational Linguistics,
pages 228244. Springer, 2001
overly large MCFGs, and reduce the redundancies to an acceptable
level.

Discussion

The relation to tree adjoining grammars (TAG) and other similar


formalisms, such as Linear Context Free Rewriting Systems (LCFRS),
is very close. LCFRSs are really just another name for MCFGs, and as
Joshi et al. 11 showed, the derivation trees of a wide range of mildly 11
A.K. Joshi, K. Vijay-Shanker, and
context sensitive formalisms can be modeled by LCFRSs. In a TAG D.J. Weir. The convergence of mildly
context-sensitive grammar formalisms.
we have an adjunction operation that corresponds to a restricted In Peter Sells et al., editor, Foundational
subset of rules: the derivation trees will use only use well-nested Issues in Natural Language Processing,
pages 3181. MIT Press, Cambridge,
rules. Thus in a TAG we cannot have a rule that looks like this: MA, 1991

N2 (ux, vy) P2 (u, v), Q2 ( x, y) (28)


an introduction to multiple context free grammars for linguists 22

This rule violates the well-nested condition because the u, v pair


and the x, y pair will overlap rather than be nested one within the
other. Look at the left hand side of this rule and see how the vari-
ables from P2 and from Q2 overlap each other. These two rules on the
other hand are well nested:

N2 (ux, yv) P2 (u, v), Q2 ( x, y) (29)

N2 ( xu, vy) P2 (u, v), Q2 ( x, y) (30)


It is an open question whether non-well-nested rules are necessary
for natural language description.
MCFGs in their full generality allow symbols of more than dimen-
sion 2. This gives rise to a complicated hierarchy of languages that
we will just sketch here. There are two factors that affect this hierar-
chy one is the maximum dimension of the symbols in the grammar,
and the other is the maximum number of symbols that we have on
the right hand side of the rules. In a CFG, we know, since all CFGs
can be converted into Chomsky Normal Form, that as long as we are
allowed to have at least 2 symbols on the right hand sides of rules,
we do not have a restriction. This is not the case for MCFGs. We will
call the rank of a grammar the maximum number of nonterminals
that we have on the right hand side of a rule. There is a complicated
2-dimensional hierarchy of MCFGs based on their rank and dimen-
sion. It seems that for natural languages we only need some simple
ones at the bottom of the hierarchy, with perhaps dimension and
rank only of 2. If we write r for the rank and d for the dimension,
then MCFGs can be parsed in time n(r+1)d . Note that if we consider
CFGs in Chomsky normal form, we have d = 1 and r = 2, so this
gives us a parsing algorithm that is n3 as we would expect. For the
case of r = 2 and d = 2 which may be adequate for natural lan-
guages, we need n6 .
MCFGs also have a lot of the nice properties that CFGs have
they form what is called an abstract family of languages. This means
for example that if we have two languages generated by a MCFG
then their union can also be defined by an MCFG they are closed
under the union operation. They are also closed under many other
operations intersection with regular sets, Kleene star and so on.
This means they are a natural computational class in a certain sense.
They are also not very linguistic: you can equally well use these
for modelling DNA sequences or anything else. Some of the other
formalisms equivalent to MCFGs or subclasses, have quite a lot of
interesting linguistic detail baked into the formalism in some way.
MCFGs are quite neutral in this respect, just like CFGs. This may or
an introduction to multiple context free grammars for linguists 23

may not be a good thing, depending on your point of view.


A brief comment about the notation: the form we have for the
rules here was introduced by Kanazawa 12 and is in our opinion 12
Makoto Kanazawa. The pump-
a great improvement on the original notation of Seki et al.13 . The ing lemma for well-nested multiple
context-free languages. In Developments
notation is perhaps closest to the notation for some other closely re- in Language Theory, pages 312325.
lated formalisms, such as Range Concatenation Grammars14 and Springer, 2009
Literal Movement Grammars 15 : these notations are based on what
13
H. Seki, T. Matsumura, M. Fujii, and
T. Kasami. On multiple context-free
are called Horn clauses, which are often used in logic programming. grammars. Theoretical Computer Science,
We have used instead of the traditional connective := to empha- 88(2):229, 1991
size the relation with the derivation operation denoted by . The
14
P. Boullier. Chinese Numbers, MIX,
Scrambling, and Range Concatenation
technical literature tends to use the original notation. Grammars. Proceedings of the 9th
If we are interested in cognitive modeling, we are always working Conference of the European Chapter of the
Association for Computational Linguistics
at a significant level of abstraction as Stabler argues 16 we should (EACL 99), pages 812, 1999
focus on the abstract thing that is being computed. The equiva- 15
A.V. Groenink. Mild context-
lence of MCFGs and MGs, as well as the equivalence of subclasses sensitivity and tuple-based general-
izations of context-grammar. Linguistics
of MCFGs to TAGs and the other well-known equivalences, when and Philosophy, 20(6):607636, 1997
taken together, suggest that MCFGs define the right combinatorial 16
E.P. Stabler. Computational perspec-
operations to model natural language syntax. From a linguistic mod- tives on minimalism. In Cedric Boeckx,
editor, Oxford Handbook of Linguistic
eling point of view, MCFGs have two compelling advantages first Minimalism, pages 617641. 2011
they can be efficiently parsed, and secondly various subclasses can be
efficiently learned 17 . 17
Ryo Yoshinaka. Polynomial-time
One of the deep sociological divides in formal/theoretical syn- identification of multiple context-free
languages from positive data and mem-
tax is between formalisms that use movement and formalisms that bership queries. In Proceedings of the
dont. One fascinating implication of the research into MCFGs is that International Colloquium on Grammati-
cal Inference, pages 230244, 2010; and
it indicates that this divide does not depend on any technical differ- Ryo Yoshinaka. Efficient learning of
ence. In a certain sense, there is a precise formal equivalence between multiple context-free languages with
movement based models and monostratal ones. multidimensional substitutability from
positive data. Theoretical Computer
Science, 412(19):1821 1831, 2011

Acknowledgments

Nothing in this note is original; I am merely re-explaining and re-


packaging the ideas of Makoto Kanazawa, Ed Stabler and many oth-
ers. My thanks to Ryo Yoshinaka for checking an earlier draft of this
note. Thanks also to Chris Brew, Misha Becker for suggesting correc-
tions. There is only a very partial list of citations here: apologies to
anyone who feels slighted.
Comments and corrections are very welcome (genuinely I am
not just being polite) and should be sent to [email protected].

References

P. Boullier. Chinese Numbers, MIX, Scrambling, and Range Con-


catenation Grammars. Proceedings of the 9th Conference of the European
an introduction to multiple context free grammars for linguists 24

Chapter of the Association for Computational Linguistics (EACL 99),


pages 812, 1999.

N. Chomsky. Three models for the description of language. IEEE


Transactions on Information Theory, 2(3):113124, 1956.

G. Gazdar, E. Klein, G. Pullum, and I. Sag. Generalised Phrase Struc-


ture Grammar. Basil Blackwell, 1985.

A.V. Groenink. Mild context-sensitivity and tuple-based generaliza-


tions of context-grammar. Linguistics and Philosophy, 20(6):607636,
1997.

A.K. Joshi, K. Vijay-Shanker, and D.J. Weir. The convergence of


mildly context-sensitive grammar formalisms. In Peter Sells et al.,
editor, Foundational Issues in Natural Language Processing, pages 31
81. MIT Press, Cambridge, MA, 1991.

L. Kallmeyer. Parsing beyond context-free grammars. Springer Verlag,


2010.

Makoto Kanazawa. The pumping lemma for well-nested multiple


context-free languages. In Developments in Language Theory, pages
312325. Springer, 2009.

J. Michaelis. Transforming linear context-free rewriting systems into


minimalist grammars. In P. de Groote, G. Morrill, and C. Retor,
editors, Logical Aspects of Computational Linguistics, pages 228244.
Springer, 2001.

P .S. Peters and R. W. Ritchie. On the generative power of transfor-


mational grammars. Information Sciences, 6:4983, 1973.

H. Seki, T. Matsumura, M. Fujii, and T. Kasami. On multiple


context-free grammars. Theoretical Computer Science, 88(2):229, 1991.

S. Shieber. Evidence against the context-freeness of natural lan-


guage. Linguistics and Philosophy, 8:333343, 1985.

E. Stabler. Derivational minimalism. In C. Retor, editor, Logical


aspects of computational linguistics (LACL 1996), pages 6895. Springer,
1997.

E.P. Stabler. Computational perspectives on minimalism. In Cedric


Boeckx, editor, Oxford Handbook of Linguistic Minimalism, pages 617
641. 2011.

K. Vijay-Shanker and David J. Weir. The equivalence of four exten-


sions of context-free grammars. Mathematical Systems Theory, 27(6):
511546, 1994.
an introduction to multiple context free grammars for linguists 25

Ryo Yoshinaka. Polynomial-time identification of multiple context-


free languages from positive data and membership queries. In
Proceedings of the International Colloquium on Grammatical Inference,
pages 230244, 2010.

Ryo Yoshinaka. Efficient learning of multiple context-free languages


with multidimensional substitutability from positive data. Theoreti-
cal Computer Science, 412(19):1821 1831, 2011.

You might also like