Unit - 3
Unit - 3
in a language. Each rule has a left-hand side, which identifies a syntactic category, and a right-
hand side, which defines its alternative component parts, reading from left to right.
E.g., the rule s --> np vp means that "a sentence is defined as a noun phrase followed
by a verb phrase." Figure 1 shows a simple CFG that describes the sentences from a
small subset of English.
A sentence in the language defined by a CFG is a series of words that can be derived by
systematically applying the rules, beginning with a rule that has S on its left-hand side. A
parse of the sentence is a series of rule applications in which a syntactic category is replaced
by the right-hand side of a rule that has that category on its left-hand side, and the final rule
application yields the sentence itself. E.g., a parse of the sentence "the giraffe dreams" is:
S => np vp
=> det n vp
=> the n vp
=> the giraffe vp
=> the giraffe iv
=> the giraffe dreams
A convenient way to describe a parse is to show its parse tree, which is simply a graphical
display of the parse. Figure shows a parse tree for the sentence "the giraffe dreams". Note that
the root of every subtree has a grammatical category that appears on the left-hand side of a rule,
and the children of that root are identical to the elements on the right-hand side of that rule.
Dependency Grammar
Traditionally, a dependency grammar belongs to the class of grammars that
emphasize words rather than constituents. Grammars that are built primarily
on constituents are known as phrase structure grammars. Phrase structure grammars are
thus constituent-based, while dependency grammars are word-based.
While phrase structure grammars see sentences and clauses structured in terms of constituents,
dependency grammars assume that sentence and clause structure derives from dependency
relationships between words. The difference is illustrated in the trees below:
Tree (or phrase marker) (a) shows what a traditional phrase structure grammar would view
as the structure of the sentence They killed the man with a gun. One can see, for instance, that
the preposition with forms a prepositional phrase with the noun phrase a gun. Compare this to
the dependency tree (b): here the preposition dominates the noun gun, which in turn dominates
the article a.
When one compares the number of nodes across the two trees, one finds that (a) contains 12
nodes, while (b) only contains 7 nodes. On the assumption that two representations convey
the same utterance, the one that does so without explanatory deficiencies is called
more minimal with respect to the other. In general, dependency grammars are
more minimal than phrase structure grammars because they assume less structure.
A second point to acknowledge is that the term grammar can have two meanings; one
meaning is rather general and refers to how linguistic units are structured. In this broader
sense, grammar is a hyperonym to syntax, morphology, and phonology. The narrow
meaning refers only to syntax. The word grammar in dependency grammar is traditionally
understood in the narrow sense, i.e. dependency grammars are theories of syntax, but not
theories of morphology or phonology. The motivation for understanding dependency
grammars as dependency syntaxes comes from their being word-based. Historically,
dependency grammars have struggled to establish themselves in the broader sense
of grammar.
The grammar explained here differs from traditional dependency grammars insofar as it
is catena-based. Since catenae operate in syntax as well as in morphology, a catena-based
dependency grammar can hope to become a grammar in the broader sense. Even though the
ultimate aim is to see the distinction between syntax and morphology as a continuum, the
discussion here will branch out into a syntax section and a morphology section.
The section on core notions has established the concept of the catena, and how it relates
to strings, components, and constituents. The purpose of this section is to show
that catenae play an important role in many phenomena seen as central in syntax research.
These areas include displacement, idiom formation, ellipsis, predicate structure,
and constructions. These phenomena will be briefly addressed in turn. Prior to this, several
other important terms will be introduced.
The terms string, catena, component, and constituent have already been defined. In addition
to these, the following terms are also necessary: root, head, dependent, governor,
and governee.
string: a word or combination of words that is
continuous with respect to precedence
catena: a word or combination of words that is
continuous with respect to dominance
component: a word or combination of words that is a
string and a catena
constituent: a component that is complete
root: the one word in a given catena that is not
dominated by any other word in that catena
head: the one word that immediately dominates a given
catena
dependent: a constituent that is immediately dominated by a
given word
governor: the one word that licenses the appearance of a
given catena
governee: a catena the appearance of which is licensed by
a given word
A context free grammar (CFG) is in Chomsky Normal Form (CNF) if all production rules
satisfy one of the following conditions:
A non-terminal generating a terminal (e.g.; X->x)
A non-terminal generating two non-terminals (e.g.; X->YZ)
Start symbol generating ε. (e.g.; S-> ε)
Consider the following grammars,
G1 = {S->a, S->AZ, A->a, Z->z}
G2 = {S->a, S->aZ, Z->a}
The grammar G1 is in CNF as production rules satisfy the rules specified for CNF. However,
the grammar G2 is not in CNF as the production rule S->aZ contains terminal followed by
non-terminal which does not satisfy the rules specified for CNF.
Note –
For a given grammar, there can be more than one CNF.
CNF produces the same language as generated by CFG.
CNF is used as a preprocessing step for many algorithms for CFG like CYK(membership
algo), bottom-up parsers etc.
For generating string w of length ‘n’ requires ‘2n-1’ production or steps in CNF.
Any Context free Grammar that do not have ε in it’s language has an equivalent CNF.
How to convert CFG to CNF?
Step 1. Eliminate start symbol from RHS.
If start symbol S is at the RHS of any production in the grammar, create a new production
as:
S0->S
where S0 is the new start symbol.
Step 2. Eliminate null, unit and useless productions.
If CFG contains null, unit or useless production rules, eliminate them. Step 3. Eliminate
terminals from RHS if they exist with other terminals or non-terminals. e.g,; production rule
X->xY can be decomposed as:
X->ZY
Z->x
Step 4. Eliminate RHS with more than two non-terminals.
e.g,; production rule X->XYZ can be decomposed as:
X->PZ
P->XY
Example – Let us take an example to convert CFG to CNF. Consider the given grammar G1:
S → ASB
A → aAS|a|ε
B → SbS|A|bb
Step 1. As start symbol S appears on the RHS, we will create a new production rule S0->S.
Therefore, the grammar will become:
S0->S
S → ASB
A → aAS|a|ε
B → SbS|A|bb
Step 2. As grammar contains null production A-> ε, its removal from the grammar yields:
S0->S
S → ASB|SB
A → aAS|aS|a
B → SbS| A|ε|bb
Now, it creates null production B→ ε, its removal from the grammar yields:
S0->S
S → AS|ASB| SB| S
A → aAS|aS|a
B → SbS| A|bb
Now, it creates unit production B->A, its removal from the grammar yields:
S0->S
S → AS|ASB| SB| S
A → aAS|aS|a
B → SbS|bb|aAS|aS|a
Also, removal of unit production S0->S from grammar yields:
S0-> AS|ASB| SB| S
S → AS|ASB| SB| S
A → aAS|aS|a
B → SbS|bb|aAS|aS|a
Also, removal of unit production S->S and S0->S from grammar yields:
S0-> AS|ASB| SB
S → AS|ASB| SB
A → aAS|aS|a
B → SbS|bb|aAS|aS|a
Step 3. In production rule A->aAS |aS and B-> SbS|aAS|aS, terminals a and b exist on RHS
with non-terminates. Removing them from RHS:
S0-> AS|ASB| SB
S → AS|ASB| SB
A → XAS|XS|a
B → SYS|bb|XAS|XS|a
X →a
Y→b
Also, B->bb can’t be part of CNF, removing it from grammar yields:
S0-> AS|ASB| SB
S → AS|ASB| SB
A → XAS|XS|a
B → SYS|VV|XAS|XS|a
X→a
Y→b
V→b
Step 4: In production rule S0->ASB, RHS has more than two symbols, removing it from
grammar yields:
S0-> AS|PB| SB
S → AS|ASB| SB
A → XAS|XS|a
B → SYS|VV|XAS|XS|a
X→a
Y→b
V→b
P → AS
Similarly, S->ASB has more than two symbols, removing it from grammar yields:
S0-> AS|PB| SB
S → AS|QB| SB
A → XAS|XS|a
B → SYS|VV|XAS|XS|a
X→a
Y→b
V→b
P → AS
Q → AS
Similarly, A->XAS has more than two symbols, removing it from grammar yields:
S0-> AS|PB| SB
S → AS|QB| SB
A → RS|XS|a
B → SYS|VV|XAS|XS|a
X→a
Y→b
V→b
P → AS
Q → AS
R → XA
Similarly, B->SYS has more than two symbols, removing it from grammar yields:
S0 -> AS|PB| SB
S → AS|QB| SB
A → RS|XS|a
B → TS|VV|XAS|XS|a
X→a
Y→b
V→b
P → AS
Q → AS
R → XA
T → SY
Similarly, B->XAX has more than two symbols, removing it from grammar yields:
S0-> AS|PB| SB
S → AS|QB| SB
A → RS|XS|a
B → TS|VV|US|XS|a
X→a
Y→b
V→b
P → AS
Q → AS
R → XA
T → SY
U → XA
So this is the required CNF for given grammar.
Since Natural language can be open to multiple interpretations at times, this would pass on to
the computers who will try to understand the natural language input given to them. Often, it
can be difficult to fully understand a sentence when we are not given enough context or if there
is poor grammar.
In this article we will be going over many different types of ambiguities that are found in NLP.
POS tagging refers to the process of classifying words in a text to a part of speech - whether
the word is a verb, noun, etc. Often, you will find that the same word can take on multiple
classifications for its part of speech depending on how the sentence is constructed. For
example, it is quite common to see words that can be used both as a verb or a noun −
This ambiguity arises because the same exact sentence can be interpreted differently based on
how the sentence is parsed. Take the following sentence −
This sentence can be construed as the boy either kicking the ball while wearing his jeans, or
kicking the ball while the ball was in the jeans. This depends on how the sentence is parsed.
Scope Ambiguity
Here we look at ambiguities that occur due to quantifiers. Taking a look back at some math
logic terminology, or just basic grammar, we know that words like ‘every’ and ‘any’ would
come to mind.
This sentence, due to the scope created with the sequential use of quantifiers ‘all’ followed by
‘a’, can have two different meanings. The two meanings are that −
The first is that all students learn the same programming language.
They all learn a language that doesn’t have to be the same one.
Lexical Ambiguity
Certain words have the property that they can have multiple different meanings. There are two
forms of lexical ambiguity that exist: Polysemy and Homonymy.
Polysemy − This is when two words are the same but have a different meaning depending on
the usage, i.e the word Foot. Foot can describe the body part, or the foot of the building.
Essentially, you are describing the base of something with the word foot.
Homonym − This occurs when a word has the same spelling or pronunciation, but has different
meanings overall. While superficially the same they are completely different in meaning. The
word bass for example can be referring to the musical instrument, or a type of fish. Another
example, which is given here to clarify that not just spelling but pronunciation is important too,
is horse and hoarse. These two have similar pronunciations, but horse refers to the animal and
hoarse refers to a sore throat.
Semantic Ambiguity
Now, instead of a word having multiple meanings, sentences can have multiple meanings
depending on the context. For example, the sentence “He ate the burnt lasagna and pie” could
mean one of two things −
Referential Ambiguity
Referential ambiguity occurs when a phrase can have multiple interpretations due to the use of
multiple objects and the referencing not being clear. For example, take the sentence −
This can mean two things depending on who has the telescope.
Here we have a loosely similar ambiguity to referential ambiguity, but more fixated on
pronouns . The use of pronouns can cause some confusion if there are multiple people being
mentioned in a sentence. Take the following sentence −
Now, from the sentence alone it is not exactly clear whether ‘she’ is referring to Michelle or
Romany.
The word ‘Parsing’ whose origin is from Latin word ‘pars’ (which means ‘part’), is used to
draw exact meaning or dictionary meaning from the text. It is also called Syntactic analysis or
syntax analysis. Comparing the rules of formal grammar, syntax analysis checks the text for
meaningfulness. The sentence like “Give me hot ice-cream”, for example, would be rejected
by parser or syntactic analyzer.
In this sense, we can define parsing or syntactic analysis or syntax analysis as follows −
It may be defined as the process of analyzing the strings of symbols in natural language
conforming to the rules of formal grammar.
We can understand the relevance of parsing in NLP with the help of following points −
Dialogue systems and summarization are Information extraction and text mining are
the examples of NLP applications where the examples of NLP applications where
deep parsing is used. deep parsing is used.
Recursive descent parsing is one of the most straightforward forms of parsing. Following are
some important points about recursive descent parser −
Regexp parsing is one of the mostly used parsing technique. Following are some important
points about Regexp parser −
As the name implies, it uses a regular expression defined in the form of grammar on
top of a POS-tagged string.
It basically uses these regular expressions to parse the input sentences and generate a
parse tree out of this.
Cocke–Younger–Kasami (CYK) Algorithm
Grammar denotes the syntactical rules for conversation in natural language. But in the theory
of formal language, grammar is defined as a set of rules that can generate strings. The set of all
strings that can be generated from a grammar is called the language of the grammar.
Context Free Grammar:
We are given a Context Free Grammar G = (V, X, R, S) and a string w, where:
V is a finite set of variables or non-terminal symbols,
X is a finite set of terminal symbols,
R is a finite set of rules,
S is the start symbol, a distinct element of V, and
V and X are assumed to be disjoint sets.
The Membership problem is defined as: Grammar G generates a language L(G). Is the given
string a member of L(G)?
Chomsky Normal Form:
A Context Free Grammar G is in Chomsky Normal Form (CNF) if each rule if each rule of G is
of the form:
A –> BC, [ with at most two non-terminal symbols on the RHS ]
A –> a, or [ one terminal symbol on the RHS ]
S –> nullstring, [ null string ]
Cocke-Younger-Kasami Algorithm
It is used to solves the membership problem using a dynamic programming approach. The
algorithm is based on the principle that the solution to problem [i, j] can constructed from
solution to subproblem [i, k] and solution to sub problem [k, j]. The algorithm requires the
Grammar G to be in Chomsky Normal Form (CNF). Note that any Context-Free Grammar can
be systematically converted to CNF. This restriction is employed so that each problem can only
be divided into two subproblems and not more – to bound the time complexity.
How does the CYK Algorithm work?
For a string of length N, construct a table T of size N x N. Each cell in the table T[i, j] is the
set of all constituents that can produce the substring spanning from position i to j. The process
involves filling the table with the solutions to the subproblems encountered in the bottom-up
parsing process. Therefore, cells will be filled from left to right and bottom to top.
1 2 3
4 5
4 [4, 4] [4, 5]
1 2 3
4 5
5 [5, 5]
In T[i, j], the row number i denotes the start index and the column number j denotes the end
index.
The algorithm considers every possible subsequence of letters and adds K to T[i, j] if the
sequence of letters starting from i to j can be generated from the non-terminal K. For
subsequences of length 2 and greater, it considers every possible partition of the subsequence
into two parts, and checks if there is a rule of the form A ? BC in the grammar where B and C
can generate the two parts respectively, based on already existing entries in T. The sentence
can be produced by the grammar only if the entire string is matched by the start symbol, i.e,
if S is a member of T[1, n].
Let us start filling up the table from left to right and bottom to top, according to the rules
described above:
1 2 3 4 5
a very heavy orange book
1
Det – – NP NP
a
2
Adv AP Nom Nom
very
3
A, AP Nom Nom
heavy
4
Nom, A, AP Nom
orange
5
Nom
book