NLP UNIT-II PPT
NLP UNIT-II PPT
Unit-II
Syntax Analysis:
2.1Parsing Natural Language
2.2Treebanks: A Data-Driven Approach to Syntax,
2.3Representation of Syntactic Structure,
2.4Parsing Algorithms,
2.5Models for Ambiguity Resolution in Parsing, Multilingual Issues
The parsing in NLP is the process of determining the syntactic structure of a text by analysing
its constituent words based on an underlying grammar.
Example Grammar:
• Then, the outcome of the parsing process would be a parse tree, where sentence is the root,
intermediate nodes such as noun_phrase, verb_phrase etc. have children - hence they are
called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are
called terminals.
Parse Tree:
A treebank can be defined as a linguistically annotated corpus that includes some kind of
syntactic analysis over and above part-of-speech tagging.
A sentence is parsed by relating each word to other words in the sentence which depend on it.
The syntactic parsing of a sentence consists of finding the correct syntactic structure of that
sentence in the given formalism/grammar.
Dependency grammar (DG) and phrase structure grammar(PSG) are two such formalisms.
PSG breaks sentence into constituents (phrases), which are then broken into smaller
constituents.
Describe phrase, clause structure Example: NP,PP,VP etc.,
DG: syntactic structure consist of lexical items, linked by binary asymmetric relations called
dependencies.
Interested in grammatical relations between individual words.
Does propose a recursive structure rather a network of relations
These relations can also have labels.
Constituency tree vs Dependency tree
Dependency structures explicitly represent
- Head-dependent relations (directed arcs)
- Functional categories (arc labels)
- Possibly some structural categories (POS)
Phrase structure explicitly represent
- Phrases (non-terminal nodes)
- Structural categories (non-terminal labels)
- Possible some functional categories (grammatical functions)
Defining candidate dependency trees for an input sentence
Learning: scoring possible dependency graphs for a given sentence, usually by factoring the
graphs into their component arcs
Parsing: searching for the highest scoring graph for a given sentence
Syntax
In NLP, the syntactic analysis of natural language input can vary from being very low-level,
such as simply tagging each word in the sentence with a part of speech (POS), or very high
level, such as full parsing.
In syntactic parsing, ambiguity is a particularly difficult problem because the most possible
analysis has to be chosen from an exponentially large number of alternative analyses.
From tagging to full parsing, algorithms that can handle such ambiguity have to be carefully
chosen.
Here we explores the syntactic analysis methods from tagging to full parsing and the use of
supervised machine learning to deal with ambiguity.
2.1Parsing Natural Language
In a text-to-speech application, input sentences are to be converted to a spoken output that
should sound like it was spoken by a native speaker of the language.
Example: He wanted to go a drive in the country.
There is a natural pause between the words derive and In in sentence that reflects an
underlying hidden structure to the sentence.
Parsing can provide a structural description that identifies such a break in the intonation.
A simpler case: The cat who lives dangerously had nine lives.
In this case, a text-to-speech system needs to know that the first instance of the word lives is
a verb and the second instance is a noun before it can begin to produce the natural
intonation for this sentence.
This is an instance of the part-of-speech (POS) tagging problem where each word in the
sentence is assigned a most likely part of speech.
Another motivation for parsing comes from the natural language task of summarization, in
which several documents about the same topic should be condensed down to a small digest
of information.
Such a summary may be in response to a question that is answered in the set of documents.
In this case, a useful subtask is to compress an individual sentence so that only the relevant
portions of a sentence is included in the summary.
For example:
Beyond the basic level, the operations of the three products vary widely.
The operations of the products vary.
The elegant way to approach this task is to first parse the sentence to find the various
constituents: where we recursively partition the words in the sentence into individual
phrases such as a verb phrase or a noun phrase.
The output of the parser for the input sentence is shown in Fig.
To explain some details of phrase structure analysis in treebank, which was a project that
annotated 40,000 sentences from the wall street journal with phrase structure tree,
The SBARQ label marks what questions ie those that contain a gap and therefore require a
trace.
• Wh- moved noun phrases are labeled WHNP and put inside SBARQ. They bear an identity
index that matches the reference index on the *T* in the position of the gap.
• However questions that are missing both subject and auxiliary are label SQ
• NP-SBJ noun phrases cab be subjects.
• *T* traces for wh- movement and this empty trace has an index ( here it is 1) and associated
with the WHNP constituent with the same index.
Parsing Algorithms
• Given an input sentence, a parser produces an output analysis of that sentence.
• Treebank parsers do not need to have an explicit grammar, but to discuss the parsing
algorithms simpler, we use CFG.
• The simple CFG G that can be used to derive string such as a and b or c from the start symbol
N.
• Here we want to provide a model that matches the intuition that the second tree above is
preferred over the first.
• The parses can be thought of as ambiguous (leftmost to rightmost) derivation of the following
CFG:
• We assign scores or probabilities to the rules in CGF in order to provide a score or probability
for each derivation.
• From these rule probabilities, the only deciding factor for choosing between the two
parses for John brought a shirt with pockets in the two rules NP->NP PP and VP-> VP PP.
The probability for NP -> NP PP is set higher in the preceding PCFG.
• The rule probabilities can be derived from a treebank, consider a treebank with three
tress t1, t2, t3
• if we assume that tree t1 occurred 10 times in the treebank, t2 occurred 20 times and t3
occurred 50 times, then the PCFG we obtain from this treebank is:
• For input a a a there are two parses using the above PCFG: the probability P1 =0.125 0.334
0.285 = 0.01189 p2=0.25 0.667 0.714 =0.119.
• The parse tree p2 is the most likely tree for that input.
Generative models
• To find the most plausible parse tree, the parser has to choose between the possible
derivations each of which can be represented as a sequence of decisions.
• Let each derivation D = d1,d2,…..,dn, which is the sequence of decisions used to build the
parse tree.
• Then for input sentence x, the output parse tree y is defined by the sequence of steps in the
derivation.
• The probability for each derivation:
• The conditioning context in the probability P(di|d1,……..,di-1) is called the history and
corresponds to a partially built parse tree (as defined by the derived sequence).
• We make a simplifying assumption that keeps the conditioning context to a finite set by
grouping the histories into equivalence classes using a function