0% found this document useful (0 votes)
65 views

NLP UNIT-II PPT

The document discusses syntax analysis in Natural Language Processing (NLP), focusing on parsing techniques, treebanks, and syntactic structures. It explains the importance of parsing for understanding sentence structure, the role of dependency and phrase structure grammars, and the use of treebanks for knowledge acquisition in syntax analysis. Additionally, it covers parsing algorithms, including shift-reduce parsing and the challenges of ambiguity in syntactic analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

NLP UNIT-II PPT

The document discusses syntax analysis in Natural Language Processing (NLP), focusing on parsing techniques, treebanks, and syntactic structures. It explains the importance of parsing for understanding sentence structure, the role of dependency and phrase structure grammars, and the use of treebanks for knowledge acquisition in syntax analysis. Additionally, it covers parsing algorithms, including shift-reduce parsing and the challenges of ambiguity in syntactic analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Natural Language Processing

Unit-II
Syntax Analysis:
2.1Parsing Natural Language
2.2Treebanks: A Data-Driven Approach to Syntax,
2.3Representation of Syntactic Structure,
2.4Parsing Algorithms,
2.5Models for Ambiguity Resolution in Parsing, Multilingual Issues
 The parsing in NLP is the process of determining the syntactic structure of a text by analysing
its constituent words based on an underlying grammar.
Example Grammar:

• Then, the outcome of the parsing process would be a parse tree, where sentence is the root,
intermediate nodes such as noun_phrase, verb_phrase etc. have children - hence they are
called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are
called terminals.
Parse Tree:
 A treebank can be defined as a linguistically annotated corpus that includes some kind of
syntactic analysis over and above part-of-speech tagging.
 A sentence is parsed by relating each word to other words in the sentence which depend on it.
 The syntactic parsing of a sentence consists of finding the correct syntactic structure of that
sentence in the given formalism/grammar.
 Dependency grammar (DG) and phrase structure grammar(PSG) are two such formalisms.
 PSG breaks sentence into constituents (phrases), which are then broken into smaller
constituents.
 Describe phrase, clause structure Example: NP,PP,VP etc.,
 DG: syntactic structure consist of lexical items, linked by binary asymmetric relations called
dependencies.
 Interested in grammatical relations between individual words.
 Does propose a recursive structure rather a network of relations
 These relations can also have labels.
Constituency tree vs Dependency tree
 Dependency structures explicitly represent
- Head-dependent relations (directed arcs)
- Functional categories (arc labels)
- Possibly some structural categories (POS)
 Phrase structure explicitly represent
- Phrases (non-terminal nodes)
- Structural categories (non-terminal labels)
- Possible some functional categories (grammatical functions)
Defining candidate dependency trees for an input sentence
 Learning: scoring possible dependency graphs for a given sentence, usually by factoring the
graphs into their component arcs
 Parsing: searching for the highest scoring graph for a given sentence
Syntax
 In NLP, the syntactic analysis of natural language input can vary from being very low-level,
such as simply tagging each word in the sentence with a part of speech (POS), or very high
level, such as full parsing.
 In syntactic parsing, ambiguity is a particularly difficult problem because the most possible
analysis has to be chosen from an exponentially large number of alternative analyses.
 From tagging to full parsing, algorithms that can handle such ambiguity have to be carefully
chosen.
 Here we explores the syntactic analysis methods from tagging to full parsing and the use of
supervised machine learning to deal with ambiguity.
2.1Parsing Natural Language
 In a text-to-speech application, input sentences are to be converted to a spoken output that
should sound like it was spoken by a native speaker of the language.
 Example: He wanted to go a drive in the country.
 There is a natural pause between the words derive and In in sentence that reflects an
underlying hidden structure to the sentence.
 Parsing can provide a structural description that identifies such a break in the intonation.
 A simpler case: The cat who lives dangerously had nine lives.
 In this case, a text-to-speech system needs to know that the first instance of the word lives is
a verb and the second instance is a noun before it can begin to produce the natural
intonation for this sentence.
 This is an instance of the part-of-speech (POS) tagging problem where each word in the
sentence is assigned a most likely part of speech.
 Another motivation for parsing comes from the natural language task of summarization, in
which several documents about the same topic should be condensed down to a small digest
of information.
 Such a summary may be in response to a question that is answered in the set of documents.
 In this case, a useful subtask is to compress an individual sentence so that only the relevant
portions of a sentence is included in the summary.
 For example:
Beyond the basic level, the operations of the three products vary widely.
The operations of the products vary.
 The elegant way to approach this task is to first parse the sentence to find the various
constituents: where we recursively partition the words in the sentence into individual
phrases such as a verb phrase or a noun phrase.
 The output of the parser for the input sentence is shown in Fig.

 Another example is the paragraph parsing.


 In the sentence fragment, the capitalized phrase EUROPEAN COUNTRIES can be replaced with
other phrases without changing the essential meaning of the sentences.
 A few examples of replacement phrases are shown in the sentence fragments .
Open border imply increasing racial fragmentation in EUROPEAN COUNTRIES.

Open borders imply increasing racial fragmentation in the countries of Europe
Open borders imply increasing racial fragmentation in European states.
Open borders imply increasing racial fragmentation in Europe
Open borders imply increasing racial fragmentation in European nations
Open borders imply increasing racial fragmentation in European countries.
• In contemporary NLP, syntactic parsers are routinely used in many applications, including
but not limited to statistical machine translation, information extraction from text
collections, language summarizations, producing entity grinds for language generation,
error correction in text.

2.2Treebanks: A Data-Driven Approach to Syntax
 Parsing recovers information that is not explicit in the input sentence.
 This implies that a parser requires some knowledge(syntactic rules) in addition to the input
sentence about the kind of syntactic analysis that should be produced as output.
 One method to provide such knowledge to the parser is to write down a grammar of the
language – a set of rules of syntactic analysis as a CFGs.
 In natural language, it is far too complex to simply list all the syntactic rules in terms of a CFG.
 The second knowledge acquisition problem- not only do we need to know the syntactic rules
for a particular language, but we also need to know which analysis is the most
plausible(probably) for a given input sentence.
 The construction of treebank is a data driven approach to syntax analysis that allows us to
address both of these knowledge acquisition bottlenecks in one stroke.
 A treebank is simply a collection of sentences (also called a corpus of text), where each
sentence is provided a complete syntax analysis.
 The syntactic analysis for each sentence has been judged by a human expert as the most
possible analysis for that sentence.
 A lot of care is taken during the human annotation process to ensure that a consistent
treatment is provided across the treebank for related grammatical phenomena.
 There is no set of syntactic rules or linguistic grammar explicitly provided by a treebank, and
typically there is no list of syntactic constructions provided explicitly in a treebank.
 A detailed set of assumptions about the syntax is typically used as an annotation guideline
to help the human experts produce the single-most plausible syntactic analysis for each
sentence in the corpus.
 Treebanks provide a solution to the two kinds of knowledge acquisition bottlenecks.
 Treebanks solve the first knowledge acquisition problem of finding the grammar underlying
the syntax analysis because the syntactic analysis is directly given instead of a grammar.
 In fact, the parser does not necessarily need any explicit grammar rules as long as it can
faithfully produce a syntax analysis for an input sentence.
 Treebank solve the second knowledge acquisition problem as well.
 Because each sentence in a treebank has been given its most plausible(probable) syntactic
analysis, supervised machine learning methods can be used to learn a scoring function over
all possible syntax analyses.
 Two main approaches to syntax analysis are used to construct treebanks: dependency graph
and phrase structure trees.
 These two representations are very closely related to each other and under some
assumptions, one representation can be converted to another.
 Dependence analysis is typically favoured for languages such as Czech and Turkish, that have
free word order.
 Phrase structure analysis is often used to provide additional information about long-distance
dependencies and mostly languages like English and French.
• NLP: is the capability of the computer software to understand the natural language.
• There are variety of languages in the world.
• Each language has its own structure(SVO or SOV)->called grammar ->has certain
set of rules->determines: what is allowed, what is not allowed.
• English: S O V Other languages: S V O or O S V
I eat mango
• Grammar is defined as the rules for forming well-structured sentences.
• belongs to VN
• Different Types of Grammar in NLP
1.Context-Free Grammar (CFG) 2.Constituency Grammar (CG) or Phrase structure
grammar 3.Dependency Grammar (DG)
Context-Free Grammar (CFG)
• Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P)
• N or VN = set of non-terminal symbols, or variables.
• T or ∑ = set of terminal symbols.
• S = Start symbol where S ∈ N
• P = Production rules for Terminals as well as Non-terminals.
• It has the form α → β, where α and β are strings on VN ∪ ∑ at least one symbol of α
• Example: Jogn hit the ball
S -> NP VP
VP -> V NP
V ->hit
NP-> DN
D->the
N->John|ball
2.3 Representation of Syntactic Structure
2.3.1 Syntax Analysis Using Dependency Graphs
 The main philosophy behind dependency graphs is to connect a word- the head of a phrase-
with the dependents in that phrase.
 The notation connects a head with its dependent using a directed (asymmetric) connections.
 Dependency graphs, just like phrase structures trees, is a representation that is consistent
with many different linguistic frameworks.
 The words in the input sentence are treated as the only vertices in the graph, which are linked
together by directed arcs representing syntactic dependencies.
 In dependency-based syntactic parsing, the task is to derive a syntactic structure for an input
sentence by identifying the syntactic head of the each word in the sentence.
 This defines a dependency graph, where the nodes are the words of the input sentence and
arcs are the binary relations from head to dependent.
 The dependency tree analyses, where each word depends on exactly one parent, either
another word or a dummy root symbol.
 By convention, in dependency tree 0 index is used to indicate the root symbol and the
directed arcs are drawn from the head word to the dependent word.
 In the Fig shows a dependency tree for Czech sentence taken from the Prague dependency
treebank.
 Each node in the graph is a word, its part of speech and the position of the word in the
sentence.
 For example [fakulte, N3,7] is the seventh word in the sentence with POS tag N3.
 The node [#, ZSB,0] is the root node of the dependency tree.
 There are many variations of dependency syntactic analysis, but the basic textual format for a
dependency tree can be written in the following form.
 Where each dependent word specifies the head Word in the sentence, and exactly one word
is dependent to the root of the sentence.
 An important notation in dependency analysis is the notation of projectivity, which is a
constraint imposed by the linear order of words on the dependencies between words.
 A projective dependency tree is one where if we put the words in a linear order based on the
sentence with the root symbol in the first position, the dependency arcs can be drawn above
the words without any crossing dependencies.
2.3.2 Syntax Analysis Using Phrase Structures Trees
 A Phrase Structure syntax analysis of a sentence derives from the traditional sentence
diagrams that partition a sentence into constituents, and larger constituents are formed by
meaning smaller ones.
 Phrase structure analysis also typically incorporate ideas from generative grammar(from
linguistics) to deal with displaced constituents or apparent long-distance relationships
between heads and constituents.
 A phrase structure tree can be viewed as implicitly having a predicate-argument structure
associated with it.
 Sentence includes a subject and a predicate. The subject is a noun phrase (NP) and the
predicate is a verb phrase.
 For example, the phrase structure analysis : Mr. Baker seems especially sensitive, taken from
the Penn Treebank.
 The subject of the sentence is marked with the SBJ marker and predicate of the sentence is
marked with the PRD marker.

 NNP: proper noun, singular VBZ: verb, third person singular present
ADJP: adjective phrase RB: adverb JJ:adjective
 The same sentence gets the following dependency tree analysis: some of the information
from the bracketing labels from the phrase structure analysis gets mapped onto the labelled
arcs of the dependency analysis.

 To explain some details of phrase structure analysis in treebank, which was a project that
annotated 40,000 sentences from the wall street journal with phrase structure tree,

 The SBARQ label marks what questions ie those that contain a gap and therefore require a
trace.
• Wh- moved noun phrases are labeled WHNP and put inside SBARQ. They bear an identity
index that matches the reference index on the *T* in the position of the gap.
• However questions that are missing both subject and auxiliary are label SQ
• NP-SBJ noun phrases cab be subjects.
• *T* traces for wh- movement and this empty trace has an index ( here it is 1) and associated
with the WHNP constituent with the same index.
Parsing Algorithms
• Given an input sentence, a parser produces an output analysis of that sentence.
• Treebank parsers do not need to have an explicit grammar, but to discuss the parsing
algorithms simpler, we use CFG.
• The simple CFG G that can be used to derive string such as a and b or c from the start symbol
N.

• An important concept for parsing is a derivation.


• For the input string a and b or c, the following sequence of actions separated by symbol
represents a sequence of steps called derivation.
• In this derivation, each line is called a sentential form.
• In the above derivation, we restricted ourselves to only expanded on the rightmost
nonterminal in each sentential form.
• This method is called the rightmost derivation of the input using a CFG.
• This derivation sequence exactly corresponds to the construction of the following parse tree
from left to right, one symbol at a time.

• However, a unique derivation sequence is not guaranteed.


• There can be many different derivation.
• For example, one more rightmost derivation that results following parse tree.
Shift Reduce Parsing
• To build a parser, we need an algorithm that can perform the steps in the above rightmost
derivation for any grammar and for any input string.
• Every CFG turns out to have an automata that is equivalent to it, called pushdown automata
(just like regular expression can be converted to finite-state automata).
• An algorithm for parsing that is general for any given CFG and input string.
• The algorithm is called shift-reduce parsing which uses two data structures: a buffer for input
symbols and a stack for storing CFG symbols.
S.No Parse Tree Stack Input Action
1 a and b or c Init
2 a a and b or c Shift a
3 (N a) N and b or c Reduce N->a
4 (N a) and N and b or c Shift and
5 (N a) and b N and b or c Shift b
6 (N a) and (N b) N and N or c Reduce N->b
7 (N (N a) and (N b)) N or c Reduce N->N and N
8 (N (N a) and (N b)) or N or c Shift or
9 (N (N a) and (N b)) or c N or c Shift c
10 (N (N a) and (N b)) or (N c) N or N Reduce N->c
11 (N (N (N a) and (N b)) or (N c)) N Reduce N->N or N
12 N (N (N a) and (N b)) or (N c)) N Accept
13
Hypergraphs and Chart Parsing(CYK Parsing)
• CFG s in the worst case such a parser might have to resort to backtracking, which means re-parsing the input
which leads to a time that is exponential in the grammar size in the worst case.
• Variants of this algorithm(CYK) are often used in statistical parsers that attempt to search the space of
possible parse trees without the limitation of purely left to right parsing.
• One of the earliest recognition parsing algorithm is CYK (Cocke, Kasami and younger) parsing algorithm and It
works only with CNF( Chomsky normal form).
CYK example:
Models for Ambiguity Resolution in Parsing
• Here we discuss on modelling aspects of parsing: how to design features and ways to resolve
ambiguity in parsing.
Probabilistic context-free grammar
• Ex: John bought a shirt with pockets

• Here we want to provide a model that matches the intuition that the second tree above is
preferred over the first.
• The parses can be thought of as ambiguous (leftmost to rightmost) derivation of the following
CFG:

• We assign scores or probabilities to the rules in CGF in order to provide a score or probability
for each derivation.
• From these rule probabilities, the only deciding factor for choosing between the two
parses for John brought a shirt with pockets in the two rules NP->NP PP and VP-> VP PP.
The probability for NP -> NP PP is set higher in the preceding PCFG.
• The rule probabilities can be derived from a treebank, consider a treebank with three
tress t1, t2, t3

• if we assume that tree t1 occurred 10 times in the treebank, t2 occurred 20 times and t3
occurred 50 times, then the PCFG we obtain from this treebank is:
• For input a a a there are two parses using the above PCFG: the probability P1 =0.125 0.334
0.285 = 0.01189 p2=0.25 0.667 0.714 =0.119.
• The parse tree p2 is the most likely tree for that input.

Generative models
• To find the most plausible parse tree, the parser has to choose between the possible
derivations each of which can be represented as a sequence of decisions.
• Let each derivation D = d1,d2,…..,dn, which is the sequence of decisions used to build the
parse tree.
• Then for input sentence x, the output parse tree y is defined by the sequence of steps in the
derivation.
• The probability for each derivation:
• The conditioning context in the probability P(di|d1,……..,di-1) is called the history and
corresponds to a partially built parse tree (as defined by the derived sequence).
• We make a simplifying assumption that keeps the conditioning context to a finite set by
grouping the histories into equivalence classes using a function

Discriminative models for Parsing


• Colins created a simple notation and framework that describes various discriminative
approaches to learning for parsing.
• This framework is called global linear model.
• Let x be a set of inputs and y be a set of possible outputs that can be a sequence of POS tags
or a parse tree or a dependency analysis.
• Each xƐx and yƐy is mapped to a d-dimensional feature vector ø(x,y), with each dimension
being a real number.
• A weight parameter vector wƐRd assigns a weight to each feature in ø(x,y), representing the
importance of that feature.
• The value of ø(x,y).w is the score of (x,y) . The height the score, the more possible it is that y is
the output of x.
• The function GEN(x) generates the set of possible outputs y for a given x.
• Having ø(x,y).w and GEN(x) specified, we would like to choose the height scoring candidate
𝑦 ∗ from GEN(x) as the most possible output

where F(x) returns the highest scoring output 𝑦 ∗ from GEN(x)


• A conditional random field (CRF) defines the conditional probability as a linear score for each
candidate y and a global normalization term:

• A simple linear model that ignores the normalization term is:


• There are two general approaches to parsing 1.Top down parsing ( start with start symbol)
2.Botttom up parsing (start from terminals)

You might also like