0% found this document useful (0 votes)
5 views

Unit 2 new one

The document discusses parsing in natural language processing (NLP), detailing its definition, role, challenges, and the two main paradigms: constituency and dependency parsing. It also covers treebanks, including the Penn Treebank and Universal Dependencies, which are essential for training parsers. Additionally, it outlines various parsing algorithms, their complexities, and applications in NLP tasks such as grammar checking and machine translation.

Uploaded by

akshitha2904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit 2 new one

The document discusses parsing in natural language processing (NLP), detailing its definition, role, challenges, and the two main paradigms: constituency and dependency parsing. It also covers treebanks, including the Penn Treebank and Universal Dependencies, which are essential for training parsers. Additionally, it outlines various parsing algorithms, their complexities, and applications in NLP tasks such as grammar checking and machine translation.

Uploaded by

akshitha2904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit 2 new

1. Parsing Natural Language


 Definition: Parsing (syntax analysis) is the process
of analyzing a sentence’s grammatical structure
according to a formal grammar. A parser takes an
input sentence and produces a structured
representation (often a parse tree or similar) after
checking syntax. In NLP this identifies constituents
(like noun phrases) and relations, enabling higher-
level understanding.
 Role in NLP: Syntax analysis is a core phase of NLP
– after tokenization and POS tagging, parsing
detects phrase structure and relationships. It
ensures the sentence is well-formed and produces
an output (e.g. a parse tree or dependency graph)
used by downstream tasks (semantic analysis,
machine translation, etc.). Parsers also report
syntax errors or ambiguities when grammar rules
fail.
 Top-Down vs Bottom-Up Parsing: In top-down
parsing (e.g. recursive descent, Earley), the parser
begins with the start symbol of the grammar and
tries to rewrite it to match the input. In bottom-up
parsing (e.g. shift-reduce, CYK), the parser begins
with the input tokens and incrementally builds
sub-phrases until reaching the start symbol. Top-
down is goal-driven (predicting structure), while
bottom-up is data-driven (combining adjacent
tokens). Top-down methods may backtrack on
grammar rules, whereas bottom-up methods
“shift” tokens onto a stack and “reduce” them by
grammar rules (see section 4).
 Challenges: Natural language is highly ambiguous
and irregular. A given sentence can often be parsed
in multiple valid ways (syntactic ambiguity, e.g. PP–
attachment or coordination ambiguities). Parsers
must handle lexical ambiguity (words with multiple
POS) and structural ambiguity (multiple parse
trees). Context-sensitive constructs (like long-
distance dependencies) and incomplete or noisy
input also pose challenges. Robust parsing often
uses probabilistic grammar models (trained on
treebanks) or heuristic disambiguation to choose
the most likely parse.
 Constituency vs Dependency: There are two main
paradigms of syntactic parsing: Constituency
(phrase-structure) parsing and dependency
parsing. Constituency parsing identifies nested
phrase constituents (NP, VP, etc.) and yields a
hierarchical parse tree with nonterminal labels.
Dependency parsing, in contrast, represents
sentences as directed graphs of binary head–
dependent relations between words. In a
dependency parse, each word (vertex) connects
via labeled edges (subject, object, modifier, etc.) to
its governor, forming a “spider” tree. Both
approaches capture grammatical structure, but
focus on different aspects: constituency parses
emphasize phrase boundaries, while dependency
parses emphasize word-to-word syntactic
relations.
2. Treebanks
 What is a Treebank: A treebank is a corpus of
sentences each annotated with syntactic structure.
In linguistics, a treebank is essentially a parsed text
corpus, where each sentence is paired with its
parse (constituency or dependency). Treebanks are
typically hand-annotated (or semi-automatically
annotated and corrected) by linguists. Constructing
a treebank is labor-intensive but provides “gold-
standard” data for empirical NLP.
 Penn Treebank: The Penn Treebank (PTB) is a
landmark English treebank of news text (Wall
Street Journal). It contains about one million words
of 1989 WSJ text bracketed in Penn Treebank style
(a phrase-structure annotation). In PTB II style,
sentences are annotated with full constituent
parse trees and POS tags. For example, PTB’s
bracketed notation for “John loves Mary” is: (S (NP
(NNP John)) (VP (VBZ loves) (NP (NNP Mary))) (. .)).
PTB also includes predicate-argument and
disfluency annotations (earlier phases), but its
legacy use is as a standard dataset for
training/evaluating constituency parsers.
 Universal Dependencies (UD): Universal
Dependencies is a cross-linguistic dependency
treebank framework. UD provides consistent
annotation of parts of speech, morphological
features and syntactic dependencies across many
languages. It is an open community project (200+
treebanks, 150+ languages) aiming for uniform
annotation guidelines. In UD, each sentence is
parsed into a dependency graph (often a tree)
linking words by grammatical relations (e.g.
nsubj(loves, John)). UD treebanks are widely used
to train dependency parsers for multilingual NLP.
 Training Parsers: Treebanks are used to train and
evaluate statistical or neural parsers. For
constituency parsing, PTB (and similar phrase-
structure treebanks) provide training examples so
a parser can learn grammar rules or probabilities.
For dependency parsing, UD corpora serve as gold
data. A parser can be learned (e.g. a Probabilistic
CFG or neural parser) to reproduce the annotated
parses. In fact, state-of-the-art NLP components
(POS taggers, parsers, semantic analyzers, machine
translation) often rely on treebank annotations.
Having a large, high-quality treebank lets NLP
systems learn grammars and patterns
automatically, greatly improving accuracy
compared to hand-crafted grammars.
3. Representation of Syntactic Structure
Figure: Example constituency parse tree for “John hit
the ball.” The root node S splits into NP and VP, with
further children down to the words.
A constituency parse tree (phrase-structure tree)
represents the hierarchical phrase structure of a
sentence. Each internal node is a nonterminal category
(like S, NP, VP, Det, N, etc.), and leaves are the actual
words (terminals). The tree above shows S → NP VP, NP
→ John, VP → V NP, etc., with parts-of-speech and
phrase labels. Such trees correspond to a context-free
derivation of the sentence. In bracketed notation, the
same structure could be written as (S (NP John) (VP (V
hit) (NP (Det the) (N ball)))), making the hierarchy
explicit. Constituency trees make it easy to see phrasal
spans and sub-phrases (e.g. which words form the
noun phrase). They are typically generated by a
context-free grammar: for example, a simple CFG might
include rules like S → NP VP, NP → Det N, VP → V NP,
etc. (A context-free grammar (CFG) is a set of
productions with a single nonterminal on the left-hand
side, allowing modular “block structure” of sentences.)
 CFG Example: A sample CFG for English might
include:
o S → NP VP
o NP → Det N | N
o VP → V NP | V
o Det → “the” | “a”, N → “ball” | “dog” | ..., V
→ “hit” | “chased” | ...
Such rules generate the above parse tree by
expanding S into NP (“John”) and VP (“hit the
ball”). The simplicity of CFGs makes them
suitable for capturing phrase structure and
generating constituency trees.
Figure: Example dependency parse (graph) with labeled
arcs. Each word is a node; edges indicate syntactic
relations (SBJ=subject, VC=copula, etc.).
A dependency graph represents syntax as word-to-
word relations. Here each node is a word (and its POS),
and directed arcs link governors (heads) to dependents,
labeled by grammatical relations. For example, in the
graph above for “A hearing is scheduled on the issue
today,” the main verb “scheduled” is the root, with
edges like nsubj(scheduled, hearing) and
obl(scheduled, issue) (subject and oblique objects).
Dependency parses contain no phrasal nodes: all
structure is captured by the directed tree over words.
Dependency edges encode who depends on whom:
subject, object, modifiers, etc. This form is popular in
many NLP applications because it directly shows
predicate–argument structure and is often simpler to
work with programmatically. Universal Dependencies is
one common scheme for such annotations. Unlike
constituency trees, dependency graphs are typically
drawn as flattened directed trees or graphs (sometimes
drawn with curved arcs as above); they abstract away
from phrase spans and emphasize binary word
relations.
 Key Differences: Both representations are tree
structures but with different emphases. A
constituency tree explicitly shows nested phrases
and uses nonterminal labels (e.g. “NP” nodes
covering multi-word spans), while a dependency
tree only has terminal-word nodes with labeled
edges. Constituency parses make it easy to identify
constituents (for example, all words under an NP
node), whereas dependency parses highlight
head–modifier relations. Notationally, constituency
trees are often given as labeled brackets, while
dependency parses can be encoded as directed
graphs (or as triples of head–relation–dependent).
In practice, languages with rigid word order (like
English) often use constituency trees, whereas
free-word-order languages often use dependency
annotation, but either style can be applied to any
language.
4. Parsing Algorithms
 CYK Parser: The Cocke–Younger–Kasami (CYK)
algorithm is a bottom-up chart parser for CFGs in
Chomsky Normal Form. It uses dynamic
programming to fill a triangular table of sub-spans.
CYK checks all ways to split each span and apply
grammar rules; in the worst case it takes $O(n^3\
cdot|G|)$ time (with $n$ = sentence length, $|G|
$ = grammar size). This cubic complexity makes it
efficient in the worst case for CFG parsing. CYK
always finds all valid parses (it can build a parse
forest) but requires CNF conversion of the
grammar. In practice, pure CYK is less used in NLP
(due to CNF constraints), but it illustrates bottom-
up chart parsing.
 Earley Parser: Earley’s algorithm is a top-down
chart parser that can handle any CFG. It
incrementally builds states (dotted rules) across
input positions, using dynamic programming to
avoid re-parsing. Earley parsing runs in $O(n^3)$
time in the general case, but is faster for simpler
grammars (e.g. $O(n^2)$ for unambiguous
grammars, linear $O(n)$ for deterministic
grammars). It handles left-recursive rules gracefully
and produces complete parse forests. Earley is
widely cited in computational linguistics for full-
sentence parsing because of its generality. (The
Earley recognizer can be extended to produce
actual parse trees, not just recognition.) Earley’s
chart-based approach shares sub-results and is
amenable to probabilistic scoring as well.
 Shift-Reduce Parser: A shift-reduce parser (a type
of bottom-up parser) reads input left-to-right using
a stack. At each step it either shifts the next input
token onto the stack or reduces a sequence of
stack items by a grammar rule. In compilers this
underlies LR parsing; in NLP it underlies many
deterministic dependency parsers (transition-
based parsing). Shift-reduce parsing can be
extremely efficient: without backtracking its
running time scales essentially linearly with
sentence length (each token is shifted once). The
Wikipedia analysis notes that a shift-reduce parser
“has no backing up” and its execution time is linear
in input size. (In practice, some lookahead or
conflict resolution is used to decide shifts vs
reduces; ambiguous grammars can force
backtracking, which would increase cost, but well-
designed grammars/parsers avoid this.)
 Chart Parsing (General): Chart parsers in NLP (like
CYK and Earley) use a “chart” (table) to record
intermediate hypotheses, avoiding redundant
work. The chart can be filled in top-down or
bottom-up order. Because charts record partial
parses for spans, chart parsing easily handles
ambiguity by keeping multiple possibilities. Time
complexity for chart parsing is generally
polynomial (often $O(n^3)$) in the sentence
length and grammar size. Chart algorithms (Earley,
CYK, the Cocke–Younger–Kasami algorithm, etc.)
all exemplify dynamic programming parsing.
 Complexity Summary: In summary, most general
CFG parsers have $O(n^3)$ worst-case complexity,
though practical performance can be better. CYK is
always $O(n^3)$ (cubic) for CNF grammars. Earley
is $O(n^3)$ worst-case but can be $O(n^2)$ or
$O(n)$ for many practical cases. Shift-reduce (LR)
parsing is effectively linear-time for unambiguous
grammars. In NLP, statistical parsers often trade a
bit of efficiency for accuracy by scoring multiple
parses (e.g. PCFG parsing with the CKY algorithm,
or beam-search shift-reduce).
Applications: Accurate syntactic parsing is a building
block for many NLP tasks. Constituency parses are used
in grammar checking, question answering (to find
noun-verb relations), and syntax-based translation.
Dependency parses are widely used in relation
extraction, semantic role labeling, and as features in
machine translation (dependency grammar often aligns
better across languages). In practice, modern NLP
systems often use probabilistic or neural parsers
trained on treebanks, outputting either parse trees or
dependency graphs as needed. The annotated
structures from treebanks have improved tools like
parsers, taggers and MT systems, and parsing remains a
fundamental skill in advanced NLP pipelines.
References: Foundational definitions and properties of
parsing and grammars; treebank construction and
usage; parse tree and dependency examples; parsing
algorithms and complexity.

You might also like