Unit 2 new one

The document discusses parsing in natural language processing (NLP), detailing its definition, role, challenges, and the two main paradigms: constituency and dependency parsing. It also covers treebanks, including the Penn Treebank and Universal Dependencies, which are essential for training parsers. Additionally, it outlines various parsing algorithms, their complexities, and applications in NLP tasks such as grammar checking and machine translation.

Uploaded by

akshitha2904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Unit 2 new one

Uploaded by

akshitha2904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Unit 2 new

1. Parsing Natural Language

 Definition: Parsing (syntax analysis) is the process
of analyzing a sentence’s grammatical structure
according to a formal grammar. A parser takes an
input sentence and produces a structured
representation (often a parse tree or similar) after
checking syntax. In NLP this identifies constituents
(like noun phrases) and relations, enabling higher-
level understanding.
 Role in NLP: Syntax analysis is a core phase of NLP
– after tokenization and POS tagging, parsing
detects phrase structure and relationships. It
ensures the sentence is well-formed and produces
an output (e.g. a parse tree or dependency graph)
used by downstream tasks (semantic analysis,
machine translation, etc.). Parsers also report
syntax errors or ambiguities when grammar rules
fail.
 Top-Down vs Bottom-Up Parsing: In top-down
parsing (e.g. recursive descent, Earley), the parser
begins with the start symbol of the grammar and
tries to rewrite it to match the input. In bottom-up
parsing (e.g. shift-reduce, CYK), the parser begins
with the input tokens and incrementally builds
sub-phrases until reaching the start symbol. Top-
down is goal-driven (predicting structure), while
bottom-up is data-driven (combining adjacent
tokens). Top-down methods may backtrack on
grammar rules, whereas bottom-up methods
“shift” tokens onto a stack and “reduce” them by
grammar rules (see section 4).
 Challenges: Natural language is highly ambiguous
and irregular. A given sentence can often be parsed
in multiple valid ways (syntactic ambiguity, e.g. PP–
attachment or coordination ambiguities). Parsers
must handle lexical ambiguity (words with multiple
POS) and structural ambiguity (multiple parse
trees). Context-sensitive constructs (like long-
distance dependencies) and incomplete or noisy
input also pose challenges. Robust parsing often
uses probabilistic grammar models (trained on
treebanks) or heuristic disambiguation to choose
the most likely parse.
 Constituency vs Dependency: There are two main
paradigms of syntactic parsing: Constituency
(phrase-structure) parsing and dependency
parsing. Constituency parsing identifies nested
phrase constituents (NP, VP, etc.) and yields a
hierarchical parse tree with nonterminal labels.
Dependency parsing, in contrast, represents
sentences as directed graphs of binary head–
dependent relations between words. In a
dependency parse, each word (vertex) connects
via labeled edges (subject, object, modifier, etc.) to
its governor, forming a “spider” tree. Both
approaches capture grammatical structure, but
focus on different aspects: constituency parses
emphasize phrase boundaries, while dependency
parses emphasize word-to-word syntactic
relations.
2. Treebanks
 What is a Treebank: A treebank is a corpus of
sentences each annotated with syntactic structure.
In linguistics, a treebank is essentially a parsed text
corpus, where each sentence is paired with its
parse (constituency or dependency). Treebanks are
typically hand-annotated (or semi-automatically
annotated and corrected) by linguists. Constructing
a treebank is labor-intensive but provides “gold-
standard” data for empirical NLP.
 Penn Treebank: The Penn Treebank (PTB) is a
landmark English treebank of news text (Wall
Street Journal). It contains about one million words
of 1989 WSJ text bracketed in Penn Treebank style
(a phrase-structure annotation). In PTB II style,
sentences are annotated with full constituent
parse trees and POS tags. For example, PTB’s
bracketed notation for “John loves Mary” is: (S (NP
(NNP John)) (VP (VBZ loves) (NP (NNP Mary))) (. .)).
PTB also includes predicate-argument and
disfluency annotations (earlier phases), but its
legacy use is as a standard dataset for
training/evaluating constituency parsers.
 Universal Dependencies (UD): Universal
Dependencies is a cross-linguistic dependency
treebank framework. UD provides consistent
annotation of parts of speech, morphological
features and syntactic dependencies across many
languages. It is an open community project (200+
treebanks, 150+ languages) aiming for uniform
annotation guidelines. In UD, each sentence is
parsed into a dependency graph (often a tree)
linking words by grammatical relations (e.g.
nsubj(loves, John)). UD treebanks are widely used
to train dependency parsers for multilingual NLP.
 Training Parsers: Treebanks are used to train and
evaluate statistical or neural parsers. For
constituency parsing, PTB (and similar phrase-
structure treebanks) provide training examples so
a parser can learn grammar rules or probabilities.
For dependency parsing, UD corpora serve as gold
data. A parser can be learned (e.g. a Probabilistic
CFG or neural parser) to reproduce the annotated
parses. In fact, state-of-the-art NLP components
(POS taggers, parsers, semantic analyzers, machine
translation) often rely on treebank annotations.
Having a large, high-quality treebank lets NLP
systems learn grammars and patterns
automatically, greatly improving accuracy
compared to hand-crafted grammars.
3. Representation of Syntactic Structure
Figure: Example constituency parse tree for “John hit
the ball.” The root node S splits into NP and VP, with
further children down to the words.
A constituency parse tree (phrase-structure tree)
represents the hierarchical phrase structure of a
sentence. Each internal node is a nonterminal category
(like S, NP, VP, Det, N, etc.), and leaves are the actual
words (terminals). The tree above shows S → NP VP, NP
→ John, VP → V NP, etc., with parts-of-speech and
phrase labels. Such trees correspond to a context-free
derivation of the sentence. In bracketed notation, the
same structure could be written as (S (NP John) (VP (V
hit) (NP (Det the) (N ball)))), making the hierarchy
explicit. Constituency trees make it easy to see phrasal
spans and sub-phrases (e.g. which words form the
noun phrase). They are typically generated by a
context-free grammar: for example, a simple CFG might
include rules like S → NP VP, NP → Det N, VP → V NP,
etc. (A context-free grammar (CFG) is a set of
productions with a single nonterminal on the left-hand
side, allowing modular “block structure” of sentences.)
 CFG Example: A sample CFG for English might
include:
o S → NP VP
o NP → Det N | N
o VP → V NP | V
o Det → “the” | “a”, N → “ball” | “dog” | ..., V
→ “hit” | “chased” | ...
Such rules generate the above parse tree by
expanding S into NP (“John”) and VP (“hit the
ball”). The simplicity of CFGs makes them
suitable for capturing phrase structure and
generating constituency trees.
Figure: Example dependency parse (graph) with labeled
arcs. Each word is a node; edges indicate syntactic
relations (SBJ=subject, VC=copula, etc.).
A dependency graph represents syntax as word-to-
word relations. Here each node is a word (and its POS),
and directed arcs link governors (heads) to dependents,
labeled by grammatical relations. For example, in the
graph above for “A hearing is scheduled on the issue
today,” the main verb “scheduled” is the root, with
edges like nsubj(scheduled, hearing) and
obl(scheduled, issue) (subject and oblique objects).
Dependency parses contain no phrasal nodes: all
structure is captured by the directed tree over words.
Dependency edges encode who depends on whom:
subject, object, modifiers, etc. This form is popular in
many NLP applications because it directly shows
predicate–argument structure and is often simpler to
work with programmatically. Universal Dependencies is
one common scheme for such annotations. Unlike
constituency trees, dependency graphs are typically
drawn as flattened directed trees or graphs (sometimes
drawn with curved arcs as above); they abstract away
from phrase spans and emphasize binary word
relations.
 Key Differences: Both representations are tree
structures but with different emphases. A
constituency tree explicitly shows nested phrases
and uses nonterminal labels (e.g. “NP” nodes
covering multi-word spans), while a dependency
tree only has terminal-word nodes with labeled
edges. Constituency parses make it easy to identify
constituents (for example, all words under an NP
node), whereas dependency parses highlight
head–modifier relations. Notationally, constituency
trees are often given as labeled brackets, while
dependency parses can be encoded as directed
graphs (or as triples of head–relation–dependent).
In practice, languages with rigid word order (like
English) often use constituency trees, whereas
free-word-order languages often use dependency
annotation, but either style can be applied to any
language.
4. Parsing Algorithms
 CYK Parser: The Cocke–Younger–Kasami (CYK)
algorithm is a bottom-up chart parser for CFGs in
Chomsky Normal Form. It uses dynamic
programming to fill a triangular table of sub-spans.
CYK checks all ways to split each span and apply
grammar rules; in the worst case it takes $O(n^3\
cdot|G|)$ time (with $n$ = sentence length, $|G|
$ = grammar size). This cubic complexity makes it
efficient in the worst case for CFG parsing. CYK
always finds all valid parses (it can build a parse
forest) but requires CNF conversion of the
grammar. In practice, pure CYK is less used in NLP
(due to CNF constraints), but it illustrates bottom-
up chart parsing.
 Earley Parser: Earley’s algorithm is a top-down
chart parser that can handle any CFG. It
incrementally builds states (dotted rules) across
input positions, using dynamic programming to
avoid re-parsing. Earley parsing runs in $O(n^3)$
time in the general case, but is faster for simpler
grammars (e.g. $O(n^2)$ for unambiguous
grammars, linear $O(n)$ for deterministic
grammars). It handles left-recursive rules gracefully
and produces complete parse forests. Earley is
widely cited in computational linguistics for full-
sentence parsing because of its generality. (The
Earley recognizer can be extended to produce
actual parse trees, not just recognition.) Earley’s
chart-based approach shares sub-results and is
amenable to probabilistic scoring as well.
 Shift-Reduce Parser: A shift-reduce parser (a type
of bottom-up parser) reads input left-to-right using
a stack. At each step it either shifts the next input
token onto the stack or reduces a sequence of
stack items by a grammar rule. In compilers this
underlies LR parsing; in NLP it underlies many
deterministic dependency parsers (transition-
based parsing). Shift-reduce parsing can be
extremely efficient: without backtracking its
running time scales essentially linearly with
sentence length (each token is shifted once). The
Wikipedia analysis notes that a shift-reduce parser
“has no backing up” and its execution time is linear
in input size. (In practice, some lookahead or
conflict resolution is used to decide shifts vs
reduces; ambiguous grammars can force
backtracking, which would increase cost, but well-
designed grammars/parsers avoid this.)
 Chart Parsing (General): Chart parsers in NLP (like
CYK and Earley) use a “chart” (table) to record
intermediate hypotheses, avoiding redundant
work. The chart can be filled in top-down or
bottom-up order. Because charts record partial
parses for spans, chart parsing easily handles
ambiguity by keeping multiple possibilities. Time
complexity for chart parsing is generally
polynomial (often $O(n^3)$) in the sentence
length and grammar size. Chart algorithms (Earley,
CYK, the Cocke–Younger–Kasami algorithm, etc.)
all exemplify dynamic programming parsing.
 Complexity Summary: In summary, most general
CFG parsers have $O(n^3)$ worst-case complexity,
though practical performance can be better. CYK is
always $O(n^3)$ (cubic) for CNF grammars. Earley
is $O(n^3)$ worst-case but can be $O(n^2)$ or
$O(n)$ for many practical cases. Shift-reduce (LR)
parsing is effectively linear-time for unambiguous
grammars. In NLP, statistical parsers often trade a
bit of efficiency for accuracy by scoring multiple
parses (e.g. PCFG parsing with the CKY algorithm,
or beam-search shift-reduce).
Applications: Accurate syntactic parsing is a building
block for many NLP tasks. Constituency parses are used
in grammar checking, question answering (to find
noun-verb relations), and syntax-based translation.
Dependency parses are widely used in relation
extraction, semantic role labeling, and as features in
machine translation (dependency grammar often aligns
better across languages). In practice, modern NLP
systems often use probabilistic or neural parsers
trained on treebanks, outputting either parse trees or
dependency graphs as needed. The annotated
structures from treebanks have improved tools like
parsers, taggers and MT systems, and parsing remains a
fundamental skill in advanced NLP pipelines.
References: Foundational definitions and properties of
parsing and grammars; treebank construction and
usage; parse tree and dependency examples; parsing
algorithms and complexity.

An Overview Including The Clean Core Concept & Dashboard
No ratings yet
An Overview Including The Clean Core Concept & Dashboard
44 pages
Testing of The V-22 Flight Control System
No ratings yet
Testing of The V-22 Flight Control System
16 pages
Cse 434 Name: Bing Hao Computer Networks (2014 Spring) Home Page
No ratings yet
Cse 434 Name: Bing Hao Computer Networks (2014 Spring) Home Page
11 pages
Consituency Vs Dependent Grammar
No ratings yet
Consituency Vs Dependent Grammar
5 pages
4.Chapter5_ Syntactic and Semantic Representations
No ratings yet
4.Chapter5_ Syntactic and Semantic Representations
47 pages
NLP CHAPTER 3
No ratings yet
NLP CHAPTER 3
23 pages
NLP UNIT-II
No ratings yet
NLP UNIT-II
71 pages
NLP Module 3
No ratings yet
NLP Module 3
41 pages
nlp unit 2
No ratings yet
nlp unit 2
13 pages
NLP UNIT-II PPT
No ratings yet
NLP UNIT-II PPT
45 pages
What Is Parsing
No ratings yet
What Is Parsing
47 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
Machine 22
No ratings yet
Machine 22
5 pages
Unit 5
No ratings yet
Unit 5
10 pages
8 Parsing
No ratings yet
8 Parsing
40 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
Unit II
No ratings yet
Unit II
61 pages
cs224n 2023 Lecture04 Dep Parsing
No ratings yet
cs224n 2023 Lecture04 Dep Parsing
45 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
45 pages
nlp unit 3 part A pdf
No ratings yet
nlp unit 3 part A pdf
75 pages
Unit Iii
No ratings yet
Unit Iii
17 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
47 pages
c
No ratings yet
c
54 pages
Unit 2
No ratings yet
Unit 2
15 pages
Constituency Parsing Ppt 2
No ratings yet
Constituency Parsing Ppt 2
33 pages
Dependency Parsing
No ratings yet
Dependency Parsing
51 pages
Parsing
No ratings yet
Parsing
10 pages
Mod - 3 (2)
No ratings yet
Mod - 3 (2)
51 pages
Unit 2_Lecture 1
No ratings yet
Unit 2_Lecture 1
19 pages
Dependency Grammars: Julia Hockenmaier
No ratings yet
Dependency Grammars: Julia Hockenmaier
21 pages
module-5
No ratings yet
module-5
24 pages
NLP UNIT-II
No ratings yet
NLP UNIT-II
42 pages
Parsing Techniques
No ratings yet
Parsing Techniques
16 pages
Dependency Parsing
No ratings yet
Dependency Parsing
377 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
Grammars: Before You Can Parse You Need A Grammar. So Where Do Grammars Come From?
No ratings yet
Grammars: Before You Can Parse You Need A Grammar. So Where Do Grammars Come From?
32 pages
14-syntax-1
No ratings yet
14-syntax-1
22 pages
NLP - UNIT II
No ratings yet
NLP - UNIT II
13 pages
IJRPR20061
No ratings yet
IJRPR20061
15 pages
SLoSP 2007 1
No ratings yet
SLoSP 2007 1
42 pages
UNIT III_NLP
No ratings yet
UNIT III_NLP
36 pages
13-Dependency Grammar-03-09-2024
No ratings yet
13-Dependency Grammar-03-09-2024
31 pages
NLP_M3_SPP
No ratings yet
NLP_M3_SPP
53 pages
Cs224n 2025 Lecture04 Dep Parsing
No ratings yet
Cs224n 2025 Lecture04 Dep Parsing
53 pages
Lecture 6
No ratings yet
Lecture 6
43 pages
Unit 123(Nlp)
No ratings yet
Unit 123(Nlp)
3 pages
Syntactic Analysis
No ratings yet
Syntactic Analysis
66 pages
module-2 ch-4
No ratings yet
module-2 ch-4
32 pages
Dependency parsing
No ratings yet
Dependency parsing
32 pages
Parsing Algorithms
No ratings yet
Parsing Algorithms
20 pages
NLP CHAPTER-1
No ratings yet
NLP CHAPTER-1
24 pages
ch08
No ratings yet
ch08
31 pages
cs224n 2019 Notes04 Dependencyparsing
No ratings yet
cs224n 2019 Notes04 Dependencyparsing
5 pages
CS224n: Natural Language Processing With Deep Learning: Lecture Notes: Part IV Dependency Parsing Winter 2019
No ratings yet
CS224n: Natural Language Processing With Deep Learning: Lecture Notes: Part IV Dependency Parsing Winter 2019
5 pages
Constituency Parsing Ppt
No ratings yet
Constituency Parsing Ppt
94 pages
NLP Mid-1
No ratings yet
NLP Mid-1
15 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
50 pages
3nlp Computer
No ratings yet
3nlp Computer
83 pages
Unit 2 Syntactic Processing
No ratings yet
Unit 2 Syntactic Processing
17 pages
Lecture 08
No ratings yet
Lecture 08
69 pages
Introduction to Formal Languages
From Everand
Introduction to Formal Languages
György E. Révész
2/5 (1)
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
A Sentence Diagramming Primer: The Reed & Kellogg System Step-By-Step
From Everand
A Sentence Diagramming Primer: The Reed & Kellogg System Step-By-Step
Dr. Judith Coats
No ratings yet
Fmang Inst Instr
No ratings yet
Fmang Inst Instr
29 pages
NABH Assessment Guide - Final
0% (1)
NABH Assessment Guide - Final
89 pages
SE Notes
No ratings yet
SE Notes
9 pages
AI^1
No ratings yet
AI^1
12 pages
SGD Graphic Displayer-User Manual-V1.0-2017!1!5
No ratings yet
SGD Graphic Displayer-User Manual-V1.0-2017!1!5
33 pages
156-315.81.20-demo
No ratings yet
156-315.81.20-demo
8 pages
Router Auditing
100% (1)
Router Auditing
24 pages
Im 28243 en Pactware Dtm Collection 2021-01-22
No ratings yet
Im 28243 en Pactware Dtm Collection 2021-01-22
28 pages
DESIGN SECURITY SYSTEM
No ratings yet
DESIGN SECURITY SYSTEM
17 pages
Biznet Corporate Fact Sheet 2020
No ratings yet
Biznet Corporate Fact Sheet 2020
3 pages
MP 1
No ratings yet
MP 1
15 pages
Datatype Const Keyword
No ratings yet
Datatype Const Keyword
11 pages
Sensitization Program
No ratings yet
Sensitization Program
33 pages
Laptop Motherboard Repair
100% (3)
Laptop Motherboard Repair
195 pages
Get Started with T1000-E Tracker _ Seeed Studio Wiki
No ratings yet
Get Started with T1000-E Tracker _ Seeed Studio Wiki
1 page
Cse8 ch02
No ratings yet
Cse8 ch02
25 pages
Social Scripts and Expectancy Violations Evaluating Communication With Human or AI Chatbot Interactants
No ratings yet
Social Scripts and Expectancy Violations Evaluating Communication With Human or AI Chatbot Interactants
17 pages
ERP - Task 4 Testing Presentation
No ratings yet
ERP - Task 4 Testing Presentation
23 pages
CSE 3rd Sem DSA
0% (1)
CSE 3rd Sem DSA
2 pages
Object Oriented Programming Language Using C++
No ratings yet
Object Oriented Programming Language Using C++
14 pages
Santak Castle 3C10~20KS Brochure
No ratings yet
Santak Castle 3C10~20KS Brochure
4 pages
Test Bank for Cognitive Psychology: Connecting Mind, Research, and Everyday Experience, 5th Edition, E. Bruce Goldsteindownload
100% (5)
Test Bank for Cognitive Psychology: Connecting Mind, Research, and Everyday Experience, 5th Edition, E. Bruce Goldsteindownload
40 pages
Walmart Refund Method
No ratings yet
Walmart Refund Method
3 pages
Daa Lab
No ratings yet
Daa Lab
38 pages
Unit 2 - Assignment 1 Brief - Layout
0% (1)
Unit 2 - Assignment 1 Brief - Layout
11 pages
InfinityMaxPro UserManual en
No ratings yet
InfinityMaxPro UserManual en
74 pages
System Programming Notes 3 - TutorialsDuniya
No ratings yet
System Programming Notes 3 - TutorialsDuniya
40 pages

Unit 2 new one

Uploaded by

Unit 2 new one

Uploaded by

Unit 2 new

1. Parsing Natural Language

You might also like