The document discusses parsing in natural language processing (NLP), detailing its definition, role, challenges, and the two main paradigms: constituency and dependency parsing. It also covers treebanks, including the Penn Treebank and Universal Dependencies, which are essential for training parsers. Additionally, it outlines various parsing algorithms, their complexities, and applications in NLP tasks such as grammar checking and machine translation.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
5 views
Unit 2 new one
The document discusses parsing in natural language processing (NLP), detailing its definition, role, challenges, and the two main paradigms: constituency and dependency parsing. It also covers treebanks, including the Penn Treebank and Universal Dependencies, which are essential for training parsers. Additionally, it outlines various parsing algorithms, their complexities, and applications in NLP tasks such as grammar checking and machine translation.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12
Unit 2 new
1. Parsing Natural Language
Definition: Parsing (syntax analysis) is the process of analyzing a sentence’s grammatical structure according to a formal grammar. A parser takes an input sentence and produces a structured representation (often a parse tree or similar) after checking syntax. In NLP this identifies constituents (like noun phrases) and relations, enabling higher- level understanding. Role in NLP: Syntax analysis is a core phase of NLP – after tokenization and POS tagging, parsing detects phrase structure and relationships. It ensures the sentence is well-formed and produces an output (e.g. a parse tree or dependency graph) used by downstream tasks (semantic analysis, machine translation, etc.). Parsers also report syntax errors or ambiguities when grammar rules fail. Top-Down vs Bottom-Up Parsing: In top-down parsing (e.g. recursive descent, Earley), the parser begins with the start symbol of the grammar and tries to rewrite it to match the input. In bottom-up parsing (e.g. shift-reduce, CYK), the parser begins with the input tokens and incrementally builds sub-phrases until reaching the start symbol. Top- down is goal-driven (predicting structure), while bottom-up is data-driven (combining adjacent tokens). Top-down methods may backtrack on grammar rules, whereas bottom-up methods “shift” tokens onto a stack and “reduce” them by grammar rules (see section 4). Challenges: Natural language is highly ambiguous and irregular. A given sentence can often be parsed in multiple valid ways (syntactic ambiguity, e.g. PP– attachment or coordination ambiguities). Parsers must handle lexical ambiguity (words with multiple POS) and structural ambiguity (multiple parse trees). Context-sensitive constructs (like long- distance dependencies) and incomplete or noisy input also pose challenges. Robust parsing often uses probabilistic grammar models (trained on treebanks) or heuristic disambiguation to choose the most likely parse. Constituency vs Dependency: There are two main paradigms of syntactic parsing: Constituency (phrase-structure) parsing and dependency parsing. Constituency parsing identifies nested phrase constituents (NP, VP, etc.) and yields a hierarchical parse tree with nonterminal labels. Dependency parsing, in contrast, represents sentences as directed graphs of binary head– dependent relations between words. In a dependency parse, each word (vertex) connects via labeled edges (subject, object, modifier, etc.) to its governor, forming a “spider” tree. Both approaches capture grammatical structure, but focus on different aspects: constituency parses emphasize phrase boundaries, while dependency parses emphasize word-to-word syntactic relations. 2. Treebanks What is a Treebank: A treebank is a corpus of sentences each annotated with syntactic structure. In linguistics, a treebank is essentially a parsed text corpus, where each sentence is paired with its parse (constituency or dependency). Treebanks are typically hand-annotated (or semi-automatically annotated and corrected) by linguists. Constructing a treebank is labor-intensive but provides “gold- standard” data for empirical NLP. Penn Treebank: The Penn Treebank (PTB) is a landmark English treebank of news text (Wall Street Journal). It contains about one million words of 1989 WSJ text bracketed in Penn Treebank style (a phrase-structure annotation). In PTB II style, sentences are annotated with full constituent parse trees and POS tags. For example, PTB’s bracketed notation for “John loves Mary” is: (S (NP (NNP John)) (VP (VBZ loves) (NP (NNP Mary))) (. .)). PTB also includes predicate-argument and disfluency annotations (earlier phases), but its legacy use is as a standard dataset for training/evaluating constituency parsers. Universal Dependencies (UD): Universal Dependencies is a cross-linguistic dependency treebank framework. UD provides consistent annotation of parts of speech, morphological features and syntactic dependencies across many languages. It is an open community project (200+ treebanks, 150+ languages) aiming for uniform annotation guidelines. In UD, each sentence is parsed into a dependency graph (often a tree) linking words by grammatical relations (e.g. nsubj(loves, John)). UD treebanks are widely used to train dependency parsers for multilingual NLP. Training Parsers: Treebanks are used to train and evaluate statistical or neural parsers. For constituency parsing, PTB (and similar phrase- structure treebanks) provide training examples so a parser can learn grammar rules or probabilities. For dependency parsing, UD corpora serve as gold data. A parser can be learned (e.g. a Probabilistic CFG or neural parser) to reproduce the annotated parses. In fact, state-of-the-art NLP components (POS taggers, parsers, semantic analyzers, machine translation) often rely on treebank annotations. Having a large, high-quality treebank lets NLP systems learn grammars and patterns automatically, greatly improving accuracy compared to hand-crafted grammars. 3. Representation of Syntactic Structure Figure: Example constituency parse tree for “John hit the ball.” The root node S splits into NP and VP, with further children down to the words. A constituency parse tree (phrase-structure tree) represents the hierarchical phrase structure of a sentence. Each internal node is a nonterminal category (like S, NP, VP, Det, N, etc.), and leaves are the actual words (terminals). The tree above shows S → NP VP, NP → John, VP → V NP, etc., with parts-of-speech and phrase labels. Such trees correspond to a context-free derivation of the sentence. In bracketed notation, the same structure could be written as (S (NP John) (VP (V hit) (NP (Det the) (N ball)))), making the hierarchy explicit. Constituency trees make it easy to see phrasal spans and sub-phrases (e.g. which words form the noun phrase). They are typically generated by a context-free grammar: for example, a simple CFG might include rules like S → NP VP, NP → Det N, VP → V NP, etc. (A context-free grammar (CFG) is a set of productions with a single nonterminal on the left-hand side, allowing modular “block structure” of sentences.) CFG Example: A sample CFG for English might include: o S → NP VP o NP → Det N | N o VP → V NP | V o Det → “the” | “a”, N → “ball” | “dog” | ..., V → “hit” | “chased” | ... Such rules generate the above parse tree by expanding S into NP (“John”) and VP (“hit the ball”). The simplicity of CFGs makes them suitable for capturing phrase structure and generating constituency trees. Figure: Example dependency parse (graph) with labeled arcs. Each word is a node; edges indicate syntactic relations (SBJ=subject, VC=copula, etc.). A dependency graph represents syntax as word-to- word relations. Here each node is a word (and its POS), and directed arcs link governors (heads) to dependents, labeled by grammatical relations. For example, in the graph above for “A hearing is scheduled on the issue today,” the main verb “scheduled” is the root, with edges like nsubj(scheduled, hearing) and obl(scheduled, issue) (subject and oblique objects). Dependency parses contain no phrasal nodes: all structure is captured by the directed tree over words. Dependency edges encode who depends on whom: subject, object, modifiers, etc. This form is popular in many NLP applications because it directly shows predicate–argument structure and is often simpler to work with programmatically. Universal Dependencies is one common scheme for such annotations. Unlike constituency trees, dependency graphs are typically drawn as flattened directed trees or graphs (sometimes drawn with curved arcs as above); they abstract away from phrase spans and emphasize binary word relations. Key Differences: Both representations are tree structures but with different emphases. A constituency tree explicitly shows nested phrases and uses nonterminal labels (e.g. “NP” nodes covering multi-word spans), while a dependency tree only has terminal-word nodes with labeled edges. Constituency parses make it easy to identify constituents (for example, all words under an NP node), whereas dependency parses highlight head–modifier relations. Notationally, constituency trees are often given as labeled brackets, while dependency parses can be encoded as directed graphs (or as triples of head–relation–dependent). In practice, languages with rigid word order (like English) often use constituency trees, whereas free-word-order languages often use dependency annotation, but either style can be applied to any language. 4. Parsing Algorithms CYK Parser: The Cocke–Younger–Kasami (CYK) algorithm is a bottom-up chart parser for CFGs in Chomsky Normal Form. It uses dynamic programming to fill a triangular table of sub-spans. CYK checks all ways to split each span and apply grammar rules; in the worst case it takes $O(n^3\ cdot|G|)$ time (with $n$ = sentence length, $|G| $ = grammar size). This cubic complexity makes it efficient in the worst case for CFG parsing. CYK always finds all valid parses (it can build a parse forest) but requires CNF conversion of the grammar. In practice, pure CYK is less used in NLP (due to CNF constraints), but it illustrates bottom- up chart parsing. Earley Parser: Earley’s algorithm is a top-down chart parser that can handle any CFG. It incrementally builds states (dotted rules) across input positions, using dynamic programming to avoid re-parsing. Earley parsing runs in $O(n^3)$ time in the general case, but is faster for simpler grammars (e.g. $O(n^2)$ for unambiguous grammars, linear $O(n)$ for deterministic grammars). It handles left-recursive rules gracefully and produces complete parse forests. Earley is widely cited in computational linguistics for full- sentence parsing because of its generality. (The Earley recognizer can be extended to produce actual parse trees, not just recognition.) Earley’s chart-based approach shares sub-results and is amenable to probabilistic scoring as well. Shift-Reduce Parser: A shift-reduce parser (a type of bottom-up parser) reads input left-to-right using a stack. At each step it either shifts the next input token onto the stack or reduces a sequence of stack items by a grammar rule. In compilers this underlies LR parsing; in NLP it underlies many deterministic dependency parsers (transition- based parsing). Shift-reduce parsing can be extremely efficient: without backtracking its running time scales essentially linearly with sentence length (each token is shifted once). The Wikipedia analysis notes that a shift-reduce parser “has no backing up” and its execution time is linear in input size. (In practice, some lookahead or conflict resolution is used to decide shifts vs reduces; ambiguous grammars can force backtracking, which would increase cost, but well- designed grammars/parsers avoid this.) Chart Parsing (General): Chart parsers in NLP (like CYK and Earley) use a “chart” (table) to record intermediate hypotheses, avoiding redundant work. The chart can be filled in top-down or bottom-up order. Because charts record partial parses for spans, chart parsing easily handles ambiguity by keeping multiple possibilities. Time complexity for chart parsing is generally polynomial (often $O(n^3)$) in the sentence length and grammar size. Chart algorithms (Earley, CYK, the Cocke–Younger–Kasami algorithm, etc.) all exemplify dynamic programming parsing. Complexity Summary: In summary, most general CFG parsers have $O(n^3)$ worst-case complexity, though practical performance can be better. CYK is always $O(n^3)$ (cubic) for CNF grammars. Earley is $O(n^3)$ worst-case but can be $O(n^2)$ or $O(n)$ for many practical cases. Shift-reduce (LR) parsing is effectively linear-time for unambiguous grammars. In NLP, statistical parsers often trade a bit of efficiency for accuracy by scoring multiple parses (e.g. PCFG parsing with the CKY algorithm, or beam-search shift-reduce). Applications: Accurate syntactic parsing is a building block for many NLP tasks. Constituency parses are used in grammar checking, question answering (to find noun-verb relations), and syntax-based translation. Dependency parses are widely used in relation extraction, semantic role labeling, and as features in machine translation (dependency grammar often aligns better across languages). In practice, modern NLP systems often use probabilistic or neural parsers trained on treebanks, outputting either parse trees or dependency graphs as needed. The annotated structures from treebanks have improved tools like parsers, taggers and MT systems, and parsing remains a fundamental skill in advanced NLP pipelines. References: Foundational definitions and properties of parsing and grammars; treebank construction and usage; parse tree and dependency examples; parsing algorithms and complexity.