Cs224n 2025 Lecture04 Dep Parsing
Cs224n 2025 Lecture04 Dep Parsing
Diyi Yang
Lecture 4: Dependency Parsing
Lecture Plan
Finish backpropagation (10 mins)
Syntactic Structure and Dependency parsing
1. Syntactic Structure: Consistency and Dependency (20 mins)
2. Dependency Grammar and Treebanks (15 mins)
3. Transition-based dependency parsing (15 mins)
4. Neural dependency parsing (20 mins)
Key Learnings: Explicit linguistic structure and how a neural net can decide it
Reminders/comments:
• In Assignment 2, you build a neural dependency parser using PyTorch!
• Come to the PyTorch tutorial, Friday, 1:30pm Gates B01
• Final project discussions – come meet with us; focus of Tuesday class in week 4
2
Back-Prop in General Computation Graph
1. Fprop: visit nodes in topological sort order
Single scalar output - Compute value of node given predecessors
2. Bprop:
- initialize output gradient = 1
… - visit nodes in reverse order:
Compute gradient wrt each node using
gradient wrt successors
… = successors of
5
Implementation: forward/backward API
6
Implementation: forward/backward API
7
Manual Gradient checking: Numeric Gradient
8
Summary
• But why take a class on compilers or systems when they are implemented for you?
• Understanding what is going on under the hood is useful!
10
Lecture Plan
✓ Finish backpropagation (10 mins)
Syntactic Structure and Dependency parsing
1. Syntactic Structure: Consistency and Dependency (20 mins)
2. Dependency Grammar and Treebanks (15 mins)
3. Transition-based dependency parsing (15 mins)
4. Neural dependency parsing (20 mins)
Key Learnings: Explicit linguistic structure and how a neural net can decide it
Reminders/comments:
• In Assignment 2, you build a neural dependency parser using PyTorch!
• Come to the PyTorch tutorial, Friday, 1:30pm Gates B01
• Final project discussions – come meet with us; focus of Tuesday class in week 4
11
1. The linguistic structure of sentences – two views: Constituency
= phrase structure grammar = context-free grammars (CFGs)
Phrase structure organizes words into nested constituents
12
The linguistic structure of sentences – two views: Constituency =
phrase structure grammar = context-free grammars (CFGs)
Phrase structure organizes words into nested constituents.
the cat
a dog
large in a crate
barking on the table
cuddly by the door
large barking
talk to
walked behind
14
Two views of linguistic structure: Dependency structure
• Dependency structure shows which words depend on (modify, attach to, or are
arguments of) which other words.
16
Why do we need sentence structure?
Human listeners need to work out what modifies [attaches to] what
18
Prepositional phrase attachment ambiguity
19
Prepositional phrase attachment ambiguity
20
PP attachment ambiguities multiply
Shuttle veteran and longtime NASA executive Fred Gregory appointed to board
Shuttle veteran and longtime NASA executive Fred Gregory appointed to board
23
Coordination scope ambiguity
24
Adjectival/Adverbial Modifier Ambiguity
25
Verb Phrase (VP) attachment ambiguity
26
Dependency paths help extract semantic interpretation –
simple practical example: extracting protein-protein interaction
demonstrated
nsubj ccomp
27
2. Dependency Grammar and Dependency Structure
Dependency syntax postulates that syntactic structure consists of relations between
lexical items, normally binary asymmetric relations (“arrows”) called dependencies
ports
by Senator Republican
on and immigration
Kansas
of
28
Dependency Grammar and Dependency Structure
Dependency syntax postulates that syntactic structure consists of relations between
lexical items, normally binary asymmetric relations (“arrows”) called dependencies
submitted
The arrows are nsubj:pass obl
aux
commonly typed
Bills were Brownback
with the name of nmod
grammatical case appos
ports flat
relations (subject,
case cc conj by Senator Republican
prepositional object,
apposition, etc.) on and immigration nmod
Kansas
case
of
29
Dependency Grammar and Dependency Structure
Dependency syntax postulates that syntactic structure consists of relations between
lexical items, normally binary asymmetric relations (“arrows”) called dependencies
Gallery: http://wellcomeimages.org/indexplus/image/L0032691.html
CC BY 4.0 File:Birch bark MS from Kashmir of the Rupavatra Wellcome L0032691.jpg
But this comes from much later – originally the grammar was oral
31
Dependency Grammar/Parsing History
• The idea of dependency structure goes back a long way
• To Pāṇini’s grammar (c. 5th century BCE)
• Basic approach of 1st millennium Arabic grammarians
• Constituency/context-free grammar is a new-fangled invention
• 20th century invention (R.S. Wells, 1947; then Chomsky 1953, etc.)
• Modern dependency work is often sourced to Lucien Tesnière (1959)
• Was dominant approach in “East” in 20th Century (Russia, China, …)
• Good for free-er word order, inflected languages like Russian (or Latin!)
• Used in some of the earliest parsers in NLP, even in the US:
• David Hays, one of the founders of U.S. computational linguistics, built early (first?)
dependency parser (Hays 1962) and published on dependency grammar in Language
32
Dependency Grammar and Dependency Structure
• Some people draw the arrows one way; some the other way!
• Tesnière had them point from head to dependent – we follow that convention
• We usually add a fake ROOT so every word is a dependent of precisely 1 other node
33
The rise of annotated data & Universal Dependencies treebanks
Brown corpus (1967; PoS tagged 1979); Lancaster-IBM Treebank (starting late 1980s);
Marcus et al. 1993, The Penn Treebank, Computational Linguistics;
Universal Dependencies: http://universaldependencies.org/
34
The rise of annotated data
Starting off, building a treebank seems a lot slower and less useful than writing a grammar
(by hand)
35
Dependency Conditioning Preferences
What are the straightforward sources of information for dependency parsing?
1. Bilexical affinities The dependency [discussion → issues] is plausible
2. Dependency distance Most dependencies are between nearby words
3. Intervening material Dependencies rarely span intervening verbs or punctuation
4. Valency of heads How many dependents on which side are usual for a head?
40
Basic transition-based dependency parser
41
Arc-standard transition-based parser
(there are other transition schemes …)
Analysis of “I ate fish”
42
Arc-standard transition-based parser
Analysis of “I ate fish”
Nota bene:
Left Arc In this example
A += I’ve at each step
[root] I ate [root] ate nsubj(ate → I) made the
“correct” next
Shift transition.
But a parser has
[root] ate fish [root] ate fish to work this out –
by exploring or
Right Arc inferring!
A +=
[root] ate fish [root] ate obj(ate → fish)
Right Arc
A += A = { nsubj(ate → I),
[root] ate [root] root([root] → ate) obj(ate → fish)
Finish root([root] → ate) }
43
MaltParser [Nivre and Hall 2005]
• We have left to explain how we choose the next action
• Answer: Stand back, I know machine learning!
• Each action is predicted by a discriminative classifier (e.g., softmax classifier) over each
legal move
• Max of 3 untyped choices (max of |R| × 2 + 1 when typed)
• Features: top of stack word, POS; first in buffer word, POS; etc.
• There is NO search (in the simplest form)
• But you can profitably do a beam search if you wish (slower but better):
• You keep k good parse prefixes at each time step
• The model’s accuracy is fractionally below the state of the art in dependency parsing,
but
• It provides very fast linear time parsing, with high accuracy – great for parsing the web
44
Conventional Feature Representation
binary, sparse 0 0 0 1 0 0 1 0 … 0 0 1 0
dim =106 –107
Feature templates: usually a combination of 1–3
elements from the configuration
Indicator features
45
Evaluation of Dependency Parsing: (labeled) dependency accuracy
UAS = 4 / 5 = 80%
ROOT She saw the video lecture
0 1 2 3 4 5
LAS = 2 / 5 = 40%
Gold Parsed
1 2 She nsubj 1 2 She nsubj
2 0 saw root 2 0 saw root
3 5 the det 3 4 the det
4 5 video nn 4 5 video nsubj
5 2 lecture obj 5 2 lecture ccomp
46
4. Why do we gain from a neural dependency parser?
Indicator Features Revisited
Categorical features are: Neural Approach:
• Problem #1: sparse learn a dense and compact feature representation
• Problem #2: incomplete
• Problem #3: expensive to compute
dense
0.1 0.9 -0.2 0.3 … -0.1 -0.5
dim = ~1000
48
A neural dependency parser [Chen and Manning 2014]
• Results on English parsing to Stanford Dependencies:
• Unlabeled attachment score (UAS) = head
• Labeled attachment score (LAS) = head and label
49
First win: Distributed Representations
• Meanwhile, part-of-speech tags (POS) and dependency labels are also represented as
d-dimensional vectors. was were
• The smaller discrete sets also exhibit many semantical similarities.
good
is
come
50
Extracting Tokens & vector representations from configuration
}
s1 good JJ ∅
s2 has VBZ ∅ of the vector
b1 control NN ∅ representation of
lc(s1) ∅ + ∅ + ∅ all these is the
rc(s1) ∅ ∅ ∅ neural
lc(s2) He PRP nsubj representation of
rc(s2) ∅ ∅ ∅ a configuration
51
Second win: Deep Learning classifiers are non-linear classifiers
• A softmax classifier assigns classes 𝑦 ∈ 𝐶 based on inputs 𝑥 ∈ ℝ𝑑 via the probability:
• Traditional ML classifiers (including Naïve Bayes, SVMs, logistic regression and softmax
classifier) are not very powerful classifiers: they only give linear decision boundaries
• But neural networks can use multiple layers to learn much more complex nonlinear
decision boundaries
52
Neural Dependency Parser Model Architecture
(A simple feed-forward neural network multi-class classifier)
Log loss (cross-entropy error) will be back-
propagated to the embeddings
Input layer x
lookup + concat Wins:
Distributed representations!
Non-linear classifier!
53
Dependency parsing for sentence structure
Chen & Manning (2014) showed that neural networks can accurately
determine the structure of sentences, supporting meaning interpretation
This paper was the first simple and successful neural dependency parser
54
Further developments in transition-based neural dependency parsing
This work was further developed and improved by others, including in particular at Google
• Bigger, deeper networks with better tuned hyperparameters
• Beam search
• Global, conditional random field (CRF)-style inference over the decision sequence
Leading to SyntaxNet and the Parsey McParseFace model (2016):
“The World’s Most Accurate Parser”
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html
Method UAS LAS (PTB WSJ SD 3.3)
Chen & Manning 2014 92.0 89.7
Weiss et al. 2015 93.99 92.05
Andor et al. 2016 94.61 92.79
55
Graph-based dependency parsers
• Compute a score for every possible dependency for each word
• Doing this well requires good “contextual” representations of each word token,
which we will develop in coming lectures
0.5 0.8
0.3 2.0
0.3 2.0