0% found this document useful (0 votes)
8 views

Cs224n 2025 Lecture04 Dep Parsing

The document outlines a lecture on Dependency Parsing within the context of Natural Language Processing with Deep Learning, covering topics such as syntactic structure, dependency grammar, and various parsing methods. Key learnings include the importance of understanding linguistic structure for effective language interpretation and the implementation of a neural dependency parser using PyTorch. The lecture also emphasizes the significance of backpropagation and gradient computation in neural networks.

Uploaded by

imuyangyang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Cs224n 2025 Lecture04 Dep Parsing

The document outlines a lecture on Dependency Parsing within the context of Natural Language Processing with Deep Learning, covering topics such as syntactic structure, dependency grammar, and various parsing methods. Key learnings include the importance of understanding linguistic structure for effective language interpretation and the implementation of a neural dependency parser using PyTorch. The lecture also emphasizes the significance of backpropagation and gradient computation in neural networks.

Uploaded by

imuyangyang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Natural Language Processing

with Deep Learning


CS224N/Ling284

Diyi Yang
Lecture 4: Dependency Parsing
Lecture Plan
Finish backpropagation (10 mins)
Syntactic Structure and Dependency parsing
1. Syntactic Structure: Consistency and Dependency (20 mins)
2. Dependency Grammar and Treebanks (15 mins)
3. Transition-based dependency parsing (15 mins)
4. Neural dependency parsing (20 mins)

Key Learnings: Explicit linguistic structure and how a neural net can decide it

Reminders/comments:
• In Assignment 2, you build a neural dependency parser using PyTorch!
• Come to the PyTorch tutorial, Friday, 1:30pm Gates B01
• Final project discussions – come meet with us; focus of Tuesday class in week 4
2
Back-Prop in General Computation Graph
1. Fprop: visit nodes in topological sort order
Single scalar output - Compute value of node given predecessors
2. Bprop:
- initialize output gradient = 1
… - visit nodes in reverse order:
Compute gradient wrt each node using
gradient wrt successors
… = successors of

Done correctly, big O() complexity of fprop and


bprop is the same
In general, our nets have regular layer-structure
Inputs and so we can use matrices and Jacobians…
3
Automatic Differentiation

• The gradient computation can be


automatically inferred from the symbolic
expression of the fprop
• Each node type needs to know how to
compute its output and how to compute
the gradient wrt its inputs given the
gradient wrt its output
• Modern DL frameworks (Tensorflow,
PyTorch, etc.) do backpropagation for
you but mainly leave layer/node writer
to hand-calculate the local derivative
4
Backprop Implementations

5
Implementation: forward/backward API

6
Implementation: forward/backward API

7
Manual Gradient checking: Numeric Gradient

• For small h (≈ 1e-4),


• Easy to implement correctly
• But approximate and very slow:
• You have to recompute f for every parameter of our model

• Useful for checking your implementation


• In the old days, we hand-wrote everything, doing this everywhere was the key test
• Now much less needed; you can use it to check layers are correctly implemented

8
Summary

We’ve mastered the core technology of neural nets!

• Backpropagation: recursively (and hence efficiently) apply the chain rule


along computation graph
• [downstream gradient] = [upstream gradient] x [local gradient]

• Forward pass: compute results of operations and save intermediate


values
• Backward pass: apply chain rule to compute gradients
9
Why learn all these details about gradients?
• Modern deep learning frameworks compute gradients for you!
• Come to the PyTorch introduction this Friday!

• But why take a class on compilers or systems when they are implemented for you?
• Understanding what is going on under the hood is useful!

• Backpropagation doesn’t always work perfectly out of the box


• Understanding why is crucial for debugging and improving models
• See Karpathy article (in syllabus):
• https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
• Example in future lecture: exploding and vanishing gradients

10
Lecture Plan
✓ Finish backpropagation (10 mins)
Syntactic Structure and Dependency parsing
1. Syntactic Structure: Consistency and Dependency (20 mins)
2. Dependency Grammar and Treebanks (15 mins)
3. Transition-based dependency parsing (15 mins)
4. Neural dependency parsing (20 mins)

Key Learnings: Explicit linguistic structure and how a neural net can decide it

Reminders/comments:
• In Assignment 2, you build a neural dependency parser using PyTorch!
• Come to the PyTorch tutorial, Friday, 1:30pm Gates B01
• Final project discussions – come meet with us; focus of Tuesday class in week 4
11
1. The linguistic structure of sentences – two views: Constituency
= phrase structure grammar = context-free grammars (CFGs)
Phrase structure organizes words into nested constituents

Starting unit: words


the, cat, cuddly, by, door

Words combine into phrases


the cuddly cat, by the door

Phrases can combine into bigger phrases


the cuddly cat by the door

12
The linguistic structure of sentences – two views: Constituency =
phrase structure grammar = context-free grammars (CFGs)
Phrase structure organizes words into nested constituents.

the cat
a dog
large in a crate
barking on the table
cuddly by the door
large barking
talk to
walked behind

14
Two views of linguistic structure: Dependency structure
• Dependency structure shows which words depend on (modify, attach to, or are
arguments of) which other words.

Look in the large crate in the kitchen by the door

16
Why do we need sentence structure?

Humans communicate complex ideas by composing words together


into bigger units to convey complex meanings

Human listeners need to work out what modifies [attaches to] what

A model needs to understand sentence structure in order to be able


to interpret language correctly

18
Prepositional phrase attachment ambiguity

19
Prepositional phrase attachment ambiguity

Scientists count whales from space


Scientists count whales from space

20
PP attachment ambiguities multiply

• A key parsing decision is how we ‘attach’ various constituents


• PPs, adverbial or participial phrases, infinitives, coordinations, etc.

• Catalan numbers: Cn = (2n)!/[(n+1)!n!]


• An exponentially growing series, which arises in many tree-like contexts:
• E.g., the number of possible triangulations of a polygon with n+2 sides
21 • Turns up in triangulation of probabilistic graphical models (CS228)….
Coordination scope ambiguity

Shuttle veteran and longtime NASA executive Fred Gregory appointed to board

Shuttle veteran and longtime NASA executive Fred Gregory appointed to board

23
Coordination scope ambiguity

24
Adjectival/Adverbial Modifier Ambiguity

25
Verb Phrase (VP) attachment ambiguity

26
Dependency paths help extract semantic interpretation –
simple practical example: extracting protein-protein interaction

demonstrated
nsubj ccomp

results mark interacts nmod:with


det
that advmod SasA
nsubj case conj:and
The
KaiC rythmically with KaiA and KaiB
conj:and cc
KaiC nsubj interacts nmod:with ➔ SasA
KaiC nsubj interacts nmod:with ➔ SasA conj:and➔ KaiA
KaiC nsubj interacts nmod:with ➔ SasA conj:and➔ KaiB

[Erkan et al. EMNLP 07, Fundel et al. 2007, etc.]

27
2. Dependency Grammar and Dependency Structure
Dependency syntax postulates that syntactic structure consists of relations between
lexical items, normally binary asymmetric relations (“arrows”) called dependencies

Bills on ports and immigration were submitted by


submitted
Senator Brownback, Republican of Kansas.

Bills were Brownback

ports
by Senator Republican

on and immigration
Kansas

of
28
Dependency Grammar and Dependency Structure
Dependency syntax postulates that syntactic structure consists of relations between
lexical items, normally binary asymmetric relations (“arrows”) called dependencies

submitted
The arrows are nsubj:pass obl
aux
commonly typed
Bills were Brownback
with the name of nmod
grammatical case appos
ports flat
relations (subject,
case cc conj by Senator Republican
prepositional object,
apposition, etc.) on and immigration nmod
Kansas
case
of
29
Dependency Grammar and Dependency Structure
Dependency syntax postulates that syntactic structure consists of relations between
lexical items, normally binary asymmetric relations (“arrows”) called dependencies

An arrow connects a head submitted


with a dependent nsubj:pass aux obl

Bills were Brownback


Usually, dependencies nmod
case appos
form a tree (a connected, ports flat
acyclic, single-root graph) case conj by Senator Republican
cc

on and immigration nmod


Kansas
case
of
30
Pāṇini’s grammar (c. 5th century BCE)

Gallery: http://wellcomeimages.org/indexplus/image/L0032691.html
CC BY 4.0 File:Birch bark MS from Kashmir of the Rupavatra Wellcome L0032691.jpg
But this comes from much later – originally the grammar was oral
31
Dependency Grammar/Parsing History
• The idea of dependency structure goes back a long way
• To Pāṇini’s grammar (c. 5th century BCE)
• Basic approach of 1st millennium Arabic grammarians
• Constituency/context-free grammar is a new-fangled invention
• 20th century invention (R.S. Wells, 1947; then Chomsky 1953, etc.)
• Modern dependency work is often sourced to Lucien Tesnière (1959)
• Was dominant approach in “East” in 20th Century (Russia, China, …)
• Good for free-er word order, inflected languages like Russian (or Latin!)
• Used in some of the earliest parsers in NLP, even in the US:
• David Hays, one of the founders of U.S. computational linguistics, built early (first?)
dependency parser (Hays 1962) and published on dependency grammar in Language

32
Dependency Grammar and Dependency Structure

ROOT Discussion of the outstanding issues was completed .

• Some people draw the arrows one way; some the other way!
• Tesnière had them point from head to dependent – we follow that convention
• We usually add a fake ROOT so every word is a dependent of precisely 1 other node

33
The rise of annotated data & Universal Dependencies treebanks
Brown corpus (1967; PoS tagged 1979); Lancaster-IBM Treebank (starting late 1980s);
Marcus et al. 1993, The Penn Treebank, Computational Linguistics;
Universal Dependencies: http://universaldependencies.org/

34
The rise of annotated data
Starting off, building a treebank seems a lot slower and less useful than writing a grammar
(by hand)

But a treebank gives us many things


• Reusability of the labor
• Many parsers, part-of-speech taggers, etc. can be built on it
• Valuable resource for linguistics
• Broad coverage, not just a few intuitions
• Frequencies and distributional information
• A way to evaluate NLP systems

35
Dependency Conditioning Preferences
What are the straightforward sources of information for dependency parsing?
1. Bilexical affinities The dependency [discussion → issues] is plausible
2. Dependency distance Most dependencies are between nearby words
3. Intervening material Dependencies rarely span intervening verbs or punctuation
4. Valency of heads How many dependents on which side are usual for a head?

ROOT Discussion of the outstanding issues was completed .


36
Dependency Parsing
• A sentence is parsed by choosing for each word what other word (including ROOT) it is
a dependent of

• Usually some constraints:


• Only one word is a dependent of ROOT
• Don’t want cycles A → B, B → A
• This makes the dependencies a tree
• Final issue is whether arrows can cross (be non-projective) or not

ROOT I ’ll give a talk tomorrow on neural networks


37
Projectivity
• Definition of a projective parse: There are no crossing dependency arcs when the
words are laid out in their linear order, with all arcs above the words
• Dependencies corresponding to a CFG tree must be projective
• I.e., by forming dependencies by taking 1 child of each category as head
• Most syntactic structure is projective like this, but dependency theory normally does
allow non-projective structures to account for displaced constituents
• You can’t easily get the semantics of certain constructions right without these
nonprojective dependencies

Who did Bill buy the coffee from yesterday ?


38
3. Methods of Dependency Parsing
1. Dynamic programming
Eisner (1996) gives a clever algorithm with complexity O(n3), by producing parse items
with heads at the ends rather than in the middle
2. Graph algorithms
You create a Minimum Spanning Tree for a sentence
McDonald et al.’s (2005) O(n2) MSTParser scores dependencies independently using an
ML classifier (he uses MIRA, for online learning, but it can be something else)
Neural graph-based parser: Dozat and Manning (2017) et seq. – very successful!
3. Constraint Satisfaction
Edges are eliminated that don’t satisfy hard constraints. Karlsson (1990), etc.
4. “Transition-based parsing” or “deterministic dependency parsing”
Greedy choice of attachments guided by good machine learning classifiers
E.g., MaltParser (Nivre et al. 2008). Has proven highly effective. And fast.
39
Greedy transition-based parsing [Nivre 2003]
• A simple form of a greedy discriminative dependency parser
• The parser does a sequence of bottom-up actions
• Roughly like “shift” or “reduce” in a shift-reduce parser – CS143, anyone?? – but the
“reduce” actions are specialized to create dependencies with head on left or right
• The parser has:
• a stack σ, written with top to the right
• which starts with the ROOT symbol
• a buffer β, written with top to the left
• which starts with the input sentence
• a set of dependency arcs A
• which starts off empty
• a set of actions

40
Basic transition-based dependency parser

Start: σ = [ROOT], β = w1, …, wn , A = ∅


1. Shift σ, wi|β, A ➔ σ|wi, β, A
2. Left-Arcr σ|wi|wj, β, A ➔ σ|wj, β, A∪{r(wj,wi)}
3. Right-Arcr σ|wi|wj, β, A ➔ σ|wi, β, A∪{r(wi,wj)}
Finish: σ = [w], β = ∅

41
Arc-standard transition-based parser
(there are other transition schemes …)
Analysis of “I ate fish”

Start Start: σ = [ROOT], β = w1, …, wn , A = ∅


1. Shift σ, wi|β, A ➔ σ|wi, β, A
2. Left-Arcr σ|wi|wj, β, A ➔
[root] I ate fish 3.
σ|wj, β, A∪{r(wj,wi)}
Right-Arcr σ|wi|wj, β, A ➔
σ|wi, β, A∪{r(wi,wj)}
Finish: σ = [w], β = ∅
Shift
[root] I ate fish
Shift
[root] I ate fish

42
Arc-standard transition-based parser
Analysis of “I ate fish”
Nota bene:
Left Arc In this example
A += I’ve at each step
[root] I ate [root] ate nsubj(ate → I) made the
“correct” next
Shift transition.
But a parser has
[root] ate fish [root] ate fish to work this out –
by exploring or
Right Arc inferring!
A +=
[root] ate fish [root] ate obj(ate → fish)

Right Arc
A += A = { nsubj(ate → I),
[root] ate [root] root([root] → ate) obj(ate → fish)
Finish root([root] → ate) }
43
MaltParser [Nivre and Hall 2005]
• We have left to explain how we choose the next action
• Answer: Stand back, I know machine learning!
• Each action is predicted by a discriminative classifier (e.g., softmax classifier) over each
legal move
• Max of 3 untyped choices (max of |R| × 2 + 1 when typed)
• Features: top of stack word, POS; first in buffer word, POS; etc.
• There is NO search (in the simplest form)
• But you can profitably do a beam search if you wish (slower but better):
• You keep k good parse prefixes at each time step
• The model’s accuracy is fractionally below the state of the art in dependency parsing,
but
• It provides very fast linear time parsing, with high accuracy – great for parsing the web
44
Conventional Feature Representation

binary, sparse 0 0 0 1 0 0 1 0 … 0 0 1 0
dim =106 –107
Feature templates: usually a combination of 1–3
elements from the configuration

Indicator features

45
Evaluation of Dependency Parsing: (labeled) dependency accuracy

Acc = # correct deps


# of deps

UAS = 4 / 5 = 80%
ROOT She saw the video lecture
0 1 2 3 4 5
LAS = 2 / 5 = 40%

Gold Parsed
1 2 She nsubj 1 2 She nsubj
2 0 saw root 2 0 saw root
3 5 the det 3 4 the det
4 5 video nn 4 5 video nsubj
5 2 lecture obj 5 2 lecture ccomp

46
4. Why do we gain from a neural dependency parser?
Indicator Features Revisited
Categorical features are: Neural Approach:
• Problem #1: sparse learn a dense and compact feature representation
• Problem #2: incomplete
• Problem #3: expensive to compute

More than 95% of parsing time is


consumed by feature computation

dense
0.1 0.9 -0.2 0.3 … -0.1 -0.5
dim = ~1000

48
A neural dependency parser [Chen and Manning 2014]
• Results on English parsing to Stanford Dependencies:
• Unlabeled attachment score (UAS) = head
• Labeled attachment score (LAS) = head and label

Parser UAS LAS sent. / s


MaltParser 89.8 87.2 469
MSTParser 91.4 88.1 10
TurboParser 92.3 89.6 8
C & M 2014 92.0 89.7 654

49
First win: Distributed Representations

• We represent each word as a d-dimensional dense vector (i.e., word embedding)


• Similar words are expected to have close vectors.

• Meanwhile, part-of-speech tags (POS) and dependency labels are also represented as
d-dimensional vectors. was were
• The smaller discrete sets also exhibit many semantical similarities.
good
is
come

NNS (plural noun) should be close to NN (singular noun).


go

nummod (numerical modifier) should be close to amod (adjective modifier).

50
Extracting Tokens & vector representations from configuration

• We extract a set of tokens based on the stack / buffer positions:

word POS dep.


A concatenation

}
s1 good JJ ∅
s2 has VBZ ∅ of the vector
b1 control NN ∅ representation of
lc(s1) ∅ + ∅ + ∅ all these is the
rc(s1) ∅ ∅ ∅ neural
lc(s2) He PRP nsubj representation of
rc(s2) ∅ ∅ ∅ a configuration

51
Second win: Deep Learning classifiers are non-linear classifiers
• A softmax classifier assigns classes 𝑦 ∈ 𝐶 based on inputs 𝑥 ∈ ℝ𝑑 via the probability:

• Traditional ML classifiers (including Naïve Bayes, SVMs, logistic regression and softmax
classifier) are not very powerful classifiers: they only give linear decision boundaries
• But neural networks can use multiple layers to learn much more complex nonlinear
decision boundaries

52
Neural Dependency Parser Model Architecture
(A simple feed-forward neural network multi-class classifier)
Log loss (cross-entropy error) will be back-
propagated to the embeddings

Softmax probabilities { Shift , Left-Arcr , Right-Arcr }


Output layer y
y = softmax(Uh + b2) The hidden layer re-represents the input —
it moves inputs around in an intermediate
Hidden layer h layer vector space—so it can be easily
h = ReLU(Wx + b1) classified with a (linear) softmax

Input layer x
lookup + concat Wins:
Distributed representations!
Non-linear classifier!

53
Dependency parsing for sentence structure
Chen & Manning (2014) showed that neural networks can accurately
determine the structure of sentences, supporting meaning interpretation

This paper was the first simple and successful neural dependency parser

The dense representations (and non-linear classifier) let it outperform other


greedy parsers in both accuracy and speed

54
Further developments in transition-based neural dependency parsing

This work was further developed and improved by others, including in particular at Google
• Bigger, deeper networks with better tuned hyperparameters
• Beam search
• Global, conditional random field (CRF)-style inference over the decision sequence
Leading to SyntaxNet and the Parsey McParseFace model (2016):
“The World’s Most Accurate Parser”
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html
Method UAS LAS (PTB WSJ SD 3.3)
Chen & Manning 2014 92.0 89.7
Weiss et al. 2015 93.99 92.05
Andor et al. 2016 94.61 92.79

55
Graph-based dependency parsers
• Compute a score for every possible dependency for each word
• Doing this well requires good “contextual” representations of each word token,
which we will develop in coming lectures

0.5 0.8

0.3 2.0

ROOT The big cat sat

e.g., picking the head for “big”


56
Graph-based dependency parsers
• Compute a score for every possible dependency (choice of head) for each word
• Doing this well requires more than just knowing the two words
• We need good “contextual” representations of each word token, which we will
develop in the coming lectures
• Repeat the same process for each other word; find the best parse (MST algorithm)
0.5 0.8

0.3 2.0

ROOT The big cat sat


e.g., picking the head for “big”
57
A Neural graph-based dependency parser
[Dozat and Manning 2017; Dozat, Qi, and Manning 2017]

• This paper revived interest in graph-based dependency parsing in a neural world


• Designed a biaffine scoring model for neural dependency parsing
• Also crucially uses a neural sequence model, something we discuss later
• Really great results!
• But slower than the simple neural transition-based parsers
• There are n2 possible dependencies in a sentence of length n

Method UAS LAS (PTB WSJ SD 3.3)


Chen & Manning 2014 92.0 89.7
Weiss et al. 2015 93.99 92.05
Andor et al. 2016 94.61 92.79
Dozat & Manning 2017 95.74 94.08
58

You might also like