NLP 4
NLP 4
Syntax Analysis
MRS. Priyanka Bhoir
Steps in NLP
Morphological Analysis
Syntactic analysis
John Ate the Apple
Semantic Analysis
g
POS Tagging
Annotate each word in a sentence with a part-of-speech.
•
POS TAG / Word Classes
o 9 traditional word classes of parts of speech
◦ Noun, verb, adjective, preposition, adverb, article, pronoun, conjunction, interjection
N noun chair, bandwidth, pacing
V verb study, debate, munch
ADJ adjective purple, tall, ridiculous
ADV adverb unfortunately, slowly
P preposition of, by, to
PRO pronoun I, me, mine
DET determiner the, a, that, those
8
Defining POS Tagging
The process of assigning a part-of-speech to each word in an input text/ a corpus
:
WORDS
TAGS
the
koala
put N
the V
keys P
on DET
the
table
5
Part-of-Speech Tagsets
There are various tag sets to choose.
•The choice of the tag set depends on the nature of the application.
–We may use small tag set (more general tags) or
–large tag set (finer tags).
•Some of widely used part-of-speech tag sets:
–Penn Treebank has 45 tags
–Brown Corpus has 87 tags
–C7 tag set has 146 tags
•In a tagged corpus, each word is associated with a tag from the used tag set.
Penn Treebank
Tagset
10
Tag Ambiguity
11
Tagging Whole Sentences with POS is
Hard
Ambiguous POS contexts
◦ E.g., Time flies like an arrow.
12
How do we disambiguate POS
❑Tagging is disambiguation task: words are ambiguous have more than one
possible part of speech and the goal is to find the correct tag for the situation.
❑Many words have only one POS tag ( ex. Is ,mary, smallest)
Transformation-based tagging
◦ Learned rules (statistic and linguistic)
◦ E.g., Brill tagger
13
Rule-Based Tagging
o The rule-based approach uses handcrafted sets of rules to tag input sentence.
o Stage 1:
o Typically…start with a dictionary of words and possible tags
o Assign all possible tags to words using the dictionary
o Stage 2:
o Write rules by hand to selectively remove tags
o Stop when each word has exactly one (presumably correct) tag
14
Rules based on context
Please book that flight
I bought a Book.
Rule-1: if (the previous word is “to”) •Rule-2: if (the previous tag is an article)
then eliminate all Noun tags then eliminate all verb tags
Book Book
W W Tn T
n-1 Wn n+1 T n+1
n-1
Start with a POS Dictionary
16
Step 1:Assign All Possible POS to Each
Word
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
17
Step 2: Apply Rules Eliminating Some POS
E.g., Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
18
Apply Rules Eliminating Some POS
E.g., Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NN
RB
JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
19
Rule-Based Part-of-Speech Tagging: Example
Pronoun
Verb
Article
Verb Noun
Pronoun
Verb
Article
Verb Noun
Properties of Rule-Based POS
Tagging
oRule-based POS taggers possess the following properties −
oThese taggers are knowledge-driven taggers.
oThe rules in Rule-based POS tagging are built manually.
oThe information is coded in the form of rules.
oWe have some limited number of rules approximately around 1000.
oSmoothing and language modeling is defined explicitly in rule-based
taggers.
EngCG ENGTWOL Tagger
Richer dictionary includes morphological and syntactic features as well
as possible POS
Uses two-level morphological analysis on input and returns all possible
POS Apply negative constraints (> 3744) to rule out incorrect POS
Sample ENGTWOL Dictionary
23
ENGTWOL Tagging: Stage
1 1: Run words through FST morphological analyzer to get POS info from morph
Step
E.g.: Pavlov had shown that salivation …
Pavlov PAVLOV N NOM SG PROPER
had HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG
24
ENGTWOL Tagging: Stage
2Step 2: Apply NEGATIVE constraints
E.g., Adverbial that rule
◦ Eliminate all readings of that except the one in It isn’t that odd.
25
POS Tagging Approaches
Rule-based tagging
◦ E.g. EnCG ENGTWOL tagger
Transformation-based tagging
◦ Learned rules (statistic and linguistic)
◦ E.g., Brill tagger
26
Stochastic POS Tagging
oThe model that includes frequency or probability (statistics) can be called stochastic.
oThe simplest stochastic tagger applies the following approaches for POS tagging −
oWord Frequency Approach
Using the probability that a word occurs with a particular tag.
the tag encountered most frequently with the word in the training set
oTag Sequence Probabilities
calculates the probability of a given sequence of tags occurring.
It is called n-gram approach because the best tag for a given word is determined
by the probability at which it occurs with the n previous tags.
Properties of Stochastic POS Tagging
Stochastic POS taggers possess the following properties −
o This POS tagging is based on the probability of tag occurring.
o It requires training corpus
o There would be no probability for the words that do not exist in the corpus.
o It uses different testing corpus (other than training corpus).
o It is the simplest POS tagging because it chooses most frequent tags associated
with a word in training corpus.
Stochastic tagging
–We call the tags hidden because they are not observed.
•A Hidden Markov model (HMM) allows us to talk about both observed events(like
words that we see in the input) and hidden events (like part-of-speech tags) that we think
of as causal factors in our probabilistic model.
?
P(John bit the apple) =?
Entropy
Event X
Probability P(X)
Surprise log (1/P(X)) Probability is high surprise is low
Entropy expected surprise obvious event entropy will be low and for rare
events ?
26
Entropy
•Entropy or self-information is the average uncertainty of a single random variable:
(i) H(X) = 0 only when the value of X is determinate, hence providing no new information
(ii) H(X) >=0, only when the value of X is Not deterministic, hence providing new information
51
Entropy –high at first and last word
Aoccdrnig to rseearch at an Elingsh uinervtisy, it deosn't
mttaer in waht oredr the ltteers in a wrod are, the olny
iprmoatnt tihng is that the frist and lsat ltteer is at the rghit
pclae. The rset can be a toatl mses and you can sitll raed it
wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter
by it slef but the wrod as a wlohe.
40%
Removed
30%
Removed
20%
Removed
10%
Removed
49
0%
Removed
Entropy
•Entropy or self-information is the average uncertainty of a single random variable:
(i) H(X) = 0 only when the value of X is determinate, hence providing no new information
(ii) H(X) >=0, only when the value of X is Not deterministic, hence providing new information
51
Maximum Entropy Model
Limitations of HMM
• Limited Context( Previous tag)
• Could not used Morphological Clue(Suffix, Capitalization)
• Handling Unknown Words (NO Emission Probabilities)
• No additional heterogeneous observations/features used
P(Noun)=1/4
P(Adjective)=1/4
P( adverb)=1/4
Most uniform
P(verb)=1/4
distribution
Constraints
Constrains:
P(verb)+P(Noun)=3/4
P(Noun)+P(Adjective)+P( adverb)+P(verb)=1
Answer:
P(verb)=3/8
P(Noun)=3/8 Moving away
P(Adjective)=1/8 from uniform
P( adverb)=1/8 distribution
Adding Constraints
❑ Brings the distribution further from Uniform Distribution
❑ Raises Maximum Likelihood of Data
❑ Lowers Maximum Entropy
Entropy
Event: X
Probability: P(X)
Surprise: log(1/P(X)) Probability is High Surprise is Low
Entropy: Expected Surprise Obvious event Entropy will be Low and for rare events?
H(P) = EP [Log2 ]
Entropy: Expected Surprise Obvious event Entropy will be Low and for rare events?
H(P) = EP [Log2 ]
Generative Discriminative
◦ Assign Join Probabilitty to paired observation and ◦ Assign Conditional Probabilitty to paired observation
Label Sequences P(X,Y) and Label Sequences P(Y|X)
◦ P(S,O)= ∏ P(Si|Si-1)*P(Si|Oi ) ◦ P(S|O)= ∏ P(Si|Si-1,Oi)
Assumes features are independent No longer assume that features are independent
Conditional Random Fields (CRFs)
oDiscriminative
X X X X X
Transition functions
add associations
between transitions from State functions
one label to another determine the
identity of the state based on input
CRFs are based on the idea of Markov Random Fields
◦ Modelled as an undirected graph connecting labels with observation
Conditional Random
Fields
e.g. positive wt value meaning e.g. if input word is full stop then
strong feature POS tag is <e>
for this state feature (Strong) For our input word and output tag
Conditional Random
Fields
Rule-based tagging
◦ E.g. EnCG ENGTWOL tagger
Transformation-based tagging
◦ E.g., Brill tagger
70
Transformation-Based (Brill) Tagging
Also known as Brill Tagging.
Combines Rule-based and Stochastic Tagging
◦ Like rule-based rules are used to specify tags in a certain environment
◦ Like stochastic approach we use a tagged corpus to find the most likely tags.
◦ Before the rules are applied, the tagger labels every word with its most likely
tag.
◦Rules are learned from data
2/19/2021 71
Transformation-Based Tagging:
Example
Example: Labels every word with
–He is expected to race tomorrow its most-likely tag
E.g. race occurrences in
–he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN
corpus:
•After selecting most-likely tags, we apply transformation rules. P(NN|race) = .98
–Change NN to VB when the previous tag is TO P(VB|race)= .02
–This rule converts race/NN into race/VB
– he/PRN is/VBZ expected/VBN to/TO race/VB tomorrow/NN
•This may not work for every case
–….. According to race
TBL Tagging Algorithm
How Transformation Rules are
Learned?
We assume that we have a tagged corpus.
•Brill Tagger algorithm has three major steps.
1. Tag the corpus with the most likely tag for each (unigram model)
2. Choose a transformation that deterministically replaces an existing
tag with a new tag such that the resulting tagged training corpus has
the lowest error rate.
3.Apply the transformation to the training corpus and add rule to end of rule set. 4
These steps are repeated until a stopping criterion is reached.
Transformation
Rules
E.g. if word-1 is an X and word is a Y then change the tag to Z”
A transformation rule is selected from a small set of
templates. Change tag a to tag b when Word
2/19/2021 76
TBL
Issues
Problem: Could keep applying (new) transformations ad infinitum
Problem: Rules are learned in ordered sequence
Problem: Rules may interact i.e. Rules may make errors that are corrected by later rules
More complex Problem: Tagging multipart words
Dan | ...