0% found this document useful (0 votes)
89 views

NLP-Lectures 4,5,6

POS tagging algorithms aim to assign part-of-speech tags to words in sequences. There are three main types of POS tagging algorithms: rule-based tagging uses hand-written rules but has high precision; statistical tagging uses supervised learning on tagged corpora; and hybrid tagging combines approaches. Statistical tagging uses n-grams and hidden Markov models to probabilistically determine the tag sequence with the highest likelihood.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

NLP-Lectures 4,5,6

POS tagging algorithms aim to assign part-of-speech tags to words in sequences. There are three main types of POS tagging algorithms: rule-based tagging uses hand-written rules but has high precision; statistical tagging uses supervised learning on tagged corpora; and hybrid tagging combines approaches. Statistical tagging uses n-grams and hidden Markov models to probabilistically determine the tag sequence with the highest likelihood.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

POS TAGGING ALGORITHMS

Lecture 4
POS TAGGING

• Words can have multiple parts of speech.


• The process of assigning a part-of-speech to each word in a sequence.
• Words can belong to many parts-of-speech. For Example→ back:
✓The back/JJ door (adjective)
✓On its back/NN (noun)
✓Win the voters back/RB (adverb)
✓Promise to back/VB you in a fight (verb)
• We want to decide the appropriate tag given a particular sequence of tokens.
POS TAGGING ALGORITHMS

• Rule-based tagging use limited number of hand-written static rules to distinguish tag
ambiguity which causes high development cost but high precision also.
• Statistical/Stochastic tagging: need supervised learning with tagged corpora and
statistical inference. It is language independent and provides acceptable precision.
✓ HMM tagging is probabilistic method that choose the tag sequence which
maximizes the product of word likelihood and tag sequence probability.
• Hybrid-based tagging:
✓ Maximum Entropy tagging: Combination of several knowledge sources.
✓ Transformation based tagging: Based on rules automatically acquired.
✓ Decision tree tagging.
RULE-BASED TAGGING

• First stage − it uses a dictionary to assign each word a list of potential parts-of-speech.
• Second stage − it uses large lists of hand-written disambiguation rules to sort down
the list to a single part-of-speech for each word.
• Properties of Rule-Based POS Tagging:
✓ These taggers are knowledge-driven taggers.
✓ The rules in Rule-based POS tagging are built manually.
✓ The information is coded in the form of rules.
✓ We have some limited number of rules approximately around 1000.
✓ Causes high development cost but high precision also
RULE-BASED TAGGING

• Start with a dictionary


• Assign all possible tags to words from the dictionary.
• Write rules (‘by hand’) to selectively remove tags
RULE-BASED TAGGING EXAMPLE

×
Rule: Eliminate VBN (past participle) if VBD (past tense) is an option when
(VBN or VBD) follows “<s> PRP (personal pronoun)”
These kinds of rules become unwieldy and force determinism where there may not be any.
STOCHASTIC/ STATISTIC POS TAGGING

• The model that includes frequency or probability (statistics) can be called stochastic.
• Word Frequency Approach
✓It disambiguate the words based on the probability that a word occurs
with a particular tag.
✓The tag encountered most frequently with the word in the training set
is the one assigned to an ambiguous instance of that word.
✓The main issue with this approach is that it may yield inadmissible
sequence of tags.
STOCHASTIC/ STATISTIC POS TAGGING

• Tag Sequence Probabilities


✓The tagger calculates the probability of a given sequence of tags
occurring.
✓It is also called n-gram approach. It is called so because the best tag for
a given word is determined by the probability at which it occurs with
the n previous tags.
PROPERTIES OF STOCHASTIC POS TAGGING

• This POS tagging is based on the probability of tag occurring.


• It requires training corpus
• There would be no probability for the words that do not exist in the corpus.
• It uses different testing corpus (other than training corpus).
• It is the simplest POS tagging because it chooses most frequent tags associated with a
word in training corpus.
POS TAGGING AND LANGUAGE MODEL
Can We Use Statistics Instead?

• Bayes’ Rule

𝑃 (𝑋, 𝑌) = 𝑃 (𝑋) 𝑃(𝑌|𝑋)= 𝑃 (𝑌) 𝑃(𝑋|𝑌)


Can We Use Statistics Instead?

• Bayes’ Rule
Can We Use Statistics Instead?

• Bayes’ Rule

𝑃 (𝑋, 𝑌) = 𝑃 (𝑋) 𝑃(𝑌|𝑋)= 𝑃 (𝑌) 𝑃(𝑋|𝑌)


CHAIN RULE

• The chain rule: (This extends Bayes rule to longer sequences)

OR
LANGUAGE MODEL

• Language model: The statistical model of a language.


(e.g., probabilities of words in an ordered sequence).
WHAT DO WE DO WITH A LANGUAGE MODEL

• Language model can do word prediction (Guess the next word…)


• Given a sequence of letters, what is the likelihood of the next letter?
• Language models can score and sort sentences.

• Language model can do POS tagging.


Promise to back the ball
P( V, To,V, DT,N) >> P( V, To, N, DT,N)
CHAIN RULE

• To assign probabilities to entire sequences.

• To estimate the probability of the last word of an n-gram given the previous
words.
CHAIN RULE

• The probability that the next word is food after I like Chinese
PROBLEM WITH CHAIN RULE

• The longer the sequence, the less likely we are to find it in a training
corpus
THANKS
N-GRAM
Lecture 5
PROBLEM WITH CHAIN RULE

• Assume statistical independence


MARKOV ASSUMPTION (SOLUTION)

• The probability of the next word depends only on the previous k words.
• N-gram is the simplest model that assigns probabilities to sentences and sequences of
words.

• N-Gram probabilities come from a training corpus.


• The larger the n, the more the number of parameters to estimate
N-GRAM

• N-gram is the simplest model that assigns probabilities to sentences and


sequences of words.
• An n-gram is a sequence of N words: “please turn your homework”
✓ A unigram is a single word like “please”, “turn”, “your”, “homework”
✓ A bigram is a two-word sequence of words like
“please turn”, “turn your”, or ”your homework”
✓ A trigram is a three-word sequence of words like
“please turn your”, or “turn your homework”.
N-gram (Start Symbols)

Bigram:
N-gram (Start Symbols)
N-gram (End Symbol)
EXAMPLE
EXAMPLE
EXAMPLE

• Let’s compute simple N-gram models of speech queries about restaurants.


• Unigram
EXAMPLE – BIGRAM COUNT
EXAMPLE- BIGRAM PROBABILITIES

• Obtain likelihoods by dividing bigram counts by unigram counts.


EXAMPLE- BIGRAM PROBABILITIES
Bigram To Estimate The Probability Of Whole
Sentences.
• We need to use the start (<s>) and end (</s>) tags here.
N-GRAM AND HIDDEN MARKOV
MODELS
PART-OF-SPEECH-TAGGING USING HIDDEN
MARKOV MODEL
• A special case of Bayesian inference.
• HMM taggers make two simplifying assumptions.
✓The first assumption is that the probability of a word appearing is dependent
only on its own part-of-speech tag; that it is independent of other words
around and other tags around.
✓The second assumption is that the probability of a tag appearing is dependent
only on the previous tag, the bigram assumption.
• What is the best sequence of tags which corresponds to the sequence of words?
PART-OF-SPEECH-TAGGING USING HIDDEN
MARKOV MODEL

First Assumption Second Assumption


PART-OF-SPEECH-TAGGING USING HIDDEN
MARKOV MODEL
PART-OF-SPEECH-TAGGING USING HIDDEN
MARKOV MODEL
• Example: determiners are very likely to precede adjectives and nouns, as in sequences
like:
that/DT flight/NN and the/DT yellow/JJ hat/NN.
• The probabilities P(NN|DT) and P(JJ|DT) to be high. But in English, adjectives don’t tend
to precede determiners, so the probability P(DT|JJ) ought to be low.
PART-OF-SPEECH-TAGGING USING HMM

• Likelihood 𝑷(𝒘𝒊 |𝒕𝒊 ), represent the probability, given that we see a given tag,
that it will be associated with a given word.
• For example if we were to see the tag VBZ (third person singular present verb) and
guess the verb that is likely to have that tag, we might likely guess the verb is, since the
verb to be is so common in English.
• A word likelihood probability like P(is|VBZ) again by counting, out of the times we see
VBZ in a corpus, how many of those times the VBZ is labeling the word is.
EXAMPLE

race is a verb (VB) or race is a common noun (NN)


THANKS
CORPUS
Lecture 6 (part 2)
CORPUS

• A corpus is a large and structured set of machine-readable texts that have been
produced in a natural communicative setting.
• Its plural is corpora.
• Language is infinite but a corpus must be finite in size.
• Main Elements in designing a corpus:
✓Corpus Representativeness
✓Corpus Size
Corpus Representativeness

• “A corpus is thought to be representative of the language variety it is supposed


to represent if the findings based on its contents can be generalized to the said
language variety”.
• “Representativeness refers to the extent to which a sample includes the full
range of variability in a population”.
• Representativeness of a corpus are determined by the following two factors:
✓Balance: The range of genre include in a corpus
✓Sampling: How the chunks for each genre are selected.
Corpus Representativeness- Balance

• Corpus balance – the range of genre included in a corpus.


• A balanced corpus covers a wide range of text categories, which are supposed to be
representatives of the language.
• There is no reliable scientific measure for balance, we can say that the accepted balance
is determined by its intended uses only.
Corpus Representativeness- Sampling

• Corpus representativeness and balance is very closely associated with


sampling.
• The kinds of texts included, the number of texts, the selection of
particular texts, the selection of text samples from within texts, and the
length of text samples.
• Each of these involves a sampling decision, either conscious or not.
CORPUS SIZE

• How large the corpus should be? There is no specific answer to this question.
• The size of the corpus depends upon the purpose as well as on some practical
considerations as follows:
✓Kind of query anticipated from the user.
✓The methodology used by the users to study the data.
✓Availability of the source of data.
• With the advancement in technology, the corpus size also increases.
EXAMPLES OF CORPUS
TREE-BANK CORPUS

• A linguistically parsed text corpus that annotates syntactic or semantic sentence


structure.
• Term ‘treebank’, which represents that the most common way of representing
the grammatical analysis is by means of a tree structure.
• Treebanks are created on the top of a corpus, which has already been annotated
with part-of-speech tags.
TYPES OF TREE-BANK CORPUS

• Semantic Treebanks
✓These Treebanks use a formal representation (if-then) of sentence’s semantic
structure.
✓They vary in the depth of their semantic/meaning representation.
✓Examples:
o Robot Commands Treebank,
o Geoquery,
o Groningen Meaning Bank,
o RoboCup Corpus.
TYPES OF TREE-BANK CORPUS

• Syntactic Treebanks
✓ Opposite to the semantic Treebanks.
✓ Parsed syntactic tree –dependency grammar
✓ For example,
o Penn Arabic Treebank, Columbia Arabic Treebank are syntactic Treebanks created in
Arabia language.
o Sininca syntactic Treebank created in Chinese language.
o Lucy, Susane and BLLIP WSJ syntactic corpus created in English language.
o Penn treebank in English (a shallow semantic)
Applications of Treebank Corpus

• In Computational Linguistics
✓part-of-speech taggers, parsers, semantic analyzers and machine translation
systems.
• In Corpus Linguistics
✓ study syntactic phenomena.
• In Theoretical Linguistics and Psycholinguistics
✓Interaction evidence.
PROPBANK CORPUS

• PropBank more specifically called “Proposition Bank” is a corpus,


• It is annotated with predicate argument relations (info about basic semantic).
• The corpus is a verb-oriented resource; the annotations here are more closely related to
the syntactic level.
• In Natural Language Processing (NLP), the PropBank project has played a very significant
role. It helps in semantic role labeling.
• semantic role labeling: assugn labels to words or phrases according to its semantic
role(agent, goal, result).
VERBNET (VN)

• VerbNet (VN) is the hierarchical domain-independent and largest lexical


resource present in English.
• It incorporates both semantic as well as syntactic information about its contents.
• VN is a broad-coverage verb lexicon having mappings to other lexical resources
such as WordNet, Xtag and FrameNet.
• It is organized into verb classes.
VERBNET (VN)

• Each VerbNet (VN) class contains:


✓A set of syntactic descriptions or syntactic frames
o Such as transitive, intransitive, prepositional phrases, ..etc.
✓A set of semantic descriptions
o Such as human, organization
WORDNET

• It is a lexical database for English language.


• It is the part of the NLTK corpus.
• Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms
called Synsets.
• All the synsets are linked with the help of conceptual-semantic and lexical
relations.
• WordNet is used for various purposes like word-sense disambiguation,
information retrieval, automatic text classification and machine translation,
finding similarity.
OTHER EXAMPLES

• Switchboard corpus
120 hours ≈ 2.4M tokens
2.4K spoken telephone conversations between US English speakers.
• Brown corpus
1M tokens, 61,805 types. Balanced collection of genres in US English from
1961.
THANKS
N-GRAM EVALUATION
Lecture 6
N-GRAM (LOG Probability)
N-GRAM (LOG Probability)

• Example: Given the log of these conditional probabilities:


log(P(Mary|<s>))=-4 log(P(</s>|cats))=-1 log(P(likes|Mary)) =-7 log(P(cats|likes))=-100
• Approximate the log probability of the following sentence with bigrams : “<s> Mary likes
cats </s>”
• Solution:
log(P(<s> Mary likes cats </s>)) = (-4)+(-7)+(-100)+(-1)= -112
EVALUATING A LANGUAGE MODEL

• How can we quantify the goodness of a model?


• How do we know whether one model is better than another?
• N-gram language models are evaluated by separating the corpus into a training set and a
test set, training the model on the training set, and evaluating on the test set.

• There are 2 general ways of evaluating LMs:


✓Extrinsic: in terms of some external measure (this depends on some task
or application).
✓Intrinsic: in terms of properties of the LM itself.
EXTRINSIC EVALUATION

• The utility of a language model is often determined in practice.


• Example:
1. Alternately embed LMs A and B into a speech recognizer.
2. Run speech recognition using each model.
3. Compare recognition rates between the system that uses LM A And the
system that uses LM B.
INTRINSIC EVALUATION

• An intrinsic evaluation metric is one which measures the quality of a model independent
of any application.
• Perplexity is the most common intrinsic evaluation metric for N-gram language models.
• The perplexity (PP) of a language model on a test set is the inverse probability of the test
set, normalized by the number of words.
• The higher the conditional probability of the word sequence, the lower the
perplexity.
• Minimizing perplexity is equivalent to maximizing the test set probability according
to the language model.
• Perplexity is related inversely to the likelihood of the test sequence according to the
model.
For a test set W = w1w2 . . .wN,
The perplexity is the inverse probability of
the test set, normalized by the number of
words.

We can use the chain rule to expand the


probability of W.

if we are computing the perplexity of W with


a bigram language model.

Lower perplexity → a better model


PERPLEXITY
PROBLEMS IN N-GRAM
PROBLEMS IN N-GRAM

• Sparse data caused by the fact that our maximum likelihood estimate was
based on a particular set of training data.
• But because any corpus is limited, some perfectly acceptable English word
sequences are missing. (Zero probability N-gram).
• A few words occur very frequently.
• Many words occur very infrequently.
• If we have no way to determine the distribution of unseen N-grams, how can we
estimate them?
SMOOTHING

• Assign some non-zero probability to any N-gram, even one that was never
observed in training.
• Smoothing addresses the poor estimates that are due to variability in
small data sets.
• Make the distribution more uniform.
SMOOTHING

• Smoothing algorithms provide a better way of estimating the probability of N-grams.

✓ Laplace Smoothing (Add-one smoothing)

Types :the number of distinct words in a corpus or vocabulary size V


SMOOTHING

• Smoothing algorithms provide a better way of estimating the probability of N-grams


than Maximum Likelihood Estimation.
✓ Laplace Smoothing (Add-one smoothing)

Does this give a proper probability distribution? Yes


Laplace smoothed bigram count
Laplace smoothed probabilities (v=1446)
ADD SMOOTHING (For Larger Corpora)
GOOD-TURING

• Define 𝑁𝑐 as the number of N-grams that occur c times.


• Idea: get rid of zeros by re-estimating c using the MLE estimate of words
that occur c + 1 times.
• Example:

• Zipf:
✓Unseen words should behave more like hapax legomena.
✓ Words that occur a lot should behave like other words that occur a lot.
GOOD-TURING ADJUSTMENTS
GOOD-TURING LIMITATIONS
SMOOTHING

• Smoothing algorithms:

✓ Katz smoothing
✓Simple interpolation (Jelinek-Mercer)
✓Absolute discounting
✓ Kneser-Ney smoothing
• Commonly used N-gram smoothing algorithms rely on lower-order
N-gram counts via backoff or interpolation.
BACKOFF
Interpolation
SMOOTHING

• Interpolation involves combining higher- and lower-order models.


• Interpolation always mix the probability estimates from all the N-gram
estimators, i.e., we do a weighted interpolation of trigram, bigram, and
unigram counts.
• In a Katz backoff N-gram model, if the N-gram we need has zero
counts, we approximate it by backing off to the (N-1)-gram. We continue
backing off until we reach a history that has some counts.
• Only “back off ” to a lower order N-gram if we have zero evidence for a
higher-order N-gram.
SMOOTHING

• Jelinek-Mercer performs better on small training sets; Katz performs


better on large training sets.
• Katz smoothing performs well on N-grams with large counts; Kneser-Ney
is best for small counts.
• Interpolated models are superior to backoff models for low (nonzero)
counts.
THANKS

You might also like