NLP-Lectures 4,5,6
NLP-Lectures 4,5,6
Lecture 4
POS TAGGING
• Rule-based tagging use limited number of hand-written static rules to distinguish tag
ambiguity which causes high development cost but high precision also.
• Statistical/Stochastic tagging: need supervised learning with tagged corpora and
statistical inference. It is language independent and provides acceptable precision.
✓ HMM tagging is probabilistic method that choose the tag sequence which
maximizes the product of word likelihood and tag sequence probability.
• Hybrid-based tagging:
✓ Maximum Entropy tagging: Combination of several knowledge sources.
✓ Transformation based tagging: Based on rules automatically acquired.
✓ Decision tree tagging.
RULE-BASED TAGGING
• First stage − it uses a dictionary to assign each word a list of potential parts-of-speech.
• Second stage − it uses large lists of hand-written disambiguation rules to sort down
the list to a single part-of-speech for each word.
• Properties of Rule-Based POS Tagging:
✓ These taggers are knowledge-driven taggers.
✓ The rules in Rule-based POS tagging are built manually.
✓ The information is coded in the form of rules.
✓ We have some limited number of rules approximately around 1000.
✓ Causes high development cost but high precision also
RULE-BASED TAGGING
×
Rule: Eliminate VBN (past participle) if VBD (past tense) is an option when
(VBN or VBD) follows “<s> PRP (personal pronoun)”
These kinds of rules become unwieldy and force determinism where there may not be any.
STOCHASTIC/ STATISTIC POS TAGGING
• The model that includes frequency or probability (statistics) can be called stochastic.
• Word Frequency Approach
✓It disambiguate the words based on the probability that a word occurs
with a particular tag.
✓The tag encountered most frequently with the word in the training set
is the one assigned to an ambiguous instance of that word.
✓The main issue with this approach is that it may yield inadmissible
sequence of tags.
STOCHASTIC/ STATISTIC POS TAGGING
• Bayes’ Rule
• Bayes’ Rule
Can We Use Statistics Instead?
• Bayes’ Rule
OR
LANGUAGE MODEL
• To estimate the probability of the last word of an n-gram given the previous
words.
CHAIN RULE
• The probability that the next word is food after I like Chinese
PROBLEM WITH CHAIN RULE
• The longer the sequence, the less likely we are to find it in a training
corpus
THANKS
N-GRAM
Lecture 5
PROBLEM WITH CHAIN RULE
• The probability of the next word depends only on the previous k words.
• N-gram is the simplest model that assigns probabilities to sentences and sequences of
words.
Bigram:
N-gram (Start Symbols)
N-gram (End Symbol)
EXAMPLE
EXAMPLE
EXAMPLE
• Likelihood 𝑷(𝒘𝒊 |𝒕𝒊 ), represent the probability, given that we see a given tag,
that it will be associated with a given word.
• For example if we were to see the tag VBZ (third person singular present verb) and
guess the verb that is likely to have that tag, we might likely guess the verb is, since the
verb to be is so common in English.
• A word likelihood probability like P(is|VBZ) again by counting, out of the times we see
VBZ in a corpus, how many of those times the VBZ is labeling the word is.
EXAMPLE
• A corpus is a large and structured set of machine-readable texts that have been
produced in a natural communicative setting.
• Its plural is corpora.
• Language is infinite but a corpus must be finite in size.
• Main Elements in designing a corpus:
✓Corpus Representativeness
✓Corpus Size
Corpus Representativeness
• How large the corpus should be? There is no specific answer to this question.
• The size of the corpus depends upon the purpose as well as on some practical
considerations as follows:
✓Kind of query anticipated from the user.
✓The methodology used by the users to study the data.
✓Availability of the source of data.
• With the advancement in technology, the corpus size also increases.
EXAMPLES OF CORPUS
TREE-BANK CORPUS
• Semantic Treebanks
✓These Treebanks use a formal representation (if-then) of sentence’s semantic
structure.
✓They vary in the depth of their semantic/meaning representation.
✓Examples:
o Robot Commands Treebank,
o Geoquery,
o Groningen Meaning Bank,
o RoboCup Corpus.
TYPES OF TREE-BANK CORPUS
• Syntactic Treebanks
✓ Opposite to the semantic Treebanks.
✓ Parsed syntactic tree –dependency grammar
✓ For example,
o Penn Arabic Treebank, Columbia Arabic Treebank are syntactic Treebanks created in
Arabia language.
o Sininca syntactic Treebank created in Chinese language.
o Lucy, Susane and BLLIP WSJ syntactic corpus created in English language.
o Penn treebank in English (a shallow semantic)
Applications of Treebank Corpus
• In Computational Linguistics
✓part-of-speech taggers, parsers, semantic analyzers and machine translation
systems.
• In Corpus Linguistics
✓ study syntactic phenomena.
• In Theoretical Linguistics and Psycholinguistics
✓Interaction evidence.
PROPBANK CORPUS
• Switchboard corpus
120 hours ≈ 2.4M tokens
2.4K spoken telephone conversations between US English speakers.
• Brown corpus
1M tokens, 61,805 types. Balanced collection of genres in US English from
1961.
THANKS
N-GRAM EVALUATION
Lecture 6
N-GRAM (LOG Probability)
N-GRAM (LOG Probability)
• An intrinsic evaluation metric is one which measures the quality of a model independent
of any application.
• Perplexity is the most common intrinsic evaluation metric for N-gram language models.
• The perplexity (PP) of a language model on a test set is the inverse probability of the test
set, normalized by the number of words.
• The higher the conditional probability of the word sequence, the lower the
perplexity.
• Minimizing perplexity is equivalent to maximizing the test set probability according
to the language model.
• Perplexity is related inversely to the likelihood of the test sequence according to the
model.
For a test set W = w1w2 . . .wN,
The perplexity is the inverse probability of
the test set, normalized by the number of
words.
• Sparse data caused by the fact that our maximum likelihood estimate was
based on a particular set of training data.
• But because any corpus is limited, some perfectly acceptable English word
sequences are missing. (Zero probability N-gram).
• A few words occur very frequently.
• Many words occur very infrequently.
• If we have no way to determine the distribution of unseen N-grams, how can we
estimate them?
SMOOTHING
• Assign some non-zero probability to any N-gram, even one that was never
observed in training.
• Smoothing addresses the poor estimates that are due to variability in
small data sets.
• Make the distribution more uniform.
SMOOTHING
• Zipf:
✓Unseen words should behave more like hapax legomena.
✓ Words that occur a lot should behave like other words that occur a lot.
GOOD-TURING ADJUSTMENTS
GOOD-TURING LIMITATIONS
SMOOTHING
• Smoothing algorithms:
✓ Katz smoothing
✓Simple interpolation (Jelinek-Mercer)
✓Absolute discounting
✓ Kneser-Ney smoothing
• Commonly used N-gram smoothing algorithms rely on lower-order
N-gram counts via backoff or interpolation.
BACKOFF
Interpolation
SMOOTHING