0% found this document useful (0 votes)
119 views6 pages

Unit-2 Aim 502

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views6 pages

Unit-2 Aim 502

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

AIM-502: UNIT-2 WORD LEVEL ANALYSIS

2.1 Explain the usage of Unsmoothed and Smoothed N-grams


• N-grams are a type of language model that is used to predict the next word in a
sequence of words.
• Unsmoothed N-grams are N-grams that are not adjusted for the frequency of
words in the training data.
• This means that unsmoothed N-grams are more likely to predict rare words than
smoothed N-grams.
• Smoothed N-grams are N-grams that are adjusted for the frequency of words in the
training data.
• This means that smoothed N-grams are less likely to predict rare words than
unsmoothed N-grams.
• The main advantage of unsmoothed N-grams is that they are more accurate for
predicting words that are not very common.
• This is because unsmoothed N-grams do not take into account the frequency of
words in the training data, so they are more likely to predict words that are not very
common.
• The main advantage of smoothed N-grams is that they are less likely to predict
rare words.
• This is because smoothed N-grams take into account the frequency of words in the
training data, so they are less likely to predict words that are not very common.
• The main disadvantage of unsmoothed N-grams is that they can be less accurate
for predicting words that are very common.
• This is because unsmoothed N-grams do not take into account the frequency of
words in the training data, so they are more likely to predict words that are not very
common.
• The main disadvantage of smoothed N-grams is that they can be less accurate for
predicting words that are not very common.
• This is because smoothed N-grams take into account the frequency of words in the
training data, so they are less likely to predict words that are not very common.
• In general, unsmoothed N-grams are more accurate for predicting words that are
not very common, while smoothed N-grams are less likely to predict rare words.
2.2 Analyze N-grams
N-gram is a sequence of the N-words in the modeling of NLP. Consider an example of the
statement for modeling. “I love reading history books and watching documentaries”. In
one-gram or unigram, there is a one-word sequence. As for the above statement, in one
gram it can be “I”, “love”, “history”, “books”, “and”, “watching”, “documentaries”. In two-
gram or the bi-gram, there is the two-word sequence i.e. “I love”, “love reading”, or “history
books”. In the three-gram or the tri-gram, there are the three words sequences i.e. “I love
reading”, “history books,” or “and watching documentaries”. The illustration of the N-gram
modeling i.e. for N=1,2,3 is given below Figure.

1|Page
AIM-502: UNIT-2 WORD LEVEL ANALYSIS

For N-1 words, the N-gram modeling predicts most occurred words that can follow the
sequences. The model is the probabilistic language model which is trained on the
collection of the text. This model is useful in applications i.e. speech recognition, and
machine translations. A simple model has some limitations that can be improved by
smoothing, interpolations, and back off. So, the N-gram language model is about finding
probability distributions over the sequences of the word. Consider the sentences i.e.
"There was heavy rain" and "There was heavy flood". By using experience, it can be said
that the first statement is good. The N-gram language model tells that the "heavy rain"
occurs more frequently than the "heavy flood". So, the first statement is more likely to
occur and it will be then selected by this model. In the one-gram model, the model usually
relies on that which word occurs often without pondering the previous words. In 2-gram,
only the previous word is considered for predicting the current word. In 3-gram, two
previous words are considered.
In the N-gram language model the following probabilities are calculated:
P (“There was heavy rain”) = P (“There”, “was”, “heavy”, “rain”) = P (“There”) P (“was”
|“There”) P (“heavy”| “There was”) P (“rain” |“There was heavy”)

As it is not practical to calculate the conditional probability but by using the “Markov
Assumptions”, this is approximated to the bi-gram model as
P (“There was heavy rain”) ~ P (“There”) P (“was” |“'There”) P (“heavy” |“was”) P (“rain”
|“heavy”)

2.3 Describe Interpolation and Backoff-Word Classes


Interpolation and backoff-word classes are two techniques used in natural
language processing (NLP) to improve the accuracy of language models.
Interpolation is another technique in which we can estimate an n-gram probability
based on a linear combination of all lower-order probabilities. For instance, a 4-gram
probability can be estimated using a combination of trigram, bigram and unigram
probabilities. The weights in which these are combined can also be estimated by reserving
some part of the corpus for this purpose. This is done by taking a weighted average of the
predictions from the different models, with the weights being determined by the accuracy
of each model.
Backoff-word classes are a technique that is used to improve the accuracy of
language models for rare words. This is done by creating a hierarchy of word classes, with
the most common words being in the highest class. When a rare word is encountered, the
language model first tries to find a word in the same class. If no word is found in the same
class, the language model then tries to find a word in the next highest class, and so on.
Interpolation and backoff-word classes are both effective techniques for improving the
accuracy of language models.

While backoff considers each lower order one at a time, interpolation considers all the
lower order probabilities together.
However, interpolation is more computationally expensive than backoff-word classes.

2.4 Explain Part-of-Speech Tagging

2|Page
AIM-502: UNIT-2 WORD LEVEL ANALYSIS

Part-of-speech (POS) tagging is a process in natural language processing (NLP) where


each word in a text is labeled with its corresponding part of speech. This can include
nouns, verbs, adjectives, and other grammatical categories.
POS tagging is useful for a variety of NLP tasks, such as information extraction, named
entity recognition, and machine translation. It can also be used to identify the grammatical
structure of a sentence and to disambiguate words that have multiple meanings.
POS tagging is typically performed using machine learning algorithms, which are trained
on a large annotated corpus of text. The algorithm learns to predict the correct POS tag for
a given word based on the context in which it appears.
There are various POS tagging schemes that have been developed, each with its own set
of tags and rules. Some common POS tagging schemes include the Penn Treebank
tagset and the Universal Dependencies tagset.
Let’s take an example,
Text: “The cat sat on the mat.”
POS tags:
 The: determiner
 cat: noun
 sat: verb
 on: preposition
 the: determiner
 mat: noun
In this example, each word in the sentence has been labeled with its corresponding part of
speech. The determiner “the” is used to identify specific nouns, while the noun “cat” refers
to a specific animal. The verb “sat” describes an action, and the preposition “on” describes
the relationship between the cat and the mat.
POS tagging is a useful tool in natural language processing (NLP) as it allows algorithms
to understand the grammatical structure of a sentence and to disambiguate words that
have multiple meanings. It is typically performed using machine learning algorithms that
are trained on a large annotated corpus of text.
Identifying part of speech of word is not just mapping words to their respective POS tags.
Same word might have different part of speech tag based on different context. Thus it is
not possible to have common mapping for parts of speech tags.
When you have a huge corpus manually finding different part-of-speech for each word is a
scalable solution. As tagging itself might take days. This is why we rely on tool-based POS
tagging.

2.5 Differentiate Rule-based stochastic and Transformation-based tagging


Rule-based POS Tagging
One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers
use dictionary or lexicon for getting possible tags for tagging each word. If the word has
more than one possible tag, then rule-based taggers use hand-written rules to identify the
correct tag. Disambiguation can also be performed in rule-based tagging by analyzing the
linguistic features of a word along with its preceding as well as following words. For
example, suppose if the preceding word of a word is article then word must be a noun.
As the name suggests, all such kind of information in rule-based POS tagging is coded in
the form of rules.
These rules may be either −
 Context-pattern rules

3|Page
AIM-502: UNIT-2 WORD LEVEL ANALYSIS

 Or, as Regular expression compiled into finite-state automata, intersected with


lexically ambiguous sentence representation.
We can also understand Rule-based POS tagging by its two-stage architecture −
 First stage − In the first stage, it uses a dictionary to assign each word a list of
potential parts-of-speech.
 Second stage − In the second stage, it uses large lists of hand-written
disambiguation rules to sort down the list to a single part-of-speech for each word.
Stochastic POS Tagging
Another technique of tagging is Stochastic POS Tagging. Now, the question that arises
here is which model can be stochastic. The model that includes frequency or probability
(statistics) can be called stochastic. Any number of different approaches to the problem of
part-of-speech tagging can be referred to as stochastic tagger.
The simplest stochastic tagger applies the following approaches for POS tagging −
Word Frequency Approach
In this approach, the stochastic taggers disambiguate the words based on the probability
that a word occurs with a particular tag. We can also say that the tag encountered most
frequently with the word in the training set is the one assigned to an ambiguous instance
of that word. The main issue with this approach is that it may yield inadmissible sequence
of tags.
Tag Sequence Probabilities
It is another approach of stochastic tagging, where the tagger calculates the probability of
a given sequence of tags occurring. It is also called n-gram approach. It is called so
because the best tag for a given word is determined by the probability at which it occurs
with the n previous tags.
Transformation-based Tagging
Transformation based tagging is also called Brill tagging. It is an instance of the
transformation-based learning (TBL), which is a rule-based algorithm for automatic tagging
of POS to the given text. TBL, allows us to have linguistic knowledge in a readable form,
transforms one state to another state by using transformation rules.
It draws the inspiration from both the previous explained taggers − rule-based and
stochastic. If we see similarity between rule-based and transformation tagger, then like
rule-based, it is also based on the rules that specify what tags need to be assigned to what
words. On the other hand, if we see similarity between stochastic and transformation
tagger then like stochastic, it is machine learning technique in which rules are
automatically induced from data.
Working of Transformation Based Learning(TBL)
In order to understand the working and concept of transformation-based taggers, we need
to understand the working of transformation-based learning. Consider the following steps
to understand the working of TBL −
 Start with the solution − The TBL usually starts with some solution to the problem
and works in cycles.
 Most beneficial transformation chosen − In each cycle, TBL will choose the
most beneficial transformation.
 Apply to the problem − The transformation chosen in the last step will be applied
to the problem.
The algorithm will stop when the selected transformation in step 2 will not add either more
value or there are no more transformations to be selected. Such kind of learning is best
suited in classification tasks.

4|Page
AIM-502: UNIT-2 WORD LEVEL ANALYSIS

Rule-based tagging Stochastic tagging Transformation-based


tagging
 These taggers are  This POS tagging is  We learn small set of
knowledge drive taggers. based on the simple rules and these
 The rules in Rule-based probability of tag rules are enough for
tagging.
POS tagging are built occurring.
 Development as well as
manually.  It requires training debugging is very easy in
 The information is coded corpus TBL because the learned
in the form of rules.  There would be no rules are easy to
 We have some limited probability for the understand.
number of rules words that do not exist  Complexity in tagging is
approximately around in the corpus. reduced because in TBL
there is interlacing of
1000.  It uses different testing
machine learned and
 Smoothing and language corpus (other than human-generated rules.
modeling is defined training corpus).  Transformation-based
explicitly in rule-based  It is the simplest POS tagger is much faster than
taggers. tagging because it Markov-model tagger.
chooses most frequent  Transformation-based
tags associated with a learning (TBL) does not
provide tag probabilities.
word in training corpus.
 In TBL, the training time is
very long especially on
large corpora.

2.6 Identify the Issues in PoS tagging


 The main problem with POS tagging is ambiguity. In English, many common words
have multiple meanings and therefore multiple POS. The job of a POS tagger is to
resolve this ambiguity accurately based on the context of use.
For example, the word "shot" can be a noun or a verb. When used as a verb, it could
be in past tense or past participle.
 Context: The part-of-speech tag of a word can depend on the context in which it is
used.
For example, the word "run" can be a noun (a race) or a verb (to move quickly).
 Variation: The part-of-speech tags of words can vary depending on the dialect or
style of the text.
For example, the word "gonna" can be tagged as a noun (a contraction of "going
to") or a verb (a contraction of "going to").

Despite these challenges, part-of-speech tagging is an important task in NLP.


It can be used for a variety of tasks, such as machine translation, information
retrieval, and sentiment analysis.

2.7 Compare Hidden Markov and Maximum Entropy models

5|Page
AIM-502: UNIT-2 WORD LEVEL ANALYSIS

Hidden Markov Model Maximum Entropy Model


 HMM is a generative model because  MEMM is a discriminative model. This
words as modelled as observations is because it directly uses posterior
generated from hidden states. probability P(T|W); that is, probability of a
tag sequence given a word sequence.
 It can also be said  MEMM uses conditional probability,
that HMM uses joint probability for conditioned on previous tag and current
maximizing the probability of the word word.
sequence.
 In HMM, for the tag sequence  In MEMM, we build a distribution by
decoding problem, probabilities are adding features, which can be hand
obtained by training on a text corpus. crafted or picked out by training. The idea
is to select the maximum entropy
distribution given the constraints
specified by the features.
 Not flexible to add features.  MEMM is more flexible because we can
add features such as capitalization,
hyphens or word endings, which are hard
to consider in HMM.
 MEMM allows for diverse non-
independent features.

6|Page

You might also like