POS Tagging (1)
POS Tagging (1)
Tagging is a kind of classification that may be defined as the automatic assignment of description
to the tokens. Here the descriptor is called tag, which may represent one of the part-of-speech,
semantic information and so on.
Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the process of
assigning one of the parts of speech to the given word. It is generally called POS tagging. In
simple words, we can say that POS tagging is a task of labelling each word in a sentence with its
appropriate part of speech. We already know that parts of speech include nouns, verb, adverbs,
adjectives, pronouns, conjunction and their sub-categories.
Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and
Transformation based tagging.
Transformation-based Tagging
Transformation based tagging is also called Brill tagging. It is an instance of the transformation-
based learning (TBL), which is a rule-based algorithm for automatic tagging of POS to the given
text. TBL, allows us to have linguistic knowledge in a readable form, transforms one state to
another state by using transformation rules.
It draws the inspiration from both the previous explained taggers − rule-based and stochastic. If
we see similarity between rule-based and transformation tagger, then like rule-based, it is also
based on the rules that specify what tags need to be assigned to what words. On the other hand,
if we see similarity between stochastic and transformation tagger then like stochastic, it is
machine learning technique in which rules are automatically induced from data.
Example
For example, a sequence of hidden coin tossing experiments is done and we see only the
observation sequence consisting of heads and tails. The actual details of the process - how
many coins used, the order in which they are selected - are hidden from us. By observing this
sequence of heads and tails, we can build several HMMs to explain the sequence. Following is
one form of Hidden Markov Model for this problem −
We assumed that there are two states in the HMM and each of the state corresponds to the
selection of different biased coin. Following matrix gives the state transition probabilities −
A=[a11a21a12a22]A=[a11a12a21a22]
Here,
aij = probability of transition from one state to another from i to j.
a11 + a12 = 1 and a21 + a22 =1
P1 = probability of heads of the first coin i.e. the bias of the first coin.
P2 = probability of heads of the second coin i.e. the bias of the second coin.
We can also create an HMM model assuming that there are 3 coins or more.
This way, we can characterize HMM by the following elements −
N, the number of states in the model (in the above example N =2, only two states).
M, the number of distinct observations that can appear with each state in the above
example M = 2, i.e., H or T).
A, the state transition probability distribution − the matrix A in the above example.
P, the probability distribution of the observable symbols in each state (in our example P1
and P2).
I, the initial state distribution.
First Assumption
The probability of a tag depends on the previous one (bigram model) or previous two (trigram
model) or previous n tags (n-gram model) which, mathematically, can be explained as follows −
PROB (C1,..., CT) = Πi=1..T PROB (Ci|Ci-n+1…Ci-1) (n-gram model)
PROB (C1,..., CT) = Πi=1..T PROB (Ci|Ci-1) (bigram model)
The beginning of a sentence can be accounted for by assuming an initial probability for each tag.
PROB (C1|C0) = PROB initial (C1)
Second Assumption
The second probability in equation (1) above can be approximated by assuming that a word
appears in a category independent of the words in the preceding or succeeding categories which
can be explained mathematically as follows −
PROB (W1,..., WT | C1,..., CT) = Πi=1..T PROB (Wi|Ci)
Now, on the basis of the above two assumptions, our goal reduces to finding a sequence C
which maximizes
Πi=1...T PROB(Ci|Ci-1) * PROB(Wi|Ci)
Now the question that arises here is has converting the problem to the above form really helped
us. The answer is - yes, it has. If we have a large tagged corpus, then the two probabilities in the
above formula can be calculated as −
PROB (Ci=VERB|Ci-1=NOUN) = (# of instances where Verb follows Noun) / (# of instances where
Noun appears) (2)
PROB (Wi|Ci) = (# of instances where W i appears in Ci) /(# of instances where C i appears)
(3)