A Review of Techniques For Morphological Analysis in Natural Language Processing
A Review of Techniques For Morphological Analysis in Natural Language Processing
ke
https://doi.org/10.58506/ajstss.v1i2.11
https://journals.must.ac.ke © 2022 The Authors. Published by Meru University of Science and Technology
This is article is published on an open access license as under the CC BY SA 4.0 license
Introduction rectional LSTM for encoding and sentence representa-
Morphological analysis is the study of how words tion. The second model improves the performance of
in a language are formed by combining morphemes the first. The third model was a greedy encoder-
(Ding et al., 2019; Premjith et al., 2018; Yambao & decoder-pointer framework for segmentation and it
Cheng, 2020). It is a subfield in semantic analysis con- too improved the first and the second models. The
taining the following subfields: morphological seg- CoNLL 2000 shared task dataset was used as the data
mentation, lemmatization, POS tagging and stemming. source. The third model performed better than the
first two, including the baseline model, achieving a
Tasks in morphological analysis 94.72 F1 score. The challenge faced during the experi-
ment was that the models I and II failed to consistent-
Morphological segmentation is an NLP task that ly improve on the final F1 score.
involves of dissecting words into their constituent An unsupervised morphological segmentation da-
morphemes (Liu et al., 2021; Wang et al., 2019; Yang taset created by the University of Pennsylvania and
et al., 2019). This is an important NLP task that allevi- Linguistic Data Consortium for the DARPA LORELEI
ates out-of-vocabulary and data sparsity problems Program, containing about 2000 tokens for morpho-
(Liu et al., 2021). Lemmatization is the process of con- logical segmentation for each of 9 resource-poor lan-
verting an inflected word, like talked, to its citation guages and root information for 7 languages in this
form, which is talk (Malaviya et al., 2019) using a lem- category was proposed by (Mott et al., 2020). The 7
matizer (Ingolfsdottir et al., 2019). This task aims at languages include: Akan (2048 tokens), Hindi (2028
reducing a given word to a form that represents its tokens), Hungarian (2027 tokens), Indonesian (2035
entry in a dictionary (Chary et al., 2019). POS tagging tokens), Russian (2050 tokens), Spanish (2050 to-
aims at assigning each word a unique tag label that kens), Swahili (2023 tokens), Tagalog (2001 tokens)
indicates its syntactic role in a sentence. Such labels and Tamil (2028 tokens). Annotation was conducted
include plural, verb and noun. The tags are necessary in two phases: a first pass done in 2018, and a quality
to identify the grammatical function a word plays in control done in 2019. Four systems were evaluated
the sentence (Ayana, 2015). POS tagging is similar to on the dataset: Morfessor (Creutz & Lagus, 2007),
chunking (Kudo et al., 2002). Two approaches could MorphoChain (Narasimhan et al., 2015), ILP and Para-
be used to perform POS tagging (Tukur et al., 2020): Ma(Xu et al., 2018). On Swahili, MorphoChain outper-
rule-based approach, where rules of language are formed the three other counterparts with an F1 meas-
handwritten and manually verified; and corpus-based ure of 0.4306, closely followed by Morfessor with
or feature-engineering approach, which is performed 0.4320. This can be explained by Morfessor’s poor
from a training dataset that acts as the knowledge performance on Bantu morphology (Pauw & de
resource. Stemming is the processing of reducing a Schryver, 2008). A challenge with this model is that it
word into its stem which is not necessarily its mor- does not distinguish between inflectional and deriva-
phological root (Koirala & Shakya, 2020). Stemming tional morphology
reduces a word ending to its root without adhering to A method of segmenting both Chinese and Japanese
morphological rules (Ingolfsdottir et al., 2019). Stem- using both word and character-level information was
ming is a subtask of lemmatization when the concern presented by (Nakagawa, 2004). The datasets used
is only to remove the suffix (Patel & Patel, 2019). for Chinese word segmentation are the Academia
Sinica corpus, the Hong Kong City University corpus
Techniques for morphological analysis and the Beijing University corpus. The dataset used
These are various techniques that have been ap- for Japanese segmentation was the RWCP corpus. For
plied in the field of morphological analysis of various comparison, Bakeoff-1, Bakeoff-2, Bakeoff-3, Maxi-
natural languages. These techniques will be analyzed mum Matching and Character Tagging systems were
for each task. Depending on the focus of the tech- used for Chinese segmentation while ChaSen, Maxi-
nique, different results have been achieved across mum Matching and Character Tagging systems were
different test cases. used for Japanese. The model achieved the F-scores
on Chinese segmentation: 0.972 on the Academia Sini-
a) Morphological segmentation ca corpus, 0.950 on the Hong Kong City University
corpus and 0.954 on the Beijing University corpus.
Machine Learning These results were better than those posted by the
Three neural models that treat chunks as the basic benchmark systems. The model achieved an F-score
unit for labeling in a sequence labeling problem were of 0.993 on Japanese segmentation, better than the
proposed by (Zhai et al., 2017). The first model is a closer benchmark system: the ChaSen with 0.991. The
bidirectional LSTM. The second model is also a bidi- strength of the model is that it is able to perform bet-
94
2
ter than the benchmark systems for unknown words ing was used, which improved the F1 score by 4.03%.
than for known words. More training epochs were needed to improve accu-
MorphAGram (Eskander et al., 2020) was present- racy for tokens in the dataset that appeared less fre-
ed as a publicly accessible unsupervised framework quently and even then, the model failed to learn the
for morphological segmentation based on ‘Adaptor least frequently occurring label.
Grammars (AG)’ and a previous work (Eskander et al., In their work on morphological segmentation for
2016). The model is evaluated on 12 languages and it Persian, (Ansari et al., 2019) use supervised methods
performed well. Their adapter grammars consisted of trained on a well labelled manual corpus. In their ex-
probabilistic context free grammars and a caching periment, the bidirectional LSTM model outper-
model. Datasets used for the experiments include formed other models with an F score of 90.53, closely
50,000 words from the Morpho Challenge competi- followed by the unidirectional LSTM at 88.80. The k-
tion (German, Finnish, Turkish and English), 50,000 Nearest Neighbor model outperformed all other mod-
words of the Georgian Wikipedia (Georgian), 50,000 els in predicting boundaries.
words of the Arabic PATB corpus (Arabic) and a 2132 Chinese word segmentation as a tagging problem
-word dataset drawn from (Kann et al., 2018) based on word-internal positions was presented by
(Wixarika, Mexicanero, Yorem Nokki and Nahuatl). (Xue, 2003). Tagging is based on maximum entropies.
Text segmentation could be either transductive The experiment was branched into two: the first com-
(where a word needs to be in the learner’s vocabulary prising a maximum matching method and serving as
first) or inductive (where a word does not need to be the baseline; and the second comprising the maxi-
in the learner’s vocabulary). The research concluded mum entropy model. The dataset used for the experi-
that inductive text segmentation had no improvement ment was the Xinhua newswire section of the Penn
in performance. (Creutz & Lagus, 2007) and Morpho- Chinese Treebank. Training data consisted of 237,791
Chain (Narasimhan et al., 2014) were chosen as the words while the test set consisted of 12,598 words.
baselines. When Boundary Precision Recall metric is The maximum entropy model achieved better results
used, the model outperforms Morfessor (Creutz & than the maximum matching method for the segmen-
Lagus, 2007) and MorphoChain on all languages. On tation task, achieving an F-score of 94.98%. When the
these metrics, language independence reduces error test set had no new words, the maximum matching
rates from Morfessor by 26.0% and MorphoChain by method achieved an F-score of 95.15% compared to a
38.0%. However, Morfessor is not well equipped to score of 89.77% when the test set contained new
handle Bantu morphology (Pauw & de Schryver, words. The model was also capable of segmenting
2008). personal names, achieving a recall of 86.86%. The
In their novel morphological segmentation work notable challenge with this segmentation approach is
for Tigrinya language, (Tedla & Yamamoto, 2018) that it was not able to accurately segment foreign per-
combine CRFs with LSTMS for detecting morpheme sonal names.
boundaries. Begin (B), Inside (I), Outside (O), Single A morphological analyzer incorporated in an Eng-
(S) and End (E) labels are used to annotate mor- lish to Swahili, Russian and Hebrew phrase-based
phemes in order to mark morpheme boundaries. A machine translation model was proposed by
window size of 5 characters was used for word em- (Chahuneau et al., 2013). The model first identifies a
beddings. 10-fold cross validation is performed on a stem bearing meaning on the target language and lat-
45,127-token corpus. The BIE tagging strategy er selects the appropriate inflection using a discrimi-
achieved the highest F1 score (94.67), followed by the native model. The translations are generated in short
BIES strategy (94.59), then by BIO strategy (90.11) phrases called synthetic phrases, according to rule
and lastly the BIOES strategy (88.39). Their experi- extraction techniques (Chiang, 2007). Only Russian
ment shows that LSTMs performed better than their segmentation was based on supervised methods. The
CRF counterpart, with both outperformed by bidirec- assumption on unsupervised method was to decom-
tional LSTMs. The corpus size was small, which con- pose a word into prefixes, a stem and suffixes. A regu-
tributed to the poor performance of the BIOES strate- lar grammar was developed to model possible mor-
gy which requires more details. phemes in the morphologically resource-rich lan-
A bidirectional LSTM model is presented by guages, in which a word comprised of a set of prefix-
(Almuhareb et al., 2019) for the word segmentation of es, a stem and a set of suffixes. Inflections are predict-
Arabic with data sourced from the Arabic Treebank. ed using a stochastic gradient descent function that
Their character embeddings used a window size of 5. make the most of the conditional log-likelihood of the
The model was trained on a 48 million token dataset. source language sentence feature pairs. A Conditional
Word segmentation without rewriting achieved an F1 Random Field (CRF) tagger is used on the source lan-
score of 97.65%, but was outperformed when rewrit- guage, which is trained on the Penn Treebank’s sec-
95
3
tion 02-21 and additionally, the TurboParser for de- set of prefixes, a stem and a set of suffixes. Inflections
pendency parsing, trained also on the Penn Treebank. are predicted using a stochastic gradient descent
The Global Voices project and the Helsinki Corpus of function that maximizes the conditional log-likelihood
Swahili were chosen as the Swahili data sets. The syn- of the source language sentence feature pairs.
thetic phrases model outperformed the class-based fsm2 finite state method for the automatic analysis
language model on all test cases. The English-to- of Runyakitara nouns is presented by
Swahili translation task outperformed other tasks, (Katushemererwe & Issue, 2010). All noun lexemes in
achieving a BLEU score of about 19.0, followed by He- the language were built into fsm2. Nouns were ex-
brew at about 17.6 and lastly, Russian at about 16.2. tracted from a Runyakitara dictionary and manually
The strength of the model is that 1) translation is con- coded into noun sub-classes. The model comprised of
text-based, 2) it does not require language-specific modules/files, that is, a special symbol file, a noun
engineering, and 3) it is workable with syntax- or grammar file and a replacement rule file. All three
phrase-based decoder without modification. Also, the comprised the finite state transducer. The symbol
model is able to generate unseen inflections (Botha & specification file contained a mapping between hu-
Blunsom, 2014).The weakness with the model is that man readable symbols and integers representing
the intrinsic inflectional dataset for evaluation was these symbols in the system. The noun grammar file
noisy, owing to errors in word alignments, with accu- contained quasi context free grammars. The replace
racy on predicting Swahili inflection being 78.2%, rules are applied to enforce grammatical forms of
higher than Russian (71.2%) but lower than Hebrew nouns, like replacing ‘u’ with ‘w’ whenever ‘m’ occurs
(85.5%). to the left of ‘u’ and either ‘a’ or ‘o’ or ‘i’ to its right.
Further, the replacement rules could modify the noun
Rule based approaches by deletion, substitution or insertion of symbols. The
system was evaluated on a dataset extracted from a
Morphological segmentation is incorporated in the weekly newspaper and an orthography reference
Abu-MaTran project systems to the English-to- book, both in a different language. The system
Finnish language pair (Sanchez et al., 2016). Segmen- achieved a precision of 80% on 4472 words and a
tation and deep learning address the data scarcity recall of 80% on the 5599-word corpus.
problem and the Finnish complex morphology. The Runyagram, a formal system for the morphological
morphological segmentation applied was rule-based. segmentation of Runyakitara verbs based on the fsm2
The Moses toolkit (Koehn et al., 2007) was used to interpreter is presented by (Fridah & Thomas, 2010).
preprocess training corpora. The corpora used for the Just like their similar model for nouns
experiments included the newsdev2015, new- (Katushemererwe & Issue, 2010), Runyagram finite
stest2015 and an SMT translated corpora of Finnish state transducer comprised of a special symbol file, a
to English. The research concluded that rule-based grammar file and a replacement rule file containing
morphological segmentation improved quality for about 34 rules. The verb grammar is defined accord-
both neural machine translation and statistical ma- ing to the number of morphemes a verb can take,
chine translation. The research also concluded that from minimum to maximum. The grammar is convert-
neural machine translation achieves better results ed into an unweighted finite-state acceptor by con-
than statistical machine translation. The disadvantage verting rules into directed graphs. The grammar con-
of this experiment is it took at least 5 days to train the tained about 330 rules was thus converted into a fi-
models. nite-state acceptor containing about 1200 transitions
A morphological analyzer is incorporated in an and about 800 states. The system was tested against
English to Swahili, Russian and Hebrew phrase-based 3971 verbs from an orthography reference book and
machine translation model (Chahuneau et al., 2013). a dictionary of another Bantu language. The system
The model first identifies a stem bearing meaning on scored a recall of 86% and a precision of 82%.
the target language and later selects the appropriate A Setswana tokenizer based on two transducers
inflection using a discriminative model. The transla- and a finite-based morphological analyzer was pre-
tions are generated in short phrases called synthetic sented by (Pretorius & Pretorius, 2009). The system
phrases, according to rule extraction techniques is majorly inclined to disjunctive orthography. Mor-
(Chiang, 2007). Only Russian segmentation was based photactics were developed on the lexc tool of the Xer-
on supervised methods. The assumption on unsuper- ox finite state tools, while morphological alternations
vised method was to decompose a word into prefixes, were modeled in the xsft tool. The contents of the lexc
a stem and suffixes. A regular grammar was devel- tool and xsft tool are combined into a finite state
oped to model possible morphemes in the morpho- transducer, which was considered the morphological
logically rich languages, where a word comprised of a analyzer. Errors in their tokenizer output were pre-
96
4
sented to humans for examination. 547 Setswana or- ment was the Universal Dependency Treebank v2.0
thographic words were obtained to evaluate the sys- dataset with 20 languages. The model achieves a
tem. The benchmark of the evaluation was a hand- 94.9% accuracy, outperforming the benchmarks with
tokenized text by an expert. The tokenizer was able to the closest model achieving 94.1%. However, a chal-
tokenize 95 orthographic verbs out of 111 tokens lenge with the model is that not relying on morpho-
drawn from the initial test set and that contained logical tags make the system unrealistic as morpho-
more than one orthographic word. Their results syntactic annotation must be available on corpora
proved that overall length of the input tokens im- that have been annotated with token-level lemmata
proved the general tokenization. The overall F-score (Malaviya et al., 2019).
of the system was 0.93. The strength of the system A contextual neural model for lemmatization was
was that it could tokenize more input words. The presented by (Malaviya et al., 2019). The model em-
weakness of the system is that the morphological ana- ploys morphological tagging (assigning words their
lyzer was underdeveloped and that it was unable to POS tags and more morphological information
perform tokenization based on the context of the to- ((Yildiz & Tantug, 2019))) to provide the summary of
kens. the context of the word in the sentence. The input of
A finite state morphological analyzer for the the lemmatizer is the output from the morphological
Ekegusii verbs was presented by (Elwell, 2008). The tagger. The Universal Dependencies Treebanks was
model is based on morphemes and implemented in the data source for the experiments. The lemmatizer
Xerox finite state tools. All finite and non-finite forms is a 2-layer bidirectional LSTM encoder and a 1-layer
were captured using only one regular expression. The bidirectional LSTM decoder consisting of 400 hidden
morphosyntax of the verb is realized by specifying a units. The baseline systems for the experiment in-
set of morphemes that occupy each morpheme slot. clude: Lematus (Bergmanis & Goldwater, 2018), UD-
The challenge with the system is that it poorly han- Pipe (Straka & Strakova, 2017), Lemming (Muller et
dled imbrication, which arises when there is a wid- al., 2015) and Morfette (Chrupała et al., 2008). The
ened range of verbal roots. The evaluation results of experiments show that morphological taggers im-
the systems are unavailable. prove the general performance of lemmatizers for the
A rule-based model for stemming Nepali text was task of lemmatization. The overall accuracy of the
presented by (Koirala & Shakya, 2020). A manually proposed model is 95.42%, better than Lematus at
annotated corpus was extracted from online news about 95.05% with all models tested across 20 lan-
portals. The corpus consisted of 4383 articles with guages.
118,056 unique words. To classify news topics, 1400 A sequence-to-sequence lemmatizer is presented
news articles drawn from sports, global, politics, by (Celano, 2020) for the closed EvaLatin shared task.
economy, literature, society and technology was ex- The lemmatizer was implemented in Keras and train-
tracted from Nepali news website and subdivided into ing spanned 10 epochs. Lemmatization relied on POS
70% training set and 30% test set. The research con- tags generated from LightGBM. These POS tags serve
cluded that stemmed classification outperformed the to disambiguate word forms. The model’s accuracy on
non-stemmed counterpart, with an F1-score differ- the development set and test set is 99.82% and
ence of 0.02 and a significantly reduced vocabulary 97.63% respectively. This model, however, could not
size of features. lemmatize Arabic numbers.
Rule based approaches
b) Lemmatization A rule based approach to Sinhala lemmatization is
presented by (Nandathilaka et al., 2018). Their model
Machine Learning relied on a POS tagger to detect the part of speech of a
word before lemmatization was performed. Roots
Lematus, a system that context-based lemmatiza- were manually annotated based on their role in the
tion using and encoder and decoder was presented by sentences. These sentences were derived from social
(Bergmanis & Goldwater, 2018). The system does not media text. A total of 30 rules were created to guide
use a morphological tagger (Malaviya et al., 2019). lemmatization of nouns. The model was tested on 300
The model is based on Nematus (Sennrich et al., words obtained from Facebook, achieving an accuracy
2017) neural machine translation toolkit. The bench- of 73.33%. The weaknesses of this model are that it
mark systems for the experiment are Morfette did not rely on a formal lexicon and that its accuracy
(Chrupała et al., 2008), Lemming (Muller et al., 2015) depended on how correctly the POS tagger was con-
and a context-sensitive lemmatizer based on two bidi- figured.
rectional gated recurrent neural networks A lemmatizer for Gujarati text based on a stemmer
(Chakrabarty et al., 2017). The dataset for the experi- is presented by (Patel & Patel, 2019). The dataset was
97
5
manually created and contained both the stem and source of data was the Universal Dependencies pro-
lemma of a word. The model allows new words to be ject v1.2, with languages chosen being at least 60,000
added to the dictionary. Wrong stems can be manual- tokens. The TNT performed better on the 22 lan-
ly handpicked and rectified. The model accurately guages than the CRF. However, the LSTM tagger per-
performed stemming on 98.33% of the total 2097 formed better than the traditional taggers on 3 lan-
words tested. 239 new words were added to the dic- guages and RNNs. The multi-task bidirectional LSTM
tionary. The weakness of the system is that a derived performed best on 12 languages and successfully pre-
stem could also be a lemma of another word bearing a dicted POS tags for out of vocabulary tokens. This was
different meaning. The model could also give errone- made possible by auxiliary loss function that en-
ous output if a certain inflection is also a part of an- hanced the performance on rare words. However, the
other totally different inflection since it uses shortest- performance of the bi-LSTM is curtailed by the pres-
affix-match technique to locate affixes. This makes ence of noise, and more so, higher rates of noise.
only one inflection to be removed leaving a part of the The Target Preserved Adversarial Neural Network
other inflection unremoved since it has been distort- (TPANN) to perform POS tagging for Twitter was pre-
ed by the removal of the first, making this distorted sented by (Gui et al., 2017). The POS tagger is based
inflection appear as part of the stem. The strength of on the bidirectional LSTM, an adversarial network
the model lies in the predefined format of the vocabu- and autoencoder. Their feature extraction component
lary, which greatly boosts its results. relies on a CNN for character embedding feature ex-
A rule-based lemmatizer or Kannada is proposed traction. The POS tagger is a feed-forward classifier
by (Prathibha & Padma, 2016). A manual dictionary of with a SoftMax layer. The datasets used to support the
both verb and noun roots was created. The model re- experiments ranged from labeled out-of-domain, la-
lied on the longest-affix-match technique to locate beled in-domain and unlabeled in-domain data. The
affixes within a word. This lemmatizer automatically out-of-domain data comprised of the Wall Street Jour-
updates the dictionary with new lemma. Affixes are nal data extracted from the Penn Treebank v3. This
also manually collected. The weaknesses of the model set was applied for training POS tagging. The labeled
are: errors could arise if affixes were absent from the in-domain data was extracted from three benchmarks
vocabulary, rule violation, misspelt input, and overfit- for comparison with their proposed method: RIT-
ting arising from longest-affix-match. This model Twitter, NPSCHAT, ARK-Twitter. This data was ap-
could also perform poorly on input with multiple suf- plied to further train and evaluate the POS tagger. The
fixes. Since the model was tested on four datasets, it unlabeled data was constructed at a large scale from
achieved an average overall accuracy of 93.50%. Twitter through its application programming inter-
A rule based lemmatizer for Punjabi is presented face. The model achieved 94.1% accuracy when evalu-
by (Puri, 2018). The model relies on the Synonym re- ated on NPSChat, better than 90.8% accuracy
placement algorithm to obtain the lemma of a word achieved on a previous work. When evaluated on the
based on predefined rules. The model works by look- RIT-Twitter, the model achieved 90.92% accuracy.
ing up the shortest synonym of an input word from a Transformation-based learning for chunking is ap-
dictionary. The weakness of this model lies in its reli- plied by (Ramshaw & Marcus, 1999). The model ap-
ance to named entity recognition when lookup. The plies Brill’s POS tagger (Brill, 1992) to assign chunk
model also relies on a list of suffixes to guide it on tags to each word based on its POS tag. The experi-
what affixes to strip from a word. The model achieved ments relied on data sourced from the Wall Street
an F score of 86 when tested on 10 articles containing Journal section of the Penn Treebank. 50,000 words
a total of 3979 words. Other weaknesses of the model were used in each test set. The experiments subdivide
are that there were a few words in the database and chunking into two subtasks: the baseNP and the parti-
that the model did not consider the context and part tioning chunk tasks. The model proves that not rely-
of speech of an input word. ing on POS tags for chunking improved the baseNP
subtask by 1% and the partitioning subtask by 5%,
c) Part-of-Speech (POS) tagging implying that the baseNP subtask is better favored by
reference to actual words. Further, the models im-
Machine Learning proved in accuracy if more words were supplied for
training, achieving 90.5% and 83.5% precision on
Bi-directional long short-term memory models baseNP and partitioning subtasks respectively.
have been used with traditional POS taggers across 22 Semantic/Syntactic Extraction Using a Neural Net-
languages and data sizes by (Plank et al., 2016). In work Architecture (SENNA), a model that relied on
their experiment, they used three taggers: the TNT, feed forward neural network and word embeddings
CRF tagger and the bidirectional LSTM tagger. The for NLP tasks like POS tagging, NER, semantic role
98
6
labelling and chunking was presented by (Collobert et (Tukur et al., 2020). A corpus of Hausa words was
al., 2011). The model achieved a best F1-score of used as the knowledge resource. The performance of
94.32%, 0.03% higher than the benchmark system. the system was tested using 187 Hausa words that
An unsupervised algorithm to identify verb argu- were presented to a Hausa expert for verification. The
ments with POS tagging as the only annotation re- system accurately tagged 76.795% of the words. This
quirement is proposed by (Abend et al., 2009). is better than the 20% accuracy achieved in the first
MXPOST (Ratnaparkhi, 1996) and decision tree-based POS tagger for Hausa language (Tukur et al., 2019).
tagger (Schmid, 1994) were used to extract POS tags The challenge the experiment faced was that the cor-
for English and Spanish respectively. The experiment pus lacked enough words and that it couldn’t correct-
sourced data from the PropBank English corpus. ly tag all the words, with conjunctions being the least
Training data consisted of 207 sentences with 132 correctly tagged at 50%.
distinct verbs. The test data comprised of 6007 sen- A POS tagger based on Hidden Markov Models, im-
tences and 1008 distinct verbs. The Spanish branch of plementing he Viterbi algorithm for optimization was
the experiment sourced data from Spanish Wikipedia, developed by (Mamo & Meshesha, 2011). To analyze
resulting in 200 sentences with 313 verb instances the performance of the model, the corpus was divided
for training and 848 sentences with 1279 verb in- into nine folds for training and the remaining onefold
stances for testing. On English test data, the model for testing. Each test set contained about 146 words.
achieved an F1 score of 59.14% when using clause The bigram algorithm recorded a 91.97% accuracy
detection compared to the 57.35% score of the base- while the unigram algorithm correctly tagged 87.5%
line system. On Spanish, the model achieved an F1 of the words.
score of 23.87% when using collocation maximum F- Conditional random field are applied to capture
score compared to 21.62% score of the baseline sys- code switched pattern sequences to tag words ex-
tem. tracted from social media text with accurate POS in-
SENNA, a model that relied on feed forward neural formation (Ghosh et al., 2016). The targeted code-
network and word embeddings for NLP tasks like POS switched languages are Bengali, Hindi and Tamil on
tagging, NER, semantic role labelling and chunking mostly-English text. The dataset comprised of utter-
was proposed by (Collobert et al., 2011). The model ances from each of the languages to English, resulting
achieved the following results: in a total of 44,908 utterances for training and 27,028
utterances for testing. POS tagging was performed
using two taggers: first the Stanford POS tagger, and
later the Conditional Random (CRF) Field tagger for
language identification. Compared to the Stanford
model, the CRF performed better on code switches
from each language to English. The CRF model
achieved the following accuracies for code-switches
to English: Bengali = 75.22%, Hindi = 73.2%, Tamil =
Table 1: SENNA's performance on POS, Chunking, NER, and 64.83%.
SRL A POS tagger for Bengali using CRF is proposed by
(Ekbal, 2007). 26 POS tags were used for the experi-
Statistical approach ment. The CRF method was chosen as it worked bet-
ter than Hidden Markov model (HMM) for languages
A POS tagger that incorporates Hidden Markov that lack large annotated corpora. The POS tagger
models and the Unigram was presented by (Tukur et comprised of context word features, word suffix,
al., 2019). The purpose of HMMs is to assign tags to a word prefix, and named entity recognition. The model
sentence based on its context, while the Unigram as- was trained using 72,341 words, with the training
signs POS tags on a per word basis. The sentences in corpus sourced from NLPAI_Contest06 and
the corpus were split into bigrams using a Hidden SPSAL2007 data. 20,000 wordforms were presented
Markov Model-based sentence analyzer. The system to the tagger during testing. The model recorded
accurately tagged 20% of the words in the corpus. 86.4% accuracy when the bare CRF was used alone.
This figure is considerably low considering that the However, the accuracy improved to 90.3% when the
corpus had words drawn from a native website. How- CRF was combined with Named Entity Recognition
ever, this is commendable as it is the first POS tagger (NER), the Bengali lexicon and unknown words. This
for the Hausa language. CRF model thus achieves better results than (Ghosh et
A technique for tagging parts of speech in Hausa al., 2016), even though their CRF was meant for a dif-
using Hidden Markov Models was proposed by ferent purpose.
99
7
signed a tag regardless. The data set for this experi-
Rule based approaches ment was rather small.
100
8
ference (Long Papers). https://doi.org/10.18653/ ence on Empirical Methods in Natural Language
v1/P17-1136 Processing, Proceedings. https://
Chary, M., Parikh, S., Manini, A. F., Boyer, E. W., & doi.org/10.18653/v1/d17-1256
Radeos, M. (2019). A review of natural language Ingolfsdottir, S. L., Loftsson, H., Daðason, J. F., & Bjar-
processing in medical education. Western Journal nadottir, K. (2019). Nefnir: A high accuracy lemma-
of Emergency Medicine. https://doi.org/10.5811/ tizer for Icelandic. http://arxiv.org/
westjem.2018.11.39725 abs/1907.11907
Chiang, D. (2007). Hierarchical phrase-based transla- Kann, K., Mager, M., Meza-Ruiz, I., & Schutze, H.
tion. Computational Linguistics. https:// (2018). Fortification of neural morphological seg-
doi.org/10.1162/coli.2007.33.2.201 mentation models for polysynthetic minimal-
Chrupała, G., Dinu, G., & van Genabith, J. (2008). resource languages. NAACL HLT 2018 - 2018 Con-
Learning morphology with Morfette. Proceedings ference of the North American Chapter of the Asso-
of the 6th International Conference on Language ciation for Computational Linguistics: Human Lan-
Resources and Evaluation, LREC 2008. guage Technologies - Proceedings of the Confer-
Collobert, Ronan, Weston, J., Bottou, L., Karlen, M., ence. https://doi.org/10.18653/v1/n18-1005
Kavukcuoglu, K., & Kuksa, P. (2011). Natural lan- Katushemererwe, F., & Issue, S. (2010). Fsm2 and the
guage processing (almost) from scratch. Journal of Morphological Analysis of Bantu Nouns – First Ex-
Machine Learning Research. periences from Runyakitara. 4(1), 58–69.
Creutz, M., & Lagus, K. (2007). Unsupervised models Koehn, P., Zens, R., Dyer, C., Bojar, O., Constantin, A.,
for morpheme segmentation and morphology Herbst, E., Hoang, H., Birch, A., Callison-Burch, C.,
learning. ACM Transactions on Speech and Lan- Federico, M., Bertoldi, N., Cowan, B., Shen, W., &
guage Processing. https:// Moran, C. (2007). Moses: open source toolkit for
doi.org/10.1145/1217098.1217101 statistical machine translation. Proceedings of the
Ding, C., Aye, H. T. Z., Pa, W. P., Nwet, K. T., Soe, K. M., 45th Annual Meeting of the ACL on Interactive
Utiyama, M., & Sumita, E. (2019). Towards Burmese Poster and Demonstration Sessions - ACL ’07.
(Myanmar) morphological analysis: Syllable-based https://doi.org/10.3115/1557769.1557821
Tokenization and Part-of-speech Tagging. ACM Koirala, P., & Shakya, A. (2020). A Nepali Rule Based
Transactions on Asian and Low-Resource Language Stemmer and its performance on different NLP ap-
Information Processing. https:// plications. http://arxiv.org/abs/2002.09901
doi.org/10.1145/3325885 Kudo, Taku, & Matsumoto, Y. (2002). Chunking with
Ekbal, A. (2007). Bengali Part of Speech Tagging using Support Vector Machines. Journal of Natural Lan-
Conditional Random Field. Proceedings of Seventh guage Processing. https://doi.org/10.5715/
…. jnlp.9.5_3
Elwell, R. (2008). Finite State Methods for Bantu Verb Liu, Su, X., Zhang, H., Gao, G., & Bao, F. (2021). Incor-
Morphology. Computational Linguistics for Less- porating Inner-word and Out-word Features for
Studied Languages, X, 56–67. Mongolian Morphological Segmentation. https://
Eskander, R., Callejas, F., Nichols, E., Klavans, J., & doi.org/10.18653/v1/2020.coling-main.408
Muresan, S. (2020). MorphAGram: Evaluation and Malaviya, C., Wu, S., & Cotterell, R. (2019). A simple
Framework for Unsupervised Morphological Seg- joint model for improved contextual neural lemma-
mentation. Aclweb.Org. tization. NAACL HLT 2019 - 2019 Conference of the
Eskander, R., Rambow, O., & Yang, T. (2016). Extend- North American Chapter of the Association for
ing the use of adaptor grammars for unsupervised Computational Linguistics: Human Language Tech-
morphological segmentation of unseen languages. nologies - Proceedings of the Conference.
COLING 2016 - 26th International Conference on Mamo, G., & Meshesha, M. (2011). Parts of Speech
Computational Linguistics, Proceedings of COLING Tagging for Afaan Oromo. International Journal of
2016: Technical Papers. Advanced Computer Science and Applications.
Fridah, K., & Thomas, H. (2010). Finite State Methods https://doi.org/10.14569/
in Morphological Analysis of Runyakitara Verbs. specialissue.2011.010301
Nordic Journal of African Studies. Mott, J., Bies, A., Strassel, S., Kodner, J., Richter, C., Xu,
Ghosh, S., Ghosh, S., & Das, D. (2016). Part-of-speech H., & Marcus, M. (2020). Morphological Segmenta-
Tagging of Code-Mixed Social Media Text. https:// tion for Low Resource Languages. May, 3996–4002.
doi.org/10.18653/v1/w16-5811 Muller, T., Cotterell, R., Fraser, A., & Schutze, H.
Gui, Tao, Zhang, Q., Huang, H., Peng, M., & Huang, X. (2015). Joint lemmatization and morphological tag-
(2017). Part-of-speech tagging for twitter with ad- ging with LEMMING. Conference Proceedings -
versarial neural networks. EMNLP 2017 - Confer- EMNLP 2015: Conference on Empirical Methods in
101
9
Natural Language Processing. https:// In Machine Translation. https://doi.org/10.1007/
doi.org/10.18653/v1/d15-1272 s10590-004-2477-4
Nakagawa, T. (2004). Chinese and Japanese word seg- Pretorius, & Pretorius, L. (2009). Setswana Tokenisa-
mentation using word-level and character-level tion and Computational Verb Morphology : Facing
information. https:// the Challenge of a Disjunctive Orthography. Com-
doi.org/10.3115/1220355.1220422 putational Linguistics.
Nandathilaka, M., Ahangama, S., & Thilini Weerasuri- Puri, R. (2018). A Rule based approach for lemmatisa-
ya, G. (2018). A Rule-based Lemmatizing Approach tion of Punjabi text Documents. 27(63019), 216–
for Sinhala Language. 2018 3rd International Con- 224.
ference on Information Technology Research, Purnamasari, K. K., & Suwardi, I. S. (2018). Rule-based
ICITR 2018. https://doi.org/10.1109/ Part of Speech Tagger for Indonesian Language.
ICITR.2018.8736134 IOP Conference Series: Materials Science and Engi-
Narasimhan, K., Barzilay, R., & Jaakkola, T. (2015). An neering. https://doi.org/10.1088/1757-
Unsupervised Method for Uncovering Morphologi- 899X/407/1/012151
cal Chains. Transactions of the Association for Com- Ramshaw, L. A., & Marcus, M. P. (1999). Text Chunk-
putational Linguistics. https://doi.org/10.1162/ ing Using Transformation-Based Learning. https://
tacl_a_00130 doi.org/10.1007/978-94-017-2390-9_10
Narasimhan, K., Karakos, D., Schwartz, R., Tsakalidis, Ratnaparkhi, A. (1996). A Maximum Entropy Model
S., & Barzilay, R. (2014). Morphological segmenta- for Part-of-Speech Tagging. In Proceedings of the
tion for keyword spotting. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Lan-
Conference on Empirical Methods in Natural Lan- guage Processing.
guage Processing, Proceedings of the Conference. Sadia, R., Rahman, M. A., & Seddiqui, M. H. (2019). N-
https://doi.org/10.3115/v1/d14-1095 gram Statistical Stemmer for Bangla Corpus. 2–6.
Neale, S., Donnelly, K., Watkins, G., & Knight, D. http://arxiv.org/abs/1912.11612
(2019). Leveraging lexical resources and constraint Sadredini, E., Guo, D., Bo, C., Rahimi, R., Skadron, K., &
grammar for rule-based part-of-speech tagging in Wang, H. (2018). A scalable solution for rule-based
Welsh. LREC 2018 - 11th International Conference part-of-speech tagging on novel hardware accelera-
on Language Resources and Evaluation. tors. Proceedings of the ACM SIGKDD International
Patel, H., & Patel, B. (2019). Stemmatizer—Stemmer- Conference on Knowledge Discovery and Data Min-
based Lemmatizer for Gujarati Text. Advances in ing. https://doi.org/10.1145/3219819.3219889
Intelligent Systems and Computing. https:// Sanchez, Cartagena, V. M., & Toral, A. (2016). Abu-
doi.org/10.1007/978-981-13-2285-3_78 MaTran at WMT 2016 Translation Task: Deep
Pauw, G., & de Schryver, G. M. (2008). Improving the Learning, Morphological Segmentation and Tuning
computational morphological analysis of a Swahili on Character Sequences. https://
corpus for lexicographic purposes. Lexikos. doi.org/10.18653/v1/w16-2322
Plank, B., Søgaard, A., & Goldberg, Y. (2016). Multilin- Schmid, H. (1994). Probabilistic Part-of-Speech Tag-
gual part-of-speech tagging with bidirectional long ging Using Decision Trees. Proceedings of the Inter-
short-term memory models and auxiliary loss. 54th national Conference on New Methods in Language
Annual Meeting of the Association for Computa- Processing.
tional Linguistics, ACL 2016 - Short Papers. Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B.,
https://doi.org/10.18653/v1/p16-2067 Hitschler, J., Junczys-Dowmunt, M., Laubli, S., Bar-
Prathibha, R. J., & Padma, M. C. (2016). Design of rule one, A. V. M., Mokry, J., & Nadejde, M. (2017).
based lemmatizer for Kannada inflectional words. Nematus: A toolkit for neural machine translation.
2015 International Conference on Emerging Re- 15th Conference of the European Chapter of the
search in Electronics, Computer Science and Tech- Association for Computational Linguistics, EACL
nology, ICERECT 2015. https://doi.org/10.1109/ 2017 - Proceedings of the Software Demonstra-
ERECT.2015.7499024 tions. https://doi.org/10.18653/v1/e17-3017
Premjith, B., Soman, K. P., & Kumar, M. A. (2018). A Sodhar, I. N., Jalbani, A. H., Channa, M. I., & Hakro, D. N.
deep learning approach for Malayalam morphologi- (2019). Parts of Speech Tagging of Romanized Sin-
cal analysis at character level. Procedia Computer dhi Text by applying Rule Based Model. 19(11), 91–
Science. https://doi.org/10.1016/ 96. https://doi.org/10.13140/RG.2.2.35194.03524
j.procs.2018.05.058 Straka, M., & Strakova, J. (2017). Tokenizing, POS tag-
Pretorius, & Bosch, S. E. (2003). Finite-state computa- ging, lemmatizing and parsing UD 2.0 with UDPipe.
tional morphology: An analyzer prototype for Zulu. CoNLL 2017 - SIGNLL Conference on Computation-
al Natural Language Learning, Proceedings of the
102
10
CoNLL 2017 Shared Task: Multilingual Parsing
from Raw Text to Universal Dependencies. https://
doi.org/10.18653/v1/k17-3009
Tedla, Y., & Yamamoto, K. (2018). Morphological Seg-
mentation with LSTM Neural Networks for Tigri-
nya. International Journal on Natural Language
Computing. https://doi.org/10.5121/
ijnlc.2018.7203
Tesfaye, D. (2011). A rule-based Afan Oromo Gram-
mar Checker. International Journal of Advanced
Computer Science and Applications, 2(8). https://
doi.org/10.14569/ijacsa.2011.020823
Tukur, A., Umar, K., & Muhammad, A. S. (2019). Tag-
ging part of speech in hausa sentences. 2019 15th
International Conference on Electronics, Computer
and Computation, ICECCO 2019, Icecco. https://
doi.org/10.1109/ICECCO48375.2019.9043198
Tukur, A., Umar, K., & Sa, A. (2020). Parts-of-Speech
Tagging of Hausa-Based Texts Using Hidden Mar-
kov Model. 6(2), 303–313.
Wang, Fam, R., Bao, F., Lepage, Y., & Gao, G. (2019).
Neural Morphological Segmentation Model for
Mongolian. Proceedings of the International Joint
Conference on Neural Networks. https://
doi.org/10.1109/IJCNN.2019.8852050
Xu, H., Marcus, M., Yang, C., & Ungar, L. (2018). Unsu-
pervised morphology learning with statistical para-
digms. Proceedings of COLING 2018, the 27th In-
ternational Conference on Computational Linguis-
tics.
Xue, N. (2003). Chinese Word Segmentation as Char-
acter Tagging. Computational Linguistics.
Yambao, & Cheng, C. (2020). Feedforward Approach
to Sequential Morphological Analysis in the Taga-
log Language. 2020 International Conference on
Asian Language Processing, IALP 2020. https://
doi.org/10.1109/IALP51396.2020.9310516
Yang, Y., Li, S., Zhang, Y., & Zhang, H. P. (2019). Point
the Point: Uyghur Morphological Segmentation Us-
ing PointerNetwork with GRU. Lecture Notes in
Computer Science (Including Subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics). https://doi.org/10.1007/978-3-
030-32381-3_30
Yildiz, E., & Tantug, A. C. (2019). Morpheus: A Neural
Network for Jointly Learning Contextual Lemmati-
zation and Morphological Tagging. https://
doi.org/10.18653/v1/w19-4205
Zhai, F., Potdar, S., Xiang, B., & Zhou, B. (2017). Neural
models for sequence chunking. 31st AAAI Confer-
ence on Artificial Intelligence, AAAI 2017.
103
11