Challengesin Arabic NLP
Challengesin Arabic NLP
net/publication/327753798
CITATIONS READS
24 5,522
4 authors, including:
Manar Alkhatib
British University in Dubai
27 PUBLICATIONS 216 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Two-class support vector machine with new kernel function based on paths of features for predicting chemical activity View project
All content following this page was uploaded by Khaled Shaalan on 26 October 2018.
Chapter 3
Khaled Shaalan1, Sanjeera Siddiqui2, Manar Alkhatib3 and Azza Abdel Monem4
Faculty of Engineering & IT, The British University in Dubai123,
Block 11, Dubai International Academic City,
P.O. Box 345015, Dubai, UAE
School of Informatics, University of Edinburgh1, UK
Faculty of Computer and Information Sciences, Ain Shams University4,
Abbassia, 11566 Cairo, Egypt
[email protected], [email protected],
[email protected] and [email protected]
1. Introduction
2. Challenges
aThe top alveolar ridge is located on the roof of the mouth between the upper teeth and the
hard palate.
62 Khaled Shaalan et al.
and a rich part system. Arabic makes use of many inflections because of
the appendages, which incorporate relational words and pronouns. Arabic
morphology is perplexing because there are about 10,000 roots that are the
basis for nouns and verbs27. There are 120 patterns in Arabic morphology.
Ref. 28 highlighted the importance of 5000 roots for Arabic morphology.
The word order in Arabic is variant. We can have a free choice of the
word we want to emphasize and put it at the head of sentence. Generally,
the syntactic analyzer parses the input tokens produced by the lexical
analyzer and tries to identify the sentence structure using Arabic grammar
rules. The relatively free word order in an Arabic sentence causes syntactic
ambiguities which require investigating all the possible grammar rules as
well as the agreement between constituents13,24.
In this paper, we discuss the challenges of Arabic language with regard
to its characteristics and their related computational problems at
orthographic, morphological, and syntactic levels. In automating the
process of analyzing Arabic sentences, there is an overlap between these
levels, as they all help in making sense and meaning of words, and in
disambiguating the sentence.
Table 1. The Hamza diacritic is determined by its own diacritics and the preceding letter.
Verb Transliteration Sentence Change applied to the present form of the verb
ﺩﻋﺎ Da-aa ﻉ
ُ ﻟﻢ ﻳﺪ Omit the last long vowel “ ”ﻭand add the present
tense letter “ “ﻱ
ﺳﻌﻰ Sa-aa ﻟﻢ ﻳﺴ َﻊ Omit the last long vowel “ “ﻯand add the present
tense letter “ “ﻱ
ﺻﻠﻰ Sala ﻟﻢ ﻳﺼ ِﻞ Omit the last long vowel “ “ﻱand add the present
tense letter “ “ﻱ
ﺯﺍﺭ Zara ﻳﺰﺭ
ْ ﻟﻢ Omit the middle long vowel " "ﺍand add the present
tense letter “ “ﻱ
Remedies to resolve this type of ambiguity might not necessarily fix all
problems33,34. For example, consider the sentence “( ”ﺭﺃﻳﺖ ﺃﻣﻞI saw
hope/Amal) which have either meaning.
2.1.4. Vowels
In written Arabic, there are two types of vowels: diacritical symbols and
long vowels. Arabic text is dominantly written without diacritics which
leads to major linguistic ambiguities in most cases as an Arabic word has
different meaning depending on how it is diactritized. A diacritic sign
(Tashkeel Or Harakat) is not an orthographic letter. It is formed as
diacritical marks above or below a consonant to give it a sound. Ref. 35
presented a good survey of recent works in the area of automatic
diacritization. There are three groups of diacritics32,36. The first group
consists of the short vowel diacritics such as Fatha ( َ◌), Dhamma ( ُ◌), and
Kasra (◌).
ِ The second group represents the doubled case ending diacritics
(Nunation or tanween) such as Tanween Fatha ( ً◌),Tanween Kasra (◌), ٍ
and Tanween Damma ( ٌ◌). These are vowels occurring at the end of
nominal words (nouns, adjectives and adverbs) indicating nominal
indefiniteness. The third group is composed of Shadda ( ّ◌) and Sukuun (◌) ْ
68 Khaled Shaalan et al.
“( ”ﻭﺳﻴﺤﻀﺮﻭﻧﻬﺎand they will bring it, wasayahdurunaha). This word can be
written in this form:
In this example, the Lemma “( ”ﺣﻀﺮhadr) accepts three prefixes: “( ”ﻭwa),
“( ”ﺱsa), and “( ”ﻱya) and two suffixes: “( ”ﻭﻥwa noun), and “( ”ﻫﺎha).
Thereby, because of the complexity of the Arabic morphology, building
an Arabic NLP system is a challenging task.
The early step in analyzing an Arabic text is to identify the words in
the input sentence based on its type and properties, and outputs them as
tokens. There might be a problem in segmentation where some word
fragments that should be parts of the lemma of a word and were mistaken
to be part of the prefix or suffix of the word; thus, were separated from the
rest of the word as a result of tokenization. This problem arises with
Named Entities Recognition where the ending character n-grams of the
Named Entity were mistaken for objects or personal/possessive anaphora,
and were separated by tokenization19. Moreover, the POS tagger used for
the training and test data may have produced some incorrect tags,
incrementing the noise factor even further.
Another morphological challenge highlighted by Ref. 46, with regard
to relationships between words. The syntactic relationship that a word has
with alternate words in the sentence shows itself in its inflectional endings
and not in the spot in connection to alternate words in that sentence. For
example, “( ”ﺍﻟﻤﻌﻠﻢ ﺍﻟﻤﺨﻠﺺ ﻳﺤﺘﺮﻣﻪ ﻁﻼﺑﻪAl Mu’alim al-mukhlis yahtarimaho
Tulabaho, the faithful teacher is respected by his students), the suffix
pronoun “( ”ـﻪHeh) in the two words “( ”ﻳﺤﺘﺮﻣﻪyahtarima-ho, respected-
him), and “( ”ﻁﻼﺑﻪTulaba-ho, students-his) refers to the word “( ”ﺍﻟﻤﻌﻠﻢAl
Mu’alim, teacher-the).
Generally, Arabic computational morphology is challenging because
the morphological structure of Arabic also comprises a predominant
system of clitics. These are morphemes that are grammatically
independent, but morphologically dependent on another word or phrase47.
Subsequently, one can naturally conclude that this proportion is higher for
Arabic information than for different languages with less perplexing
72 Khaled Shaalan et al.
morphology that the same word can be joined to various appends and
clitics and thus, the vocabulary is much greater. The following Arabic
words: “”ﻣﻜﺘﻮﺏ, (Maktoob, Written) “”ﻛﺘﺎﺑﺎﺕ, (Kitabat, Writings), “”ﻛﺎﺗﺐ
(Katib, Writer) “( ”ﻛﺘﺎﺏKitab, Book), “( ”ﻛﺘﺐKutob, Books) , “”ﻣﻜﺘﺐ
(Maktab, Office) , “( ”ﻣﻜﺘﺒﺔMaktabah, Library), “( ”ﻛﺘﺎﺑﻪKitabah, Writing)
are derived from the same Arabic three consonants trilateral with the origin
verb “( ”ﻛﺘﺐKtb, Wrote). They also refer to the same concept. To extract
the stem from the words, there are two types of stemming. The first type
is light stemming which is used to remove affixes (prefixes, infixes, and
suffixes) that belong to the letters of the word “( ”ﺳﺄﻟﺘﻤﻮﻧﻴﻬﺎsa'altamuniha);
where they are formed by combinations of these letters. The second type
is called heavy stemming (i.e. root stemming) which is used to extract the
root of the words and includes implicitly light stemming48,49.
2.2.3. Annexation
Another morphologic challenge in Arabic language is that we can
compose a word to another by a conjunction of two words. This
conjunction can be with nouns, verbs, or particles. Although it is not
common in traditional Arabic language, it is used in Modern Standard
Arabic. Usually, the compound word is semantically transparent such that
the meaning of the compound word is compositional in the sense that the
meaning of the whole is equal to the meaning of parts put together50. For
example, the word “( ”ﺭﺃﺳﻤﺎﻟﻴﺔcapitalism, rasimalia) comes from compound
of two nouns “( ”ﺭﺃﺱ ﺍﻟﻤﺎﻝcapital, ras almal); the word “( ”ﻣﺎﺩﺍﻡas long as,
madam) comes from the compound of a particle “( ”ﻣﺎma) and a verb “”ﺩﺍﻡ
(dam), and the word “( ”ﻛﻴﻔﻤﺎhowever) comes from the compound of two
particles “( ”ﻛﻴﻒkayf) and “( ”ﻣﺎma). The meaning of a compound word is
important for understanding the Arabic text, which is a challenge to POS
tagging and applications that require semantic processing51.
that are unable to capture the effects of inflectional variation. Thus, they
can cause problems in Machine Translation, Information Retrieval, Text
Summarization, among other NLP applications. Such expression is termed
as idiomatic multi word expressions. Other multi words expressions are
words that co-occur together more often than not, but with transparent
compositional semantics such as “( ”ﺭﺋﻴﺲ ﺍﻟﺪﻭﻟﺔThe president of the
country, rayiys alddawla). As such, they do not pose a challenge in NLP
applications. Such expressions could be of interest if we categorize them
to types as in Named Entity Recognition, i.e. contextual cues.
Ambiguous Anaphora
The pronominal anaphora is a very widely used type in Arabic language
as it has empty semantic structure and does not have an independent
meaning from their antecedent; the main subject. This pronoun could be a
third personal pronoun, called “( ”ﺿﻤﻴﺮ ﺍﻟﻐﺎﺋﺐdamir alghayib) in Arabic,
such as “ ”ﻫﺎ/hA/ (her/hers/it/its), “ ”ﻩ/h/ (him/his/it/its), “ ”ﻫﻢ/hm/
(masculine: them/their), and “ ”ﻫﻦ/hn/ (feminine: them/their).
Challenges in Arabic Natural Language Processing 75
Hidden Anaphora
Another major kind of anaphora is hidden anaphora. It is restricted to the
subject position when there is no present noun or pronoun acting as the
subject. This is evident in the following sentence: “ ﻣﻌﻘﺪﺓ،”ﺍﻟﻤﻼﺣﻈﺔ ﻋﻠﻰ ﺍﻟﻠﻮﺡ
(The note on the board, complex) where the pronoun “ ”ﻫﻲis not presented
in the sentence, i.e. “ ﻫﻲ ﻣﻌﻘﺪﺓ، ”ﺍﻟﻤﻼﺣﻈﺔ ﻋﻠﻲ ﺍﻟﻠﻮﺡ, which is called “zero
anaphora”. The human mind can determine the hidden Anaphora
(antecedent) but it causes grammatical mistakes in automated NLP
systems.
76 Khaled Shaalan et al.
2.3.4. Agreement
Agreement is a major syntactic principle that affects the analysis and
generation of an Arabic sentence which is very significant to difficult NLP
applications such as Machine Translation and Question Answering13,47.
Agreement in Arabic is full or partial and is sensitive to word order
effects1. An adjective in Arabic usually follows the noun it modifies
“( ”ﺍﻟﻤﻮﺻﻮﻑalmawsuf) and fully agrees with respect to number, gender,
case, and definiteness, e.g. “( ”ﺍﻟﻮﻟﺪ ﺍﻟﻤﺠﺘﻬﺪThe diligent boy, alwald
almujtahad) and “( ”ﺍﻷﻭﻻﺩ ﺍﻟﻤﺠﺘﻬﺪﻭﻥThe diligent boys, al'awlad
almujtahidin). The verb is marked for agreement depending on the word
order of the subject relative to the verb, see Figure 1.
3. Conclusion
one mapping between the letters in the language and the sounds with
which they are associated. An Arabic word does not dedicate letters to
represent short vowels. It requires changes in the letter form depending on
its place in the word, and there is no notion of capitalization. As for MSA
texts, short vowels are optional which makes it even more difficult for non-
native speakers of Arabic to learn the language and present challenges to
analyze Arabic words. Morphologically, the word structure is both rich
and compact such that it can represent a phrase or a complete sentence.
Syntactically, the Arabic sentence is long with complex syntax. Arabic
Anaphora has increased the ambiguity of the language, as in some cases
the Machine Translation system fails to identify the correct antecedent
because of the ambiguity of the antecedent. External knowledge is needed
to correct the antecedent. Moreover, Arabic sentence constituents (free
word order) can be swapped without affecting structure or meaning, which
adds more syntactic and semantic ambiguity, and requires analysis that is
more complex. Nevertheless, agreement in Arabic is full or partial and is
sensitive to word order effects.
Arabic language differs from other languages because of its complex
and ambiguous structure that the computational system has to deal with at
each linguistic level.
References
Language Resources and Tools, NEMLAR, 22nd–23rd Sept., Egypt, pp. 118/122
(2004).
35. Azmi and R. Almajed, A survey of automatic Arabic diacritization techniques, Natural
Language Engineering, Cambridge University Press, UK, 21(3):477/495 (2015).
36. S. Abu-Rabia, The Role of Vowels in Reading Semitic Scripts: Data from Arabic and
Hebrew, Reading and Writing: An Interdisciplinary Journal, 14, 39/59 (2001). DOI:
10.1023/A:1008147606320.
37. Farghaly, Three Level Morphology for Arabic, presented at the Arabic Morphology
Workshop, Linguistics Summer Institute, Stanford, CA, (1987).
38. T. McCarthy, The critical theory of Jurgen Habermas, Studies in Soviet Thought,
Springer, Berlin Heidelberg, 23(1):77/79 (1982).
39. Soudi, G. Neumann and A. Bosch, Arabic computational morphology: knowledge-
based and empirical methods, vol. 38, Springer, Dordrecht (2007).
40. Shoukry and A. Rafea, Sentence-level Arabic sentiment analysis, 2012 International
Conference on Collaboration Technologies and Systems (CTS), Denver, CO, USA,
2012, pp. 546/550 (2012). DOI: 10.1109/CTS.2012.6261103.
41. S. S. Al-Fedaghi and F. Al-Anzi., A New Algorithm to Generate Arabic Root-Pattern
forms, In Proceedings of the 11th national Computer Conference and Exhibition, pp.
391/400 (1989).
42. N. De Roeck and W. Al-Fares, A morphologically sensitive clustering algorithm for
identifying Arabic roots, In Proceedings of the 38th Annual Meeting on Association
for Computational Linguistics, Association for Computational Linguistics, pp. 199/206
(2000).
43. S. Mesfar, Towards a cascade of morpho-syntactic tools for Arabic natural language
processing, In Computational Linguistics and Intelligent Text Processing, Springer
Berlin Heidelberg, pp. 150/162 (2010).
44. Y., Benajiba, M. Diab and P. Rosso, Arabic named entity recognition using optimized
feature sets, In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, Association for Computational Linguistics, pp. 284/293 (2008).
45. Y. Benajiba, P. Rosso and M. J. Bened, ANERsys: An Arabic Named Entity
Recognition system based on Maximum Entropy, In Proc. of CICLing-2007, Springer-
Verlag, LNCS series (4394), pp. 143/153 (2007).
46. K. Thakur, Genitive Construction in Hindi. M. Phil Thesis, University of Delhi, India
(1997).
47. K. Shaalan, Arabic GramCheck: A Grammar Checker for Arabic, Software Practice
and Experience, John Wiley & sons Ltd., UK, 35(7):643-665 (2005).
48. M. N. Al-Kabi, S. Kazakzeh, B. Abu Atab, S. Al-Rababah and S. Alsmadi, A Novel
Root based Arabic Stemmer, Journal of King Saud University, Computer and
Information Sciences, 27(2):94–103 (2015). DOI: 10.1016/j.jksuci.2014.04.001
49. H. K. AlAmeed, S. O. AlKitbi, A. A. AlKaabi, K. S. AlShebli, N. F. AlShamsi, N. H.
AlNuaimi, and S. S. AlMuhairi, Arabic Light Stemmer: A new enhanced approach, In
Proceedings of the Second International Conference on Innovations in Information
Technology (IIT'05), Dubai, UAE (2005).
50. W. M. Amer. (2010). Compounding in English and Arabic: A contrastive study,
Technical Report, available online at:
Challenges in Arabic Natural Language Processing 83
http://site.iugaza.edu.ps/wamer/files/2010/02/Compounding-in-English-and-
Arabic.pdf
51. S. Elkateb, W. Black, P. Vossen, D. Farwell, H. Rodríguez, A. Pease and M. Alkhalifa,
Arabic WordNet and the challenges of Arabic, In Proceedings of Arabic NLP/MT
Conference, London, UK (2006).
52. K. Shaalan, An Intelligent Computer Assisted Language Learning System for Arabic
Learners. Computer Assisted Language Learning: An International Journal, Taylor &
Francis Group Ltd., 18(1 & 2):81/108 (2005).
53. Hammo, A. Moubaiddin, N. Obeid, and A. Tuffaha, Formal Description of Arabic
Syntactic Structure in the Framework of the Government and Binding Theory,
Computacion y Sistemas, 18(3):611/625 (2014).
54. S. Hammami, L. Belguith and A. Hamadou, Arabic Anaphora Resolution: Corpora
Annotation with Co-referential Links, The International Arab Journal of Information
Technology, 6(5):481/489 (2009).
55. R. Al-Sabbagh and K. Elghamry, Arabic Anaphora Resolution: A Distributional,
Monolingual and Bilingual Approach, Faculty of Al-Alsun, Ain Shams University,
Cairo, Egypt (2002).
56. S. Usama, On issues of Arabic syntax: An essay in syntactic argumentation, Brill’s
Annual of Afroasiatic Languages and Linguistics, pp. 236/280 (2011).
57. M. Shquier and T. Sembok, Word agreement and ordering in English-Arabic machine
translation, 2008 International Symposium on Information Technology, IEEE Explore,
Kuala Lumpur, pp. 1/10 (2008). DOI: 10.1109/ITSIM.2008.4631625.