An Algerian Dialect Study and Resources
An Algerian Dialect Study and Resources
net/publication/299559852
CITATIONS READS
12 1,041
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Karima Meftouh on 01 April 2016.
Salima Harrat∗ , Karima Meftouh† , Mourad Abbas‡ , Khaled-Walid Hidouci§ and Kamel Smaili¶
∗ Ecole Supérieure d’Informatique (ESI), Algiers, Algeria
† Badji Mokhtar University, Annaba, Algeria
‡ CRSTDLA Centre de Recherche Scientifique et Technique
Abstract—Arabic is the official language overall Arab coun- the middle-east. They may be also different inside the
tries, it is used for official speech, news-papers, public adminis- same country.
tration and school. In Parallel, for everyday communication, non-
official talks, songs and movies, Arab people use their dialects • These dialects are also widely influenced by other
which are inspired from Standard Arabic and differ from one languages such as French, English, Spanish, Turkish
Arabic country to another. These linguistic phenomenon is called and Berber.
disglossia, a situation in which two distinct varieties of a language
are spoken within the same speech community. It is observed In Algeria, as well as in all arab countries, these dialects are
Throughout all Arab countries, standard Arabic widely written used in everyday conversations. However, with the advent of
but not used in everyday conversation, dialect widely spoken in the internet they are increasingly used in social networks and
everyday life but almost never written. Thus, in NLP area, a lot forums. They emerge on the web as a real communication
of works have been dedicated for written Arabic. In contrast, language due to the ease to communicate in dialect especially
Arabic dialects at a near time were not studied enough. Interest for people with low level of education. But unfortunately basic
for them is recent. First work for these dialects began in the last NLP tools for these dialects are not available.
decade for middle-east ones. Dialects of the Maghreb are just This work is a first part of the Project TORJMAN1 which
beginning to be studied. Compared to written Arabic, dialects
are under-resourced languages which suffer from lack of NLP
is a Speech-To-Speech Translator between Algerian Arabic
resources despite their large use. We deal in this paper with dialects and MSA. Unlike Middle-East Arabic dialects, Al-
Arabic Algerian dialect a non-resourced language for which no gerian Arabic dialects are non-resourced languages, they lack
known resource is available to date. We present a first linguistic all kinds of NLP resources. Consequently, TORJMAN begins
study introducing its most important features and we describe from Scratch.
the resources that we created from scratch for this dialect. In this paper, we describe and extend resources creation tasks
for Arabic dialect of Algeria that appeared in [1] and [2].
Keywords—Arabic dialect, Algerian dialect, Modern Standard
Arabic, Grapheme to Phoneme Conversion, Morphological Analysis
We focus on Algiers dialect which is the spoken Arabic of
Algiers (capital city of Algeria) and its periphery. This choice
is justified by the fact that this dialect is the one we know
best and practice since we are native speakers of this dialect.
I. I NTRODUCTION For convenience of reference, we will design Algiers dialect
by ALG, this will make this manuscript easier to read.
Under-resourced languages are languages which lacks re- This paper is organized as follows: before dealing with Alge-
sources dedicated for natural language processing. In fact, rian dialect we give in Section II a brief overview of Arabic
these languages suffer from unavailability of basic tools like language, whereas in Section III we present different aspects of
corpora, mono or multilingual dictionaries, morphological and ALG. The following Sections will be dedicated to the resources
syntactic analyzers, etc. This lack of resources makes working that we created, we detail how we made the first corpus of
with these languages a great challenge, especially when we Algiers dialect (Section IV). Then we present ALG grapheme-
deal with unwritten languages like Arabic dialects. Compared phoneme converter(Section V) which has allowed us to get a
to other under-resourced languages, Arabic dialects present the phonetized corpus of Algiers dialect. In Section VI we describe
following additional difficulties: how we created a morphological analyzer for ALG by adapting
BAMA[3] the well known analyser for MSA. Finally, we will
• Since they are spoken languages they are not written
conclude by summarizing the main ideas of this work and by
and there are no established rules to write them. A
giving our future tendencies.
same word could have many orthographic forms which
are all acceptable since there is no writing rules as
II. A RABIC LANGUAGE
reference.
Arabic is a Semitic language, it is used by around 420
• The flexibility in the grammatical and lexical levels million people. It is the official language of about 22 countries.
despite their belonging to Arabic Language. Arabic is a generic term covering 3 separate groups:
• Besides the fact that these dialects are different from 1 TORJMAN is a national research project which is totally financed by the
Arabic, they are also different from each other. For Algerian research ministry, this appellation means translator or interpreter in
instance, dialects of the Maghreb differ from those of English.
384 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
• Classical Arabic: is principally defined as the Arabic Arabic in various language levels: Phonological differences
used in the Qur’an and in the earliest literature from between Classical Arabic and spoken Arabic are moderate
the Arabian peninsula, but also forms the core of much (compared to other pairs of language-dialect), whereas gram-
literature until the present day. matical differences are the most striking ones. At lexical
level, differences are marked with variations in form and with
• Modern Standard Arabic: Generally referred as MSA differences of use and meaning.
(Alfus’ha in Arabic), is the variety of Arabic which Indeed, at phonological level, ALG (naturally) shares the most
was retained as the official language in all Arab features related to Arabic. In addition to the 28 consonants
countries, and as a common language. It is essentially phonemes of Arabic4 (given in Table I), ALG consonantal
a modern variant of classical Arabic. Standard Arabic system
is not acquired as a mother tongue, but rather it is includes non Arabic phonemes like /g/ as in the word
learned as a second language at school and through ¨A¯ (all), and the phonemes /p/ and /v/ used mainly in words
exposure to formal broadcast programs (such as the borrowed from French like the case of éJÓñK (adapted from the
daily news), religious practice, and newspaper [4].
French word ”pompe” which means a pump) and èQ
ʯ (adapted
• Arabic dialects: also called colloquial Arabic or ver- from the French word ”valise” which means a bag). Also, it
naculars are spoken varieties of Arabic language. In should be noted that the use of the phonemes ( ) and ( X) is
contrast to classical Arabic and MSA, they are not
written. These dialects have mixed form with many very rare, most of the time is pronounced /d‘/( ) and X
variations. They are influenced both by the ancient is pronounced /d/( X). The same case is observed for /T/ ( H )
local tongues and by European languages such as ). Note that the last two substitutions
French, Spanish, English, and Italian.2 Differences which is pronounced /t/( H
between these variants of spoken Arabic throughout are observed also for Jordanian dialect [9].
the Arab world can be large enough to make them
incomprehensible to one another. Hence, regarding TABLE I: Arabic phonemes using SAMPA 5
the large differences between dialects, we can con-
sider them as disparate languages depending on the Letter
Phoneme Letter Phoneme Letter Phoneme
geographical place in which they are practiced. Thus, @ /?/ P /z/ /q/
most of the literature describe Arabic dialects from H. /b/ /s/ ¸ /k/
the viewpoint of east-west dichotomy [5]:3 H /t/ /S/ È /l/
◦ Middle-east dialects: include spoken Arabic of H /T/ /s‘/ Ð /m/
Arabian peninsula(Gulf countries and Yemen), h. /Z/ /d‘/ à /n/
Levantine dialect (Syria, Lebanese, Palestinian h /x/ /t‘/ ë /h/
and Jordan), Iraqi dialect Egyptian and Sudan p /X/ /D‘/ ð /w/
dialect. X /d/ ¨ /?‘/ h. /j/
◦ Maghreb dialects: Spoken mostly in Algeria, X /D/ ¨ /G/
Tunisia, Morocco, Libya and Mauritania. Note P /r/ ¬ /f/
that, Maltese a form of Arabic dialect is most /a/ /i/ /u/
often found in Malta. @ /a:/ ø /i:/ ð /u:/
385 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
and affixes) but it must be marked by the ante-position (for both the accusative and genitive) depending on the
of a particle or an expression that indicates the future grammatical function of the word. For example, mas-
culine regular plural of MSA word ÕÎªÓ (teacher) could
like YªJ.Óð @ (later) or @ð Y« (tomorrow), next month,
...etc.
be àñÒÊªÓ (nominative case) or á
ÒÊªÓ (accusative
or genitive). In contrast, for instance the ALG word
TABLE VII: The verb I.ªË conjugation in the present tense. l'
@P (going) always takes á
m
' @P for the regular plural
whatever its grammatical category.
Pronouns ALG MSA English
• Feminine regular plural: is obtained by adding the
1st Person
AK @ I. ª ÊK I. ª Ë @ I play
AJk
ñJ.ªÊK
I. ª ÊK
We play suffix H@ to the word without changing the structure
I K @
úæ.ª ÊK
á
J. ªÊK You play
of the word as in MSA but with a single difference in
2nd Person
case endings. Indeed, in MSA, the feminine regular
I K @
I. ª ÊK
I. ª ÊK You play plural has the following marks cases ( H@
for
or H@
AÓñJK @ ñJ. ª ÊK àñJ.ªÊK You play
nominative and H @ or H @for accusative and genitive),
ùë I. ª ÊK I. ª ÊK She plays ALG has only one mark case which is the Sukun
3rd Person
ñë I. ª ÊK
I. ª ÊK
He plays àñºË@ (absence of diacritic whose symbol is ). For
AÓñë ñJ. ª ÊK
àñ J.ªÊK
They play example the plural of MSA word íÊJ
Ôg. is HCJ
Ôg. or
H CJ
Ôg.8 and the plural of ALG word íK. A is always
. A (both MSA and ALG words mean beautiful).
HAK
• The imperative: It expresses commands or requests,
and is used only for the second person. It is generally • Broken plural: an irregular form of plural which
realised by adding the prefix @ and the suffixes ø and modifies the structure of the singular word to get its
ð to the verb. plural. As in MSA it has different rules depending
on the word pattern. Like singular words, the MSA
broken plural takes the three case endings in ALG it
TABLE VIII: The verb h. Qk conjugation in the present tense. does not.
Pronouns ALG MSA English In Table IX we give an example for each ALG plural category.
I K @ úk. Q k @ úk. Q k @ Get out (you, singular, feminine)
Another major difference between Algiers dialect and the
I K @ h. Q k @ h. Q k @ Get out (you, singular, masculine) written Arabic is the absence of the dual (a kind of plural
which
AÓñJK @ ñk. Q k @ @ñk. Q k @ Get out (you, plural, feminine & masculine) designs 2 items). Indeed in MSA, for example the dual
of YËð (a boy) is designed by à@Y Ëð ( the word is post-fixed by
à@ or áK
depending on the case9 ). In ALG Generally, the dual
2) Declension: Singular word declension in written Ara-
bic corresponds to three cases: the nominative, the genitive, is obtained by the word h. ð P (two) followed by the plural
and the accusative which take the short vowels , and
8 Ôg or H CJÔg also.
HCJ
7 See next section III-B2 9 à@ for
nominative
.
.
for both accusative and genitive
case and áK
387 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
àð
Regular feminine íJ.
J.£ HAJ.
J.£ íJ.
J.£ HA JJ£/ H AJJ£
.
. .
. Doctor/Doctors
Case ending No vowel
H@ , , , , , HA / H A HA / H A
Q
£ PñJ
£ Q
£ PñJ
£ Bird/Birds
Irregular
ÐñK
ÐAK
@/ HAÓA
K
@ ÐñK
ÐAK
@ Day/Days
Case ending No vowel No vowel , , , , , , , , , ,
10
(feminine or masculine)
of the noun or the adjective. For TABLE XI: Interrogative particles and pronouns in ALG and
example, the dual of YËð is XBð h. ð P (two boys) their equivalents in MSA.
ALG MSA English
C. Syntactic level
àñº áÓ Who
AÓ @ ø@ Which
1) Declarative form: Words order of a declarative sentence
in ALG is relatively flexible. Indeed, in common usage ALG áK
ð áK
@ Where
sentences could begin with the verb, the subject or even the á
JÓ áK
@ áÓ From where
á @ ð / @ ð What
object. This order is based on the importance given by the @ XAÓ
AK . @ XAÖß. With what
speaker to each of these entities; usually the sentence begins
with the item that the speaker wishes to highlight. In Table A ¯ ú¯
@ XAÓ In What
A J¯ð úæÓ When
X we give an example of different word orders for a same
C«ð @ XAÖÏ Why
sentence. It should be noted that the two first forms (SVO, A ®» J
» How
ÈAm
Õ» How many
5
QÓ YËñË@
QÓ YËñË@ The boy is ill
YËñË@
QÓ úæ
AÓ
QÖß. YËñË@
Ë The boy is not ill
6
QÓ YËñË@
QÓ YËñË@ The boy is ill
QÓ úæ
AÓ YËñË@ A
QÓ
Ë YËñË@ The boy is not ill
IV. C ORPUS CREATION like English where a phoneme may be represented by a letter
or a group of letters and vice-versa. Unlike English, Arabic
As mentioned above, this work began from scratch. No is considered a transparent language, in fact the relationship
kind of resources was available for Algiers dialect. The foun- between grapheme and phoneme is one to one, but note
dation stone of the work was a corpus that we created by that this feature is conditioned by the presence of diacritics.
transcribing conversations recorded from everyday life and Lack of vocalization generates ambiguity at all levels (lexical,
also from some TV shows and movies. This transcription step syntactic and semantic) and the phonetic level consequently,
required conventional writing rules to make the transcribed
such as the word I . J» /ktb/, its phonetic transcription could be
text homogeneous. Considering the fact that ALG is an Arabic
/kataba/, /kutiba/, /kutubun/, /kutubi/, /katbin/... Algiers dialect
dialect, we adopted the following writing policy: when writing
obeys to the same rule, without diacritics grapheme-phoneme
a word in Algiers dialect we look if there is an Arabic word
conversion will be a difficult issue to resolve.
close to this dialect word, if it does exist we adopt the Arabic
Most works on G2P conversion obey to two approaches: the
writing for the dialect word, otherwise the word is written as
first one is dictionary-based approach, where a phonetized dic-
it is pronounced.
tionary contains for each word of the language its correct pro-
The transcription step produced a corpus of 6400 sentences
nunciation. The G2P conversion is reduced to a lookup of this
that we afterwards translated to MSA. Thus, we got a parallel
dictionary. The second approach is rule-based [12], [13], [14],
corpus of 6400 aligned sentences. In Table XIII, we give
in which the conversion is done by applying phonetic rules,
informations about the size of this corpus.
these rules are deduced from phonological and phonetic studies
of the considered language or learned on a phonetized corpus
TABLE XIII: Parallel corpus description. using a statistical approach based on significant quantities of
data[15], [16]. For Algiers dialect which is a non-resourced
Corpus #Distinct words #Words language, a dictionary based solution for a G2P converter is
ALG 8966 38707 not feasible since a phonetized dictionary with a large amount
MSA 9131 40906 of data is not available. The first intuitive approach (regards
to the lack of resource) is a rule based one, but the specificity
It should be noted that all tasks described above were done of Algiers dialect (that we will detail hereafter in the next
by hand. It was time consuming but the result was a clean section.) had led us to a statistical approach in order to consider
parallel corpus. Furthermore, ALG side of this corpus has all features related to this language.
been vocalized with our diacritizer described in [11] and used
to develop the first NLP resources dedicated to an Algerian A. Issues of G2P conversion for Algiers dialect
dialect (at our knowledge). The next sections of this paper are
dedicated to describe these resources. Algiers dialect G2P conversion obeys to the same rules
as MSA. Indeed, ALG could be considered as a transparent
language since alignment between grapheme and phoneme is
V. G RAPHEME -T O -P HONEME CONVERSION one to one when the input text is vocalized. But unfortunately,
As pointed out above, the general purpose of the project it is not as simple as what has been presented, since ALG
TORJMAN is a speech translation system between Modern contains several borrowed words from foreign languages which
Standard Arabic and Algiers dialect. Such a system must most of them have been altered phonologically and adapted
include a Text-to-Speech module that requires a Grapheme- to it. Henceforth, the vocabulary of this dialect contains many
To-Phoneme converter. We therefore dedicated our efforts French words used in everyday conversation. French borrowed
to develop this converter by using ALG vocalized corpus words could be divided into two categories: the first includes
described earlier. French words phonologically altered such as the word íJ
ÊÓA¯
Grapheme-to-Phoneme (G2P) conversion or phonetic tran- (famille in French, family) and the second one includes words
scription is the process which converts a written form of a which are uttered as in French like the word Pñ (sûr in
word to its pronunciation form. Grapheme phoneme conversion French, sure) whose utterance is /syö/(/y/ is not an Arabic
is not a simple deal, especially for non-transparent languages phoneme but a French phoneme). This last category constitutes
389 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
a serious deal for G2P conversion since these words do not difference that in MSA the @ is pronounced if
obey to Arabic pronunciation rules. the definite article is in the beginning of the
sentence.
TABLE XIV: Example of French words used in ALG. • When the definite article È@ is followed by
The è is not pronounced in Algiers dialect when it is representation (a set of phonemes). This system uses Moses
preceded by . package[17], Giza++[18] for alignment and SRILM[19] for
language model training. The main motivation of using a sta-
Example: íK. AJ» (his book)=⇒ /kta:bu/ tistical approach is that we can include French phonemes in the
11) Words containing the sequences à , H training data. For building this system, the first component is a
. parallel corpus including a text and its phonetic representation.
When a à is followed by a H . , the is pronounced
à Actually, this resource is not available, so we created it by
as /m/ using the rule based converter described above. We proceed as
Example: Q.J Ó (a foretop) =⇒ /mambar/ follows: we used the rule based system to convert Arabic words
12) Gemination rule and French words phonologically altered (category 1 and 2)
When the Shadda appears on a consonant, this con- to Arabic phonemes. Whereas for French words realized with
sonant is doubled (geminated) French phonemes (category 3), we began by identifying them
and we transliterated them to their original form in Latin script,
Example: Qº (sugar) =⇒ /sukkur/ then converted them to French phonemes (using a free French
It should be noted that most of these rules could be G2P converter), all these operations were done by hand. For
applied for other Algerian dialects and Arabic dialect example the word àñJ
ºKñ»
is transliterated to connexion then
close to them such Tunisian and Moroccan. converted to /kOnnEksjÕ/.
Experiment: As indicated above for experiment we used This system operates at grapheme and phoneme level,
our ALG vocalized corpus which includes three categories of we split the parallel corpus into individual graphemes and
words: phonemes including a special character as word separator in
order to restore the word after conversion process (see Table
1) Arabic words. XVI).
2) French words phonologically altered and their pro-
nunciation is realized with Arabic phonemes.
3) French words for which the pronunciation is realized TABLE XVI: Examples of aligned graphemes and phonemes.
with French phonemes. o
I º o
K
Null /t/ /u/ /k/ Null /s/ /a/ /n/
We applied phonetization rules seen below on the ALG corpus.
In addition to Arabic words, French words of the second
category are correctly phonetized because their phonetic real- Experiment: For evaluating the statistical approach, we
ization is close to Algiers dialect. For example the word íJK
Pñ»
split the parallel corpus into three datasets: training data (80%)
(kitchen, original French word is cuisine) which is a borrowed tuning data (10%) and testing data (10%).First we tested the
French word phonologically altered is correctly converted as statistical approach on a corpus containing only Arabic words
/ku:zina/, while a word in the third category as àñJ
ºKñ» and French words phonologically altered (category 1 and 2).
(connection, original French word connexion) is incorrectly We got an accuracy of 93%. Then we proceeded to a test
converted to /ku:niksju:n/ since it is realized /kOnnEksjÕ/ with on a corpus including the three words categories, system
French phonemes. Considering these words, system accuracy accuracy decreases to 85%. This result is due to the increase of
is 92%. The issue of these words is that we can not introduce hypothesis number of each grapheme because of introducing
rules for French words written in Arabic script, since the French phonemes in the training data. The graphemes ñ for
relation between Arabic graphemes and French phonemes is example in some Arabic words (category 1) are phonetized as
not one to one. For example the graphemes ñ in a French the French phonemes /y/ or /Õ/ instead of the Arabic long vowel
word written in Arabic script could correspond to the French /u:/, the phoneme /Õ/ instead of /u:n/. Contrary to that some
phonemes /y/, /u/, /O/ or /O/ (see some examples in Table XV). words in category 3 are phonetized with Arabic phonemes by
substituting for example the phonemes /y/, /u/, /O/ or /O/ by
the /u:/, and /E/ by /a:/.
TABLE XV: Examples of mappings between Arabic grapheme
ñ and French phonemes. D. Discussion
converter. Unfortunately, we have not sufficient data for testing and all complex prefixes where they appear instead of
such a converter, since our corpus includes only about 1k the prefix J
(expressing the future when it precedes
words of category 3. In terms of resources, this work allowed
imperfect verbs ) and the prefix ¬ 13 (conjunction),
us to build a phonetized dictionary for Algiers dialect; at our some examples are given in Table XVII.
knowledge no such resource is available at this time.
VI. M ORPHOLOGICAL A NALYZER FOR A LGERIAN TABLE XVII: Examples of kept, deleted and added prefixes
D IALECT in ALG prefixes table.
A. Related works Kept pref. Description
K ,
K Imperfect Verb Prefix(sing.,third person,masc.,fem.)
Compared to MSA, there are a little number of Morpho- È@ Noun Prefix (definite article)
logical Analysers (MA) dedicated to Arabic dialects. Works H. , È Preposition Prefix
in this area could be divided into two categories. The first Del. pref. Description
one includes MA that are built from scratch such as in [20] ¬ Conjunction Prefix
and [21], the second includes works that attempt to adapt Future Imperfect Verb Prefix
existing MSA Morphological Analysers to Arabic dialect. ÈAJ.¯ Conj.Pre.+Preposition Pre.+Definite Art. Pre.
This trend is adopted for several dialects since it is not time Add. pref. Description
consuming. In [22], authors used BAMA Buckwalter Arabic ¬ Preposition Prefix
Morphological Analyser [3] by extending its affixes table with ÈA¯ Preposition Pre.+Definite Art. Pre.
Levantine/Egyptian dialectal affixes. The same approach is áK
Perfect verb pre. (past voice, (sing., masc.) and (plu, masc/fem.))
adopted in [23] where a list of dialectal affixes (belonging to á K Perfect verb pre. (past voice, (sing. fem.)
four Arabic dialects) was added to Al-Khalil [24] affix list.
Authors in [25] converted the ECAL (Egyptian Colloquial 2) Suffixes table: We also eliminated all MSA suffixes
Arabic Lexicon) to SAMA (Standard Modern Arabic Anal- not used in Algiers dialect mainly:
yser) representation [26]. For Tunisian dialect, authors in [27]
• Suffixes related to the dual both feminine and
adapted Al-Khalil MA, they create a lexicon by converting
masculine,
MSA patterns to Tunisian dialect patterns and then extracting
• Feminine plural suffixes,
specific roots and patterns from a training corpus that they
• All word case endings suffixes
created.
All complex suffixes where they appear were also
B. Adopted Approach deleted. Likewise, we added dialectal suffixes like the
suffix for negation and all complex suffixes that
To build a MA for Algiers dialect, we decide to adapt must be included with it.
BAMA, since it does not consume time and takes profit We integrated also a set of suffixes to take into
from the fact that it is widely used. BAMA is based on a account all various writings of dialects words which
dictionary of three tables containing Arabic stems, suffixes are not normalized. An example is the suffix ð, which
and prefixes and three compatibility tables defining relations could express the plural (feminine and masculine) in
between stems, prefixes and suffixes. Adaptation of BAMA is the end of a verb, a possessive pronoun at the end
got by populating these tables by dialect data. of a noun exactly like the MSA suffix ë. We give in
table XVIII a set of examples of each case.
C. Building the dialect dictionary
We built dialect dictionary by adopting the following 2) Stems table: Dialect stems table was populated by the
principle: in order to exploit BAMA dictionary, we kept from lexicon of Algiers dialect corpus and MSA stems included in
it all entries that belong also to ALG with some modification BAMA. We used a part (85%, 9170 distinct words) of our
(for example MSA prefixes K
, K and K. are used in ALG ALG corpus for creating dialect stems, the remaining 15%
so we kept them as ALG prefixes). Beside that, we deleted all (1618 distinct words) is used for test.
entries which are not suitable for Algiers dialect. Moreover, we
created entries that are purely dialectal and which did never Stems from ALG corpus lexicon
exist in MSA dictionary.
First, we began by extracting a list of nouns easily identifi-
1) Affixes tables: For affixes tables, common affixes be-
able by affixes è and definite article Ë @ (used only with nouns).
tween MSA and ALG are kept (in prefixes and suffixes tables), We deleted these two affixes from all extracted words, then
whereas all other MSA affixes which do not belong to dialect from obtained list of words we created stem entries according
were deleted. However, some dialect affixes which do not exist to BAMA. Next, the rest of the corpus was analysed and
in MSA were added to affixes tables. Note that when an affix classified into three sets: function words, verbs and nouns
is deleted, all complex affixes where it occurs are also deleted.
(which do not include è and Ë @ suffixes) and converted to stems
1) Prefixes table: We kept some prefixes unchanged like according to BAMA stems categories. Let us indicate that we
prefixes K
and K that precede imperfect verbs (for added some stems categories to take into account all dialectal
the singular third person masculine and feminine, features. For example, in MSA the perfect verb stem category
respectively). We eliminated purely MSA prefixes12 13 Note
that ¬ as MSA conjunction prefix has been deleted (since it does
12 Prefixes that could not belong to Algiers dialect.
not exist in ALG), and ¬ as preposition prefix has been created.
392 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
TABLE XVIII: Examples of kept, deleted and added suffixes in ALG suffixes table.
Kept Aff. Description
áK
Accusative/genitive noun Suffix(masc.,plu.)
H@ Noun Suffix(fem.,plu.)
H perfect verb suffix (fem.,sing)
Del. suff. Description
à Perfect/Imperfect Verb Suffix(subject, plu., fem.)
AÖß Perfect/Imperfect Verb Suffix(subject, dual., fem/masc., 2nd person)
AÒë Perfect/Imperfect Verb Suffix(direct object, dual., fem/masc., 3rd person)
àð Nominative Noun Suffix (masc.,plu.)
à@ Nominative Noun Suffix (masc.,dual)
áë Perfect/Imperfect Verb Suffix(direct object, plu., fem.)
áê K Perfect Verb Suffix(subject sing.,2nd person,masc.,direct object, plu., 3rd person, fem.)
Add. suff. Description
Perfect/Imperfect Verb Negation Suffix
Òë Perfect/Imperfect Verb Negation Suffix (direct object,plu., 3nd person, masc./fem)
Ò» Perfect/Imperfect Verb Negation Suffix (direct object,plu., 2nd person, masc./fem.)
ð Per./Imp. Verb Suffix(direct object,plu.,masc.,fem.)
with the pattern ɪ ¯ covers the three persons, the two genders, TABLE XX: Examples of converted stems from MSA to ALG.
the single, the dual and plural; just relative suffixes are added
to it to have its different inflected forms. In ALG, we split this Stems ALG Dialect MSA English
H. Qå H. Qå H. Qå He beat
stem category into two distinct stems: ɪ ¯ and ɪ¯ to cover
H. Qå
H. Qå
H. Qå
He drunk
all perfect verbs inflected forms, in Table XIX we give an
ÈYK. È YK. È Y K. He changed
example related to the stem ©ÖÞ
(to hear).
Q.» Q. » Q. » He grew
TABLE XIX: Example of splitting a MSA stem to two we constructed imperfect verb stems and command
Dialectal stems. verb stems from the ALG perfect verb stems that we
Eng. pro. Dia pro. Dia. verb Dia. stem MSA pro. MSA verb MSA stem created as described above.
She ùë I ª ÖÞ
©ÖÞ
ùë I ªÖ Þ
2) Nouns
They
AÓñë ñªÖÞ
Ñë @ñªÖÞ
©ÖÞ
We kept all proper nouns from MSA stems table
He ñë ©ÖÞ
©ÖÞ
ñë ©Ö Þ
because it contains an important number of entries
Þ
We AJë AJª ÖÞ
ám ' AJªÖ related to countries, currencies, personal nouns,... We
analysed all other types of words and kept from them
those existing in ALG by modifying diacritics, adding
Exploiting MSA BAMA stems or deleting one or more letters.
3) Function words
1) Verbs We deleted all function words that do not exist in
The main idea for creating ALG verb stems from ALG like relative pronouns and personal pronouns
MSA stems is using verbs pattern. For example the related to the dual and feminine plural, then we
verbs having ALG pattern ɪ ¯ are in most cases translated remaining ones to ALG.
Arabic verbs with the patterns ɪ ¯, ɪ ¯ or ɪ¯. Some Note that we introduced dialect stems with non Arabic
other ALG verbs keep the same
like verbs with the patterns ɪ ¯
pattern as in MSA letters ¬ G, ¬ V, and H P in stems table and we modified
BAMA code to consider words containing these letters. Also,
From stems table, we extracted
all perfect
ing the patterns ɪ ¯, ɪ ¯, ɪ¯ and ɪ ¯. After that,
verbs hav- since every stem entry in BAMA contains an English glossary,
when creating a dialect entry, we added the Arabic word to
the verbs having the three first patterns are converted English glossary, so for each dialect entry is associated an
to Algiers dialect pattern by changing diacritic marks English and Arabic glossary.
to ɪ ¯ while the verbs corresponding to pattern ɪ ¯
After creating affixes and stems tables for ALG, compatibility
tables of BAMA were updated according to the data included
are kept as they are (since this pattern is used in in these tables.
ALG). At this stage, we constructed a set of Arabic
verb stems having dialect pattern, we analysed them
and eliminated all stems that are not used in ALG. D. Experiment
We give in Table XX some examples. As mentioned above, we tested our MA on the Algiers
We proceed as explained above for other patterns as Dialect corpus, the test set contains 1618 distinct words
ɪ ® K, É«A ® K, É«A ¯ , ɪ ® J@. It should be noted that, extracted from 600 sentences chosen randomly. We consider
393 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
that a word is correctly analysed if it is correctly decomposed the G2P converter is about 85%. In terms of corpus resources,
to prefix+stem+suffix and if all the features related to them this task enabled us to transcribe the ALG corpus to a phonetic
are correct (POS, gender, number, person). We first began by form. We also proposed a morphological analyser for AlG that
testing the MA with stems extracted only from the ALG corpus we adapted from the well known BAMA dedicated for MSA.
lexicon, then we introduced stems created from the MSA stems We reached an accuracy rate of 69% when evaluating it on
table. We list in Table XXI the obtained results. a dataset extracted from ALG corpus. Our future work before
developing a statistical machine translation system, is to extend
TABLE XXI: Results of ALG morphological Analyser. the corpus we created to other Algerian Arabic dialects, and
to adapt all tools dedicated to ALG to these dialects.
Results ALG MSA stems+ALG
corpus stems corpus stems
# Analysed words 703 1115 ACKNOWLEDGEMENT
Percentage 43% 69%
# Unanalysed words 915 503 This work has been supported by PNR (Projet National
Percentage 57% 31%
de Recherche of Algerian Ministry of Higher Education and
Scientific Research).
We examinated the words for which no answer were given
by the morphological analyzer(see Table XXII), most of the R EFERENCES
cases are: [1] S. Harrat, K. Meftouh, M. Abbas, and K. Smaili, “Building resources
for algerian arabic dialects,” in Proceedings of Interspeech, 2014, pp.
• French words which do not exist in the stem table like 2123–2127.
úæJ
QK (électricité , electricity), or words like PñJ
Jm .' @ [2] ——, “Grapheme to phoneme conversion: An arabic dialect case,”
(ingénieur, engineer) and ðQÒJ
JË@ (numéro, number) in Proceedings of 4th International Workshop On Spoken Language
that are included in stems table but with an other Technologies For Under-resourced Languages SLTU, 2014, pp. 257–
orthography (respectively PñJ
J
m.' @ and ð Q
ÒJ
JË@ ). The 262.
[3] B. Tim, “Buckwalter arabic morphological analyzer version 1.0,” Lin-
same case is observed for nouns written with long guistic Data Consortium LDC2002L49, 2002.
vowel @ in the end instead of è such as ACK (place).
[4] K. Kirchhoff, J. Bilmes, S. Das, N. Duta, M. Egan, G. Ji, F. He,
• We noticed also that some words are written with J. Henderson, D. Liu, M. Noamany, P. Schone, R. Schwartz, and
missed letters like the word AË@ which appears in D. Vergyri, “Novel approaches to arabic speech recognition: Report
from the 2002 johns-hopkins summer workshop,” in Proceedings of
stems table as ZAË@. The same case is noticed for IEEE International Conference on Acoustics, Speech, and Signal Pro-
cessing,(ICASSP ’03), vol. 1, April 2003, pp. I–344–I–347.
úÍA¯ (he said to me) instead of úÍA¯ or úÍ ÈA¯ or ñÊJ¯ [5] R. Hetzron, The Semitic Languages, ser. Routledge language
¯.
(I said to him) instead of ñÊJʯ or ñË IÊ family descriptions. Routledge, 1997. [Online]. Available: https:
//books.google.com/books?id=nbUOAAAAQAAJ
• Some Unanalyzed words also are proper nouns. [6] J. C. Watson, The phonology and morphology of Arabic. Oxford
university press, 2007.
[7] A. Boucherit, L’Arabe parlé à Alger. ANEP Edition, 2002.
TABLE XXII: Examples of unanalyzed words.
[8] C. A. Ferguson, “Diglossia,” Word, vol. 15, pp. 325–340, 1959.
Unanalyzed word Corresponding stem English [9] F. H. Amer, B. A. Adaileh, and B. A. Rakhieh, “Arabic diglossia:
K QK@
HA I K QK@ Internet A phonological study,” Argumentum 7, Debreceni Egyetemi Kiadó,
YªJ.Ó@ YªJ.Óð@ After Tanulmàny, pp. 19–36, 2011.
QAÖß
QK QÖß
QK Trimester [10] C. A. Ferguson, “Two problems in arabic phonology,” Word, vol. 13,
®J
ÊJ
K
àñ ®J
ÊK
àñ Phone pp. 460–478, 1957.
[11] S. Harrat, M. Abbas, K. Meftouh, and K. Smaili, “Diacritics restoration
for arabic dialect texts,” in Proceedings of Interspeech, 2013, pp. 125–
132.
VII. C ONCLUSION
[12] M. Alghamdi, H. Almuhtasab, and M. Alshafi, “Arabic phonological
This paper summarize a first attempt to work on Algerian rules,” Journal of King Saud University: Computer Sciences and Infor-
Arabic dialects which are non-resourced languages. These mation (in Arabic), vol. 16, pp. 1–25, 2004.
dialects lag behind compared to other dialects of the Middle- [13] Y. A. El-Imam, “Phonetization of arabic: rules and algorithms,” Com-
puter Speech Language, vol. 18, no. 4, pp. 339–373, 2004.
east for which several works were dedicated and produced
many NLP tools. The presented work is the first part of a [14] M. Zeki, O. O. Khalifa, and A. Naji, “Development of an arabic
text-to-speech system,” in International Conference on Computer and
big project of Speech translation between MSA and Algerian Communication Engineering (ICCCE). IEEE, 2010, pp. 1–5.
dialects. We focus in this first part on the one spoken in [15] P.Taylor, “Hidden markov model for grapheme to phoneme conversion,”
Algiers and its periphery. We began by a study showing all in Proceedings of Interspeech, 2005, pp. 1973–1976.
fearures related to it, then we introduced resources that we [16] K. U. Ogbureke, P. Cahill, and J. Carson-Berndsen, “Hidden markov
created from scratch. This process was expensive in terms of models with context-sensitive observations for grapheme-to-phoneme
time and human effort but the results were worth it. We get a conversion.” in Proceedings of Interspeech, 2010, pp. 1105–1108.
cleaned corpus of Algiers dialect aligned to MSA, this corpus [17] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico,
is the first parallel corpus which includes Algerian dialect to N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar,
A. Constantin, and E. Herbst, “Moses: Open Source Toolkit for Statis-
date. We presented also the Grapheme-to-Phoneme converter tical Machine Translation,” Proceedings of the Annual Meeting of the
that we created for Algiers dialect. We combined a rule based Association for Computational Linguistics, demonstation session, pp.
approach to a statistical appraoch. The level of correctness for 177–180, 2007.
394 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
395 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 3, 2016
Appendix
396 | P a g e
www.ijacsa.thesai.org
View publication stats