0% found this document useful (0 votes)
19 views

Challenges_in_Rendering_Arabic_Text_to_English_Using_Machine_Translation_A_Systematic_Literature_Review

Uploaded by

jood.otoom15
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Challenges_in_Rendering_Arabic_Text_to_English_Using_Machine_Translation_A_Systematic_Literature_Review

Uploaded by

jood.otoom15
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Received 10 July 2023, accepted 21 August 2023, date of publication 29 August 2023, date of current version 7 September 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3309642

Challenges in Rendering Arabic Text to English


Using Machine Translation: A Systematic
Literature Review
SHAHAB AHMAD ALMAAYTAH 1 AND SOLEMAN AWAD ALZOBIDY 2
1 Department of English Language and Humanity, Applied College, King Faisal University, Hofuf 31982, Saudi Arabia
2 Department of English Language and Translation Studies, College of Sciences and Theoretical Studies, Saudi Electronic University, Ryiadh 11673, Saudi Arabia

Corresponding author: Shahab Ahmad Almaaytah ([email protected])


The authors extend their appreciation to the Deanship of Scientific Research at King Faisal University for funding this research work
through the project number GRANT2, 503.

ABSTRACT The Arabic text can be translated into English using a variety of machine translation techniques.
The translation of Arabic text into English still poses a number of challenges in contemporary Arabic.
To identify these challenges that encounter while translating Arabic text into English using machine trans-
lation, a systematic literature review (SLR) approach is used. The SLR steps—protocol creation, first and
final selection, quality assessment, data extraction and synthesis—are used. Nineteen challenges are reported
during the SLR process based on fifty-six research papers. The four most important problems are carefully
examined, and the possible solutions of other researchers are discussed. Word sense disambiguation, Arabic
named entity, rich and complex morphology and low resource are the four critical challenges during rendering
Arabic text to English text. Other challenges are also reported in this article.

INDEX TERMS Natural language processing, machine translation, Arabic, systematic literature review,
challenges.

I. INTRODUCTION that indicate the quality of the translation with reference


Machine translation (MT) has advanced for practically all to the Arabic language needs to be further improved. The
languages in recent years and has become quite important in challenges faced by machine translation can be broadly
many applications [1]. As a result, current MT advancements divided into two groups: technical challenges and linguistic
have greatly improved translation quality [2]. Machine trans- challenges.
lations that are correct and precise are becoming more and A major technical challenge associated with AMT is the
more in demand. Finding an adequate and ideal translation, lack of datasets and lexical resources that can be utilised
however, is a difficult task in any linguistic context [3], [4]. as common benchmarks for conducting unified tests. As a
Different machine translation systems already exist includ- matter of truth, academics frequently only collect data rel-
ing Al-MutarjimTM Al-Arabey 3.01 , Sakhr2 , SYSTRAN3 , evant to their own fields of study, ignoring a wide range
Shaheen4 , Bing Translator5 , Babylon6 , and Google Trans- of other fields in the process. They then used these data
late7 . There are several challenges highlighted in various to try to fix the linguistic problems with Arabic. MT is
study works [3], [5], [6], [7], such as linguistic mistakes, made more difficult by additional technological difficulties
such out-of-vocabulary (OOV), extremely long sentences,
The associate editor coordinating the review of this manuscript and and out-of-domain test data [8]. Examples of effective solu-
approving it for publication was Agostino Forestiero . tions include BPE [9], character-level BPE variation [10], and
1 https://al-mutarjim-al-arabey.software.informer.com/3.0/
2 http://www.sakhr.com/index.php/en/
hybrid approaches [11].
3 https://www.systransoft.com/ A main linguistic challenge is the nature of the Ara-
4 https://mt.qcri.org/api bic language as great degree of ambiguity, linguistic com-
5 https://www.bing.com/translator/ plexity, and variety when compared to other languages.
6 https://translation.babylon-software.com/ Other Arabic features like word order freedom, several dia-
7 https://translate.google.com
critization schemas, a wide variety of dialectal variants along
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
94772 VOLUME 11, 2023
S. A. Almaaytah, S. A. Alzobidy: Challenges in Rendering Arabic Text to English Using MT

social and geographic dimensions present serious linguistic all the terms. In addition, linguistic specialists are required to
problems to MT [12]. For instance, it has been demonstrated develop thorough norms.
that improving the performance of AMT [13], [14], [15], [16],
[17] through pre-processing the Arabic source by morpholog- B. STATISTICAL MACHINE TRANSLATION
ical segmentation [14], [15], syntactical reordering [16], and Statistical machine translation [4], [20], [21], [22] uses sta-
hybridization. tistical models from a group of datasets made up of parallel
A survey on Arabic machine translation was conducted corpora that have been sentence-aligned. For the majority of
to explore the techniques that employ machine transla- languages, phrase-based models to SMT provide the most
tion available in literature and to encourage researchers to cutting-edge performance [24]. In this approach, initially, the
study these techniques. This survey focused on the sum- translation model is trained on the bilingual corpus to esti-
marization of major techniques used in machine translation mate the probability of the source sentence being a translated
from Arabic into English, and discusses their strengths and version of the target sentence. Then the language model is
weaknesses [4]. trained on monolingual corpora which is used to improve the
Various surveys [4], [18], [19] are conducted in which the fluency of the output translation. At the end, the maximum
topic of Arabic machine translation to other languages was probability of product of both the language model and the
thoroughly examined. All of these earlier analyses and studies translation model is computed which gives the most probable
came to the conclusion that it is difficult to design a good MT sentence in the target language. Phrase-based, syntax-based,
system that satisfies human criteria [4]. However, none of the and hierarchical phrase-based models are the three types of
mentioned survey papers performed a systematic literature SMT models [22].
review to identify the existing challenges of Arabic Machine Statistical MT can handle ambiguity by recording
Translation (AMT). This research paper adopt the systematic phrase-based translations with their frequency of occurrence
literature review for the identification of various challenges on a phrase table [4], [20]. The translation result generates
and their possible solutions exists in the literature. Further through this approach is more fluent and natural. In addition
classification of these challenges is performed. In order to this mechanism is language independent, easy, cheap and fast
accomplish this, we intend to address the following research to build.
questions:
Research Question 1: What are the challenges, as identi-
C. NEURAL MACHINE TRANSLATION
fied in the literature, of rendering Arabic text to English using
machine translation? Neural MT models have been proposed and have outper-
Research Question 2: What are the proposed solutions formed than other mechanisms though these models need a
and its limitations, as identified in the literature of rendering huge amount of parallel data to be trained. Convolutional
Arabic text to English using machine translation? neural networks (CNN) are used to encode a source text into
a continuous vector, and recurrent neural networks (RNN)
are used as the decoder to predict the word in the destination
II. ARABIC MACHINE TRANSLATION MECHANISMS language. The concept of the attention mechanism was devel-
Rule-based, statistical, and neural machine translation are the oped by Bahdanau et al. [2], where the decoder pays attention
three basic mechanisms for machine translation [4]. These to input or to any element of the input text. A vector with the
three approaches are also used in Arabic machine translation. same size as the input sequences is produced by calculating
attention using each encoder output and the current hidden
state.
A. RULE-BASED MACHINE TRANSLATION A neural MT model was created between Arabic text and
In Rule-Based machine translation, a set of linguistic rules English by Almahairi et al. [25]. In some studies, neural char-
are used to translate the source text to the target text [4], [20], acteristics for Arabic text is investigated [25], and [26]. The
[21], [22]. A language specialist usually develops the rules. primary difference between neural and statistical MT is that
The use of bilingual or multilingual lexicons, including those the former has a specific language model while the latter has
for Arabic and other languages, is another component of this seen success in a variety of domains [25] in terms of fluency
strategy. Keep in mind that the lexicons and rule collection and accuracy. The fundamental issue with NMT, however,
were constructed manually. is that it necessitates the use of a large parallel corpus [25],
The main Arabic MT system, known as UniArab, was which increases the complexity of the training model.
created by Salem et al. [23] as a global MT system based on Baniata et al. [27] introduced Transformer-based neu-
a linguistic model. This method’s strength lies in its ability ral machine translation model for Arabic text. This system
to thoroughly examine both the syntax and semantic lev- used subword units and shared vocabulary within the Arabic
els, as indicated in [4], and the fact that it still works for dialect to enhance the behavior of the multi-head attention
language pairings with little available parallel data, such as sublayers for the encoder. Experiments are carried out to
low-resource language pairs [22]. However, it is hard to create validate that the proposed mechanism adequately addresses
laws that apply to all languages because doing so would need the unknown word issue and boosts the quality of Ara-
extensive linguistic expertise and a top-notch dictionary. The bic translation. Self-attention-based Transformer [28] is a
latter is more expensive to construct and might not include stack of layers in a sequence-to-sequence model. To create

VOLUME 11, 2023 94773


S. A. Almaaytah, S. A. Alzobidy: Challenges in Rendering Arabic Text to English Using MT

non-linearity, each layer first uses self-attention to extract Rendering advertisement Arabic text: (Arabic translating,
information from the entire sentence. This is followed by Arabic text translation, Arabic broadcasting text)
a point-wise feed-forward network. To enhance the out- Process of step 3 on research questions are as:
comes on the Arabic-English, a Deep Learning architecture Machine translation, Arabic machine translation, Machine
based on Convolutional Neural Networks (CNNs) and the translation challenges, advertisement text translation,
transformer model was developed [29]. Experiments on the Process of step 4 on research questions are as:
UN Arabic-English datasets achieved that transformer based RQ1:
model performs better than the most advanced Arabic MT ‘‘Machine translation’’ OR MT OR ‘‘Computer trans-
systems. lation’’ OR ‘‘Automatic translation’’ OR ‘‘Automatic text
conversion’’ AND Challenges OR Difficulties OR Threats
III. RESEARCH METHODOLOGY
OR Complaints OR Hardships OR Hardness OR Problems
A Systematic Literature Review (SLR) process [30] is used OR Complications OR Obstacles AND ‘‘Rendering adver-
for data collection, which is a structured and defined proce- tisement Arabic text’’ OR ‘‘Arabic translating’’ OR ‘‘Arabic
dure for finding, evaluating, and analysing published primary text translation’’ OR ‘‘Arabic broadcasting text’’ OR ‘‘Arabic
studies in order to answer a particular research question. language’’
Systematic literature reviews are different from ordinary liter- RQ2:
ature survey because they are explicitly planned and method- ‘‘Machine translation’’ OR MT OR ‘‘Computer transla-
ically executed. A systematic review may offer a higher level tion’’ OR ‘‘Automatic translation’’ OR ‘‘Automatic text con-
of validity in its conclusions by finding, analysing, and sum- version’’ AND ‘‘Solution and limitation’’ OR Mechanisms
marising all available information on a particular research OR Techniques OR Methods OR Strategies OR issues OR
subject than may be achievable in any one of the papers it restrictions AND ‘‘Rendering advertisement Arabic text’’
has examined. OR ‘‘Arabic translating’’ OR ‘‘Arabic text translation’’ OR
To plan the review’s strategy, a systematic review protocol ‘‘Arabic broadcasting text’’ OR ‘‘Arabic language’’
is created. These are the key steps in this methodology:
B. RESOURCES TO BE SEARCHED
The following digital libraries and databases will be searched.
A. SEARCH STRATEGY
• IEEEXplore (ieeexplore.ieee.org)
The following search strategy steps are used for the • ACM Digital Library (www.acm.org)
construction of search terms. • Google Scholar (scholar.google.com)
Step 1: Use the Research Questions for the derivation • ScienceDirect (sciencedirect.com)
of major terms, by identifying population, intervention and • SpringerLink (springerlink.com)
outcome
Step 2: For these major terms, find the alternative spellings
and synonyms C. SEARCH CONSTRAINTS AND VALIDATION
Step 3: Verify the key words in any relevant paper; We are searching for all published relevant literature to our
Step 4: Use Boolean Operators for conjunction if the search terms (strings) using the aforementioned resources.
database allows, in such a way, to use ‘OR’ operator for the We are searching for all relevant literature and hence do not
concatenation of alternative spellings and synonyms whereas put any date boundaries. A prior trial search was conducted
‘AND’ for the concatenation of major terms. on ScienceDirect (sciencedirect.com) and IEEE Explore (iee-
Process of step 1 on research questions are as: explore.ieee.org) digital libraries using a set of major terms
RQ1: Machine translation, Challenges, Rendering (‘‘machine translation’’) AND (challenges OR solutions OR
advertisement Arabic text to English ‘‘proposed solutions’’ OR limitations) AND (‘‘Arabic text’’
RQ2: Machine translation, Mechanisms and its limitations, OR ‘‘Arabic advertisement text’’).
rendering advertisement Arabic text to English We got certain related research papers from our trail search.
Process of step 2 on research questions are as: These related papers will be used for the validation of our
RQ1: search terms (strings).
Machine translation: (MT, Computer translation,
Automatic translation, Automatic text conversion) D. PUBLICATION SELECTION
Challenges: (Challenges, Difficulties, Threats, Com- Publication selection procedure will be carried out by using
plaints, Hardships, Hardness, Problems, Complications, publication inclusion criteria, publication exclusion criteria
Obstacles) and selection of the primary sources. The main purpose
Rendering advertisement Arabic text: (Arabic translating, of this publication selection procedure is to choose only
Arabic text translation, Arabic broadcasting text) those search results which are relevant to our research ques-
RQ2: tions. We will only select those research papers/reports/books
Machine translation: (MT, Computer translation, that are relating to Arabic machine translation. Others
Automatic translation, Automatic text conversion) research papers/reports/books not related to Arabic machine
Solution and limitation: (Mechanisms, Techniques, translation will be will be ignored.
Methods, Strategies, issues, restrictions) The inclusion criteria are listed as below:

94774 VOLUME 11, 2023


S. A. Almaaytah, S. A. Alzobidy: Challenges in Rendering Arabic Text to English Using MT

• Research work that describe challenges in Arabic plausibly satisfy the selection criteria, based on a reading
machine translation of the title and abstract of the papers; followed by a final
• Research work that describes difficulties in translation selection from the initially selected list of papers that sat-
advertisement text from Arabic to English isfy the selection criteria, based on a reading of the entire
• Research work that identifies the Arabic machine papers. In order to reduce the researcher’s bias the inter-rater
translation system reliability test was performed where the secondary reviewer
• Research work that shows different solutions for selected five publications randomly from the list of ‘‘Total
translating Arabic text to English Searched’’ and performed the initial and final selection pro-
• Research work that describes limitations during cesses. The results were compared with the results produced
translating Arabic text by the primary author and no disagreements were found.
Exclusion criteria are used to decide which piece of litera- We have identified fifty six (56) research articles for research
ture (research papers/reports/books) found by the search term question 1 and thirty nine (39) research articles for research
will not be selected for review. question 2 as shown in Table 1. We noticed that eighteen (18)
The criteria are listed below: research articles are common in both research questions. Data
• Research work that is not relevant to the research extraction is performed from seventy seven (77) research
questions papers.
• Research work that don’t describe Arabic machine From the final selected research papers, the data related to
translation our questions are extracted. This extraction contains the chal-
• Research work other than machine translation lenges reported in these research articles and their possible
solutions.
E. SELECTING PRIMARY SOURCES
Primary sources will be initially selected by analyzing the
B. DATA ANALYSIS
title, keywords and abstracts of searched literature. This
review will exclude/ignore those searched literature which In data analysis, the first step to organize quantitative data to
have no relevance to the research questions. group scores or values into frequencies, because frequency
The primary sources chosen during this initial selection analysis is helpful for the treatment of descriptive informa-
process will be checked against the above inclusion/exclusion tion. The number of occurrences and percentages of each
criteria by reviewing through full text of the research papers. challenge in Arabic MT are reported using these frequency
If any uncertainty occurs regarding the inclusion/exclusion tables.
decision, the case will be sent to the secondary reviewer. The In order to answer RQ1, Table 2 shows the list of chal-
process will be checked by the third reviewer. lenges while translating Arabic advertisement text to English
Inclusion/exclusion decision record regarding each pri- as identified through the SLR. ‘‘Word sense disambigua-
mary source will be maintained properly. This will include tion/Ambiguity’’ and ‘‘Arabic named entity’’ are the most
the justification whether or not the primary source has been common challenges while translating advertisement text from
included in the final review. Arabic to English. The results also indicates that ‘‘Rich and
complex morphology’’ and ‘‘Low resource’’ are the other
critical challenges as reported in the literature.
F. PUBLICATION QUALITY ASSESSMENT
Comparison between main challenges in Arabic machine
The publication quality assessment is carried out when the
translation is presented as below
final selection of publications is completed. This assessment
is performed parallel with data extraction process.
The quality assessment checklist contains the following 1) WORD SENSE DISAMBIGUATION
questions, which will be marked as ‘‘Yes’’ or ‘‘No’’ or The Arabic language has many different types of ambigu-
‘‘partial’’ or ‘‘NA’’: ity; depending on the situation, many words can have many
• Is it clearly identified challenges/difficulties during meanings. For instance, the word ‘ ’ can mean two dif-
tracking object(s) in the augmented reality environment? ferent things: first, as verb such as ‘‘go’’ and second as a
• Is it clearly identified fields/area in the augmented noun ‘‘gold’’. A human using common sense can recognise
reality tracking environment? this ambiguity with ease, but a machine translating the text
• Is it clear how to solve challenges/difficulties during cannot tell the difference. Instead, MT needs more intricate
tracking object(s) in the augmented reality environment? computation and analysis to accurately determine the mean-
ing; this procedure is known as Word Sense Disambiguation
IV. RESULTS (WSD) [32], [33].
The publication follow the SLR procedure as given in [31]. Word Sense Disambiguation (WSD) is the challenge of
determining a word’s sense (meaning) in a certain context.
A. SEARCHING AND SELECTING SOURCES WSD in Natural Language Processing (NLP) is the process
Process of searching and selecting the primary research for of automatically figuring out a word’s meaning by taking the
both research questions are shown in Table 1. surrounding context into account [34]. One illustration of an
The planned selection process had two parts: an ini- ambiguous Arabic term is the word ‘ ’ (Khal), which can
tial selection from the search results of papers that could be rendered as either ‘‘empty,’’ ‘‘imagined,’’ or ‘‘battalion.’’

VOLUME 11, 2023 94775


S. A. Almaaytah, S. A. Alzobidy: Challenges in Rendering Arabic Text to English Using MT

The three meanings are mixed together because of the Arabic on CNN, followed by Bi-LSTM and CRF. Their findings
writing system, which is undiacritical and vowelless. In gen- on the corpora ANERcorp and Kalimat demonstrate the
eral, Arabic is full with polysemous terms. The astounding effectiveness of their model for Arabic-English machine
repetition of names for human bodily parts in Arabic is an transliteration, producing cutting-edge outcomes.
intriguing finding. For instance, when thinking of the term
‘‘head,’’ one would picture the neck, nose, eyes, ears, and
tongue [35]. 3) RICH AND COMPLEX MORPHOLOGY
Arabic letters can also be ambiguous when attached to The morphology of Arabic is rich and complicated, very
morphemes to create ambiguous compound words, so the different from that of English. By reducing the quantity of
problem is not restricted to Arabic words. For example, the source vocabulary and enhancing the accuracy of word
adding the letter ‘ ’ which is equivalent to the ‘‘b in English alignments, pre- and post-processing of Arabic source text
letter is changing an atomic word into a compound word in using NLP technologies like segmentation and tokenization
case of ‘‘ ’’ meaning ‘‘in the school’’, ‘‘ ’’ meaning has been proven to increase the performance of MT. For
‘‘by the money’’, ‘‘ ’’ meaning ‘‘at the door’’ and ‘‘ ’’ example, a single Arabic word ‘‘ ’’ translated to ‘‘and
meaning ‘‘using the pen’’. This is thus because the letter his home’’ in English is formed by prepending the prefix ‘‘ ’’
‘‘ ’’ can have any of the following meanings when used (‘‘and’’) to the base lexeme ‘‘ ’’ (‘‘home’’) and the prefix
as a prefix: through, in, by, for, and at. Only five of the ‘‘ ’’ (‘‘him’’). Numerous publications have addressed this dif-
ten functions that the letter ‘‘ ’’ can play when prefixed to ficult aspect of Arabic [14], [25], [43], [44], [45], [46], [47].
various nouns [35]. Compared to prior AMT research works, there are few
The biggest hurdle for WSD comes from Arabic texts MT investigations exists that address this issue. An attention-
lacking diacritical marks since they increase the amount of based neural machine translation model between Arabic and
possible meanings for a word and consequently make the English is proposed by Almahairi et al. [25] that had the
work of disambiguation much more challenging. According highest level of accuracy.
to the Arabic WordNet (AWN), the word ‘‘ ’’ Sawt, for More recently, Garcia-Martinez et al. [48] examined the
instance, has 11 senses when written without diacritics, but results of decomposing the target words of an Arabic-French
only two when it is written as ‘‘ ’’ Sawata. The word factored NMT model employing linguistic preprocessing.
‘‘ ,’’ which contains seven senses, is another example [36]. Their model predicted the lemma as well as the combi-
The works [32], [37], [38], [39], [40] attempt to handle the nation of the following elements at the time of decoding:
challenge of WSD. All of the WSD mechanisms make use the POS tag, the tense, the gender, the number, the person,
of words in a sentence to mutually disambiguate each other. and the case information. To replicate low-resource and rich
The difference between different techniques can be seen in resource behaviors, the model underwent training utilizing
the type and source of knowledge that the lexical units in a small or large parallel training datasets, accordingly. Both
sentence convey. Therefore, all of these methodologies can be their Factored and conventional NMT architectures used BPE
categorized as knowledge-based or corpus-based approaches. segmentation. The factored NMT models performed signifi-
cantly better, according to their evaluation results on various
test sets.
2) ARABIC NAMED ENTITY
Identifying and categorising proper names within an
open-domain text is the goal of the Named Entity Recognition 4) LOW RESOURCE
(NER) task. Because of its intricate morphology, this NLP Arabic is a language with low resources. Since learning
task is generally recognised to be more challenging for the depends on the volume of training data, MT performs bet-
Arabic language. The use of NER has also been demonstrated ter for high resource languages than for low resource lan-
to improve the performance of NLP tasks like machine guages [8]. The millions of data that MT systems train on
translation, information retrieval, and question answering. demonstrate a direct correlation with accuracy. Triangular
The lack of capitalization, the extensive lexical variety, and MT [49], back-translation [49], fine-tuning [50], multilin-
the inconsistent manner in which Arabic names are written all gual NMT [50], and zero-shot transfer [51] are only a few
contribute to the difficulty of named entities [41]. For either of the strategies that have been researched for handling
meaning-based translation or phoneme-based transliteration low-resource languages.
to yield a trustworthy translation result, named entities must Some works that try to handle low-resource languages are
be handled correctly [42]. These issues were solved by worth mentioning. For instance, to enhance the translation
Shaalan [41], who developed a productive and robust Arabic performance of low-resource couples, the authors in [49]
named entity recognition system. presented a novel triangular training architecture. The trans-
Ameuret al. [19] has proposed a translation attempt for the lation models of the rich language are jointly optimised
MT between Arabic and English utilising an attention-based using a unified bidi-rectional Expectation-Maximization
encoder-decoder. The outcomes demonstrated the effective- (EM) algorithm in this design, which uses a rich language
ness of their strategy in compared to some earlier stud- as the intermediate latent variable. On the MultiUN and
ies. Recently, to improve NE transliteration, Alkhatib and IWSLT2012 datasets, their strategy dramatically improves
Shaalan [42] used a hybrid deep learning approach based the translation quality of rare languages like Arabic.

94776 VOLUME 11, 2023


S. A. Almaaytah, S. A. Alzobidy: Challenges in Rendering Arabic Text to English Using MT

TABLE 1. Data sources and search results.

TABLE 2. Challenges of AMT identified through SLR (in descending order of frequency).

A character-based hybrid NMT model that mixes RNN and English—and made it available to the public for free. Accord-
CNN networks was presented by Almansor and Al-Ani [52]. ing to the findings of the studies, a multilingual dialect model
They developed their model using only 90K sentence pairs with MSA and bootstrapping produces the best outcomes.
from a very small subset of the TED parallel corpora, includ- AraBench [54] system is developed that is used to perform
ing the IWSLT 2016 Arabic-English corpus. Comparing the machine translation evaluation suite for Arabic dialect to
openNMT word-based NMT model to Arabic-English, they English. They used a variety of training settings, including
noted considerable gains. fine-tuning, back-translation, and data augmentation. The
Abid [53] developed an NMT model as a way to assessment suite opens a wide range of research frontiers
enhance MT models without using any outside sources on low-resource machine translation, such as Arabic dialect
of data. To achieve solid baselines, the author boot- translation. Both the dialectal system and the assessment suite
strapped already-existing parallel sentences and combined are accessible to the general public for academic research.
them with multilingual training. They produced a benchmark Data augmentation techniques are investigated for synthesiz-
dataset in four languages—Egyptian, Levantine, Arabic, and ing dialectal Arabic-English code-switching (CS) text [55].
VOLUME 11, 2023 94777
S. A. Almaaytah, S. A. Alzobidy: Challenges in Rendering Arabic Text to English Using MT

The quality of the generated sentences are accessed through four main challenges encountered while rendering Arabic text
human evaluation and evaluate the effectiveness of data aug- into English are word sense disambiguation, Arabic named
mentation on machine translation (MT), automatic speech entities, rich and complex morphology, and low resource
recognition (ASR), and speech translation (ST) tasks. Results availability. The four most important problems are carefully
showed that data augmentation achieved much improvement examined, and other researchers have suggested solutions.
in perplexity, relative improvement on WER for ASR task, Other challenges includes Arabic vocalization, unknown and
BLEU points on MT task, and BLEU points on ST over spelling mistakes, dialectal variation, free word order, out-of-
a baseline trained on available data without augmentation. domain, out of vocabulary, word alignment, computational
Transformation mechanism is proposed to augment data overhead, sentence length, accuracy, fluency, performance,
during training that extended the distribution of authentic computation time, data sparsity and noisy data. Some of these
data [56]. In particular, it uses augmented data as auxil- challenges depend on other challenges during translation pro-
iary tasks to provide new contexts when the target prefix cess. Although technology has advanced significantly in the
is not helpful for the next word prediction. This enhances decade, much effort needs to be done in order to obtain high
the encoder and steadily increases its contribution by forcing accuracy and good fluency.
the Grammatical Error Correction (GEC) model to pay more Various machine translation techniques to translate from
attention to the text representations of the encoder during Arabic to English text are examined in this research work.
decoding. The impact of these approach was investigated We intend to create a new Arabic to English machine transla-
using the Transformer-based for low-resource Arabic GEC. tion system in the future that will handle the main challenges
Experimental results showed that the proposed approach out- discussed in this research paper.
performed the baseline, the most common data augmentation Conflicts of Interest: The authors declare no conflict of
methods, and classical synthetic data approaches. interest.
ACKNOWLEDGMENT
V. DISCUSSION
In this study, the challenges encountered during translating (Shahab Ahmad Almaaytah and Soleman Awad Alzobidy
Arabic text to English are categorized based on its reporting contributed equally to this work.)
in research papers. The most significant study areas that have REFERENCES
been examined in the studies that have been made will be [1] N. T. Alsohybe, N. A. Dahan, and F. M. Ba-Alwi, ‘‘Machine-translation
highlighted in this part. history and evolution: Survey for Arabic-English translations,’’ 2017,
Machine translation is an NP-hard problem that aims arXiv:1709.04685.
[2] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
to produce accurate translations. Although technology has jointly learning to align and translate,’’ 2014, arXiv:1409.0473.
advanced significantly over the last decade, more effort still [3] L. Alkhawaja, H. Ibrahim, F. Ghnaim, and S. Awwad, ‘‘Neural machine
has to be done in its development. As a result, even after translation: Fine-grained evaluation of Google translate output for English-
post-processing, the original text meaning is still not precise. to-Arabic translation,’’ Int. J. English Linguistics, vol. 10, no. 4, p. 43,
Apr. 2020.
Despite numerous attempts to improve word alignment in [4] A. Alqudsi, N. Omar, and K. Shaker, ‘‘Arabic machine translation:
AMT systems, it still falls short of the mark and exhibits A survey,’’ Artif. Intell. Rev., vol. 42, no. 4, pp. 549–572, Dec. 2014.
divergence in terms of fluency. As a result, to improve fluency [5] O. Jabak, ‘‘Assessment of Arabic-English translation produced by Google
in the target language, an effective alignment procedure is translate,’’ Int. J. Linguistics, Literature Transl., vol. 2019, pp. 1–12,
Jul. 2019.
needed after translation. One method for doing this is through [6] M. H. Al-Khresheh and S. A. Almaaytah, ‘‘English proverbs into Arabic
an induced alignment during the decoding step, as was done through machine translation,’’ Int. J. Appl. Linguistics English Literature,
for English. vol. 7, no. 5, pp. 158–166, 2018.
[7] M. E. Marouani, T. Boudaa, and N. Enneya, ‘‘Statistical error analysis of
In terms of computation, it is essential to find a way machine translation: The case of Arabic,’’ Computación Sistemas, vol. 24,
to accelerate neural network training at both the computa- no. 3, pp. 1053–1061, Sep. 2020.
tion and memory levels, especially for rich morphological [8] P. Koehn and R. Knowles, ‘‘Six challenges for neural machine translation,’’
words, to enable the use of much larger vocabularies for 2017, arXiv:1706.03872.
[9] R. Sennrich, B. Haddow, and A. Birch, ‘‘Neural machine translation of rare
both the source and target languages, long sentences, and words with subword units,’’ 2015, arXiv:1508.07909.
low-frequent words. With existing MT approaches, OOV, [10] W. Ling, I. Trancoso, C. Dyer, and A. W. Black, ‘‘Character-based neural
uncommon and unknown words, ambiguous words, and mis- machine translation,’’ 2015, arXiv:1511.04586.
[11] M.-T. Luong and C. D. Manning, ‘‘Achieving open vocabulary neu-
spelt words are difficult to manage. One of the fundamental ral machine translation with hybrid word-character models,’’ 2016,
issue while working with the Arabic language is really its arXiv:1604.00788.
intricate and rich morphology, which differs greatly from [12] N. Y. Habash, ‘‘Introduction to Arabic natural language processing,’’
that of Indo-European languages (such as English). These Synth. Lectures Hum. Lang. Technol., vol. 3, no. 1, pp. 1–187, 2010.
[13] A. Alqudsi, N. Omar, and K. Shaker, ‘‘A hybrid rules and statistical method
experiments demonstrated that morphological segmentation for Arabic to English machine translation,’’ in Proc. 2nd Int. Conf. Comput.
and Arabic tokenization can significantly enhance the results Appl. Inf. Secur. (ICCAIS), May 2019, pp. 1–7.
of overall translation. [14] N. Habash and F. Sadat, ‘‘Arabic preprocessing schemes for statistical
machine translation,’’ in Proc. Hum. Lang. Technol. Conf. NAACL, Com-
panion Volume, Short Papers, 2006, pp. 49–52.
VI. CONCLUSION AND FUTURE WORK
[15] M. Oudah, A. Almahairi, and N. Habash, ‘‘The impact of preprocess-
A Systematic Literature Review (SLR) is used to identify ing on Arabic-English statistical and neural machine translation,’’ 2019,
the challenges during translating Arabic text to English. The arXiv:1906.11751.

94778 VOLUME 11, 2023


S. A. Almaaytah, S. A. Alzobidy: Challenges in Rendering Arabic Text to English Using MT

[16] M. Ellouze, W. Neifar, and L. H. Belguith, ‘‘Word alignment applied on [41] K. Shaalan, ‘‘A survey of Arabic named entity recognition and classifica-
English-Arabic parallel corpus,’’ in Proc. LPKM, 2018, pp. 1–9. tion,’’ Comput. Linguistics, vol. 40, no. 2, pp. 469–510, Jun. 2014.
[17] M. E. Marouani, T. Boudaa, and N. Enneya, ‘‘Incorporation of linguistic [42] M. Alkhatib and K. Shaalan, ‘‘Boosting Arabic named entity recognition
features in machine translation evaluation of Arabic,’’ in Proc. Int. Conf. transliteration with deep learning,’’ in Proc. 33rd Int. Flairs Conf., 2020,
Big Data, Cloud Appl. Cham, Switzerland: Springer, 2018, pp. 1–18. pp. 484–487.
[18] H. M. Elsherif and T. R. Soomro, ‘‘Perspectives of Arabic machine trans- [43] Y.-S. Lee, K. Papineni, S. Roukos, O. Emam, and H. Hassan, ‘‘Language
lation,’’ J. Eng. Sci. Technol., vol. 12, no. 9, pp. 2315–2332, 2017. model based Arabic word segmentation,’’ in Proc. 41st Annu. Meeting
[19] M. S. H. Ameur, F. Meziane, and A. Guessoum, ‘‘Arabic machine trans- Assoc. Comput. Linguistics, 2003, pp. 399–406.
lation: A survey of the latest trends and challenges,’’ Comput. Sci. Rev., [44] A. Hatem and N. Omar, ‘‘Syntactic reordering for Arabic-English
vol. 38, Nov. 2020, Art. no. 100305. phrase-based machine translation,’’ in Database Theory and Application,
[20] M. Alkhatib and K. Shaalan, ‘‘The key challenges for Arabic machine Bio-Science and Bio-Technology. Cham, Switzerland: Springer, 2010,
translation,’’ in Intelligent Natural Language Processing: Trends and pp. 198–206.
Applications. Cham, Switzerland: Springer, 2018, pp. 139–156. [45] E. A. Mohammed and M. J. A. Aziz, ‘‘English to Arabic machine trans-
[21] B. Kituku, L. Muchemi, and W. Nganga, ‘‘A review on machine transla- lation based on reordring algorithm,’’ J. Comput. Sci., vol. 7, no. 1,
tion approaches,’’ Indonesian J. Elect. Eng. Comput. Sci., vol. 1, no. 1, pp. 120–128, Jan. 2011.
pp. 182–190, 2016. [46] W. Antoun, F. Baly, and H. Hajj, ‘‘AraBERT: Transformer-based model for
[22] L. Han, ‘‘Machine translation evaluation resources and methods: A sur- Arabic language understanding,’’ 2020, arXiv:2003.00104.
vey,’’ 2016, arXiv:1605.04515. [47] N. Zalmout and N. Habash, ‘‘Optimizing tokenization choice for machine
[23] Y. Salem, A. Hensman, and B. Nolan, ‘‘Implementing Arabic-to-English translation across multiple target languages,’’ Prague Bull. Math. Linguis-
machine translation using the role and reference grammar linguistic tics, vol. 108, no. 1, pp. 257–269, Jun. 2017.
model,’’ in Proc. 8th Annu. Conf. Inf. Technol. Telecommun., Galway, [48] M. García-Martínez, W. Aransa, F. Bougares, and L. Barrault, ‘‘Addressing
Irelan, 2008, pp. 103–110. data sparsity for neural machine translation between morphologically rich
[24] F. J. Och and H. Ney, ‘‘The alignment template approach to statistical languages,’’ Mach. Transl., vol. 34, no. 1, pp. 1–20, Apr. 2020.
machine translation,’’ Comput. Linguistics, vol. 30, no. 4, pp. 417–449, [49] S. Ren, W. Chen, S. Liu, M. Li, M. Zhou, and S. Ma, ‘‘Triangular architec-
2004. ture for rare language translation,’’ 2018, arXiv:1805.04813.
[25] A. Almahairi, K. Cho, N. Habash, and A. Courville, ‘‘First result on Arabic [50] P. Shapiro and K. Duh, ‘‘Comparing pipelined and integrated approaches to
neural machine translation,’’ 2016, arXiv:1606.02680. dialectal Arabic neural machine translation,’’ in Proc. 6th Workshop NLP
[26] A. Alrajeh, ‘‘A recipe for Arabic-English neural machine translation,’’ Similar Lang., Varieties Dialects, 2019, pp. 214–222.
2018, arXiv:1808.06116. [51] W. Lan, Y. Chen, W. Xu, and A. Ritter, ‘‘GigaBERT: Zero-shot transfer
[27] L. H. Baniata, I. K. Ampomah, and S. Park, ‘‘A transformer-based neural learning from English to Arabic,’’ in Proc. Conf. Empirical Methods
machine translation model for Arabic dialects that utilizes subword units,’’ Natural Lang. Process. (EMNLP), 2020, pp. 1–5.
Sensors, vol. 21, no. 19, p. 6509, 2021. [52] E. H. Almansor and A. Al-Ani, ‘‘A hybrid neural machine translation
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, technique for translating low resource languages,’’ in Proc. Int. Conf.
and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. Neural Inf. Mach. Learn. Data Mining Pattern Recognit. Cham, Switzerland: Springer,
Process. Syst., vol. 30, 2017, pp. 1–11. 2018, pp. 347–356.
[29] A. I. E. Farouk, ‘‘Transformer model and convolutional neural networks [53] W. Abid, ‘‘The SADID evaluation datasets for low-resource spoken lan-
(CNNs) for Arabic to English machine translation,’’ in Proc. 5th Int. guage machine translation of Arabic dialects,’’ in Proc. 28th Int. Conf.
Conf. Big Data Internet Things. Cham, Switzerland: Springer, 2022, Comput. Linguistics, 2020, pp. 6030–6043.
pp. 399–410. [54] H. Sajjad, A. Abdelali, N. Durrani, and F. Dalvi, ‘‘AraBench: Benchmark-
[30] B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey, and ing dialectal Arabic-English machine translation,’’ in Proc. 28th Int. Conf.
S. Linkman, ‘‘Systematic literature reviews in software engineering-a sys- Comput. Linguistics, 2020, pp. 5094–5107.
tematic literature review,’’ Inf. Softw. Technol., vol. 51, no. 1, pp. 7–15, [55] I. Hamed, N. Habash, S. Abdennadher, and N. T. Vu, ‘‘Investigating lexical
Jan. 2009. replacements for Arabic-English code-switched data augmentation,’’ 2022,
[31] L. Shamseer, D. Moher, M. Clarke, D. Ghersi, A. Liberati, M. Petticrew, arXiv:2205.12649.
P. Shekelle, and L. A. Stewart, ‘‘Preferred reporting items for systematic [56] A. Solyman, M. Zappatore, W. Zhenyu, Z. Mahmoud, A. Alfatemi,
review and meta-analysis protocols (PRISMA-P) 2015: Elaboration and A. O. Ibrahim, and L. A. Gabralla, ‘‘Optimizing the impact of data aug-
explanation,’’ Bmj, vol. 349, pp. 1–12, Jan. 2015. mentation for low-resource grammatical error correction,’’ J. King Saud
[32] M. Hadni, S. E. A. Ouatik, and A. Lachkar, ‘‘Word sense disambiguation Univ.-Comput. Inf. Sci., vol. 35, no. 6, Jun. 2023, Art. no. 101572.
for Arabic text categorization,’’ Int. Arab J. Inf. Technol., vol. 13, no. 1A,
pp. 215–222, 2016.
[33] S. A.-A. Mussa and S. Tiun, ‘‘Word sense disambiguation on English
SHAHAB AHMAD ALMAAYTAH received the
translation of holy Quran,’’ Bull. Electr. Eng. Informat., vol. 4, no. 3,
pp. 241–247, Sep. 2015. Ph.D. degree in linguistics and translation. He is
[34] R. Navigli, ‘‘Word sense disambiguation: A survey,’’ ACM Comput. Surv., currently a Coordinator with the English Lan-
vol. 41, no. 2, p. 1–69, 2009. guage Department, King Faisal University. His
[35] E. Abuelyaman, L. Rahmatallah, W. Mukhtar, and M. Elagabani, research interests include machine translation
‘‘Machine translation of Arabic language: Challenges and keys,’’ in Proc. (MT), machine learning (ML), natural language
5th Int. Conf. Intell. Syst., Modeling Simulation, Jan. 2014, pp. 111–116. processing, linguistics, and translation studies.
[36] N. Bouhriz, F. Benabbou, and E. Habib, ‘‘Word sense disambiguation
approach for Arabic text,’’ Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 4,
pp. 1–10, 2016.
[37] W. A. Gale, K. W. Church, and D. Yarowsky, ‘‘A method for disambiguat-
ing word senses in a large corpus,’’ Comput. Humanities, vol. 26, nos. 5–6,
pp. 415–439, Dec. 1992. SOLEMAN AWAD ALZOBIDY received the
[38] I. Dagan and A. Itai, ‘‘Word sense disambiguation using a second language Ph.D. degree in linguistics from Mysore Univer-
monolingual corpus,’’ Comput. Linguistics, vol. 20, no. 4, pp. 563–596, sity, India, in 2015. He is currently an Assistant
1994. Professor in linguistics with Saudi Electronic Uni-
[39] F. Ahmed and A. Nürnberger, ‘‘Arabic/English word translation disam- versity, where he has been a Faculty Member, since
biguation using parallel corpora and matching schemes,’’ in Proc. 12th 2016. His research interests include computational
Annu. Conf. Eur. Assoc. Mach. Transl., 2008, pp. 6–11. linguistics, machine translation, and translation
[40] D. Yarowsky, ‘‘Unsupervised word sense disambiguation rivaling super- studies.
vised methods,’’ in Proc. 33rd Annu. Meeting Assoc. Comput. Linguistics,
1995, pp. 1522–1531.

VOLUME 11, 2023 94779

You might also like