From Recurrent Neural Network Techniques To Pre-Trained Models: Emphasis On The Use in Arabic Machine Translation
From Recurrent Neural Network Techniques To Pre-Trained Models: Emphasis On The Use in Arabic Machine Translation
1. INTRODUCTION
Over the last few years, machine translation (MT) has been extremely valuable in a wide range of
applications and has made progress for almost all languages [1]-[6] . Indeed, low-resource languages’ limited
training corpora result in worse translation performance. Furthermore, given that utilizing an open vocabulary
in MT systems yields high computational cost, such systems constrain the vocabulary to those words that occur
most often in the training corpus. This degrades the performance of the system, especially for morphologi-
cally rich languages, since many words are ignored (out of vocabulary (OOV)) in the target vocabulary, and
therefore remain unknown to the system. A lot of attention has been devoted to the Arabic language in the MT
community in the past decade. Arabic is the official language of 25 countries, it is primarily spoken by more
than 375 million people and is ranked as the fifth most spoken language in the world. It is a language that is
written from right to left, using a cursive script, with 28 letters in its alphabet. These letters are consonants and
vowels. However, the morphology of the Arabic language, along with other linguistic aspects, has made MT
to and from Arabic much more difficult. The morphological richness of this language, which is characterized
by the high presence of the agglutination phenomenon, makes that an Arabic word may represent an entire
sentence in English, as illustrated by the word ” ” (/wbmdArshm/) which means in English “and in
ÑîD P @ YÖß.ð
their schools”. The phenomenon of agglutination in certain languages results in an increased number of OOV
words in neural machine translation (NMT) systems. To address these challenges, researchers have explored
alternative models that utilize smaller orthographic vocabulary units instead of complete words. One approach
is to represent words as sequences of characters, which can be achieved through techniques like byte pair encod-
ing (BPE) [7], or even by considering individual characters as the basic units. These alternatives successfully
dealt with the OOV problem but involved a significant drop in semantic and syntactic information, resulting in
mistranslations [8], [9].
MT can be classified into three main categories: into rule-based machine translation (RBMT), statisti-
cal machine translation (SMT), and NMT. RBMT relies on linguistic rules created by language experts, making
it dependent on extensive dictionaries and significant linguistic knowledge [10]. However, building such re-
sources can be expensive, and it is challenging to create rules that cover all languages. SMT, on the other hand,
is a data-driven approach that employs probabilistic models. It consists of three primary stages: the translation
model, the language model, and the decoder model. The translation model estimates the probability that a
source sentence corresponds to a target sentence based on a bilingual corpus. The language model, trained on
a monolingual corpus, enhances the fluency of the translation. In the decoding phase, the most probable target
sentence is determined using the language and translation models. SMT can handle ambiguity by utilizing a
phrase table that records phrase-based translations and their frequency of occurrence, resulting in more fluent
and natural translations compared to RBMT [11].
SMT has been known to struggle when translating sentences that significantly differ from the content
in the training data [10]. In recent years, NMT has gained substantial attention from the research community
due to its remarkable performance [12]-[15]. NMT models employ an end-to-end encoder decoder frame-
work. In this architecture, the encoder plays a crucial role in converting a source sentence into a continuous
vector, commonly known as a context vector. This vector captures the pertinent information derived from
the input sentence. Once the encoder has produced the context vector, the decoder utilizes it to generate the
translation in the target language, progressing word by word. Furthermore, there has been a recent surge in
the use of large pre-trained transformer-driven language models (PTMs), such as the bidirectional encoder
representations from transformers (BERT) [16] and generative pre-trained transformer (GPT) [17] families of
models, have been storming natural language, attaining peak efficiency in many tasks. The attractive side to
this overwhelming push towards using large architectures pre-trained on massive collections of text is that the
pre-trained checkpoints, as well as the inference code, are freely accessible. This can result in saving hun-
dreds of tensor processing unit (TPU)/ graphics processing unit (GPU) hours, as warm-starting a model using
a pre-trained checkpoint generally required fewer fine-tuning steps, while still achieving substantial improve-
ments in performance. More significantly, the feasibility of starting from a state-of-the-art performance model
such as BERT motivates the community to significantly advance toward developing both improved and easily
reusable MT systems. However, despite the success of these PTMs in tasks such as Glue and stanford question
answering dataset (SQuAD), there is still a need for research to explore their potential for other applications,
particularly in the area of sequence-to-sequence (Seq2Seq) models for MT. Arabic is one language that could
benefit from this research, as there is a growing demand for MT systems that can accurately translate Arabic
text into other languages. Hence, in this paper, we present a transformer-based Seq2Seq model for Arabic MT
that leverages the publicly available AraBERT and AraGPT-2 pre-trained checkpoints. Our model is initial-
ized using a combination of these checkpoints, and we explore various settings to find the optimal initialization
method. We show that our approach outperforms randomly initialized models and achieves new state-of-the-art
results in Arabic MT
The rest of the paper is organized as follows. Section 2 summarizes the research work that has been
done with regard to Arabic MT. Section 3 describes the models and pre-trained checkpoints used in this work.
Section 4 reports the experiments considered in this paper and discusses the findings. Lastly, a conclusion and
future perspectives are set out.
2. RELATED WORKS
Over the past years, there has been a notable surge in research studies focused on the NMT paradigm.
In this section, we categorize the existing research on Arabic NMT into two primary classifications:
– Pre- and post-processing: these studies aim to improve the quality of NMT systems by utilizing pre-
processing and/or post-processing techniques. This includes techniques such as segmentation, normal-
ization, tokenization, and post-processing re-scoring. The focus is on optimizing the input data and
refining the output translations to improve overall performance.
– Morphology, vocabulary, and factored NMT: this category investigates the incorporation of diverse lin-
guistic knowledge sources into baseline NMT systems. The studies investigate the impact of incorpo-
rating morphological information, exploring different vocabulary sizes and subword units, and incor-
porating hierarchical or factored approaches to improve translation quality. These approaches leverage
linguistic factors to enhance the NMT models.
From recurrent neural network techniques to pre-trained models: emphasis on ... (Nouhaila Bensalah)
2406 ❒ ISSN: 2252-8938
the model to better handle Arabic morphology. Alternatively, a SentencePiece (an unsupervised text tokenizer
and detokenizer) is trained on unsegmented text to generate the second release of ARABERT (AraBERTv0.1)
that involves no segmentation. The model was trained on a large-scale dataset composed of a combination of
Arabic Wikipedia, Arabic Gigaword, and OSCAR Arabic. This version of the model is particularly useful for
tasks where pre-segmented text is not available, such as social media or dialectal Arabic. The final vocabulary
size is also 64k tokens, but it includes fewer subword units than AraBERTv1.
Decoder
Encoder .. ..
From recurrent neural network techniques to pre-trained models: emphasis on ... (Nouhaila Bensalah)
2408 ❒ ISSN: 2252-8938
The best model (BiGRU as an encoder and BiLSTM as a decoder with an attention mechanism and
FastText embeddings) got in this case a BLEU score of 42.18% compared to 43.09% achieved without pre-
processing. The observed results can be attributed to the effectiveness of Arabic preprocessing in addressing
data sparsity and managing tokens that may not be present in the training corpus. Considering this analysis, the
optimal combination for achieving desirable outcomes would involve utilizing BiGRU as the encoder, BiLSTM
as the decoder, employing the attention mechanism, and incorporating Arabic preprocessing techniques.
In Table 1, the baseline scores for, the best model (BiGRU as an encoder, BiLSTM as a decoder,
the attention mechanism and FastText embeddings), the original transformer model and our transformer im-
plementation with the same hyper-parameters are presented. Our implementation achieves significantly higher
BLEU points than the best model. The middle section of Table 1 presents the findings for different initialization
From recurrent neural network techniques to pre-trained models: emphasis on ... (Nouhaila Bensalah)
2410 ❒ ISSN: 2252-8938
schemes using AraBERT and AraGPT-2 pre-trained checkpoints. For AraBERT, we choose the AraBERTv0.1-
base checkpoint for initializing the encoder or the decoder, or both. First, we note that is more beneficial to
initialize the model, on the encoder side, with the AraBERT checkpoint. In addition, models initialized with
the AraBERT checkpoint (AraBERT2RND, RND2AraBERT, AraBERT2AraBERT, and AraBERTSHARE) re-
ceive a significant boost.
For AraGPT, to initialize the decoder, we adopt the AraGPT2-base checkpoint. The AraGPT-based
models (RND2AraGPT and AraBERT2AraGPT) are not as efficient, mainly when using AraGPT as a decoder
and the target language is English. The reason behind this is the fact that the AraGPT model has been pre-
trained primarily on Arabic text.
5. CONCLUSION
MT is a complex task, and different languages may require different approaches to achieve the best
results. Arabic is a Semitic language with a complex structure that differs from that of European languages.
Therefore, the same MT approach may not work as well for Arabic as for European languages. Recently, neural
network-based MT has emerged as an alternative approach to traditional SMT. In this study, we compare the
performance of seven DL models based on LSTM, GRU, BiLSTM, and BiGRU as simple encoders/decoders
with attention mechanisms and different word embeddings, including Word2Vec, GloVe, and FastText. We
also investigate the effect of Arabic text preprocessing on the MT models’ performance. We explored different
transformer encoder-decoder models and initialized them in different ways, including random initialization and
warm-starting with public checkpoints of AraBERT and AraGPT-2. Our findings suggest that pre-trained en-
coder checkpoints are crucial for Arabic MT as they enable shared weights between the encoder and decoder,
which minimizes the memory footprint. Our model is initialized using a combination of these checkpoints,
and we explore various settings to find the optimal initialization method. We also found that the combination
of AraBERT and AraGPT-2 in a single model does not improve efficiency compared to a randomly initialized
base model. However, we noted that is more beneficial to initialize the model, on the encoder side, with the
AraBERT checkpoint. Our findings provide insights into the selection and use of pre-trained checkpoints in
neural network-based MT models, which can facilitate the development of more accurate and efficient MT
systems for Arabic. As part of future work, we believe that there is still a lot of potential in combining different
pre-trained models for MT, and we plan to investigate the impact of BERT and GPT checkpoints for multilin-
gual NMT. Additionally, we aim to evaluate different language-specific BERT model checkpoints and assess
the performance of the transformer when using the multilingual version. These investigations will help us to
better understand the strengths and limitations of different MT models and inform the development of more
effective and efficient MT systems.
ACKNOWLEDGEMENTS
We acknowledge the financial support for this research from the Centre National pour la Recherche
Scientifique et Technique (CNRST) Morocco and Khawarizmi Project.
REFERENCES
[1] N. Alsohybe, N. Dahan, and F. B. -Alwi, “Machine-translation history and evolution: survey for arabic-english translations,” Current
Journal of Applied Science and Technology, vol. 23, no. 4, pp. 1–19, 2017, doi: 10.9734/cjast/2017/36124.
[2] A. Al-Janabi, E. A. Al-Zubaidi, and B. M. Merzah, “Detecting translation borrowings in huge text collections using vari-
ous methods,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 30, no. 3, pp. 1609–1616, 2023, doi:
10.11591/ijeecs.v30.i3.pp1609-1616.
[3] R. Chingamtotattil and R. Gopikakumari, “Neural machine translation for Sanskrit to Malayalam using morphology and evolution-
ary word sense disambiguation,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 28, no. 3, pp. 1709–1719,
2022, doi: 10.11591/ijeecs.v28.i3.pp1709-1719.
[4] M. K. Nyein and K. M. Soe, “Source side pre-ordering using recurrent neural networks for English-Myanmar machine
translation,” International Journal of Electrical and Computer Engineering, vol. 11, no. 5, pp. 4513–4521, 2021, doi:
10.11591/ijece.v11i5.pp4513-4521.
[5] P. Wijonarko and A. Zahra, “Spoken language identification on 4 Indonesian local languages using deep learning,” Bulletin of
Electrical Engineering and Informatics, vol. 11, no. 6, pp. 3288–3293, 2022, doi: 10.11591/eei.v11i6.4166.
[6] T. M. Angona et al., “Automated bangla sign language translation system for alphabets by means of MobileNet,” Telkom-
nika (Telecommunication Computing Electronics and Control), vol. 18, no. 3, pp. 1292–1301, 2020, doi: 10.12928/TELKOM-
NIKA.V18I3.15311.
[7] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics, 2016, pp. 1715–1725, doi: 10.18653/v1/P16-1162.
[8] D. Ataman, M. Negri, M. Turchi, and M. Federico, “Linguistically motivated vocabulary reduction for neural machine translation
from Turkish to English,” The Prague Bulletin of Mathematical Linguistics, vol. 108, no. 1, pp. 331–342, 2017, doi: 10.1515/pralin-
2017-0031.
[9] A. Tamchyna, M. W. -D. Marco, and A. Fraser, “Modeling target-side inflection in neural machine translation,” in Proceedings of
the Second Conference on Machine Translation, 2017, pp. 32–42, doi: 10.18653/v1/W17-4704.
[10] L. S. Hadla, T. M. Hailat, and M. N. Al-Kabi, “Evaluating Arabic to English machine translation,” International Journal of Advanced
Computer Science and Applications, vol. 5, no. 11, pp. 68–73, 2014.
[11] M. Alkhatib and K. Shaalan, “The key challenges for Arabic machine translation,” Studies in Computational Intelligence, vol. 740,
pp. 139–156, 2018, doi: 10.1007/978-3-319-67056-0 8.
[12] K. Cho et al., “Learning Phrase Representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734.
[13] A. Vasvani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on
Neural Information Processing Systems, 2017, pp. 5998–6008.
[14] G. Manias, A. Mavrogiorgou, A. Kiourtis, and D. Kyriazis, “An evaluation of neural machine translation and pre-trained word
embeddings in multilingual neural sentiment analysis,” in 2020 IEEE International Conference on Progress in Informatics and
Computing (PIC), 2020, pp. 274–283, doi: 10.1109/PIC50277.2020.9350849.
[15] B. Klimova, M. Pikhart, A. D. Benites, C. Lehr, and C. S. -Stockhammer, “Neural machine translation in foreign language
teaching and learning: a systematic review,” Education and Information Technologies, vol. 28, no. 1, pp. 663–682, 2023, doi:
10.1007/s10639-022-11194-2.
[16] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language under-
standing,” in 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2019, vol. 1, pp. 4171–4186.
[17] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” OpenAI,
pp. 1–12, 2018.
[18] H. Sajjad, F. Dalvi, N. Durrani, A. Abdelali, Y. Belinkov, and S. Vogel, “Challenging language-dependent segmentation for Arabic:
an application to machine translation and part-of-speech tagging,” in Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics, 2017, vol. 2, pp. 601–607, doi: 10.18653/v1/P17-2095.
[19] M. Oudah, A. Almahairi, and N. Habash, “The impact of preprocessing on Arabic-English statistical and neural machine transla-
tion,” in Proceedings of Machine Translation Summit XVII Volume 1: Research Track, 2019, pp. 214–221.
[20] M. S. H. Ameur, A. Guessoum, and F. Meziane, “Improving Arabic neural machine translation via n-best list re-ranking,” Machine
Translation, vol. 33, no. 4, pp. 279–314, 2019, doi: 10.1007/s10590-019-09237-6.
[21] A. Alrajeh, “A recipe for Arabic-English neural machine translation,” Arxiv-Computer Science, vol. 1, pp. 1–5, 2018.
[22] S. Ding, A. Renduchintala, and K. Duh, “A call for prudent choice of subword merge operations in neural machine translation,” in
Proceedings of Machine Translation Summit XVII: Research Track, 2019, pp. 204–213.
[23] D. Ataman, W. Aziz, and A. Birch, “A latent morphology model for open-vocabulary neural machine translation,” in 8th International
Conference on Learning Representations, ICLR 2020, 2020, pp. 1–15.
[24] D. Ataman, O. Firat, M. A. D. Gangi, M. Federico, and A. Birch, “On the importance of word boundaries in character-level
neural machine translation,” in Proceedings of the 3rd Workshop on Neural Generation and Translation, 2019, pp. 187–193, doi:
10.18653/v1/D19-5619.
[25] X. Liu, D. F. Wong, Y. Liu, L. S. Chao, T. Xiao, and J. Zhu, “Shared-private bilingual word embeddings for neural machine
translation,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3613–3622,
doi: 10.18653/v1/P19-1352.
[26] A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, “A fast and furious segmenter for Arabic,” in Proceedings of the 2016
conference of the North American chapter of the association for computational linguistics: Demonstrations, 2016, pp. 11–16.
[27] W. Antoun, F. Baly, and H. Hajj, “ARAGPT2: pre-trained transformer for Arabic language generation,” in WANLP 2021 - 6th
Arabic Natural Language Processing Workshop, Proceedings of the Workshop, 2021, pp. 196–207.
[28] M. Cettolo, C. Girardi, and M. Federico, “WIT3: web inventory of transcribed and translated talks,” in Proceedings of the 16th
Annual Conference of the European Association for Machine Translation, EAMT 2012, 2012, pp. 261–268.
[29] M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, R. Cattoni, and M. Federico, “The IWSLT 2016 evaluation campaign,” in Proceed-
ings of the 13th International Conference on Spoken Language Translation, 2016, pp. 1–14.
From recurrent neural network techniques to pre-trained models: emphasis on ... (Nouhaila Bensalah)
2412 ❒ ISSN: 2252-8938
BIOGRAPHIES OF AUTHORS
Nouhaila Bensalah received her M.Sc. degree in 2018 in Informatics and Telecommu-
nications from the Department of Physics, Faculty of Sciences, Mohammed V University, Rabat,
Morocco. Currently, she is actively pursuing a Ph.D. degree at the esteemed LIM Laboratory of In-
formatics, Faculty of Sciences and Techniques Mohammedia. With a strong academic background
and a keen interest in cutting-edge technologies, Nouhaila is actively engaged in research activities.
She has made significant contributions to her field through her participation in international and na-
tional conferences, where she has presented her work in the form of eight publications. Her research
interests revolve around machine learning and natural language processing (NLP), with a particu-
lar focus on Arabic machine translation. She is deeply committed to advancing the understanding
and application of these fields, aiming to contribute to the development of improved techniques
and methodologies in the domain of Arabic machine translation. She can be contacted at email:
[email protected].
Abdellah Adib received the Doctorat de 3rd Cycle and the Doctorat d’Etat-es-Sciences
degrees in Statistical Signal Processing from the Mohammed V University, Rabat, Morocco, in 1996
and 2004, respectively. Since 1997, he has been an assistant professor at the Scientific Institute of
Rabat and a professor of higher education at the Faculty of Science and Technology of Mohamme-
dia since 2008. He was head of the Department between 2012 and 2015. He was a member of the
Scientific Committee of FSTM for two terms, 2015-2018 and 2018-2021. He was also a member of
the CNRST scientific committees as well as an expert evaluator for information technologies for two
consecutive terms, 2013-2016 and 2016-2020. Since 1993, his research has focused on automatic in-
formation processing, source separation, and applications (seismic, biomedical, and speech signals).
He is also the author or co-author of more than 30 papers in international journals and more than
80 papers in international conferences. He has been a member of several technical committees of
IEEE, EURASIP, and Springer. He has supervised more than 20 theses in different fields related to
his favorite areas. He can be contacted at email: [email protected].