IEEE_Conference_Template1

The paper presents AraPunc, a new dataset for Arabic punctuation restoration using transformer models, addressing the challenges posed by missing punctuation in text. The dataset is derived from the Tashkeela corpus and includes six punctuation classes, with experiments showing that the XLM-RoBERTa model achieves the highest performance. The authors also explore cross-finetuning with the QASR dataset to enhance model effectiveness in punctuation restoration tasks.

Uploaded by

Rowena Alamri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

IEEE_Conference_Template1

Uploaded by

Rowena Alamri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/373946905

AraPunc: Arabic Punctuation Restoration Using Transformers

Preprint · September 2023

CITATIONS READS
0 583

2 authors:

Abdelrahman Sakr Marwan Torki

Alexandria University Alexandria University
4 PUBLICATIONS 3 CITATIONS 127 PUBLICATIONS 2,806 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Marwan Torki on 15 September 2023.

The user has requested enhancement of the downloaded file.

AraPunc: Arabic Punctuation Restoration Using
Transformers
Abdelrahman Sakr Marwan Torki
Computer and Systems Engineering Department Computer and Systems Engineering Department
Alexandria University Alexandria University
Alexandria, Egypt Alexandria, Egypt
[email protected] [email protected]

Abstract—Adding punctuation to Arabic text enhances read- Missing punctuation marks in the text makes the summariza-
ability and clarity. This is very clear in many applications such tion task harder. So, restoring the missing punctuation will
as automatic speech recognition (ASR) and machine translation give a correct structure to the text, hence enabling better
(MT) systems. In this paper, we introduce a new punctuation
dataset. Our AraPunc dataset is based on the pre-processing of summarization.
the Tashkeela “Arabic diacritization corpus”. We keep six classes: It is also important for the output of ASR systems to
space ‘0’, full-stop ‘.’, comma ‘,’, the colon‘:’, semicolon ‘;’, and enhance the structure of the text generated with these systems.
question mark ‘?’. We treat the punctuation restoration task The output of ASR systems without sufficient punctuation
as a token-wise classification problem that assigns a class (one will affect systems that depend on this output like speaker
of the six classes) to each word on the input sentence. We train
different transformer-based language models on our new dataset. diarization systems.
We found that XLM-RoBERTa outperforms other transformer- In Arabic, it is very rare to find a dataset that can be
based models with a macro-average F1-score of 0.7851 on the used directly to build punctuation restoration models. QASR
AraPunc test set. We also allowed cross-finetuning between QCRI [2] dataset is a large-scale annotated Arabic speech corpus.
Aljazeera Speech Recognition (QASR) dataset and our novel It has large-scale words and sentences, but it is proposed
AraPunc dataset. We managed to achieve a macro average F1-
score of 0.7050 on the QASR test set, after training the model
for conversations only. It could not be extended to build a
first using the AraPunc dataset. Our experiments revealed that general punctuation restoration system. Moreover, The text
AraPunc provides better representations which makes it more is categorized into four classes: space, a full-stop, comma,
suitable to fine-tune models for punctuation restoration task. We and question mark. These specific classes restrict the text
release our dataset and code to facilitate future research on this to only these categories, which could potentially affect its
topic1 .
ability to train a general artificial intelligence (AI) model to
Index Terms—Punctuation, Arabic Transformers, AraPunc.
restore missing punctuation in data that includes additional or
different types of punctuation marks.
I. I NTRODUCTION
We propose to have a more general-purpose dataset that
The punctuation restoration process is a way to retrieve enables the training on more classes of punctuation. To achieve
the missing punctuation marks in the text to make it more that goal, we choose to build the AraPunc dataset on the
readable. It breaks down long sentences into meaningful Tashkeela [3] dataset. Tashkeela [3] is an Arabic diacritization
chunks to enable the readers to process the information before corpus, that contains over 75 million words. It has different
proceeding. Punctuation helps also in clarifying the meaning types of punctuation within its sentences. It is collected from
of a sentence or passage, it gives structure to the text, which Islamic classical books and modern standard Arabic texts. So,
helps readers to understand relationships between words and it could be a general dataset not limited to conversational
phrases. Without punctuation, the text will be ambiguous. style text as the QASR [2] dataset. In addition to building the
Punctuation has a great effect on conveying the tone of the AraPunc dataset, we train several transformer-based models to
sentence. For example, the question mark ‘?’ would signal a address the punctuation restoration task.
question to the reader. Colons ‘:’, semicolons ‘;’, and commas To summarize, our contributions in this paper are multifold:
‘,’ would indicate transitions within a sentence that would • We introduce the general-purpose AraPunc dataset con-
affect the rhythm and flow of the text. taining six classes of punctuation.
The existence of punctuation is very important in translation • We train several transformer-based models to address the
to detect the context of the text correctly, and not to confuse punctuation restoration task in Arabic.
the meaning of each sentence with others [1]. • We also experiment with cross-finetuning between the
In the summarization task, it is required to summarize each QASR [2] dataset and our novel AraPunc dataset. Using
paragraph and each sentence while keeping the same meaning. the AraPunc dataset we managed to train a transformer-
based model, that increases the performance when fine-
1 https://github.com/Body123/Arabic-Punctuation-Prediction tuning it on the QASR dataset in a later step.
II. R ELATED W ORK filtered out. Additionally, utterances containing six or fewer
There are several studies on punctuation restoration on the words are also removed. Finally, diacritics and brackets are
transcript generated from ASR systems, but research in the eliminated from the text. They mentioned that the distribution
case of general-purpose punctuation restoration is limited. of punctuation in it is highly imbalanced, due to the nature of
Based on the extensive survey by Pâis and Tufis [4], they the Arabic language. They use a simple transformer-biLSTM
addressed the distinction between methods that include audio- architecture as a baseline. They managed to achieve an F1-
specific features and methods that use lexical features only. score for space, comma, full-stop, and question marks of
Among the approaches they addressed, are rule-based ap- 0.979, 0.447, 0.615, and 0.605 respectively.
proaches and bootstrapping approaches that derive rules from In summary, most of the data used in the training of these
large corpora such as [5]. N-gram approaches, in [6] addressed models are related to speech recognition datasets. Other forms
sentence boundary detection. Sequence tagging models, in [7], of data, like books, newspapers, etc., have not been explored to
there was interest in punctuation recovery and capitalization solve this task. As a result, we aim to use general data such as
restoration in Spanish and Portuguese languages, they em- the Tashkeela [3] dataset. Applying some pre-processing steps
ployed a Maximum Entropy model to tackle this problem. to it in order to make it suitable for punctuation restoration
In [8], with the ICSIBoost tool2 , a boosting approach for tasks in Arabic text. We aim to investigate various language
sentence-end detection was proposed. Conditional-random- models utilizing transformer architecture, such as BERT [15]
fields (CRF) probabilistic models, considering punctuation and RoBERTa [16] to build a strong AI model that can restore
restoration for commas, exclamation marks, period, and ques- the missing punctuation in the text accurately.
tion marks was proposed using a factorial CRF (F-CRF) III. APPROACH
model in [9]. There exist many approaches to retrieve missing
The transformer [17] approach is a deep learning model
punctuation marks using neural architectures: one of them is
architecture that utilizes the encoder-decoder structure. It uses
implemented using Gated Recurrent Units (GRU) combined
self-attention mechanisms to capture contextual relationships
with an attention mechanism in [10]. It is considered that CRF
from tokens in the sequence. In Fig. 1, we show that
and neural network architectures are the most frequently used
the transformer architecture can be adapted for punctuation
techniques among the previous models.
restoration.
In [11], the authors were interested in the output of medical
Combining the transformer architecture with transfer learn-
ASR systems, using a pre-trained transformer model with sub-
ing has proven to be beneficial [15], [18]. This involves
word embeddings to overcome lexical sparsity in the medical
utilizing a pre-trained transformer model and fine-tuning it
domain. They make a fine-tuning step on medical data and
specifically for a downstream task. Applying this strategy
a task adaptation step, masking randomly punctuation before
has resulted in notable performance enhancements across
starting to train the actual model. They achieved an F1-score
various Natural Language Processing (NLP) tasks. In our
of 0.92 for full-stop and 0.81 for comma with Bio-BERT [12],
case of restoring missing punctuation, we will adapt a similar
which was trained on biomedical corpora.
approach to leverage the advantages of transfer learning.
In [13], the authors solved this task as a token-wise
prediction, they experimented with various transformer-based
language models. Their results highlighted that utilizing mul-
tilingual transformer-based models produces better results
compared to monolingual models. Notably, this holds true for
languages that adhere to a left-to-right structure.
For the Arabic language, the authors of [14] used a tra-
ditional approach to punctuation and spelling correction for
Arabic. They used Support Vector Machines (SVM) and CRF
classifiers for the classification task. Using morphological in-
formation and part-of-speech (POS), they managed to achieve
an F1-score of 0.56.
QASR [2] Arabic dataset was designed for restoring punc-
tuation related to ASR systems. They provided a large scale
of Arabic audio, accompanied by annotations that include
punctuation marks embedded within the annotated text. They
use only spaces, commas, full stops, and question marks. In
order to prepare the data for training an AI model, several
pre-processing steps are applied. Firstly, a maximum window
of 120 tokens is used to segment the utterances from the Fig. 1. The transformer architecture can be used for punctuation restoration.
same speaker. Next, utterances without any punctuation are The encoder part takes the unpunctuated text. The decoder part adds punctu-
ation to the output text.
2 https://github.com/benob/icsiboost
A. Models until convergence trying to maximize the average-macro F1-
We use the following base models, to apply our fine-tuning score on dev sets and test the generalization on the test sets.
strategy: IV. A RA P UNC DATASET
1) XLM-RoBERTa
We introduce a new dataset. This is done through the
XLM-RoBERTa [19] is a multilingual version of RoBERTa
following steps:
[16], The model used is pre-trained on 2.5TB of filtered Com-
• In the Tashkeela [3] dataset, we retrieve text files in the
monCrawl data, which includes text from 100 languages. We
dataset. Each line within these text files contains various
specifically utilize the XLM-RoBERTa-large3 variant due to
sentences. We choose to shuffle these lines to make
its extensive parameter count of 355M. This larger parameter
the train/dev/test sets contain different types of writing
count enables the model to capture more complex patterns.
styles. This is because each line contains many different
2) CamelBERT
sentences and most of the lines ended with a full-stop
CAMeL-Lab [20] is an Arabic pre-trained language model
mark.
that is trained on mixed data such that: Modern Standard
• We remove diacritics from words in the whole dataset.
Arabic (MSA), dialectal Arabic (DA), and classical Arabic
• We remove the artifacts. Such as HTML code, Unicode
(CA) datasets. We use this variant4 , as it is trained on a large
symbols, different types of brackets, forward slash, back-
number of Arabic words count of 17.3B words and the variety
slash, dollar signs, etc.
of types of data the model trained on.
• We remove all text that does not have punctuation.
3) AraBERTV2
• In order to facilitate the extraction of punctuation marks,
AraBERT [21] is an Arabic pre-trained language model. It
we add a space before and after them. This aids in the
is based on Google’s BERT [15] architecture. It uses the same
identification and extraction.
BERT-Base configuration. There are multiple versions of this
• We partitioned the acquired lines into an 80% training
model. We choose AraBERTv25 as it is a more recent version
set and a 20% test set.
of the model. It is trained on 200M sentences and 8.6B words
• We split the train set into the training part representing
and has 136M parameters.
85% of the initial train set and the dev set representing
B. Architecture 15% of the initial train set.
• The final distribution of the AraPunc dataset we obtained:
We follow the architecture of [13] that utilizes transfer
train set represents 68%, dev set represents 12%, and test
learning of pre-trained language models to solve the punctua-
set represents 20% of the dataset. See the distribution of
tion restoration task. The general structure of this architecture
the AraPunc dataset in Table I.
necessitates the dataset to adhere to a particular format.
• Then data was converted to tab-separated format, such
The first column contains the word, and the second column
that the first column contains the word, the second
contains a specific punctuation mark. We follow this format
column contains a specific punctuation mark if the word
for our new AraPunc dataset.
is followed by punctuation, and contains a 0 if the word
In this architecture, the approach involves tokenizing each
is not followed by a punctuation mark. This format of the
document and then dividing it into sequences consisting of
dataset is consistent with the one used for the shared task
512 tokens. This limitation is due to the maximum number
on Sentence End and Punctuation Prediction in NLG Text
of tokens that can be processed by the pre-trained language
[23] that was held at the Swiss Text Analytics Conference
models. However, this process introduces potential issues, such
in 2021. See examples from the AraPunc dataset in Table
as starting or ending a sequence in the middle of a sentence.
II.
This may affect the contextual understanding required for
successful prediction. To address this problem, the architecture CL Train Dev Test
employs an overlapping window technique, similar to the , 1756058 (4.713%) 309118 (4.711%) 514741 (4.7%)
. 638133 (1.712%) 112409 (1.713%) 187367 (1.71%)
stride parameter in convolutional neural networks. A window ? 51798 (0.139%) 9193 (0.14%) 15448 (0.1411%)
size of 100 is used during training, where the data is processed 0 33639104 (90.28%) 5923672 (90.28%) 9888211 (90.3%)
in overlapping segments. The loss calculation is performed : 939876 (2.522%) 165549 (2.52%) 275918 (2.51%)
; 233479 (0.626%) 40846 (0.622%) 67756 (0.618%)
over the entire sequence, including the overlapping regions. TABLE I
This approach is considered to act as data augmentation as it D ISTRIBUTION OF PUNCTUATION CLASSES IN A RA P UNC
generates new training sequences.

C. Implementation
V. E XPERIMENTS AND R ESULTS
The training uses a learning rate of 4e−5, using Adafactor
In this section, we will provide detailed comparisons with
[22] as an optimization algorithm. We train all experiments
different transformer-based language models on our AraPunc
3 https://huggingface.co/xlm-roberta-large/ dataset. We evaluate the QASR [2] test set which is not seen
4 https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-mix during the training procedure of the AraPunc dataset. Also,
5 https://huggingface.co/aubmindlab/bert-base-arabertv2 we present cross-finetuning between the QASR dataset and
... ... ... ... ... ...
ÈA¯ : á« 0 ... ...
We contrast the results of the average-macro F1-score for
the different transformer-based language models in Table V.
... ... áK. @ 0 ... ...
We observe that XLM-RoBERTa is outperforming Camel-
Õ¯ @ 0 úG. @ 0 AÓ @ð 0 BERT and AraBERTV2.
èCË@ 0 iJm.' , éJÊm' 0
éK . ñJºÖÏ@ , á« 0 JË@ 0 B. Evaluation on QASR dataset
X @ð 0 YëAm.× , é®ËAK. 0 Direct Testing on QASR test set: In this experiment, we use
èA¿QË@ 0 éËñ¯ : ú¾m¯ 0 the optimal checkpoint we obtained from the explore different
éðQ®ÖÏ@ , ½J.Ê® Kð 0 ÑîDªK. 0
language model (DLM) and use it directly to test the QASR
lk. ð 0 ú¯ 0 A ®KB@ 0
test set. Since the QASR dataset contains only 4 classes, so
J.Ë@ whenever the model predicts a class from the additional classes
I , áK Yg. AË@ , úÎ« 0
AÓð 0 ÈA¯ : ½Ë X ; (: , ;) in AraPunc we mapped it to space (0). This experiment
. J.k @ achieved a macro-average F1-score of 0.4441, see the results
I 0 àA¿ 0 àB 0
of this experiment in the testing (DT) part in Table VI.
à@ 0 Èñ®K 0 éJ¯ 0 Train on QASR dataset In this experiment, we train the
éÊª®K 0 øQK 0 AK. AëP@ 0 XLM- RoBERTa language model on the train set of the QASR
½K. 0 áÓ 0 ðYªÊË , dataset. This experiment achieved a macro-average F1-score
AJË@ 0 é®Êg 0 ... ... of 0.6539 on the QASR [2] test set. Detailed results in the
Éª¯A¯ 0 ú¯ 0 ... ... train from scratch (Scratch) part in Table VI.
ÑîE. , èCË@ . ...... Fine-tune on QASR dataset: In this experiment, we use
... ...... ... ...... the optimal checkpoint we obtained from the DLM experi-
TABLE II ment and fine-tune it on the QASR dataset. This experiment
SNIPPET FROM T HE A RA P UNC DATASET.
achieved a macro-average F1-score of 0.7050. Detailed result
in the fine-tuning from AraPunc (F-tune AraPunc) part in
Table VI. We show our insights from the experiments in
our novel AraPunc dataset, showing the effect of our general- section VI.
purpose AraPunc dataset on specific domain datasets such as
C. Fine-tune from QASR dataset (F-tune QASR)
the QASR dataset, We employ the macro-average F1-score
metric to assess our outcomes. In this experiment, we use the optimal checkpoint we
We follow [2] to prepare the QASR dataset for our experi- obtained from the Train on QASR dataset and fine-tune it
ments. Then we split the QASR dataset to train/dev/test sets. on the AraPunc dataset. This experiment achieved a macro-
We use the test set for evaluation and make its distribution average F1-score of 0.7735. Detailed result in Table VII.
near to the dev set presented in [2] for comparison purposes. D. Qualitative Results
We present the distribution of the QASR dataset in Table III.
We present practical examples that compare outputs of fine-
CL Train Dev Test tuning from the AraPunc and training from scratch on the
, 415714 (3.18%) 6525 (3.29%) 4545 (3.37%) QASR [2] dataset in Table VIII. We present practical examples
. 146951 (1.12%) 2307 (1.166%) 1322 (0.981%)
? 83577 (0.6641%) 1373 (0.6941%) 780 (0.578%)
that compare the outputs of the optimal checkpoint from
0 12409931 (95.05%) 187595 (94.84%) 128094 (95.06%) exploring different language models on the AraPunc dataset
TABLE III and fine-tuning from the QASR dataset in TableIX. We show
D ISTRIBUTION OF PUNCTUATION CLASSES IN QASR DATASET
our insights from the examples in section VI.
VI. D ISCUSSION
The objective of exploring different language models is to
A. Explore Different Language Models (DLM)
determine the choice of transformer-based language model
In this experiment, we train one RoBERTa-based [16] based on the AraPunc dataset. From table V, the XLM-
language model namely: XLM-RoBERTa [19] and two BERT- RoBERTa [19] language model achieved the highest macro-
based [15] language models namely: Camel-BERT [20] and average F1-score values of 0.7853 and 0.7851 on the AraPunc
AraBERTv2 [21] on the AraPunc (train set). dev set and test set respectively. So we can choose it to be the
This experiment involves choosing the optimal checkpoint baseline to continue exploring our next experiments.
that maximizes the macro-average F1-score on the AraPunc The objective of the evaluation of the QASR dataset is to
dev set. Subsequently, this checkpoint is tested for its overall examine the effect of the AraPunc dataset on other datasets that
performance on the AraPunc test set to assess its generalization come from different domains such as QASR [2] which comes
ability. Also, this checkpoint is selected to continue the other from the speech domain. From testing in Table VI, Results
experiments we will present. We show detailed results of this show that our scores were affected badly by the domain
experiment in Table IV change.
0 . ; ? , :
Dev P 0.9700/0.9686/0.9683 0.8577/0.8440/0.8300
0.6949/0.6652/0.6649 0.7886/0.7745/0.7050 0.6849/0.6729/0.6712 0.7804/0.7571/0.7671
R 0.9780/0.9767/0.9767 0.8792/0.8439/0.8571
0.6989/0.7402/0.7183 0.7507/0.7424/0.8093 0.5686/0.5440/0.5410 0.7833/0.7826/0.7683
F1 0.9740/0.9727/0.9725 0.8683/0.8439/0.8433
0.6969/0.7007/0.6906 0.7692/0.7581/0.7536 0.6213/0.6016/0.5991 0.7818/0.7696/0.7677
Test P 0.9700/0.9688/0.9684 0.8581/0.8441/0.8278
0.6957/0.6647/0.6647 0.7904/0.7756/0.7144 0.6849/0.6708/0.6706 0.7787/0.7556/0.7661
R 0.9780/0.9765/0.9766 0.8800/0.8455/0.8581
0.6959/0.7400/0.7163 0.7499/0.7377/0.8092 0.5688/0.5455/0.5421 0.7826/0.7833/0.7685
F1 0.9740/0.9726/0.9725 0.8689/0.8448/0.8427
0.6958/0.7003/0.6896 0.7696/0.7562/0.7589 0.6215/0.6017/0.5996 0.7806/0.7692/0.7673
TABLE IV
R EPORTED P RECISION (P), R ECALL (R) AND MACRO - AVERAGE F1- SCORE (F1) ON A RA P UNC DEV/T EST SET. R ESULTS IN EACH CELL MAP TO
XLM-RO BERTA / C AMEL BERT / A RA BERTV2 RESPECTIVELY

Model Dev F1-Score Test F1-Score seq1 GT F-tune Scratch seq2 GT F-tune Scratch
XLM-RoBERTa 0.7853 0.7851 AraPunc AraPunc
CamelBERT 0.7744 0.7741 ð@ 0 0 0 à@ 0 0 0
AraBERTV2 0.7711 0.7717
TABLE V YªK. 0 0 0 È@ñÓ B@ 0 0 0
MACRO - AVERAGE F1- SCORE COMPARISON ZA®Ë 0 0 0 QÖß 0 0 0
AÓAK. ð @ 0 0 0 Q.« 0 0 0
©Ó 0 0 0 HAñÓ 0 0 0
0 . ? ,
áKñK. . . ? ¼ñJK. ð 0 0 0
Testing P 0.9735 0.1359 0.4141 0.2123
(DT) R 0.9450 0.1921 0.3679 0.3659 úæªK 0 0 0 . Akð
HAK 0 0 0
F1 0.9590 0.1592 0.3897 0.2687 ¬ðQ¯B 0 0 0 éK PAg. , , ,
Fine-tune from AraPunc P 0.9793 0.6512 0.7404 0.5523
(F-tune AraPunc) R 0.9862 0.5847 0.6949 0.4638 ÈA¯ 0 0 0 à@ 0 0 0
F1 0.9827 0.6162 0.7169 0.5042 Bð @ 0 0 0 AÓ 0 0 0
Train from scratch P 0.9936 0.5024 0.4965 0.3383
(Scratch) R 0.9402 0.6241 0.8231 0.7879 @ Yë 0 0 0 ÉJ¯ 0 0 0
F1 0.9662 0.5567 0.6194 0.4734 ÐC¾Ë@ 0 0 . úÎ« 0 0 0
TABLE VI áºËð 0 0 0 ZQÖÏ@ 0 0 0
R EPORTED P RECISION (P), R ECALL (R) AND MACRO - AVERAGE F1- SCORE
(F1) QASR TEST SET @ Yë 0 0 0 à@ 0 0 0
úæªK 0 0 0 IJ.K 0 0 0

Test 0 . ; ? , :
ZA¢«@ 0 0 0 éKðYg 0 0 ,
P 0.9679 0.85110.6772 0.7712 0.6742 0.7731 ÐA¢JË@ 0 0 0 à@ ð 0 0 0
R 0.9778 0.86400.6780 0.7558 0.5431 0.7628 ø PñË@ 0 0 0 áK Y® JJÖÏ@ 0 0 0
F1 0.9728 0.85750.6776 0.7634 0.6016 0.7679
TABLE VII @YK QÓ 0 0 0 @ñË 0 0 0
R EPORTED P RECISION (P), R ECALL (R) AND MACRO - AVERAGE F1- SCORE áÓ 0 0 0
èPðQåËAK 0 0 0
.
(F1) ON A RA P UNC TEST SET
I ¯ñË@ 0 0 0 úÎ« 0 0 0
AÖÏ 0 0 0

k , . .
éJÒ 0 0 0
The objective of fine-tuning from AraPunc and training from ÕæmÌ '@ , , ,
scratch on the QASR dataset is to examine the effect of the TABLE VIII
COMPARE GROUND TRUTH (GT) VS F INE - TUNE FROM A RA P UNC (F- TUNE
initial weights our model gains from training it first on the A RA P UNC ) VS T RAIN FROM SCRATCH (S CRATCH ) ON SAMPLES FROM
AraPunc dataset before starting the fine-tuning process on QASR TEST SET
the QASR dataset. Results in Table VI show that there is an
increase in performance due to the initial weights our model
gained from the AraPunc dataset.
Practical examples from the QASR test set in Table VIII, gains from training first on the QASR dataset before starting
aim to compare the ground truth (GT) of the QASR test set, the fine-tuning process on the AraPunc dataset.
the output obtained from fine-tuning from AraPunc, and the In the comparison of this experiment and training from
output obtained from the train from scratch on QASR dataset scratch on AraPunc, there is a decrease in performance in
on samples from the QASR test set. We could see that the the result in Table VII, that is because QASR contains fewer
AraPunc dataset makes an increase in performance of the classes, and the additional classes like colon and semi-column
prediction of the model in a question mark, full-stop mark in AraPunc dataset are mapped before in QASR dataset to
and to know the complex pattern to when to put a comma space or other not appropriate punctuation marks. That causes
or not, which helps to explain the evaluation results we got confusion in the initial weights of the model before starting
before in Table VI. to fine-tune the model in the AraPunc dataset.
The objective of fine-tuning from the QASR [2] dataset Practical examples from the AraPunc test set in Table IX aim
is to examine the effect of the initial weights that the model to compare the ground truth (GT) of the AraPunc test set
F-tune F-tune
seq1 GT DLM seq2 GT DLM [3] T. Zerrouki and A. Balla, “Tashkeela: Novel corpus of arabic vocalized
QASR QASR
texts, data for auto-diacritization systems,” Data in Brief, vol. 11,
AÓ A¯ 0 0 0 ÈA¯ : : :
pp. 147–151, 2017.
Q« 0 0 0 Õç' 0 0 0 [4] V. Păiş and D. Tufiş, “Capitalization and punctuation restoration: a
½Ë X 0 0 0 A«AK. 0 0 0 survey,” Artificial Intelligence Review, vol. 55, pp. 1681–1722, jul 2021.
[5] G. Petasis, F. Vichot, F. Wolinski, G. Paliouras, V. Karkaletsis, and C. D.
áÓ 0 0 0 ½Ë X 0 0 0 Spyropoulos, “Using machine learning to maintain rule-based named-

HAÓQjÖÏ@ 0 0 0 Ñë@P YK. , , 0 entity recognition and classification systems,” in proceedings of the 39th
annual meeting of the association for computational linguistics, pp. 426–
C¯ ; ; , Y¯ð 0 0 0 433, 2001.
ÉJËYK. 0 0 0 0 0 0 [6] A. Stolcke and E. Shriberg, “Automatic linguistic segmentation of con-
versational speech,” in Proceeding of Fourth International Conference on
à@ 0 0 0 úÎ« 0 0 0 Spoken Language Processing. ICSLP’96, vol. 2, pp. 1005–1008, IEEE,
é<Ë@ 0 0 0 Õºk 0 0 0 1996.
[7] F. Batista, I. Trancoso, and N. Mamede, “Automatic recovery of punc-
ÐQk 0 0 0 ©JK. 0 0 0
tuation marks and capitalization information for iberian languages,” in I
ZAJ @ 0 0 0 ém '. @QÖÏ@ 0 0 0 Joint SIG-IL/Microsoft Workshop on Speech An Language Technologies
for Iberian Languages, Porto Salvo, Portugal, pp. 99–102, 2009.
AîDÓ 0 0 0 YªK. 0 0 0
[8] M. Hasan, R. Doddipatla, and T. Hain, “Multi-pass sentence-end
é® JjJÖÏ@ 0 0 0 @ Yë . . . detection of lecture speech,” in Fifteenth Annual Conference of the
International Speech Communication Association, 2014.
AîE@ ñk @ð . . . [9] W. Lu and H. T. Ng, “Better punctuation prediction with dynamic
TABLE IX conditional random fields,” in Proceedings of the 2010 conference on
COMPARE GROUND TRUTH (GT) VS BEST MODEL FROM E XPLORE empirical methods in natural language processing, pp. 177–186, 2010.
D IFFERENT L ANGUAGE M ODELS (DLM) VS FINE - TUNE FROM QASR [10] O. Tilk and T. Alumäe, “Bidirectional recurrent neural network
DATASET (F- TUNE QASR) ON SAMPLES FROM A RA P UNC TEST SET with attention mechanism for punctuation restoration.,” in Interspeech,
pp. 3047–3051, 2016.
[11] M. Sunkara, S. Ronanki, K. Dixit, S. Bodapati, and K. Kirchhoff,
“Robust prediction of punctuation and truecasing for medical asr,” arXiv
preprint arXiv:2007.02025, 2020.
and output obtained from a train from scratch on AraPunc [12] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang,
and output obtained from fine-tuning from QASR [2] dataset “BioBERT: a pre-trained biomedical language representation model for
on samples from AraPunc test set, we could see that QASR biomedical text mining,” Bioinformatics, vol. 36, pp. 1234–1240, sep
2019.
dataset makes confusion to the model in prediction of [13] O. Guhr, A.-K. Schumann, F. Bahrmann, and H. J. Böhme, “Fullstop:
semi-column, and the pattern of the comma in QASR dataset Multilingual deep models for punctuation prediction,” June 2021.
is related to speech style, that helps to explain the evaluation [14] M. Attia, M. Al-Badrashiny, and M. Diab, “GWU-HASP: Hybrid Arabic
spelling and punctuation corrector,” in Proceedings of the EMNLP 2014
results we got before in Table VII. Workshop on Arabic Natural Language Processing (ANLP), (Doha,
Qatar), pp. 148–154, Association for Computational Linguistics, Oct.
2014.
VII. C ONCLUSION [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of deep bidirectional transformers for language understanding,”
In this paper, we managed to introduce a new general- in Proceedings of the 2019 Conference of the North American Chapter of
purpose dataset and compare it with a specific domain dataset the Association for Computational Linguistics: Human Language Tech-
such as the QASR [2] dataset, showing that the positive effect nologies, Volume 1 (Long and Short Papers), (Minneapolis, Minnesota),
pp. 4171–4186, Association for Computational Linguistics, June 2019.
of general-purpose dataset like AraPunc on a specific domain [16] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
dataset like QASR dataset. So AraPunc can play a good role L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
as a baseline dataset that could improve the performance for pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
training AI models on other datasets, that come from other . Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
domains like the speech domain. neural information processing systems, vol. 30, 2017.
We showed also that the transformer architecture can suc- [18] M. Hassib, N. Hossam, J. Sameh, and M. Torki, “AraDepSu: Detecting
depression and suicidal ideation in Arabic tweets using transformers,”
cessfully solve the problem of Arabic punctuation restoration in Proceedings of the The Seventh Arabic Natural Language Processing
and get improvement in the results with this architecture. Workshop (WANLP), (Abu Dhabi, United Arab Emirates (Hybrid)),
we show that models based on ROBERTa [16] like XLM- pp. 302–311, Association for Computational Linguistics, Dec. 2022.
[19] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,
RoBERTa [19] language model increases the performance than F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov,
models based on BERT [15] such as CAMeL-Lab [20] or “Unsupervised cross-lingual representation learning at scale,” CoRR,
AraBERT [21] due to increasing the data size and training vol. abs/1911.02116, 2019.
[20] G. Inoue, B. Alhafni, N. Baimukan, H. Bouamor, and N. Habash, “The
time of RoBERTa based models 6 . interplay of variant, size, and task type in arabic pre-trained language
models,” CoRR, vol. abs/2103.06678, 2021.
R EFERENCES [21] W. Antoun, F. Baly, and H. M. Hajj, “Arabert: Transformer-based model
[1] M. M. Mogahed, “Punctuation marks make a difference in translation: for arabic language understanding,” CoRR, vol. abs/2003.00104, 2020.
Practical examples.,” Online Submission, 2012. [22] N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with
[2] H. Mubarak, A. Hussein, S. A. Chowdhury, and A. Ali, “Qasr: Qcri sublinear memory cost,” CoRR, vol. abs/1804.04235, 2018.
aljazeera speech resource–a large scale annotated arabic speech corpus,” [23] D. Tuggener and A. Aghaebrahimian, “The sentence end and punctua-
arXiv preprint arXiv:2106.13000, 2021. tion prediction in nlg text (sepp-nlg) shared task 2021,” in Swiss Text
Analytics Conference–SwissText 2021, Online, 14-16 June 2021, CEUR
6 https://towardsdatascience.com/BERT-roBERTa-distilBERT-xlnet-which- Workshop Proceedings, 2021.
one-to-use-3d5ab82ba5f8