AISC2024 Paper 5
AISC2024 Paper 5
Arabic Focus
Ridouane Tachicart1 (corresponding author), Karim Bouzoubaa2, Driss Namly3
1 LARGESS, FSJES El Jadida, Chouaib Doukkali University, El Jadida, Morocco
[email protected]
3
Mohammadia School of Enginners, Mohamed V University in Rabat, Rabat, Morocco
[email protected]
1 Introduction
In recent years, the field of natural language processing (NLP) has witnessed
tremendous growth and advancement [1], thanks to the increasing availability of large-
scale datasets and the development of powerful machine-learning algorithms. One
critical aspect of NLP that has received significant attention is lexicon creation.
Lexicons are comprehensive databases of words and their meanings, organized into
structured formats, and play a vital role in many NLP tasks, including machine
translation, sentiment analysis, and speech recognition.
The need for lexicon creation arises from the necessity for a deep understanding of
language and its nuances in various NLP tasks. With the explosion of social media and
other web-based platforms, there is an enormous amount of unstructured textual data
that requires effective processing and analysis [2]. Consequently, researchers and
practitioners have recognized the importance of developing lexicons specialized for
different domains, languages, and applications.
Traditionally, lexicon creation has relied on manual efforts by linguists and
lexicographers. However, the advent of large-scale data and advancements in machine
learning have revolutionized this task [3]. Data-driven approaches now enable the
automatic creation of lexicons by analyzing vast amounts of textual data to extract
insights about word usage, semantic relationships, and other linguistic properties [4].
Despite these advancements, several challenges remain, such as ensuring data quality
and avoiding linguistic or cultural biases in the resulting lexicons.
This paper explores the challenges and best practices in lexicon creation, presenting a
case study on developing a Moroccan Arabic lexicon using a hybrid approach that
combines manual annotation and machine learning techniques. Section 2 explores the
methods involved in creating effective lexicons, followed by identifying related
challenges in section 3. Section 4 presents best practices for quality control, bias
reduction, ethical considerations, and interoperability, and discusses the use of machine
learning algorithms and crowdsourcing for automated lexicon creation. The case study
method for creating a Moroccan Arabic lexicon and the effectiveness of the hybrid
approach in improving lexicon accuracy and coverage are detailed in section 5. The
paper concludes with a comprehensive understanding of lexicon creation and its
implications.
2.4 Crowdsourcing
Crowdsourcing is another technique used for lexicon creation, especially for under-
resourced languages [20]. This method involves outsourcing the annotation task to a
large number of individuals, usually through online platforms [11]. Crowdsourcing can
be a cost-effective and scalable way to annotate large amounts of data quickly,
especially for tasks that require subjective judgments, such as sentiment analysis or
emotion classification. Crowdsourcing can also provide a diverse range of perspectives
and reduce the risk of bias that may arise from relying on a single annotator. However,
the quality of the annotations may vary widely, depending on the expertise and
motivation of the crowd workers, and the quality control mechanisms may require
significant resources to implement effectively [21]{Citation}. In addition,
crowdsourcing may not be suitable for all types of lexicon creation tasks, especially
those that require specialized knowledge or domain expertise. In general, the
crowdsourcing approach can be a useful tool for creating lexicons at scale, but it
requires careful planning, monitoring, and evaluation to ensure the quality and
reliability of the results.
Following this approach, the work of [22] proposes a crowdsourcing method for lexicon
acquisition. The authors introduce the concept of "pure emotions" and develop a lexicon
of such emotions that can be used for sentiment analysis. The lexicon is built by
crowdsourcing emotional associations with various words, and the resulting emotions
are classified as positive, negative, or neutral. The work also evaluates the effectiveness
of the pure emotion lexicon in sentiment analysis and finds that it outperforms
traditional lexicons.
Overall, lexicon creation is an essential task in NLP applications, and there are several
approaches to it, each with its advantages and limitations. Manual annotation can result
in high-quality lexicons but is time-consuming and subjective. Machine learning-based
methods can handle large volumes of data quickly but may not capture the nuances of
language as well as manual annotation. Hybrid approaches can leverage the strengths
of both but may require significant resources to implement. Corpus-based generation
and crowdsourcing have their advantages and limitations as described above. The
choice of approach will depend on the specific requirements of the application and the
available resources.
Lexicon creation is a critical task in natural language processing (NLP) that involves
developing comprehensive and accurate dictionaries of words and their meanings.
Lexicons are essential for many NLP applications, such as named entity recognition,
machine translation, and sentiment analysis [22]. However, lexicon creation is a
complex and challenging task that requires careful attention to several factors. In this
section, we will explore some of the key challenges and best practices in lexicon
creation and discuss possible solutions.
3.1 Challenges
To create the training dataset, we utilized the Moroccan UGT corpus (2.1 million
words), composed in Arabic script and gathered in a prior study [32]. We augmented
the original data by incorporating texts sourced from Moroccan websites and blogs,
resulting in a total of 3.6 million words. To ensure uniformity, we employed an
automated normalizer to remove numbers, special characters, and non-Arabic letters.
In this section, we describe the specific machine learning models employed in the
creation of the Moroccan Arabic lexicon, their parameters, the rationale for choosing
them, and the performance metrics used to evaluate their effectiveness.
Training
• Embedding size = 300: This represents the dimension of the embedding space.
• Batch size = 201: This indicates the number of tuples on which the neural
network operates in each training step.
After the training process, two files were obtained: “model.vec,” containing only the
aggregated Moroccan word vectors, and “model.bin,” which includes the vectors for
all the Moroccan (UGT) words n-grams. We then evaluated the model by computing
the precision and recall. It should be noted that model optimization can be performed
using the binary file.
Evaluation
After training the FastText model on the Moroccan Arabic dataset, we conducted a
comprehensive evaluation to quantify its performance using precision, recall, and F-
measure. In Table 2, we provide the results of evaluating the output model. The
Precision score was 0.91, indicating that 91% of the identified variants were relevant.
Recall was measured at 0.87, demonstrating that the model successfully captured 87%
of all relevant variants present in the data. The F-measure (or F1 score) for our model
was 0.89, reflecting a high level of accuracy and effectiveness in identifying
orthographic variants. The Error Rate was calculated at 0.0865. These metrics highlight
the model's robustness and reliability in enhancing the Moroccan Arabic lexicon
through the integration of machine learning techniques.
The orthographic variations extracted exhibited a degree of noise, with nearly 35% of
these variants not relevant to their corresponding Moroccan normalized words. This
discrepancy arose because the FastText nearest neighbor feature not only identifies
orthographically similar words (utilizing n-gram information) but also includes
semantically close words (using full word context). For instance, in Figure 1, the words
( لوطوloto) /car/, ( طاكسيtaksi) /taxi/, and ( تاكسيtaksi) /taxi/, though synonyms, are not
considered orthographic variants of طوموبيلة.
where:
Sngram is calculated based on word bi-grams, and only candidates with Sngram <0.5 are
discarded. After implementing these rules, the set containing at least one orthographic
variant candidate was reduced to 30.13%. It is worth noting that the process of lexicon
refinement proved to be time-consuming compared to all the preceding stages
combined.
4.6 Discussion
The development of the Moroccan Arabic lexicon using a hybrid approach combining
machine learning and automatic refinement techniques has proven to be effective in
improving the accuracy and coverage of lexicons. Our study indicates that the hybrid
approach successfully identifies a significant portion of orthographic variants in
Moroccan Arabic, with 53.14% of normalized words possessing at least one variant.
This demonstrates the potential of combining manual efforts with machine learning to
enhance lexicon creation. The refinement process, which reduced the set of candidates
to 30.13%, highlights the importance of meticulous post-processing to ensure quality.
However, several challenges were encountered, including ensuring the quality and
uniformity of the data. The preprocessing steps, such as normalization and cleaning,
were crucial in preparing the dataset for effective training, yet the presence of noise and
irrelevant variants in the initial extraction highlighted the limitations of automated
techniques and the need for robust preprocessing methods. Additionally, while FastText
proved effective in identifying orthographic variants, the inclusion of semantically
similar words as variants indicated a limitation of the model, necessitating additional
refinement steps that were time-consuming but essential for improving precision.
Balancing manual annotation with automated efforts posed a significant challenge.
While machine learning can efficiently process large datasets, manual annotation
remains crucial for ensuring accuracy and addressing nuances that automated methods
may miss. Our hybrid approach aimed to leverage the strengths of both, but finding the
optimal balance required careful consideration and iterative refinement.
5 CONCLUSION
The field of natural language processing has advanced significantly in recent years due
to the availability of big data and advances in machine learning and AI techniques. One
crucial aspect of NLP is the creation of accurate and comprehensive lexicons. However,
traditional approaches to lexicon creation are often time-consuming, expensive, and
limited in coverage. In this paper, we explored the challenges and best practices in
lexicon creation and presented a case study of developing a Moroccan Arabic lexicon
using a hybrid approach. We discussed the challenges of subjectivity and ambiguity in
natural language, the quality of the data, the scalability and efficiency of the annotation
process, and the ethical and social implications of lexicon creation. We also proposed
best practices for developing comprehensive annotation guidelines, using domain-
specific sources of data, and leveraging machine learning and other innovative
techniques. The case study that we highlighted in this paper demonstrates the
effectiveness of the hybrid approach in improving the accuracy and coverage of the
created lexicon and highlights the importance of balancing manual annotation and
machine learning techniques.
References
Bibliography
[1] C. Xieling, X. Haoran and T. Xiaohui, "Vision, status, and research topics
of Natural Language Processing," Natural Language Processing Journal, vol.
1, no. 100001, pp. 2949-7191, 2022.
[2] D. Camacho, M. V. Luzón and E. Cambria, "New trends and applications in
social media analytics," Future Generation Computer Systems, pp. 318-321,
2021.
[3] V. Sanjeev, S. Rohit, D. Subhamay and M. Debojit, "Artificial intelligence
in marketing: Systematic review and future research direction," International
Journal of Information Management Data Insights, vol. 1, no. 1, 2021.
[4] F. Jun, C. Gong, G. Cheng, L. Xiaodong and L. Raymond Y. K., "Automatic
Approach of Sentiment Lexicon Generation for Mobile Shopping Reviews,"
Wireless Communications and Mobile Computing, pp. 1530-8669, 2018.
[5] B. Sagot, "Automatic Acquisition of a Slovak Lexicon from a Raw Corpus,"
in Text, Speech and Dialogue, Berlin, Heidelberg, Springer Berlin Heidelberg,
2005, pp. 156-163.
[6] P. Hanks and J. Pustejovsky, "A Pattern Dictionary for Natural Language
Processing," Revue française de linguistique appliquée, vol. X, no. 2, pp. 63-
82, 2005.
[7] R. Tachicart and K. Bouzoubaa, "Moroccan Arabic vocabulary generation
using a rule-based approach," Journal of King Saud University - Computer and
Information Sciences, 2021b.
[8] S. a. C. S. a. C. S. a. R. R. D. a. S. M. M. Angioni, "A Big Data framework
based on Apache Spark for Industry-specific Lexicon Generation for Stock
Market Prediction," in ICFNDS '21, Dubai, United Arab Emirates, 2022.
[9] H. Kohli, H. Feng, N. Dronen, C. McCarter, S. Moeini and A. Kebarighotbi,
"How Lexical is Bilingual Lexicon Induction?," in Findings of the Association
for Computational Linguistics: NAACL 2024, Mexico City, Mexico,
Association for Computational Linguistics, 2024, pp. 4381-4386.
[10] S. M. Mohammad and P. D. Turney, "Crowdsourcing a word-emotion
association lexicon," Computational Intelligence, vol. 29, no. 3, pp. 436-465.
[11] R. Martins, J. Almeida, P. Novais and P. Henriques, "Creating a social
media-based personal emotional lexicon," in Proceedings of the 24th Brazilian
Symposium on Multimedia and the Web, Salvador, BA, Brazil, 2018.
[12] M. Burghardt, D. Granvogl and C. Wolff, "Creating a Lexicon of Bavarian
Dialect by Means of Facebook Language Data and Crowdsourcing," in
Proceedings of the Tenth International Conference on Language Resources
and Evaluation (LREC'16), Portorož: European Language Resources
Association (ELRA), 2016, pp. 2029-2033.
[13] R. Artstein and M. Poesio, "Inter-Coder Agreement for Computational
Linguistics," Computational Linguistics, vol. 34, no. 4, pp. 596 - 555, 2008.
[14] S. R. El-Beltagy, "NileULex: A Phrase and Word Level Sentiment Lexicon
for Egyptian and Modern Standard Arabic," in Proceedings of the Tenth
International Conference on Language Resources and Evaluation (LREC'16),
Portorož, Slovenia, 2016.
[15] Y. Mohab and R. E.-B. Samhaa, "An Arabic Sentiment Lexicon Built
Through Automatic Lexicon Expansion," Procedia Computer Science, vol.
142, pp. 94-103, 2018.
[16] M. Z. Asghar, S. Ahmad, M. Qasim, S. R. Zahra and F. M. Kundi,
"SentiHealth: creating health-related sentiment lexicon using hybrid
approach," SpringerPlus, vol. 5, no. 1, 2016.
[17] B. Heerschop, A. Hogenboom and F. Frasincar, "Sentiment Lexicon
Creation from Lexical Resources," in Business Information Systems, Berlin,
Heidelberg, Springer Berlin Heidelberg, 2011, pp. 185-196.
[18] S. H. Muhammad, P. Brazdil and A. Jorge, "Incremental Approach for
Automatic Generation of Domain-Specific Sentiment Lexicon," in Advances
in Information Retrieval, Cham, Springer International Publishing, 2020, pp.
619-623.
[19] A. Atiah Alsolamy, M. Ahmed Siddiqui and I. H. Khan, "A Corpus Based
Approach to Build Arabic Sentiment Lexicon," International Journal of
Information Engineering and Electronic Business, vol. 6, pp. 16-23, 2019.
[20] V. Benko, "Crowdsourcing for the Slovak Morphological Lexicon," in
Slovenskočeský Natural Language Processing Workshop (SloNLP 2018),
Košice, Slovaki, 2018.
[21] J. Čibej, D. Fišer and I. Kosem, "The role of crowdsourcing in
lexicography," in Electronic lexicography in the 21st century: linking lexical
data in the digital age. Proceedings of the eLex 2015 conference, I. J. M. K. J.
K. S. Kosem, Ed., Herstmonceux Castle, United Kingdom, 2015, pp. 70-83.
[22] G. Haralabopoulos and E. Simperl, "Crowdsourcing for Beyond Polarity
Sentiment Analysis A Pure Emotion Lexicon," ArXiv, 2017.
[23] J. Wiebe, "Subjectivity Word Sense Disambiguation," in Proceedings of the
3rd Workshop in Computational Approaches to Subjectivity and Sentiment
Analysis, Jeju, Republic of Korea, 2012.
[24] Y. Apurwa, P. Aarshil and S. Manan, "A comprehensive review on
resolving ambiguities in natural language processing," AI Open, vol. 2, pp. 85-
92, 2021.
[25] C. Hutto and E. Gilber, "VADER: A Parsimonious Rule-Based Model for
Sentiment Analysis of Social Media Text," Proceedings of the International
AAAI Conference on Web and Social Media, vol. 8, no. 1, pp. 216-225, 2014.
[26] H. Aristar-Dry, S. Drude, M. Windhouwer, J. Gippert and I. Nevskaya,
"“Rendering Endangered Lexicons Interoperable through Standards
Harmonization”: the RELISH project," in Proceedings of the Eighth
International Conference on Language Resources and Evaluation
({LREC}'12), Istanbul, Turkey, 2012.
[27] E. Zavitsanos, G. Tsatsaronis, I. Varlamis and G. Paliouras, "Scalable
Semantic Annotation of Text Using Lexical and Web Resources," in Artificial
Intelligence: Theories, Models and Applications, 2010.
[28] A. Al-Laith, M. Shahbaz, H. F. Alaskar and A. Rehmat, "AraSenCorpus: A
Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text
Corpus," Applied Sciences, vol. 11, no. 5, 2011.
[29] L. XIAOFENG and Z. ZHIMING, "Unsupervised Approaches for Textual
Semantic Annotation, A Survey," ACM Computing Surveys, vol. 52, no. 4, p.
1–45, 2019.
[30] M. M. Saif, "Best Practices in the Creation and Use of Emotion Lexicons,"
arXiv, 2023.
[31] P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, "Enriching Word
Vectors with Subword Information," Transactions of the Association for
Computational Linguistics, vol. 5, pp. 135-146, 2017.
[32] R. Tachicart and K. Bouzoubaa, "An Empirical Analysis of Moroccan
Dialectal User-Generated Text," in 11th International Conference
Computational Collective Intelligence, Hendaye, France, 2019b.
[33] R. Tachicart and k. Bouzoubaa, "Towards Automatic Normalization of the
Moroccan Dialectal Arabic User Generated Text," in Arabic Language
Processing: From Theory to Practice, Springer International Publishing,
2019c, pp. 264-275.
[34] V. I. Levenshtein, "Binary Codes Capable of Correcting Deletions,
Insertions and Reversals," Soviet Physics Doklady, vol. 10, p. 707, 1966.
[35] P. Ulrich, P. Thomas and F. Norbert, "Retrieval effectiveness of proper
name search methods," Information Processing & Management, Vols. Volume
32, Issue 6, pp. 667-679, 1996.
Declarations
Ethical Approval
not applicable
Competing interests
The authors declare no competing interests regarding the research, authorship, and
publication of this article.
Authors' contributions
Ridouane Tachicart and Karim Bouzoubaa conceived of the presented idea, developed
the theory and performed the computations. Karim Bouzoubaa and Driss Namly
verified the analytical methods and supervised the findings of this work. All authors
discussed the results and contributed to the final manuscript.
Funding Declaration
No Funding