0% found this document useful (0 votes)
1 views

AISC2024 Paper 5

This paper discusses effective techniques for lexicon creation in natural language processing (NLP), focusing on a case study of Moroccan Arabic. It highlights the challenges of traditional methods and presents a hybrid approach that combines manual annotation and machine learning to enhance lexicon accuracy and coverage. The paper emphasizes the importance of addressing issues such as subjectivity, data quality, and ethical implications in lexicon development.

Uploaded by

ads.boufrahi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

AISC2024 Paper 5

This paper discusses effective techniques for lexicon creation in natural language processing (NLP), focusing on a case study of Moroccan Arabic. It highlights the challenges of traditional methods and presents a hybrid approach that combines manual annotation and machine learning to enhance lexicon accuracy and coverage. The paper emphasizes the importance of addressing issues such as subjectivity, data quality, and ethical implications in lexicon development.

Uploaded by

ads.boufrahi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Effective Techniques in Lexicon Creation: Moroccan

Arabic Focus
Ridouane Tachicart1 (corresponding author), Karim Bouzoubaa2, Driss Namly3
1 LARGESS, FSJES El Jadida, Chouaib Doukkali University, El Jadida, Morocco
[email protected]

2 Mohammadia School of Enginners, Mohamed V University in Rabat, Rabat, Morocco


[email protected]

3
Mohammadia School of Enginners, Mohamed V University in Rabat, Rabat, Morocco
[email protected]

Abstract. Natural language processing (NLP) has seen significant advancements


due to the growing availability of data and improvements in machine learning
techniques. A critical task in NLP is lexicon creation, which involves developing
comprehensive and accurate dictionaries of words and their meanings.
Traditional methods, such as manual creation or expert acquisition, are often
time-consuming and limited in scope. This paper explores the challenges and best
practices in lexicon creation and presents a case study on developing a Moroccan
Arabic lexicon using a hybrid approach that combines manual annotation and
machine learning. We address issues of subjectivity, ambiguity, data quality,
scalability, and the ethical implications of lexicon creation. The case study
demonstrates the hybrid approach's effectiveness in enhancing lexicon accuracy
and coverage, emphasizing the balance between manual annotation and machine
learning. This paper provides valuable insights for NLP practitioners and
researchers, showcasing efficient and effective lexicon creation techniques.

Keywords Lexicon creation, Moroccan Arabic, Natural language Processing,


Machine Learning, Orthographic Variation, Lexical Resources.

1 Introduction

In recent years, the field of natural language processing (NLP) has witnessed
tremendous growth and advancement [1], thanks to the increasing availability of large-
scale datasets and the development of powerful machine-learning algorithms. One
critical aspect of NLP that has received significant attention is lexicon creation.
Lexicons are comprehensive databases of words and their meanings, organized into
structured formats, and play a vital role in many NLP tasks, including machine
translation, sentiment analysis, and speech recognition.
The need for lexicon creation arises from the necessity for a deep understanding of
language and its nuances in various NLP tasks. With the explosion of social media and
other web-based platforms, there is an enormous amount of unstructured textual data
that requires effective processing and analysis [2]. Consequently, researchers and
practitioners have recognized the importance of developing lexicons specialized for
different domains, languages, and applications.
Traditionally, lexicon creation has relied on manual efforts by linguists and
lexicographers. However, the advent of large-scale data and advancements in machine
learning have revolutionized this task [3]. Data-driven approaches now enable the
automatic creation of lexicons by analyzing vast amounts of textual data to extract
insights about word usage, semantic relationships, and other linguistic properties [4].
Despite these advancements, several challenges remain, such as ensuring data quality
and avoiding linguistic or cultural biases in the resulting lexicons.
This paper explores the challenges and best practices in lexicon creation, presenting a
case study on developing a Moroccan Arabic lexicon using a hybrid approach that
combines manual annotation and machine learning techniques. Section 2 explores the
methods involved in creating effective lexicons, followed by identifying related
challenges in section 3. Section 4 presents best practices for quality control, bias
reduction, ethical considerations, and interoperability, and discusses the use of machine
learning algorithms and crowdsourcing for automated lexicon creation. The case study
method for creating a Moroccan Arabic lexicon and the effectiveness of the hybrid
approach in improving lexicon accuracy and coverage are detailed in section 5. The
paper concludes with a comprehensive understanding of lexicon creation and its
implications.

2 Lexicon creation techniques

Lexicon creation is an important component of natural language processing tasks [5],


[6]. It involves the development of a structured database of words and their features
such as meanings, forms, morpho-syntactic tags, and relationships, which serves as a
foundational resource for various NLP applications such as machine translation,
morphological analysis, and information retrieval [7]. With the increasing availability
of big data and advances in artificial intelligence, new emerging techniques for lexicon
creation can overcome the limitations of traditional approaches [8]. Machine learning
algorithms such as word embeddings [9] as well as crowdsourcing [10] are some of the
methods that can be employed for lexicon creation. Additionally, social media data can
serve as a valuable resource for constructing and enriching lexicons [11], [12]. The
choice of technique and data source depends on the specific requirements and
constraints of the lexicon creation task.
2.1 Manual Annotation

Manual annotation is a traditional approach to lexicon creation that involves human


experts annotating a corpus of text to identify and define the relevant terms [1]. This
approach is particularly useful for creating domain-specific lexicons, where experts
have knowledge of the relevant vocabulary [2]. The manual annotation process is
usually carried out by trained annotators who carefully review the text and assign labels
to each term based on their understanding of the context and the meaning of the term
[13]. Manual annotation can result in high-quality lexicons with precise definitions and
can capture subtle nuances and context-specific meanings of terms especially when
developing Gold Standards [3]. However, it can be time-consuming and labor-
intensive, especially for large lexicons [4], and is prone to subjectivity and
inconsistency, as different annotators may have different interpretations [5].
To illustrate this approach, we can cite the work of [14]. In this work, authors proposed
NileULex [7], which is a sentiment lexicon that includes both word-level and phrase-
level annotations for Egyptian and Modern Standard Arabic. Their approach involves
manual annotation of sentiment polarity labels for words and phrases extracted from a
large corpus of text. The resulting NileULex contains more than 17,000 word-level
entries and 6,000 phrase-level entries, covering a range of domains and topics.

2.2 Machine learning

Machine learning-based methods are an increasingly popular approach to lexicon


creation that uses algorithms to automatically extract and classify terms from a corpus
of text [6]. These methods can be supervised, unsupervised, or semi-supervised,
depending on the availability of annotated data. Supervised methods require labeled
data, where each term is annotated with its corresponding label, to train a model to
classify new terms. Unsupervised methods do not require any labeled data, and instead,
rely on clustering or other techniques to group terms with similar characteristics. Semi-
supervised methods combine both approaches by using a small amount of labeled data
to train a model and then using it to label the remaining data automatically. Machine
learning-based methods can handle large volumes of data quickly and efficiently,
generalize to new domains or languages, and be less subjective and more consistent
than manual annotation. However, they may not capture the subtle nuances and context-
specific meanings of terms as well as manual annotation and may require significant
computational resources and expertise to implement and fine-tune the models.
One work that falls in this approach is where authors built MoArLex [15] which is an
Arabic sentiment lexicon using automatic lexicon expansion techniques. To this end,
the authors used automatic lexicon expansion using the NileULex lexicon and word
embedding. In total, MoArLex contains over 26,000 Arabic words that have been
annotated with their sentiment polarity.
2.3 Corpus-based generation

Corpus-based generation is a semi-automatic technique used to create a lexicon by


analyzing a large corpus of text [17]. The process involves identifying and extracting
relevant terms from the corpus and then filtering and refining the list to create a final
lexicon. Corpus-based generation is widely used in natural language processing (NLP)
applications such as text information retrieval, classification, and sentiment analysis
[18]. There are many steps involved in the corpus-based generation to create a lexicon.
It starts with corpus collection to collect a large corpus of text that is relevant to the
domain or topic of interest. Once the corpus is collected, it needs to be pre-processed
to remove noise and irrelevant information. The next step is to identify the relevant
terms in the corpus. This can be done using techniques such as frequency analysis, part-
of-speech tagging, and named entity recognition. After identifying the terms, they need
to be filtered and refined to create a final lexicon. This can be done using various
techniques such as removing duplicate terms, filtering out irrelevant terms, and
manually reviewing the list to ensure accuracy and completeness. The final step is to
validate the lexicon to ensure that it is accurate and relevant to the domain or task. This
can be done by evaluating the lexicon using various metrics such as precision, recall,
and F1 score.
One of the main advantages of corpus-based generation is that it can be used to create
a lexicon that is specific to a particular domain or task. It also allows for the
identification of new or emerging terms that may not be present in existing lexicons.
However, it can be time-consuming and requires a large corpus of text to be effective.
In this direction, the work of [19] proposes a corpus-based approach to create a
sentiment lexicon for the Arabic language. The authors use a large corpus of Arabic
text to identify words and phrases with positive or negative sentiments. They also
consider the effect of negation and intensifiers on sentiment, which is common in the
Arabic language. The lexicon is evaluated using a dataset of Arabic social media
content and is found to perform better than other Arabic sentiment lexicons created
using manual annotation or automatic methods alone.

2.4 Crowdsourcing

Crowdsourcing is another technique used for lexicon creation, especially for under-
resourced languages [20]. This method involves outsourcing the annotation task to a
large number of individuals, usually through online platforms [11]. Crowdsourcing can
be a cost-effective and scalable way to annotate large amounts of data quickly,
especially for tasks that require subjective judgments, such as sentiment analysis or
emotion classification. Crowdsourcing can also provide a diverse range of perspectives
and reduce the risk of bias that may arise from relying on a single annotator. However,
the quality of the annotations may vary widely, depending on the expertise and
motivation of the crowd workers, and the quality control mechanisms may require
significant resources to implement effectively [21]{Citation}. In addition,
crowdsourcing may not be suitable for all types of lexicon creation tasks, especially
those that require specialized knowledge or domain expertise. In general, the
crowdsourcing approach can be a useful tool for creating lexicons at scale, but it
requires careful planning, monitoring, and evaluation to ensure the quality and
reliability of the results.
Following this approach, the work of [22] proposes a crowdsourcing method for lexicon
acquisition. The authors introduce the concept of "pure emotions" and develop a lexicon
of such emotions that can be used for sentiment analysis. The lexicon is built by
crowdsourcing emotional associations with various words, and the resulting emotions
are classified as positive, negative, or neutral. The work also evaluates the effectiveness
of the pure emotion lexicon in sentiment analysis and finds that it outperforms
traditional lexicons.
Overall, lexicon creation is an essential task in NLP applications, and there are several
approaches to it, each with its advantages and limitations. Manual annotation can result
in high-quality lexicons but is time-consuming and subjective. Machine learning-based
methods can handle large volumes of data quickly but may not capture the nuances of
language as well as manual annotation. Hybrid approaches can leverage the strengths
of both but may require significant resources to implement. Corpus-based generation
and crowdsourcing have their advantages and limitations as described above. The
choice of approach will depend on the specific requirements of the application and the
available resources.

2.5 Hybrid Approach

The hybrid approach to lexicon creation integrates manual annotation, automated


processes, and machine learning techniques, harnessing the strengths of each to create
a more robust and comprehensive lexicon [8]. In this approach, human experts annotate
a subset of the data to train a machine-learning model, which can then be used to
annotate the rest of the data automatically [9]. This approach can combine the precision
of manual annotation with the scalability of machine learning, reducing the subjectivity
and inconsistency of manual annotation [10]. The hybrid approach can be customized
to specific domains or languages and can result in high-quality lexicons. However, it
may require significant resources and expertise to implement and fine-tune the hybrid
models, and it may still be affected by the quality and representativeness of the training
data. Overall, the hybrid approach can be a powerful tool for creating lexicons that are
both accurate and scalable.
One example of a work that uses the hybrid approach is the work of [16]. The authors
propose a hybrid approach for creating a sentiment lexicon specific to health-related
content. The authors combine manual annotation by health experts and automatic
methods based on machine learning to create a lexicon containing positive and negative
health-related terms. The lexicon is evaluated using a dataset of health-related tweets
and is found to outperform other lexicons created using either manual annotation or
automatic methods alone. The study demonstrates the effectiveness of combining
human expertise with machine learning-based methods for sentiment analysis in the
context of health-related content.

3 Challenges and best practices

Lexicon creation is a critical task in natural language processing (NLP) that involves
developing comprehensive and accurate dictionaries of words and their meanings.
Lexicons are essential for many NLP applications, such as named entity recognition,
machine translation, and sentiment analysis [22]. However, lexicon creation is a
complex and challenging task that requires careful attention to several factors. In this
section, we will explore some of the key challenges and best practices in lexicon
creation and discuss possible solutions.

3.1 Challenges

● Subjectivity and ambiguity


One of the main challenges in lexicon creation is the subjectivity [23] and ambiguity
[24] of natural language. Many words and expressions have multiple meanings and
connotations depending on the context and the speaker's intention. This makes it
difficult to develop clear and consistent definitions for each word in the lexicon.
Moreover, there may be differences in interpretation and usage across different regions,
dialects, and cultures. To address this challenge, lexicon creators need to develop
comprehensive annotation guidelines that capture the nuances and complexities of the
language and account for cultural and linguistic diversity.

● Quality of the data


Another challenge in lexicon creation is the quality of the data. The accuracy and
completeness of a lexicon depend on the quality and representativeness of the
underlying data [25]. However, digital text data is often noisy and heterogeneous,
containing misspellings, grammatical errors, slang, and other forms of non-standard
language. In addition, digital text data may not be representative of the target
population, especially if it is collected from social media or other online sources. To
address this challenge, lexicon creators need to use domain-specific sources of data to
improve the accuracy and coverage of the lexicon and develop preprocessing and
cleaning techniques that can filter out irrelevant or low-quality data.
● Interoperability
Interoperability is a major issue in lexicon creation, as different NLP systems may
require different formats, sizes, and content of lexicons. The lack of interoperability
can make it difficult to reuse and integrate existing resources, resulting in significant
duplication of effort and reduced efficiency [26]. To address the interoperability issues,
it is important to consider standardization, modularization, and metadata as key
solutions. Standardization of lexicon format and content can ensure consistency and
compatibility with other NLP resources, while modularization can enable the creation
of lexicons with different levels of granularity to meet the varying requirements of
different applications. Additionally, including metadata in a lexicon can provide
important contextual information about the lexicon and its contents, facilitating its
wider use and integration into different NLP systems.

● Scalability and efficiency


Another challenge in lexicon creation is the scalability and efficiency of the annotation
process [27]. Traditional manual annotation methods can be time-consuming and labor-
intensive, and may not be feasible for large-scale lexicon creation tasks. Machine
learning-based methods can help to automate the annotation process, but they require
large amounts of labeled data to train accurate models. To overcome this challenge,
researchers are exploring semi-supervised [28] and unsupervised machine learning
techniques [29] that can learn from partially labeled or unlabeled data, and active
learning strategies that can select the most informative samples for annotation.

● Privacy and ethical concerns


Finally, lexicon creators need to consider the ethical and social implications of their
work. Lexicons can be used for a variety of purposes, some of which may have
unintended consequences, such as reinforcing stereotypes, perpetuating biases, or
infringing on privacy rights. To address this challenge, lexicon creators need to develop
ethical guidelines [30] and evaluate the potential impact of their lexicons on different
stakeholders and communities.
In conclusion, lexicon creation is a complex and challenging task that requires careful
attention to several factors. By developing comprehensive annotation guidelines, using
domain-specific sources of data, and leveraging machine learning and other innovative
techniques, we can create lexicons that are both accurate and scalable. Moreover, by
considering the ethical and social implications of our work, we can ensure that our
lexicons are used responsibly and for the benefit of society.

4 Case study: Moroccan Arabic lexicon creation

Creating lexicons for a low-resource language such as Moroccan Arabic can be a


particularly challenging task due to the limited availability of annotated data and
linguistic resources. In this section, we present a case study that focuses on the creation
of a lexicon for Moroccan Arabic, a dialect of Arabic spoken in Morocco. The case
study outlines the methodology used to collect and annotate the data, the challenges
encountered during the creation process, and the evaluation of the resulting lexicon.

4.1 Overview of Moroccan Arabic and its linguistic features

Moroccan Arabic, also known as Darija, is a dialect of Arabic spoken in Morocco by


approximately 35 million people. It is the most widely spoken language in the country
and has a unique set of linguistic features that distinguish it from other Arabic dialects.
One of the most notable features of Moroccan Arabic is its use of Tamazight, Spanish,
and French loanwords. Tamazight, a language spoken by indigenous populations in
North Africa, has had a significant influence on Moroccan Arabic vocabulary.
Similarly, due to the country's historical ties to France and Spain, many French and
Spanish words have been adopted into the Moroccan Arabic lexicon. The following
table illustrates a sample of Moroccan Arabic vocabulary. Table 1 presents a selection
of Moroccan Arabic vocabulary, showcasing its diverse origins.

Table 1: Sample of Moroccan Arabic lexicon


MA word Word Language English translation
origin origin
‫فالصو‬
falso Spanish Scammed
[fAlSw]
‫بريكوالج‬
bricolage French DIY
[brykwlAj]
‫مش‬
‫موش‬ Tamazight Cat
[m$]
‫كتب‬
‫كتب‬ Arabic To write
[ktb]
‫كال‬
‫أكل‬ Arabic To eat
[klA]
‫كونجوال‬
congeler French To freeze
[kwnjwlA]
‫زكا‬
‫يزكا‬ Tamazight To calm
[zgA]
‫بوماضة‬
Pomada Spanish ointment
[bwmADp]
‫لحم‬
‫لحم‬ Arabic meal
[lHm]
4.2 Data Collection and Preprocessing

To create the training dataset, we utilized the Moroccan UGT corpus (2.1 million
words), composed in Arabic script and gathered in a prior study [32]. We augmented
the original data by incorporating texts sourced from Moroccan websites and blogs,
resulting in a total of 3.6 million words. To ensure uniformity, we employed an
automated normalizer to remove numbers, special characters, and non-Arabic letters.

4.3 Machine Learning Integration

In this section, we describe the specific machine learning models employed in the
creation of the Moroccan Arabic lexicon, their parameters, the rationale for choosing
them, and the performance metrics used to evaluate their effectiveness.

For the task of creating a lexicon of Moroccan orthographic variations (OV), we


selected FastText [31], a library for efficient text classification and representation
learning. FastText was chosen for its ability to handle large datasets and its
effectiveness in capturing word similarities through n-grams and neural network
embeddings. FastText's n-gram model is particularly suitable for handling the
morphological richness of Moroccan Arabic, where words can have multiple
orthographic variants.

Training

We employed an unsupervised model as a crucial element in constructing the OV


lexicon. Initially, we adhered to the FastText guidelines, specifically utilizing
FastText's autotune feature to automatically optimize and determine the most suitable
hyperparameters for our dataset. The following specifications were configured during
the training process:

• Window size = 2: This denotes the size of the word context.

• Number of epochs = 5: This parameter governs the total number of iterations


the algorithm performs during training across the entire dataset.

• Embedding size = 300: This represents the dimension of the embedding space.

• Batch size = 201: This indicates the number of tuples on which the neural
network operates in each training step.
After the training process, two files were obtained: “model.vec,” containing only the
aggregated Moroccan word vectors, and “model.bin,” which includes the vectors for
all the Moroccan (UGT) words n-grams. We then evaluated the model by computing
the precision and recall. It should be noted that model optimization can be performed
using the binary file.

Evaluation

To evaluate the effectiveness of the machine learning approach, we employed the


following performance metrics:

• Precision: The proportion of relevant orthographic variants among the


retrieved variants.

• Recall: The proportion of relevant orthographic variants that were correctly


identified.

• Error Rate: The proportion of orthographic variants that were incorrectly


identified as relevant.

After training the FastText model on the Moroccan Arabic dataset, we conducted a
comprehensive evaluation to quantify its performance using precision, recall, and F-
measure. In Table 2, we provide the results of evaluating the output model. The
Precision score was 0.91, indicating that 91% of the identified variants were relevant.
Recall was measured at 0.87, demonstrating that the model successfully captured 87%
of all relevant variants present in the data. The F-measure (or F1 score) for our model
was 0.89, reflecting a high level of accuracy and effectiveness in identifying
orthographic variants. The Error Rate was calculated at 0.0865. These metrics highlight
the model's robustness and reliability in enhancing the Moroccan Arabic lexicon
through the integration of machine learning techniques.

Table 2: Evaluation Metrics for the Moroccan Arabic Lexicon Model

Metric Precision Recall F-measure Error Rate


Value 0.91 0.87 0.89 0.0865
4.4 Lexicon Inference

Utilizing a Moroccan Reference Vocabulary (MRV) comprising 4.5 million normalized


Moroccan words [33], we examined the binary file to identify the nearest neighbor
vectors for each word within the vocabulary. Assuming that the extracted vectors
correspond to orthographic variations of the words, our findings revealed that 53.14%
of the normalized Moroccan words possess at least one orthographic variant. However,
the described process, conducted without any refinement tasks, was unsuccessful in
extracting orthographic variants for the remaining 46.86% of the reference vocabulary.
Figure 1 presents a sample of the identified orthographic variants for the Moroccan
normalized word ‫ طوموبيلة‬/car/ (tomobila), along with the corresponding orthographic
similarity rate.

Figure 1: Model’s Prediction results of the word: ‫طوموبيلة‬


4.5 Lexicon Refinement

The orthographic variations extracted exhibited a degree of noise, with nearly 35% of
these variants not relevant to their corresponding Moroccan normalized words. This
discrepancy arose because the FastText nearest neighbor feature not only identifies
orthographically similar words (utilizing n-gram information) but also includes
semantically close words (using full word context). For instance, in Figure 1, the words
‫( لوطو‬loto) /car/, ‫( طاكسي‬taksi) /taxi/, and ‫( تاكسي‬taksi) /taxi/, though synonyms, are not
considered orthographic variants of ‫طوموبيلة‬.

To refine the OV lexicon, we employed a character-level rule-based technique that


assesses orthographic similarities between words. In addition to considering the
Levenshtein distance [34], we eliminated candidate orthographic variants that do not
share a significant number of sub-n-grams with the Moroccan normalized word. The n-
gram similarity score, denoted as Sngram [35], is computed as follows: Sngram = α/β

where:

• α represents the number of unique sub-n-grams shared between the Moroccan


Reference Vocabulary (WMRV) and the candidate orthographic variant (WOV).

• β is the total sum of unique sub-n-grams in both WMRV and WOV.

Sngram is calculated based on word bi-grams, and only candidates with Sngram <0.5 are
discarded. After implementing these rules, the set containing at least one orthographic
variant candidate was reduced to 30.13%. It is worth noting that the process of lexicon
refinement proved to be time-consuming compared to all the preceding stages
combined.
4.6 Discussion

The development of the Moroccan Arabic lexicon using a hybrid approach combining
machine learning and automatic refinement techniques has proven to be effective in
improving the accuracy and coverage of lexicons. Our study indicates that the hybrid
approach successfully identifies a significant portion of orthographic variants in
Moroccan Arabic, with 53.14% of normalized words possessing at least one variant.
This demonstrates the potential of combining manual efforts with machine learning to
enhance lexicon creation. The refinement process, which reduced the set of candidates
to 30.13%, highlights the importance of meticulous post-processing to ensure quality.

However, several challenges were encountered, including ensuring the quality and
uniformity of the data. The preprocessing steps, such as normalization and cleaning,
were crucial in preparing the dataset for effective training, yet the presence of noise and
irrelevant variants in the initial extraction highlighted the limitations of automated
techniques and the need for robust preprocessing methods. Additionally, while FastText
proved effective in identifying orthographic variants, the inclusion of semantically
similar words as variants indicated a limitation of the model, necessitating additional
refinement steps that were time-consuming but essential for improving precision.
Balancing manual annotation with automated efforts posed a significant challenge.
While machine learning can efficiently process large datasets, manual annotation
remains crucial for ensuring accuracy and addressing nuances that automated methods
may miss. Our hybrid approach aimed to leverage the strengths of both, but finding the
optimal balance required careful consideration and iterative refinement.

The development of an accurate and comprehensive Moroccan Arabic lexicon has


important implications for various NLP applications, such as machine translation,
sentiment analysis, and speech recognition systems, particularly for languages and
dialects with limited existing resources. The process of lexicon creation must also
consider ethical and social implications to ensure that the lexicon does not reinforce
linguistic or cultural biases, emphasizing the need for ongoing evaluation and
refinement to maintain fairness and inclusivity in NLP applications. The scalability of
the hybrid approach suggests that similar methods could be applied to other languages
and dialects, enabling researchers to create high-quality lexicons for under-resourced
languages and contributing to the broader field of NLP.

5 CONCLUSION

The field of natural language processing has advanced significantly in recent years due
to the availability of big data and advances in machine learning and AI techniques. One
crucial aspect of NLP is the creation of accurate and comprehensive lexicons. However,
traditional approaches to lexicon creation are often time-consuming, expensive, and
limited in coverage. In this paper, we explored the challenges and best practices in
lexicon creation and presented a case study of developing a Moroccan Arabic lexicon
using a hybrid approach. We discussed the challenges of subjectivity and ambiguity in
natural language, the quality of the data, the scalability and efficiency of the annotation
process, and the ethical and social implications of lexicon creation. We also proposed
best practices for developing comprehensive annotation guidelines, using domain-
specific sources of data, and leveraging machine learning and other innovative
techniques. The case study that we highlighted in this paper demonstrates the
effectiveness of the hybrid approach in improving the accuracy and coverage of the
created lexicon and highlights the importance of balancing manual annotation and
machine learning techniques.

References

Bibliography

[1] C. Xieling, X. Haoran and T. Xiaohui, "Vision, status, and research topics
of Natural Language Processing," Natural Language Processing Journal, vol.
1, no. 100001, pp. 2949-7191, 2022.
[2] D. Camacho, M. V. Luzón and E. Cambria, "New trends and applications in
social media analytics," Future Generation Computer Systems, pp. 318-321,
2021.
[3] V. Sanjeev, S. Rohit, D. Subhamay and M. Debojit, "Artificial intelligence
in marketing: Systematic review and future research direction," International
Journal of Information Management Data Insights, vol. 1, no. 1, 2021.
[4] F. Jun, C. Gong, G. Cheng, L. Xiaodong and L. Raymond Y. K., "Automatic
Approach of Sentiment Lexicon Generation for Mobile Shopping Reviews,"
Wireless Communications and Mobile Computing, pp. 1530-8669, 2018.
[5] B. Sagot, "Automatic Acquisition of a Slovak Lexicon from a Raw Corpus,"
in Text, Speech and Dialogue, Berlin, Heidelberg, Springer Berlin Heidelberg,
2005, pp. 156-163.
[6] P. Hanks and J. Pustejovsky, "A Pattern Dictionary for Natural Language
Processing," Revue française de linguistique appliquée, vol. X, no. 2, pp. 63-
82, 2005.
[7] R. Tachicart and K. Bouzoubaa, "Moroccan Arabic vocabulary generation
using a rule-based approach," Journal of King Saud University - Computer and
Information Sciences, 2021b.
[8] S. a. C. S. a. C. S. a. R. R. D. a. S. M. M. Angioni, "A Big Data framework
based on Apache Spark for Industry-specific Lexicon Generation for Stock
Market Prediction," in ICFNDS '21, Dubai, United Arab Emirates, 2022.
[9] H. Kohli, H. Feng, N. Dronen, C. McCarter, S. Moeini and A. Kebarighotbi,
"How Lexical is Bilingual Lexicon Induction?," in Findings of the Association
for Computational Linguistics: NAACL 2024, Mexico City, Mexico,
Association for Computational Linguistics, 2024, pp. 4381-4386.
[10] S. M. Mohammad and P. D. Turney, "Crowdsourcing a word-emotion
association lexicon," Computational Intelligence, vol. 29, no. 3, pp. 436-465.
[11] R. Martins, J. Almeida, P. Novais and P. Henriques, "Creating a social
media-based personal emotional lexicon," in Proceedings of the 24th Brazilian
Symposium on Multimedia and the Web, Salvador, BA, Brazil, 2018.
[12] M. Burghardt, D. Granvogl and C. Wolff, "Creating a Lexicon of Bavarian
Dialect by Means of Facebook Language Data and Crowdsourcing," in
Proceedings of the Tenth International Conference on Language Resources
and Evaluation (LREC'16), Portorož: European Language Resources
Association (ELRA), 2016, pp. 2029-2033.
[13] R. Artstein and M. Poesio, "Inter-Coder Agreement for Computational
Linguistics," Computational Linguistics, vol. 34, no. 4, pp. 596 - 555, 2008.
[14] S. R. El-Beltagy, "NileULex: A Phrase and Word Level Sentiment Lexicon
for Egyptian and Modern Standard Arabic," in Proceedings of the Tenth
International Conference on Language Resources and Evaluation (LREC'16),
Portorož, Slovenia, 2016.
[15] Y. Mohab and R. E.-B. Samhaa, "An Arabic Sentiment Lexicon Built
Through Automatic Lexicon Expansion," Procedia Computer Science, vol.
142, pp. 94-103, 2018.
[16] M. Z. Asghar, S. Ahmad, M. Qasim, S. R. Zahra and F. M. Kundi,
"SentiHealth: creating health-related sentiment lexicon using hybrid
approach," SpringerPlus, vol. 5, no. 1, 2016.
[17] B. Heerschop, A. Hogenboom and F. Frasincar, "Sentiment Lexicon
Creation from Lexical Resources," in Business Information Systems, Berlin,
Heidelberg, Springer Berlin Heidelberg, 2011, pp. 185-196.
[18] S. H. Muhammad, P. Brazdil and A. Jorge, "Incremental Approach for
Automatic Generation of Domain-Specific Sentiment Lexicon," in Advances
in Information Retrieval, Cham, Springer International Publishing, 2020, pp.
619-623.
[19] A. Atiah Alsolamy, M. Ahmed Siddiqui and I. H. Khan, "A Corpus Based
Approach to Build Arabic Sentiment Lexicon," International Journal of
Information Engineering and Electronic Business, vol. 6, pp. 16-23, 2019.
[20] V. Benko, "Crowdsourcing for the Slovak Morphological Lexicon," in
Slovenskočeský Natural Language Processing Workshop (SloNLP 2018),
Košice, Slovaki, 2018.
[21] J. Čibej, D. Fišer and I. Kosem, "The role of crowdsourcing in
lexicography," in Electronic lexicography in the 21st century: linking lexical
data in the digital age. Proceedings of the eLex 2015 conference, I. J. M. K. J.
K. S. Kosem, Ed., Herstmonceux Castle, United Kingdom, 2015, pp. 70-83.
[22] G. Haralabopoulos and E. Simperl, "Crowdsourcing for Beyond Polarity
Sentiment Analysis A Pure Emotion Lexicon," ArXiv, 2017.
[23] J. Wiebe, "Subjectivity Word Sense Disambiguation," in Proceedings of the
3rd Workshop in Computational Approaches to Subjectivity and Sentiment
Analysis, Jeju, Republic of Korea, 2012.
[24] Y. Apurwa, P. Aarshil and S. Manan, "A comprehensive review on
resolving ambiguities in natural language processing," AI Open, vol. 2, pp. 85-
92, 2021.
[25] C. Hutto and E. Gilber, "VADER: A Parsimonious Rule-Based Model for
Sentiment Analysis of Social Media Text," Proceedings of the International
AAAI Conference on Web and Social Media, vol. 8, no. 1, pp. 216-225, 2014.
[26] H. Aristar-Dry, S. Drude, M. Windhouwer, J. Gippert and I. Nevskaya,
"“Rendering Endangered Lexicons Interoperable through Standards
Harmonization”: the RELISH project," in Proceedings of the Eighth
International Conference on Language Resources and Evaluation
({LREC}'12), Istanbul, Turkey, 2012.
[27] E. Zavitsanos, G. Tsatsaronis, I. Varlamis and G. Paliouras, "Scalable
Semantic Annotation of Text Using Lexical and Web Resources," in Artificial
Intelligence: Theories, Models and Applications, 2010.
[28] A. Al-Laith, M. Shahbaz, H. F. Alaskar and A. Rehmat, "AraSenCorpus: A
Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text
Corpus," Applied Sciences, vol. 11, no. 5, 2011.
[29] L. XIAOFENG and Z. ZHIMING, "Unsupervised Approaches for Textual
Semantic Annotation, A Survey," ACM Computing Surveys, vol. 52, no. 4, p.
1–45, 2019.
[30] M. M. Saif, "Best Practices in the Creation and Use of Emotion Lexicons,"
arXiv, 2023.
[31] P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, "Enriching Word
Vectors with Subword Information," Transactions of the Association for
Computational Linguistics, vol. 5, pp. 135-146, 2017.
[32] R. Tachicart and K. Bouzoubaa, "An Empirical Analysis of Moroccan
Dialectal User-Generated Text," in 11th International Conference
Computational Collective Intelligence, Hendaye, France, 2019b.
[33] R. Tachicart and k. Bouzoubaa, "Towards Automatic Normalization of the
Moroccan Dialectal Arabic User Generated Text," in Arabic Language
Processing: From Theory to Practice, Springer International Publishing,
2019c, pp. 264-275.
[34] V. I. Levenshtein, "Binary Codes Capable of Correcting Deletions,
Insertions and Reversals," Soviet Physics Doklady, vol. 10, p. 707, 1966.
[35] P. Ulrich, P. Thomas and F. Norbert, "Retrieval effectiveness of proper
name search methods," Information Processing & Management, Vols. Volume
32, Issue 6, pp. 667-679, 1996.

Declarations

Ethical Approval
not applicable

Competing interests
The authors declare no competing interests regarding the research, authorship, and
publication of this article.

Authors' contributions
Ridouane Tachicart and Karim Bouzoubaa conceived of the presented idea, developed
the theory and performed the computations. Karim Bouzoubaa and Driss Namly
verified the analytical methods and supervised the findings of this work. All authors
discussed the results and contributed to the final manuscript.

Funding Declaration
No Funding

Availability of data and materials


The lexicon will be available under a license.

You might also like