0% found this document useful (0 votes)

46 views

Topic Segmentation For Textual Document Written in Arabic Language

The document discusses adapting two topic segmentation algorithms (C99 and TextTiling) for use on Arabic language texts. It describes the ArabC99 and ArabTextTiling algorithms which were created by modifying the pre-processing and applying language-specific processing like stemming. The adaptations are evaluated on an Arabic corpus and compared to existing segmentation methods.

Uploaded by

Maya Hs

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Topic Segmentation For Textual Document Written in Arabic Language

Uploaded by

Maya Hs

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Computer Science 35 (2014) 437 – 446

18th International Conference on Knowledge-Based and Intelligent

Information & Engineering Systems - KES2014

Topic segmentation for textual document written in Arabic language

Anja Habacha Chaibi∗, Marwa Naili, Samia Sammoud
RIADI-ENSI, University of Manouba, Manouba 2010, Tunisia

Abstract
Topic segmentation is important for many natural language processing applications such as information retrieval, text summa-
rization... In our work, we are interested in the topic segmentation of textual document. We present a survey of related works
particularly C99 and TextTiling. Then, we propose an adaptation of these topic segmenters for textual document written in Arabic
language named as ArabC99 and ArabTextTiling. For experimental results, we construct an Arabic corpus based on newspapers of
diﬀerent Arab countries. Finally, we evaluate the performance of these new segmenters by comparing them together and to related
works using the metrics WindowDiﬀ and F-measure.
©
c 2014 Published
2014 The by Published
Authors. Elsevier B.V. This isB.V.
by Elsevier an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
Peer-review under responsibility of KES International.
Peer-review under responsibility of KES International.
Keywords: Topic segmentation; Arabic language processing; ArabC99; ArabTextTiling.

1. Introduction

The aim of topic segmentation is to divide a document into segments, such that each segment is thematically coher-
ent and consecutive segments are about different topics. This technique is used to improve the access to information.
For example, in information retrieval, short relevant text segments that directly correspond to the user’s query can
be returned instead of long documents. For text summarizing, a better summary can be obtained from topically seg-
mented documents. For the last years, several approaches have been proposed for the topic segmentation and they can
be classified in endogenous approach and exogenous approach. The first approach exploits the information contained
in the text to be segmented such as lexical repetition. In the other hand, the second approach uses external resources
like: thesaurus, dictionary and co-occurrence network.
While extensive research has targeted the topic segmentation for the English language, few have studied it in other
languages especially for the Arabic language. Indeed, for the last years, many topic segmenters for English language
have been presented by several authors. For example, we mention TextTiling which is developed by Hearst 1 . She
uses a sliding window and computes similarities between adjacent blocks based on their frequency vectors. Choi 2
presented a new topic segmenter which is C99 . This algorithm is based on lexical cohesion and it uses the cosine

∗ Corresponding author. Tel.: +216-95-238-571.

E-mail address: [email protected]

1877-0509 © 2014 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
Peer-review under responsibility of KES International.
doi:10.1016/j.procs.2014.08.124
438 Anja Habacha Chaibi et al. / Procedia Computer Science 35 (2014) 437 – 446

metric to compute similarity between sentences. Later, Choi 3 improved his algorithm by using Latent Semantic
Analysis (LSA) to extract semantic knowledge from corpora. Ferret 4 proposed his own topic segmenter F06 which is
based on TextTiling algorithm. Later, he improved it by using the thematic similarities between words and he named
it F06T. He used also a co-occurrence network in his third algorithm F06C. Then, he combined F06T and F06C and
proposed the F06CT algorithm.
Unlike the English language, there is a lack of research for the Arabic language. In fact, the specific issue dealing
with topic segmentation for the Arabic language is raised in the research of Brants et al. 5 in 2002, El-Shayeb et al. 6
in 2007, Touir et al. 7 in 2008 and Harrag et al. 8 in 2010. In the work of Brants et al. 5 , new topic segmenter has
been presented, namely TopSeg, which is based on Probabilistic Latent Semantic Analysis (PLSA). The aim of this
segmenter is to identify boundaries between concatenated texts. The limit of this work is that TopSeg ignores the
stemming step and uses full forms terms and character n-grams which increase the execution time. For the evaluation,
Brants et al. 5 used the error probability which have been criticized for being biased, e.g., it penalizes false negatives
(missed boundaries) more than false positives (erroneous additional boundaries). In the work of El-shayeb et al. 6 , a
comparative analysis of three different text segmentation algorithms (SeLeCT, LCseg and TextTiling) on Arabic news
stories, was presented. Also, a combined system of SeLeCt and LCseg (ModSeleCT) was described. For that, each
algorithm was implemented, adopted for Arabic language and evaluated on an Arabic Reuters news story dataset.
The objective of this work is to identify boundaries between 1000 concatenated news stories. For the evaluation,
four metrics have been used: Recall, Precision, Rseg and WindowDiff. In the work of Touir et al. 7 , an automatic
technique to help segment the Arabic texts while preserving the semantics was presented. This technique is based
on an empirical study on the sentences and clauses connectors. In order to evaluate the segmentation process, only
ten Arabic essays were segmented and the results were compared to manual segmentations performed by linguistic
experts. So, for the comparison they used two factors: correct hit and incorrect hit. The correct hit represents the
position marked by the process as a segment boundary and agreed by the judge. Incorrect hit represents the position
marked by the process as a segment boundary and the judge disagrees with it. In the work of Harrag at al. 8 , the two
topic segmenters C99 and TextTiling was adapted to the Arabic language using the light stemmer. As result, two
segmenters have been presented: TopSegArab and ArabTiling. To evaluate the performance of these two systems,
only five texts were segmented and the results were compared with the judgments of a group of seven readers. The
comparison was accomplished by using three metrics: Recall, Precision and F-measure which is more specific to the
information retrieval than the topic segmentation.

In this paper, we dedicated our research to the topic segmentation on Arabic text corpus. Therefore, this paper is
organized as follows: Section 2 presents the adaptation of C99 2 and TextTiling 1 to the Arabic language; Section 3
describes our Arabic test corpus; Section 4 deals with the evaluation of the proposed segmenters; and ﬁnally, section 5
is dedicated to the conclusion and our future work.

2. Adaptation to Arabic language

Currently, the most known algorithms for topic segmentation are C99 2 and TextTiling 1 . These two segmenters are
dedicated to the English language. So, we decide to adapt them to the Arabic language. Therefore, in this section,
two text segmentation systems are presented, namely ArabC99 and ArabTextTiling and they are based respectively on
C99 2 algorithm and TextTiling 2 algorithm.

2.1. ArabC99

ArabC99 is a topic segmenter dedicated for the Arabic language. It is based on lexical cohesion and it goes through
two important steps: pre-processing and segmentation. The pre-processing step includes the following operations:
words extraction, stop words elimination and stemming. Indeed, ArabC99 extracts all the words from the input text.
Then, the not useful words are eliminated by using a list of stop word specific to the Arabic language. Next, a
stemming program is applied in order to provide the root of each useful word. For that, we use the Shereen Khoja’s
stemmer 9 which is used to Arabic language. In fact, Khoja’s stemmer 9 removes the longest suffix and the longest
prefix. Then it matches the remaining word with verbal and nouns patterns, to extract the root. The stemmer makes
Anja Habacha Chaibi et al. / Procedia Computer Science 35 (2014) 437 – 446 439

use of several linguistic data ﬁles such as a list of all diacritic characters, punctuation characters, deﬁnite articles and
168 stop words. To better explain the pre-processing step, we present the following example. As shown in Fig. 1, the
three operations have been applied to one arabic sentence along with a translation to the English language.

Fig. 1. The pre-processing step.

The second step is the segmentation which is the same as C99 algorithm 2 . As shown in Fig. 2, this step is divided
in four phases. The ﬁrst phase is the construction of the frequency dictionary. In fact, each sentence is represented
by a vector which contains the frequency of each word. The second phase is the similarity matrix construction. In
fact, the similarity between a pair of sentences is computed using the cosine measure as shown in equation 1. Once
the similarity matrix is constructed, the rank matrix will be calculated. Each value in the similarity matrix is replaced
by its rank in the local region. As shown in equation 2, the rank is the number of neighboring elements with a lower
similarity value. Finally, the fourth phase is dedicated to identify the topic boundaries by using Reynar’s maximisation
algorithm 10 .

f x, j fy, j
S im(x, y) = (1)
f x,2 j fy,2 j

With f x, j denote the frequency of word j in sentence x and fy, j denote the frequency of word j in sentence y.

Number of elements with a lower similarity value compared to sim(x,y)

Rank(x, y) = (2)
Number of elements examined

2.2. ArabTextTiling

ArabTextTiling is a topic segmentation algorithm for Arabic language. Like ArabC99, this algorithm contains two
important steps: pre-processing and segmentation. The pre-processing step consists of the succession of the following
operations: words extraction, stop words elimination and stemming. For the stemming phase, we also use Khoja’s
stemmer 9 . In fact, the pre-processing step of ArabTextTiling is the same as the pre-processing step of ArabC99
(Fig. 1). The second step is the same as TextTiling 1 and it includes three phases (Fig. 3).
The ﬁrst one is the blocks construction by using a sliding window. Then, the similarity between blocks is calculated
by the cosine measure as in equation 1. This formula yields a score between 0 and 1, inclusive. These scores will be
440 Anja Habacha Chaibi et al. / Procedia Computer Science 35 (2014) 437 – 446

Fig. 2. The second step of ArabC99

Fig. 3. The second step of ArabTextTiling.

plotted as shown in Fig. 3. Finally, boundaries are determined by changes in the sequence of similarity scores. These
changes are detected by recording the higher peaks in the similarity curve.

2.3. Example of topic segmentation

To better understand the concept of topic segmentation, we present the following example. In this example, we
have performed ArabTextTiling on an Arabic document. This document is constructed from a concatenation of two
articles from the Algerian newspaper Al-Chourouk 11 . The ﬁrst paper is about the protest of higher school teachers to
condemn the phenomenon of school violence in Algeria. The second paper is about the organization of the General
Assembly of the federal electoral Algerian Football. As shown in Fig. 4, our segmenter identiﬁed correctly and
automatically the two parts of the document.
Anja Habacha Chaibi et al. / Procedia Computer Science 35 (2014) 437 – 446 441

Fig. 4. The result of ArabTextTiling

3. Arabic test corpus

We have used Arabic newspapers: the Arabic version of The Diplomatic World 12 and Tunisian 13 , Egyptian 14 and
Algerian 11 versions of a newspaper named Al-Chourouk. Indeed, we have collected 120 articles dealing with various
topics such as politics, sports, culture, history, technology and arts. Then these articles are divided into four collections
of documents. The first collection contains 35 articles from the Tunisian newspaper. The second collection includes
35 articles from the Egyptian newspaper. The third collection consists of 30 articles from the Algerian newspaper.
Finally, the fourth collection is made up of 20 articles from The Diplomatic World newspaper.
The originality of our research is presented by the fact that we used an important amount of data (120 raw texts)
unlike the works of Touir et al. 7 (10 texts) and Harrag et al. 8 (5 texts). Also, the aim of our work is not only the
identification of boundaries between concatenated texts like the most of research which has presented until now.
Indeed, we have constructed two sets. For the first set we have combined four documents in series in order to identify
boundaries between texts. While for the second set and as shown in Fig. 5, we have combined four documents by
section and in turn.
442 Anja Habacha Chaibi et al. / Procedia Computer Science 35 (2014) 437 – 446

Fig. 5. The construction of the second set.

4. Evaluation

Unlike the inefficient metrics which have been used in the works of Bants et al. 5 and Touir et al. 7 , we used the
two metrics F-measure and WindowDiff for the evaluation. Indeed, the WindowDiff metric is a variant of the Pk
measure which uses a sliding window. This window allows to take into account the number of boundaries between
two sentences remote with a distance k (k is the length of the sliding window). In the other hand, the F-measure metric
has been seeing as a compromise between the two major standards for the evaluation which are Recall and Precision.
So, with the use of F-measure and WindowDiff, we have evaluated our new topic segmenters for the two test sets
which have been described in the previous section.

4.1. Evaluation results using the ﬁrst set:

In this subsection, we present the results of each algorithm (ArabC99 and ArabTextTiling) for the ﬁrst set with the
use of curves and summary tables to properly analyze the results.

As we said in section 3, there are four collections of documents which are well identiﬁed in the following curves.
In fact, the ﬁrst collection which is related to the Tunisian newspaper 13 is limited by the origin of the curve and point
A. The second collection of article which is from the Egyptian newspaper 14 is limited by the two points A and B.
The third collection which contains articles from the Algerian newspaper 11 is limited by B and C. Finally, the fourth
collection which is related to the Arabic version of The Diplomatic World 12 is limited by C and D.

Fig. 6 shows the F-measure values as a function of 120 documents. The ArabC99 performance is better than
ArabTextTiling performance especially for the second and fourth collection. For the ﬁrst collection, the performance
of the two segmenters gets close.
In table 1, the average F-measure results of ArabC99 and ArabTextTiling is shown. The results show that ArabC99
outperforms ArabTextTiling in all the four collections of documents. Also, for the fourth collection, F-measure values
of the two segmenters decreases relative to the other collections.

Table 1. Average F-measure results for the ﬁrst set.

Segmentation First collection Second collection Third collection Fourth collection
ArabTextTiling 0.366 0.173 0.422 0.094
ArabC99 0.523 0.559 0.803 0.395

Fig. 7 shows the WindowDiﬀ values as a function of the four collections of documents. Largely, the performance
of ArabC99 is better than the performance of ArabTextTiling. Indeed, the error rate of ArabC99 is less than the error
Anja Habacha Chaibi et al. / Procedia Computer Science 35 (2014) 437 – 446 443

Fig. 6. F-measure results for the ﬁrst set.

rate of ArabTextTiling especially for the two last collections.

Fig. 7. WindowDiﬀ results for the ﬁrst set.

The table 2 contains the average WindowDiﬀ results of ArabC99 and ArabTextTiling for the ﬁrst set. The results
show that ArabC99 has the smallest error rate compared to ArabTextTiling especially for the two last collections.

Table 2. Average WindowDiﬀ results for the ﬁrst set.

Segmentation First collection Second collection Third collection Fourth collection
ArabTextTiling 0.561 0.596 0.521 0.658
ArabC99 0.434 0.409 0.279 0.311

4.2. Evaluation results using the second set:

In this subsection, we present the results of ArabC99 and ArabTextTiling for the second set using the two metrics
F-measure and WindowDiﬀ.

Fig. 8 shows the F-measure results as a function of the 120 documents for our two topic segmenters. For the
third and fourth collection, the performance of ArabC99 is always better than the performance of ArabTextTiling.
444 Anja Habacha Chaibi et al. / Procedia Computer Science 35 (2014) 437 – 446

Yet, for the ﬁrst and second collection, the performance of the two segmenters gets close and for some documents
ArabTextTiling outperforms ArabC99.

Fig. 8. F-measure results for the second set.

The table 3 contains the average F-measure values. As shown in this table, we notice that the performance ArabC99
is better than the performance of ArabTextTiling. We also notice that the F-measure values of ArabTextTiling de-
creases for the last collection.

Table 3. Average F-measure results for the second set.

Segmentation First collection Second collection Third collection Fourth collection
ArabTextTiling 0.452 0.205 0.353 0.131
ArabC99 0.542 0.486 0.767 0.452

Fig. 9 shows the WindowDiﬀ values as a function of the four collections of documents. For the two ﬁrst collections,
the performance of ArabC99 and ArabTextTiling are almost close. In fact, ArabC99 outperforms ArabTextTiling for
some documents and the opposite is also true. But, for the last two collections, the performance of ArabC99 is better
than the performance of ArabTextTiling especially for the fourth collection.

Fig. 9. WindowDiﬀ results for the second set.

Anja Habacha Chaibi et al. / Procedia Computer Science 35 (2014) 437 – 446 445

The table 4 presents the segmentation performs of ArabC99 and ArabTextTiling using the WindowDiﬀ metric. For
the two ﬁrst collections, ArabTextTiling outperforms ArabC99 with the smallest error rates. While for the last two
collection, the performance of ArabC99 is better than the performance of ArabTextTiling.

Table 4. Average WindowDiﬀ results for the second set.

Segmentation First collection Second collection Third collection Fourth collection
ArabTextTiling 0.451 0.509 0.594 0.578
ArabC99 0.526 0.628 0.438 0.372

4.3. Discussion

In general, ArabC99 outperforms ArabTextTiling for the two sets. However, the performance of each segmenter is
not the same for the four collections of documents. This diﬀerence can be explained by the fact that the four collec-
tions are from diﬀerent sources (Tunisian, Egyptian and Algerian). In fact, these countries use Arabic language but
each one of them is characterized by its own style. Also, for the fourth collection that treats only one topic which is
politic, the performance of ArabTextTiling decreases and ArabC99 remains the best among the two of them. Further-
more, this collection is also characterized with length of its documents.

Moreover, if we compare our work with the work of Harrag at al. 8 , we can say that our results are different. As
shown in table 5, Harrag et al. 8 confirmed that ArabTiling which is the Arabic version of TextTiling is better than
TopSegArab which is the Arabic version of C99. Indeed, compared to TopSegArab, ArabTiling has the best values of
Precision (0.81) and F-measure (0.65). While, in our work we found that ArabC99 outperforms ArabTextTiling for
the two test sets. In fact the evaluation of Harrag at al. 8 is not reliable because they have only used five texts as a data
set which is very little for a trustworthy evaluation. Also this difference is caused by the use of a different stemmer
which requires a further study and the use of a different data source.

Table 5. Comparison of the algorithms with an Arabic corpus 8 .

Segmentation Recall Precision F-measure
Humain judges 0.81 0.84 0.82
ArabTiling 0.55 0.81 0.65
TopSegArab 0.54 0.64 0.58

5. Conclusion

In this paper, we proposed an adaptation of two topic segmenters (C99 and TextTiling) for textual document written
in Arabic language. The Arabic segmenters ArabC99 and ArabTextTiling use Khoja’s stemmer. We evaluate these
segmenters with two test sets based on newspapers of different Arab countries. The originality of this work is that
the evaluation is not limited to the detection of boundary between texts. We notice too that for Arabic language, the
difference between the dialects of countries can be important in the topic segmenter process. In addition to that we
remark that the adaptation of topic segmenter depends of the choice of the Arabic stemmer and we must study this
effect in depth.

From this work several future works can be considered. As a ﬁrst work, we must evaluate our new segmenters
using an Arabic benchmark such as the corpus of Latifa 15 which includes 842684 words and 415 texts. We must also
study the integration of Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Analysis (PLSA) in the
topic segmentation. To go further, we should propose a new multilingual hybrid topic segmenter which uses internal
and external resources.

References

1. M. A. Hearst. 1997. TextTiling: ”Segmenting text into multi-paragraph subtopic passages”. Computational Linguistics, 23 (1) :33-64.
446 Anja Habacha Chaibi et al. / Procedia Computer Science 35 (2014) 437 – 446

2. F. Y. Y. Choi. ”Advances in domain independent linear text segmentation”. z. Proceeding of NAACL-00, pp 26-33, 2000.
3. F. Y. Y. Choi, P. Wiemer-Hastings, and J. Moore. ”Latent Semantic Analysis for Text Segmentation”. Proceedings of EMNLP, 2001, pp.
109-117.
4. Olivier Ferret. ”Improving text segmentation by combining endogenous and exogenous methods”. TAL. Volume 47 - n2/2006, pp 111-135.
5. T. Brants, F. Chen, and A. Farahat. ”Arabic Document Topic Analysis”. TREC 2002. Gaithersburg: NIST, 2002.
6. M. A. El-Shayeb, S. R. El-Beltagy, and A. Rafea. ”Comparative Analysis of Diﬀerent Text Segmentation Algorithms on Arabic News Stories”.
In Proc. IEEE International Conference on Information Reuse and Integration, 2007, pp. 441-446.
7. A. A. Touir, H. Makhtour, and W. Al-Sanea. ”Semantic-Based Segmentation of Arabic Texts”. Inf. Tech. J., 7(7)(2008), pp. 1009-1015.
8. F. Harrag, A. H. Cherif, A. S. Al-Salman. ”Comparative Study of Topic Segmentation Algorithms Based On Lexical Cohesion: Experimental
Results on Arabic Language”. The Arabian Journal for Science and Engineering, Volume 35, Number 2C, 2010.
9. S. Khoja, ”Stemming Arabic Text”, http://zeus.cs.paciﬁcu.edu/shereen/research.htm. (Accessed April 03, 2013)
10. J. Reynar. ”Topic Segmentation: Algorithms and Application”. Ph.D. thesis. Computer and Information Science. Universety of pennsyvania,
Pennsylvania, USA, 1998.
11. Echourouk-algerie. http://www.okbob.net/article-lire-le-jounal-echourouk-algerie-75761076.html. (Accessed April 17, 2013)
12. Le Monde Diplomatique. http://www.mondiploar.com/index.php3. (Accessed April 26, 2013)
13. Pressetunisie. http://www.pressetunisie.net/alchourouk.php. (Accessed April 14, 2013)
14. Shorouknews. http://shorouknews.com/egypt. (Accessed April 16, 2013)
15. Latifa Al-Sulaiti’s Homepage. http://www.comp.leeds.ac.uk/eric/latifa/research.htm. (Accessed March 17, 2014)

WIIIP WJ IV Comprehensive Sample Report - Abbrev
No ratings yet
WIIIP WJ IV Comprehensive Sample Report - Abbrev
10 pages
Automatic Creation of Quality Multi-Word Lexica From Noisy Text Data
No ratings yet
Automatic Creation of Quality Multi-Word Lexica From Noisy Text Data
7 pages
Receptive vs. Expressive
100% (1)
Receptive vs. Expressive
18 pages
Fine-Tuning and Multilingual Pre-Training For Abst
No ratings yet
Fine-Tuning and Multilingual Pre-Training For Abst
13 pages
5316ijnlc01 PDF
No ratings yet
5316ijnlc01 PDF
11 pages
Text Classification For Arabic Words Using Rep-Tree
No ratings yet
Text Classification For Arabic Words Using Rep-Tree
8 pages
Research Article: Abstractive Arabic Text Summarization Based On Deep Learning
No ratings yet
Research Article: Abstractive Arabic Text Summarization Based On Deep Learning
14 pages
Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization
No ratings yet
Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization
16 pages
Amharic Arabic Neural Machine Translatio
No ratings yet
Amharic Arabic Neural Machine Translatio
14 pages
Bashaier Proposal Ver 22-8-2024
No ratings yet
Bashaier Proposal Ver 22-8-2024
15 pages
El Kah-Anoual-Publications-17-08-2022-11-08-19-34
No ratings yet
El Kah-Anoual-Publications-17-08-2022-11-08-19-34
10 pages
Using Webcorp in The Classroom For Building Specialized Dictionaries
No ratings yet
Using Webcorp in The Classroom For Building Specialized Dictionaries
13 pages
S5-Automatic Arabic Text Summarisation System (AATSS) Based On Morphological Analysis
No ratings yet
S5-Automatic Arabic Text Summarisation System (AATSS) Based On Morphological Analysis
9 pages
Serge Sharoff, Reinhard Rapp, Pierre Zweigenbaum (Auth.), Serge - Building and Using Comparable Corpora (2013, Springer) [10.1007_978!3!642-20128-8] - Libgen.li
No ratings yet
Serge Sharoff, Reinhard Rapp, Pierre Zweigenbaum (Auth.), Serge - Building and Using Comparable Corpora (2013, Springer) [10.1007_978!3!642-20128-8] - Libgen.li
333 pages
AraBERT Transformer Model For Arabic Comments and Reviews Analysis
No ratings yet
AraBERT Transformer Model For Arabic Comments and Reviews Analysis
9 pages
Challenges_in_Rendering_Arabic_Text_to_English_Using_Machine_Translation_A_Systematic_Literature_Review
No ratings yet
Challenges_in_Rendering_Arabic_Text_to_English_Using_Machine_Translation_A_Systematic_Literature_Review
8 pages
Applying Deep Learning For Arabic Keyphrase Extraction
No ratings yet
Applying Deep Learning For Arabic Keyphrase Extraction
8 pages
Arabic Text Classification: The Need For Multi-Labeling Systems
No ratings yet
Arabic Text Classification: The Need For Multi-Labeling Systems
25 pages
Challenges in Rendering Arabic Text To English Usi
No ratings yet
Challenges in Rendering Arabic Text To English Usi
10 pages
Text Paraphrasing With Large Language Models-3
No ratings yet
Text Paraphrasing With Large Language Models-3
6 pages
Arabic Keyphrase Extraction
0% (1)
Arabic Keyphrase Extraction
77 pages
7 Ijans
No ratings yet
7 Ijans
22 pages
Arabic Text Summarization Challenges Usi
No ratings yet
Arabic Text Summarization Challenges Usi
9 pages
Semantic Similarity Between Medium-Sized Texts
No ratings yet
Semantic Similarity Between Medium-Sized Texts
13 pages
Algorithms: A Novel Hybrid Genetic-Whale Optimization Model For Ontology Learning From Arabic Text
No ratings yet
Algorithms: A Novel Hybrid Genetic-Whale Optimization Model For Ontology Learning From Arabic Text
32 pages
Spell Correction
No ratings yet
Spell Correction
46 pages
ADPBC Arabic Dependency Parsing Based Co
No ratings yet
ADPBC Arabic Dependency Parsing Based Co
8 pages
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
No ratings yet
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
5 pages
Automatic Text Document Summarization Based On Machine Learning
No ratings yet
Automatic Text Document Summarization Based On Machine Learning
4 pages
Towards A New Token Based Framework For New Token
No ratings yet
Towards A New Token Based Framework For New Token
7 pages
Romanized Arabic Transliteration: Achraf Chalabi Hany Gerges
No ratings yet
Romanized Arabic Transliteration: Achraf Chalabi Hany Gerges
8 pages
Machine Translation
No ratings yet
Machine Translation
3 pages
Optimal Stop Word Selection For Text Mining in Critical Infrastructure Domain
No ratings yet
Optimal Stop Word Selection For Text Mining in Critical Infrastructure Domain
6 pages
Yirdaw 2012
No ratings yet
Yirdaw 2012
8 pages
Leveraging_DistilBERT_for_Summarizing_Arabic_Text_An_Extractive_Dual-Stage_Approach
No ratings yet
Leveraging_DistilBERT_for_Summarizing_Arabic_Text_An_Extractive_Dual-Stage_Approach
14 pages
LAMP: A Multimodal Web Platform For Collaborative Linguistic Analysis
No ratings yet
LAMP: A Multimodal Web Platform For Collaborative Linguistic Analysis
9 pages
Language An A Lays Is
No ratings yet
Language An A Lays Is
71 pages
A Proposed Approach For Arabic Semantic Annotation
No ratings yet
A Proposed Approach For Arabic Semantic Annotation
10 pages
Medicine Dispenser
No ratings yet
Medicine Dispenser
9 pages
S2-Hybrid Method For Text Summarization Based On Statistical and Semantic Treatment
No ratings yet
S2-Hybrid Method For Text Summarization Based On Statistical and Semantic Treatment
34 pages
Arabic Part-Of-Speech Tagging Using The Sentence Structure: Y.O. Mohamed El Hadj, I.A. Al-Sughayeir, A.M. Al-Ansari
No ratings yet
Arabic Part-Of-Speech Tagging Using The Sentence Structure: Y.O. Mohamed El Hadj, I.A. Al-Sughayeir, A.M. Al-Ansari
5 pages
1 s2.0 S131915781730544X Main
No ratings yet
1 s2.0 S131915781730544X Main
7 pages
Report 116 Smit
No ratings yet
Report 116 Smit
11 pages
2023.arabicnlp-1.20
No ratings yet
2023.arabicnlp-1.20
12 pages
2010 - Improving Arabic Text Categorization Using Neural Network With SVD
No ratings yet
2010 - Improving Arabic Text Categorization Using Neural Network With SVD
7 pages
Investigate The Impact of Stemming On Mauritanian
No ratings yet
Investigate The Impact of Stemming On Mauritanian
7 pages
Information 13 00228
No ratings yet
Information 13 00228
12 pages
LIpaper ICDAR01 Final PDF
No ratings yet
LIpaper ICDAR01 Final PDF
5 pages
From Recurrent Neural Network Techniques To Pre-Trained Models: Emphasis On The Use in Arabic Machine Translation
No ratings yet
From Recurrent Neural Network Techniques To Pre-Trained Models: Emphasis On The Use in Arabic Machine Translation
10 pages
ArMeXLeR Arabic Meaning Extraction Through Lexical Resources: A General-Purpose Data Mining Model For Arabic Texts
No ratings yet
ArMeXLeR Arabic Meaning Extraction Through Lexical Resources: A General-Purpose Data Mining Model For Arabic Texts
6 pages
W03-1102
No ratings yet
W03-1102
8 pages
Improving Sentiment Analysis in Arabic Using Word Representation
No ratings yet
Improving Sentiment Analysis in Arabic Using Word Representation
6 pages
Text Summerization
No ratings yet
Text Summerization
11 pages
Computer Science 4022:pp. 898-907.: Deposited On: 5 November 2007
No ratings yet
Computer Science 4022:pp. 898-907.: Deposited On: 5 November 2007
11 pages
Context-Based Persian Multi-Document Summarization (Global View)
No ratings yet
Context-Based Persian Multi-Document Summarization (Global View)
5 pages
Experiments With A Hindi-to-English Transfer-Based MT System Under A Miserly Data Scenario
No ratings yet
Experiments With A Hindi-to-English Transfer-Based MT System Under A Miserly Data Scenario
21 pages
Text Summarization As Feature Selection For Arabic Text Classification
No ratings yet
Text Summarization As Feature Selection For Arabic Text Classification
4 pages
IEEE_Conference_Template1
No ratings yet
IEEE_Conference_Template1
7 pages
Text Alignment
No ratings yet
Text Alignment
14 pages
Ara - CANINE: Character-Based Pre-Trained Language Model For Arabic Language Understanding
No ratings yet
Ara - CANINE: Character-Based Pre-Trained Language Model For Arabic Language Understanding
15 pages
An Introduction to Functional Programming Through Lambda Calculus
From Everand
An Introduction to Functional Programming Through Lambda Calculus
Greg Michaelson
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Champion'S: Alphanumeric Series
No ratings yet
Champion'S: Alphanumeric Series
77 pages
Your Turn
No ratings yet
Your Turn
3 pages
Kollam Logo Aug2023
No ratings yet
Kollam Logo Aug2023
9 pages
Code Switching
No ratings yet
Code Switching
6 pages
Eng101 Handouts
No ratings yet
Eng101 Handouts
300 pages
Calendar Thematic Plan For The 9th Grade
No ratings yet
Calendar Thematic Plan For The 9th Grade
14 pages
Act 15
No ratings yet
Act 15
10 pages
Detailed Lesson Plan in English Vii: A. Routine Activities 1. Greetings 2. Classroom Management
100% (1)
Detailed Lesson Plan in English Vii: A. Routine Activities 1. Greetings 2. Classroom Management
5 pages
Dh102028reading Key
No ratings yet
Dh102028reading Key
24 pages
Download ebooks file Proceedings of the Twenty Third Annual Conference of the Cognitive Science Society Johanna D. Moore all chapters
No ratings yet
Download ebooks file Proceedings of the Twenty Third Annual Conference of the Cognitive Science Society Johanna D. Moore all chapters
81 pages
Treirb Telugu JL Syllabus
No ratings yet
Treirb Telugu JL Syllabus
3 pages
English Las (Second Quarter)
No ratings yet
English Las (Second Quarter)
7 pages
Ing PDF
No ratings yet
Ing PDF
1 page
English Around The World
No ratings yet
English Around The World
13 pages
21.03.20 6º Ano Homework (Dever de Casa)
No ratings yet
21.03.20 6º Ano Homework (Dever de Casa)
3 pages
Cambridge Dictionary Key
No ratings yet
Cambridge Dictionary Key
2 pages
English Viva
No ratings yet
English Viva
140 pages
Effective Speaking Techniques
No ratings yet
Effective Speaking Techniques
39 pages
Spark 3 Blueprint Tremujori 2 Dhe Test Key
No ratings yet
Spark 3 Blueprint Tremujori 2 Dhe Test Key
2 pages
The Structural Study of Myth: Claude Levi-Strauss
No ratings yet
The Structural Study of Myth: Claude Levi-Strauss
9 pages
LESSON PLAN - FORM 1 - Amanah - : (Below List Only Applicable For L.A Lesson,)
No ratings yet
LESSON PLAN - FORM 1 - Amanah - : (Below List Only Applicable For L.A Lesson,)
2 pages
Modul EM Final
No ratings yet
Modul EM Final
78 pages
UNIT 11 Review
No ratings yet
UNIT 11 Review
2 pages
How To Write A History Essay at CCA
100% (1)
How To Write A History Essay at CCA
5 pages
ENK BridgeCourse 23 24
No ratings yet
ENK BridgeCourse 23 24
33 pages
Grammar and Beyond 2, Units 25-28, Final Exam Review
No ratings yet
Grammar and Beyond 2, Units 25-28, Final Exam Review
4 pages
Cambridge Grammar of English - Types of Verbs
No ratings yet
Cambridge Grammar of English - Types of Verbs
19 pages
adjectives-and-prepositions-british-english-student
No ratings yet
adjectives-and-prepositions-british-english-student
7 pages

Topic Segmentation For Textual Document Written in Arabic Language

Uploaded by

Topic Segmentation For Textual Document Written in Arabic Language

Uploaded by

Available online at www.sciencedirect.

18th International Conference on Knowledge-Based and Intelligent

Topic segmentation for textual document written in Arabic language

∗ Corresponding author. Tel.: +216-95-238-571.

2. Adaptation to Arabic language

Fig. 1. The pre-processing step.

Number of elements with a lower similarity value compared to sim(x,y)

Fig. 2. The second step of ArabC99

Fig. 3. The second step of ArabTextTiling.

2.3. Example of topic segmentation

Fig. 4. The result of ArabTextTiling

3. Arabic test corpus

Fig. 5. The construction of the second set.

4.1. Evaluation results using the ﬁrst set:

Table 1. Average F-measure results for the ﬁrst set.

Fig. 6. F-measure results for the ﬁrst set.

rate of ArabTextTiling especially for the two last collections.

Fig. 7. WindowDiﬀ results for the ﬁrst set.

Table 2. Average WindowDiﬀ results for the ﬁrst set.

4.2. Evaluation results using the second set:

Fig. 8. F-measure results for the second set.

Table 3. Average F-measure results for the second set.

Fig. 9. WindowDiﬀ results for the second set.

Table 4. Average WindowDiﬀ results for the second set.

Table 5. Comparison of the algorithms with an Arabic corpus 8 .

You might also like