A Jaccards Similarity Score Based Methodology For Kannada Text Document Summarization
A Jaccards Similarity Score Based Methodology For Kannada Text Document Summarization
Abstract—The techniques for better Information Retrieval supervised approach, classification algorithms are trained
(IR) have become indispensable as the amount of information on a summarized collection of documents. In the
available on the World Wide Web is steadily increasing. A unsupervised approach, the HITS algorithm is run on the
method based on Jaccards’ similarity score for Kannada document graphs.
language text document summarization is devised, which
produces a coherent summary of a given document. The corpus New methods in Automatic Extracting by H P
used in this work is prepared from the Kannada web portals Edmundson (1969) say the choice of sentences for
called as kanaja and kannadawebdunia. The results obtained summary is automatic. The sentences which have the
are compared with keyword based and sentence ranking based greater potential of conveying the meaning of the
methods used for summarization Results obtained are document are selected. Components such as Title and
satisfactory heading words, pragmatic words and structural indicators
are considered.
Keywords— S ummary, Jaccard, similarity, score,
extractive, human summary The automatic creation of literature abstracts by H P
Luhn (1958), indicates that statistical information derived
I. INT RODUCT ION from word frequency and distribution is used to compute
The steady increase in the amount of information present scores for sentences and their significance to generate ’auto
abstract’.
online gives rise to “Information Overload” problem. Better
Information Retrieval (IR) techniques could hence be used for The work [12] of Wen Xiao and Giuseppe Carenini
addressing this Information overload problem. Text document (2019), involves a novel approach of a neural single-
summarization is an important Information Retrieval (IR) document extractive text summarization model. The model
technique which presents the document in a more concise and incorporates both the global and local context within the
compact way which would help address information overload current topic.
issues effectively. There are millions of web pages available
online in Indian languages. Based on a survey, there are The paper [13], ‘A Summarization System for
around 60 news publications and blogs available in the Indian Scientific Documents’, (2019), presents a novel system for
Languages. Information Retrieval (IR) in the Indian languages summarization of Computer Science publications. A
in general and Kannada language in particular is a relatively qualitative user study was done to identify the most
new phenomenon. Kannada is a regional language spoken in valuable scenarios for understanding of scientific
the state of Karnataka. There are millions of Kannada speakers documents. Based on the findings, a system for
in the state. Around 10,000 million WebPages in Kannada are summarization of scientific documents has been presented
available online. Reading and understanding the contents of for different information requirement.
these pages may be a time consuming ask. Hence, we thought The work, ‘Abstractive Document Summarization with
it is important to develop methodologies for Kannada Text a Graph-Based Attentional Neural Model’, (2017) [14],
document summarization. The dataset used in this work is addresses the need of finding salient content from the
collected from two Kannada web sites; kanaja and original document. A graph-based attention mechanism in
kannadawebdunia. The rest of the paper is organized as a hierarchical encoder framework has been proposed. A
follows. Section-II highlights the literature survey; Section-III hierarchical beam search algorithm is used to generate
describes the methodology adopted in this work. Section-IV multi-sentence summary.
is about the Results and Discussion.
The paper [15], ‘Combining Different Summarization
II. LIT ERAT URE SURVEY Techniques for Legal Text’ (2012), presents a hybrid
Better summarization techniques and tools have played summarization technique, in which a number of different
significant role in information retrieval. It is often said that techniques are combined in a rule-based system. This
summarization is all about the reducing the content of the approach has been used to summarize legal documents.
document. The following is the gist of the literature survey
carried out.
These surveys indicate multiple attempts to create
Context factors such as input, purpose, and output effective summaries that are on par with human
factors influence summarization. Marina Litvak et al summaries. Summaries produced by machine may not be
(2008), have influenced the work ’Graph-Based Keyword coherent, as it requires Natural Language Processing.
Extraction for Single-Document summarization’. This is
an interesting approach suggested wherein they introduce
two approaches: Supervised and Unsupervised. In the
Authorized licensed use limited to: Carleton University. Downloaded on June 18,2021 at 00:19:02 UTC from IEEE Xplore. Restrictions apply.
III. M ET HODOLOGY log (number of words∈s1) + log(number of words
∈ s 2)
The methodology that has been followed, can be
described in the following 9 steps. We store each score for sentences in a document in
the matrix form.
Step 6: Calculate the Sentence rank
Step 1: Prepare the Input file
The row sum is calculated for each row in the
similarity matrix.
Webpage contents form the input file for our work. The Row sum (s1) = S(s1, s2) +S(s1, s3) +…. +S (s1, sn)
contents of the webpage cannot be processed directly as it
contains tags and other formatting options. The webpage The row sum is regarded to be the rank of a
contents are copied to a text file, and special symbols are sentence, and reflects the significance of a sentence
removed. relative to other sentences in a document.
Hypotheses: Since the sentence ranks have been
Step 2: Identify Sentences computed by taking into consideration all sentences,
higher sentence rank indicates all similar sentences in
a document. Thus, keeping the sentences with higher
Once the input file is prepared as required, we split it rank gives more weightage to the summary, as it
into a set of sentences, where each sentence is to be treated reduces the redundant sentences.
as an array, and all the ‘.’ are replaced with ‘\n’. Thereby,
the requirement for manual intervention is reduced. The result obtained is stored in a structure. The
structure contains the sentence position as the key, and
the rank as the value.
Step 3: Stemming
The input to the stemmer, is the refined document Step 7: Sort the Rank Vector
containing array of sentences, which gives root words as
the output. Stemming is applied to feature words of a Once the sentences have been ranked, they are sorted
category. Thus, the summarized document doesn’t in descending order. This will help us in choosing the top
contains any redundant words. ranked sentences.
Step 4: Eliminate Stop Words Step 8: Find the position of the top ‘m’ sentences
These are words of high frequency, that have to be Once the sorted list of ranks is obtained, the original
identified and removed. To prepare the stop words list, we position of these ranks has to be determined. For that,
use the standard stop words list in English and this is structure from step 6 is reverse looked using rank to find
translated into Kannada. the position. Here ‘m’ is the user defined variable to store
Pronouns, and words that do not have meaning in the the limit of summary i.e. the no. of sentences to be present
dictionary are regarded as stop words. The exhaustive list in the summary.
of stop words prepared are stored as a hash table. Step 9: Fetching the chosen sentences, and displaying
Words in the document are checked against the hash the abstract
table, and matching words are removed from the document. The following step involves the fetching of sentences
Thus, ensuring that the document will contain only which are to be a part off the summary, from the input file
important words. and displaying them.
Algorithm:
Step 5: Form the similarity matrix Takes input file in the text format and produces
sentence limited summary.
The Jaccard coefficient measures the resemblance
between sample sets which are nothing but sentences in
this case, and is defined as the size of the intersection Input: File, number of sentences requied in the output -m
divided by the size of the union of the sample sets: Output: Summary containing required number of
Individual sentences are compared with every other sentences
sentence present in the document, and in this manner, we
compute a similarity matrix.
Logic:
Similarity (s1, s2) =
total number of common words between sentences
Authorized licensed use limited to: Carleton University. Downloaded on June 18,2021 at 00:19:02 UTC from IEEE Xplore. Restrictions apply.
Prepare the input file
Read from the file and build array of sentences
For each sentence from the file –
Extract words into array
For each word in the array
Apply stemming and store back the root word
For each sentence having root words
Eliminate stop words
For i = 1 to n
For j =1 to n
Similarity (i,j) = common words in sentence
Fig 2 summary for 10 documents (Religion)
(i,j) / Math.log (Total words in sentence i) + Math.log
(total words in sentence j)
REVIEWER 2:
For j = 1 to n
The summary produced by one person may
not be similar to the summary produced by the other.
Rank(i) = ∑similarity (i , j) A sentence that is considered important by a person
j=1 may not be considered as important by another.
Another very influential factor in determining the
Sort the Rank vector in descending importance of a sentence is the proficiency of the
order Get the position of top ‘m’ person in that subject
sentences
Sort the position vector in ascending order
Extract all the sentences in position vector from the
original input file
10
Authorized licensed use limited to: Carleton University. Downloaded on June 18,2021 at 00:19:02 UTC from IEEE Xplore. Restrictions apply.
Comparison of the current approach to summarization with par with the human summary. The results obtained are
Key word based methodology for summarization: comparatively better, as is demonstrated by the sample
graphs obtained for the previous approach by Jayashree R
et.al [10, 11].
The two Figures below i.e. Figure 5 and 6 show
the similarity between human generated, and REFERENCES
machine generated extracts for the keyword based
summarization approach [11, 12]. The sample graphs [1] Marina Litvak and Mark Last ‘Graph-Based Keyword Extraction for
given in these figures indicate that, the similarity Single-Document Summarization’, Proceedings of the 2ndWorkshop
between machine generated summary and human on Multi-source, Multilingual Information Extraction and
summary is rather low. Summarization, Coling 2008, pages 17-24, 2008.
[2] H. P. Edmundson: New Methods in Automatic Extracting. J. ACM
16(2), pages 264-285 ,1969
[3] H P Luhn ,’T he Automatic creation of literature abstracts’,Presented
at IRE National Convention’,1958.
[4] Mari-Sanna Paukkeri and T imo Honkela,'Likey: Unsupervised
Language-independent Keyphrase Extraction, Proceedingsof the5th
International Workshop on Semantic Evaluation, ACL 2010, pages
162–165,2010.
[5] Letian Wang, Fang Li, SJT ULT LAB: Chunk Based Method for
Keyphrase Extraction, Proceedings of the 5 th InternationalWorkshop
on Semantic Evaluation, ACL 2010, pages 158– 161, 2010.
[6] Fumiyo Fukumoto Akina Sakai Yoshimi Suzuki,' Eliminating
Redundancy by Spectral Relaxation for Multi-Document
Summarization', Proceedings of the 2010 Workshop on Graph- based
Methods for Natural Language Processing, ACL 2010, pages98–102,
Fig 5 Graph showing Machine Sumary1 Vs Human Summary (Sports) 2010
[7] Ahmet Aker T revor Cohn,'Multi-document summarization usingA*
search and discriminative training', Proceedings of the 2010
Conference on Empirical Methods in Natural Language Processing,
pages 482–491, 2010.
[8] B K Boguraev, C Kennedy ‘Salience-based content characterisation of
text documents’, proceedings of ACL Workshop on Intelligent,
Scalable T ext Summarization, 1997
[9] Eduard Dragut, Fang Fang, Prasad,Weiyi Meng, Stop WordandRelated
Problems in Web Interface Integration, Journal Proceedings of the
VLDB Endowment,Volume2,Issue1, pages 349-360, 2009.
[10] Jayashree.R, Srikantamurthy.K,’ T ext Document Summarization in the
Kannada Language using Key word Extraction’,Proceedingsof theIst
International Conference on Artificial Intelligence,Soft computingand
Applications(AIAA)-2011,pages 48- 54, Tirunelveli, India, 2011.
[11] Jayashree.R, Srikantamurthy.K, BasavarajSAnami,’ CategorizedText
Document Summarization in the Kannada Language by Sentence
Ranking’, ISDA, 2012, pages776-781
Fig 6 Graph of Machine Summary1 Vs Machine Summary2 (Sports) [12] Xiao, Wen, and Giuseppe Carenini. "Extractive summarization of long
documents by combining global and local context." (2019).
[13] Erera S, Shmueli-Scheuer M, Feigenblat G, Nakash OP, Boni O,
Roitman H, Cohen D, Weiner B, Mass Y, Rivlin O, Lev G. A
Summarization System for Scientific Documents. 2019.
V. CONCLUSION [14] T an, Jiwei, Xiaojun Wan, and Jianguo Xiao. "Abstractive document
summarization with a graph-based attentional neural
The higher row sum indicates the fact that the model." Proceedings of the 55th Annual Meeting of theAssociation for
similarity of that particular sentence is higher compared Computational Linguistics (Volume 1: Long Papers). 2017
to all other sentences in a document, so the inclusion of [15] Galgani, F., Compton, P., & Hoffmann, A. (2012, April). Combining
such sentence would represent all other sentences in a different summarization techniques for legal text. In Proceedingsof the
workshop on innovative hybrid approaches to the processingof textual
document, which would add weight age to the summary data (pp. 115-123).
generated. The summary generated by this method is on
11
Authorized licensed use limited to: Carleton University. Downloaded on June 18,2021 at 00:19:02 UTC from IEEE Xplore. Restrictions apply.