0% found this document useful (0 votes)
46 views

Paper 14

The document discusses text summarization using text frequency ranking sentence prediction. It describes extractive and abstractive summarization techniques and proposes a TFRSP algorithm using supervised and unsupervised learning to combine the benefits of both approaches.

Uploaded by

Unknown Unknown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Paper 14

The document discusses text summarization using text frequency ranking sentence prediction. It describes extractive and abstractive summarization techniques and proposes a TFRSP algorithm using supervised and unsupervised learning to combine the benefits of both approaches.

Uploaded by

Unknown Unknown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2020 4th International Conference on Computer, Communication and Signal Processing (ICCCSP)

Text Summarization Using Text Frequency


Ranking Sentence Prediction
2020 4th International Conference on Computer, Communication and Signal Processing (ICCCSP) | 978-1-7281-6509-7/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICCCSP49186.2020.9315203

Meena S M Ramkumar M P
Department of Computer Science and Engineering Department of Computer Science and Engineering
Thiagarajar College of Engineering Thiagarajar College ofEngineering
Madurai, India Madurai,
[email protected] [email protected]

Asmitha R E Emil Selvan G SR


Department of Computer Science and Engineering Department of Computer Science and Engineering
Thiagarajar College of Engineering Thiagarajar College of Engineering
Madurai, India Madurai, India
[email protected] [email protected]

Abstract— In the era of information technology, data plays 1. Extractive summarization: It extracts the most relevant and
significant role. The data which prevails on the internet are significant sentences which are actually the subclass of the
unstructured and are not in a concise manner. To make the raw sentences from the original text [2].Extractive summarization
data into a structured, readable, coherent and concise and to is similar to highlighting the prominent sentences from the
extract the summary of data, the text summarization concept is original textdocument.
introduced. The text summarization involves in providing
asummary of the useful information from the raw data without 2. Abstractive Summarization: It extracts the main gist of the
dissolving the main theme of the data. Nowadays readers face the original text and generates the summary with its own words.
challenge of reading comments, reviews, news articles, blogs, etc., Abstractive summarization is equivalent to recreating the
as they are too informal and noisy. Retrieving the correct gist of original text with new phrases and produces the summary[2].
the text which is necessary for all the readers is a quite difficult
task. In order to overcome the problems faced by the readers, Both the extractive summarization and abstractive
TFRSP (Text Frequency Ranking Sentence Prediction) summarization has benefits and drawbacks of its own.
algorithmis proposedto generate a precise summary that uses Extractive summarization picks out the sentences that are vital
supervised and unsupervised learning algorithms. The proposed and correct but has the problem of incoherency while
approach uses the combination of TF-IDF-TR (Term Frequency abstractive summarization generates word-by-word summary
– Inverse Document Frequency – Text Rank) as an unsupervised that will be crisp and in readable format but it may lead to loss
learning algorithm and Seq2Seq (Sequence to Sequence) model as of key facts in case of huge documents.
a supervised learning algorithm to obtain the benefits of both
extractive and abstractive summarization. The results of the
The proposedTFRSP algorithm deals with the integration
proposed TFRSPapproach is compared with the existing methods of both extractive summarization and abstractive
of text summarization using the Recall Oriented Understudy of summarization which uses supervised and unsupervised
Gisting Evaluation (ROUGE) and attains a high ROUGE score, learning algorithms.The Term Frequency-Inverse Document
hence achieves high accuracy ofsummary. Frequency (TF-IDF) is used in combination with the Text
Rank (TR) algorithm in extractive summarization. It is an
Keywords— Text Summarization, Natural Language unsupervised learning technique that does not have the need to
Processing, Extractive Summarization, Unsupervised learning, supervise the model, which is modified as TF-IDF-TR in the
Abstractive Summarization, Supervised learning, ROUGE proposed TFRSP approach. Abstractive summarization
involves the sequence to sequence (seq2seq) model which is a
I. INTRODUCTION supervised learning algorithm that includes training and
The digitalized world makes rapid growth in the field of testing datasets. TheTFRSP method collects the dataset from
information retrieval. People are relying on a variety of Amazon Product reviews which is preprocessed and the TF-
resources to stay updated. Considering the time as a factor, IDF-TR algorithm is used to generate the extractive summary
people want the information to be in a short and precise in the first phase which is then fed into the second phase
manner. There exists a major issue while reading the news abstractive seq2seq model as an input to obtain a more precise
articles and online reviews or feedbacks which are hard to summary, hence obtaining an effective summary when the
conclude unless they are completely read. This leads to the performance is calculated using the Recall-Oriented
evolution of the text summarization concept for the Understudy for Gisting Evaluation (ROUGE)score.
betterment of information retrieval. Text summarization is the The key factor of choosing the TFRSP algorithm is to
concept of extracting the main corpus of information as a combine the strengths of both abstractive and extractive
summary from the original text in a brief, orderly and human summarization techniques. Extractive summarization works
interpreted manner [1]. Automatic text summarization uses on huge documents and can produce the summary in the form
the ideas of Natural Language Processing (NLP) to obtain the of important corpus retrieval. But abstractive summarization
summary systematically. Automatic text summarization works only on lesser documents but still can produce the
generates a human interpreted summary in the form of a summary with higher accuracy in the form of human
system-generatedsummary. generated summary. When these two different approachesare
Text summarization is classified as shown in fig.1,
namely extractive and abstractivesummarization:

978-1-7281-6509-7/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 14,2021 at 22:42:20 UTC from IEEE Xplore. Restrictions apply.
2020 4th International Conference on Computer, Communication and Signal Processing (ICCCSP)

combined the resulting summary could be more accurate. The B. Text RankAlgorithm
ROUGE score evaluates the summary produced by the The text rank is an unsupervised algorithm which is used
system with the human-interpreted referencesummary. for ranking the sentences with the help of weights as a value.
Text rank algorithm has its base origin from Google’s page
rank algorithm which ranks the pages based on its hyperlinks
and its importance[8]. A directed graph is constructed with the
help of sentences as of its name graph-based ranking
algorithm.The sentences are considered as nodes or vertices
and the similarity between two nodes is connected with the
help of edges[9].Text rank algorithm isa recommender based
algorithm in which the importance of the sentences is
recommended by the vertices connected by the edges within
the graph.
C. Sequence to SequenceModel
Fig. 1. Extractive and Abstractive Text Summarization The proposed TFRSP algorithm involves an abstractive
summarization algorithm which is termed as sequence to
The key benefit of text summarization is to minimize the sequence model which is useful in creating the new phrases by
reading time of the end-users.Text summarization plays an retaining the meaning of the source document. The Sequence to
important part throughout the fields of medicinal, news sequence model is first introduced by Google which powers
articles, financial and legal document analysis, blogs, applications like Google translate, image captioning, text
literature, online reviews, etc. [3] summarization, online chat bots,etc. It is an encoder-decoder
based model that maps the sequences of different lengths of
II. RELATEDWORK input and output to each other [10].The encoder-decoder
The literature survey for the text summarization is briefly component has the subcomponent Long Short Term
Memory(LSTM) that is useful in capturing long term
explained in the followingsegment. The main discussion is
dependencies. The encoder-decoder model consists of 2 phases
associated to the supervised and unsupervised algorithms and
- training and inference phases [11]. The encoder and decoder
evaluation metric [4] which is used to evaluate the summary aremeantfor both the training and inference phase. In the
generated. training phase, the encoder reads the entire input word by word
A. Term Frequency (TF) - Inverse Document Frequency and processes the information present in the input sequence and
(IDF) stores it as a hidden state. The hidden state from the encoder is
The algorithm used in the extractive summarization is TF- meant to initialize the decoder and is trained to predict the next
IDF. Here, TF represents Term Frequency in which the word in the sequence from the previous hidden state
frequency of the words is counted. The frequency acquired is word[12].In the inference phase, the sequence to sequence
used to find the importance of the word. Higher the model is tested with new sequences for which the target
frequency of the word higher is the importance of the word in summary sequence will not be known[13].
the document [5] [6]. The simplest way of explaining TF is, it D. ROUGEscore
measures the frequent occurrence of the word in the document
[7]. IDF represents the Inverse Document Frequency which ROUGE - Recall Oriented Understudy of Gisting
allots a higher value to the rare words and lower value to the Evaluation is a metric for text summarization. It is used to
recurrent words. At times TF miscalculates the stop words as measure the n-gram matches of the system-summary retrieved
important as of their frequent occurrence. To rectify the issue from text summarization with the reference summary generated
faced by TF, IDF identifies the rare occurrence of words in by the humans. ROUGE measure includesprecision, recall, and
the document. IDF denotes the inverse version ofTF; both f-measure values which when combined will yield the ROUGE
when combined produce TF-IDF which is the multiplication score. ROUGE-1 score evaluates the overlapping of unigrams
of TF and IDF. The formula for TF-IDF is mentioned below from the system generated summary with the human-generated
[5][7]: reference summary [4]. ROUGE -2 evaluates the bigrams
overlapping in the similar way as
ROUGE -1. Among the various ROUGE -n scores ROUGE-1
( , )= () ℎ ℎ is considered as having the highest accuracy in finding outthe
( , ) = log (1) overlapping words.

( , )= ( , )× (, ) = (2)

In (1), ( , )represents the frequency of the word iin the


document j[7]. N denotes the number of documents in the
dataset and denotes the documents containing the word at = (3)
least once. The higher is the value of when the word is
frequently used in multiple documents. TF–IDF is the value
of the word iin the document j of the document N[5]. ×
− =2∗ (4)
+

978-1-7281-6509-7/20/$31.00 ©2020 IEEE


Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 14,2021 at 22:42:20 UTC from IEEE Xplore. Restrictions apply.
2020 4th International Conference on Computer, Communication and Signal Processing (ICCCSP)
In (2), Recall refers to the extent to which the reference Decoder Recurrent Neural Network. The algorithm showed
summary is related to system summary. In (3), Precision promising results with the multi-sentence document.
refers to the extent to which the summary generated by the
system was relevant [1]. In (4), F-measure is the balanced III. PROPOSEDSYSTEM
score or harmonic mean of Precision (2) and Recall (1) [7]. The steps that are involved in generating the summary are
E. LiteratureReview discussed in the section. Firstly, datasets are collected as the
The studies regarding the various techniques in text input raw data and secondly the preprocessing of the text is
summarization are performed and it is explained in the done to obtain the cleaned and structured text. Finally, the
section. The algorithms for text summarization which are summarization of the cleaned text is performed using TFRSP
proposed earlier are discussed in the forthcoming section. algorithm.

In [5], Joo-Chang Kim and Kyungyong Chung proposed A. DatasetCollection


the associative feature extraction in health data. In the paper There are two ways to obtain datasets. It can be a well-
text preprocessing for the health big data is done.TF-IDF is defined format as .csvfile which is obtained from Kaggle or
implemented to find out the most relevant words from the other dataset collection websites. Another way of dataset
document. Along with that TF-C-IDF is incorporated for collection is done through web-scraping which extracts raw
associative feature extraction purpose from the retrieved data from the websites. Here the Amazon Product Reviews
corpus obtained in TF-IDF. The associative keywords are dataset is collected from Kaggle [18]. The dataset is
further analyzed with the help of the Apriori algorithm. preprocessed and then the summary will be generated through
Apriori algorithm is a data mining algorithm in which the combined unsupervised and supervised techniques.
associative rules are designed for large data relations in large
data sets. The main advantage of the proposed method is to B. Preprocessing
digitalize the health data and to extract the proper associative The raw input data collected from different sources are
keywords. preprocessed through various libraries in Natural Language
In [14], ShahzadQaiser and Ramsha Ali discussed mainly Processing (NLP) [19]that is used for string or text processing.
on the TF-IDF’s benefits, drawbacks and solutions for better Manipulation of large amounts of natural data is done using
and improvised algorithms.TF-IDF is a straight forward, easy NLP. Text summarization has a set of activities which
algorithm to deploy and the most appropriate information involvesNLP.
from the data is retrieved. But at times the algorithm used in
the paper cannot find out the most prompt word due to slight
changes in the tense forms. In order to overcome the issues
faced by TF-IDF a new approach of TF-IDF is proposed
which involves the techniques of the combination of
classification algorithms with TF-IDF.
In [15], Rajendra Kumar Roul and JajatiKeshariSahoo
proposed the concepts of sentiment analysis with the
extractive summarization. But the paper emphasized mainly
on the extractive summarization techniques. The hierarchical
summarization which involves the four main algorithms like
Textrank, Lexrank, Latent Semantic Analysis (LSA) and
sumbasic are introduced and found that the functionality of
the Textrank algorithm is far better when compared with
other algorithms.
In [16], Madhurima Dutta et al. have proposed an
extractive summarization approach using the concepts of
graph theory. Infomap clustering algorithm is used to cluster
the sentences and the similarities among the sentences are
found by constructing a graph.
In [1], ChanduParmaret al. have compared the algorithms
related to the abstractive summarization. A comparison is
done with sequence to sequence model and Long Short Term
Memory (LSTM) bidirectional and according to the approach
used in the paper, it is found that the summarization
produced from the sequence to sequence model is more Fig. 2. Text Preprocessing
efficient when compared with LSTM bidirectional. The
amazon reviews and CNN dataset are used for effective The raw input dataset undergoes various preprocessing
comparison and for better implementation of algorithms. stages. Firstly the input noisy data is checked for redundancy
and null values. The null values are removed and the data is
In [17], Ramesh Nallapati et al. have proposed a model
sent for the process of tokenization in which the whole text
on abstractive text summarization using Attentional Encoder-
document is tokenized into sentences and then to words. From
the collection of words obtained, stop word removal is
performed [16] [20]. It is followed by lemmatization which is a
technique to obtain the root word from theextracted

978-1-7281-6509-7/20/$31.00 ©2020 IEEE


Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 14,2021 at 22:42:20 UTC from IEEE Xplore. Restrictions apply.
2020 4th International Conference on Computer, Communication and Signal Processing (ICCCSP)

keywords. Lemmatization is a process of obtaining a minimized as the initial process of extractive summarization is
dictionary word which lags in stemming process. The words done to consider the most important facts.
are converted into lowercase and cleaned text is obtained. The complete working of the TFRSP approach for
The preprocessing steps are shown in the flow in the fig.2. summary generation is shown in the figure 4. The input text
document is preprocessed and extractive summary is obtained
The TFRSP algorithm used to obtain a concise and
from the unsupervised phase. Then the output of phase 1 is fed
meaningful summary is shown in the fig. 3. The TFRSP into the supervised phase which generates the final summary
algorithm consists of two phases namely unsupervised phase as of shown in figure4.
and supervised phase as represented in the fig. 4.
C. UnsupervisedPhase
The data after preprocessing is fed into the stream of
extractive summarization where the TF-IDF-TR algorithm is
applied over the preprocessed text document. Here the
concept of TF-IDF is used where the most frequent words are
extracted and the ranking of sentences is done using the text
rank algorithm. Using the concept of TF-IDF-TR the
coherency of the summary is maintained as the sentences are
ranked based on the similarity of the sentences. The weighted
graph is constructed after the TF-IDF algorithm and the
wordsareconsidered as the vertices in the graph. Text rank
algorithm uses the cosine similarity matrix for finding the
most important words or sentences with the help of similarity
and relevancy for the sentences. But in TFRSP algorithm
Text rank algorithm is combined with TF-IDF to produce the
most import sentences based on ranking algorithm along with
term frequencycalculation.

Fig. 4. Unsupervised and supervised phases of TFRSP

IV. RESULTS ANDDISCUSSION


The integrated approach for summary generation is
implemented in the Python 3Jupyter Notebook [21]. The
original review text given as input to the first phase of the
proposed TFRSP algorithm to produce the phase 1 summary
Fig. 3. Algorithm for TFRSP as an intermediate result and it is then served as input to the
next phase of TFRSP algorithm to generate the final output
summary as shown in fig.5.
D. SupervisedPhase
The summary extracted from the extractive Original Review Text: I have bought several of the Vitality
summarization in the unsupervised phase is fed as an input to canned dog food products and have found them all to be of
the abstractive summarization phase in which the good quality. The product looks more like a stew than a
unsupervised algorithm sequence to sequence model is used. processed meat and it smells better. My Labrador is finicky
As the size of the original document is reduced after and she appreciates this product better thanmost.
extractive summarization it is a benefit for abstractive
summarization. The sequence to sequence model is applied Phase 1 Summary:The product looks more like a stew than a
over the summary. The model is initially trained with the processed meat and itsmells better.
reference summary datasets. The previous output of the Final summary: great product.
extractive summarization is considered to be the hidden state
in the encoder-decoder model. The final summary is Fig. 5. Example of sample review text and its summary
predicted with the help of the hidden state and the trained
reference model. The summary obtained from the abstractive The performance of the TFRSP method is compared with
summarization is more precise and the loss of datais the other existing summarization methods [22] using the
ROUGE score. The package installed in calculating the

978-1-7281-6509-7/20/$31.00 ©2020 IEEE


Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 14,2021 at 22:42:20 UTC from IEEE Xplore. Restrictions apply.
2020 4th International Conference on Computer, Communication and Signal Processing (ICCCSP)
ROUGE score is rouge-0.3.2[23]. ROUGE measure is a REFERENCES
summary accuracy evaluation metric in which it compares the [1] Parmar, Chandu, RanjanChaubey, and Kirtan Bhatt. "Abstractive Text
human-generated reference summary with system generated Summarization Using Artificial Intelligence." Available at SSRN 3370795
summary. The performance for the Amazon product review [2] (2019).
Gupta, Vanyaa, NehaBansal, and Arun Sharma. "Text summarization for big
summarization is analyzed using the ROUGE-1 score which data: A comprehensive survey." In International Conference on Innovative
comprises F-measure, precision, and recall as mentioned in Computing and Communications, pp. 503-516. Springer, Singapore, 2019.
the Table1. [3] Applications of automatic summarization : https://blog.frase.io/20-
applications-of-automatic-summarization-in-the-enterprise/
TABLE 1. Performance Comparison of existing algorithms with proposed [4] ShanmugasundaramHariharan. "Studies on intrinsicsummary evaluation",
TFRSP algorithm using ROUGE score International Journal of ArtificialIntelligenceand Soft Computing, 2010
[5] Kim, Joo-Chang, and Kyungyong Chung. "Associative feature information
ROUGE – 1 Score extraction using text mining from health big data." Wireless Personal
Algorithms Communications 105, no. 2 (2019):691-707.
F- Precision Recall [6] Bhavadharani, M., M. P. Ramkumar, and Selvan GSR Emil. "Performance
measure Analysis of Ranking Models in Information Retrieval." In 2019 3rd
International Conference on Trends in Electronics and Informatics (ICOEI),
Extractive
0.051387 0.027464 0.077614 pp. 1207-1211. IEEE,2019.
(Text Rank) [7] Pan, Suhan, Zhiqiang Li, and Juan Dai. "An improved TextRank keywords
extraction algorithm." In Proceedings of the ACM Turing Celebration
Abstractive Conference-China, pp. 1-7.2019.
(Seq2Seq) 0.121599 0.149289 0.106128 [8] Mihalcea, Rada. "Graph-based ranking algorithms for sentence extraction,
applied to text summarization." In Proceedings of the ACL Interactive Poster
TFRSP and Demonstration Sessions, pp. 170-173.2004.
Method 0.248323 0.287440 0.204339 [9] Mallick, Chirantana, Ajit Kumar Das, MadhurimaDutta, Asit Kumar Das, and
ApurbaSarkar. "Graph-based text summarization using
modifiedTextRank."InSoftComputinginDataAnalytics,pp.137-
146. Springer, Singapore, 2019.
From the experimental results obtained in the Table 1, it [10] Song, Shengli, Haitao Huang, and TongxiaoRuan. "Abstractive text
is found that the ROUGE score for the integrated approach of summarization using LSTM-CNN based deep learning." Multimedia Tools
the text summarization using TFRSP algorithm is higher and Applications 78, no. 1 (2019):857-875.
when compared with the existing extractive and abstractive [11] "Advances in Computational Intelligence", SpringerScience and Business
Media LLC,2019
methodologies separately as shown in the fig. 5. The [12] Understanding Encoder - Decoder Sequence to sequence model :
proposed approach using TFRSP algorithm generates a https://towardsdatascience.com/understanding-encoder-decoder- sequence-to-
precise summary which is similar to a human interpreted sequence-model-679e04af4346
summary. [13] Text Summarization using Sequence to sequence encoder decoder model :
https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-
summarization-using-deep- learning-python/
[14] Qaiser, Shahzad, and Ramsha Ali. "Text mining: use of TF-IDF to examine
the relevance of words to documents." International Journal of Computer
Applications 181, no. 1 (2018):25-29.
[15] Roul, Rajendra Kumar, and JajatiKeshariSahoo. "Sentiment Analysis and
Extractive Summarization Based Recommendation System." In
Computational Intelligence in Data Mining, pp. 473-487. Springer, Singapore,
2020.
[16] Dutta, Madhurima, Ajit Kumar Das, ChirantanaMallick, ApurbaSarkar, and
Asit K. Das. "A Graph Based Approach on Extractive Summarization." In
Emerging Technologies in Data Mining and Information Security, pp. 179-
187. Springer, Singapore,2019.
[17] Nallapati, Ramesh, Bowen Zhou, CaglarGulcehre, and Bing Xiang.
"Abstractive text summarization using sequence-to-sequence rnns and
beyond." arXiv preprint arXiv:1602.06023(2016).
[18] Kaggle Dataset :https://www.kaggle.com/skathirmani/amazon-reviews
Fig. 5. Performance analysis of the experimental results [19] "Natural Language Processing and ChineseComputing", Springer Science and
Business MediaLLC,2018
[20] Bhavadharani, M., M. P. Ramkumar, and Emil Selvan GSR. "Information
V. CONCLUSION Retrieval in Search Engines Using Pseudo Relevance Feedback Mechanism."
In 2019 International Conference on Vision Towards Emerging Trends in
The TFRSP algorithm generates a summary for the Amazon Communication and Networking (ViTECoN), pp. 1-5. IEEE,2019.
product reviews combining the techniques of unsupervised [21] Python 3 Jupyter Notebook : https://jupyter.org/
(extractive summarization) and supervised (abstractive [22] "Computational Intelligence in Data Mining", Springer Science and Business
Media LLC,2020
summarization) algorithms, producing an integrated approach [23] ROUGE :https://pypi.org/project/rouge/
with the increased accuracy of 87.58 % when compared with the
traditional methods of the text summarization. The summary
generated from the combined supervised and unsupervised
learning results in 38.42 %increase in the ROUGEscore ofthe
existing methods. The proposed method could be further
improved by combining the classification techniques such as
Naive Bayes, Decision tree, etc. along with the TF-IDF. The
proposed method could also be tested for various datasets and
the accuracy of the summary generated can be increased by
increasing theepochs.

978-1-7281-6509-7/20/$31.00 ©2020 IEEE


Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 14,2021 at 22:42:20 UTC from IEEE Xplore. Restrictions apply.

You might also like