Paper 14
Paper 14
Meena S M Ramkumar M P
Department of Computer Science and Engineering Department of Computer Science and Engineering
Thiagarajar College of Engineering Thiagarajar College ofEngineering
Madurai, India Madurai,
[email protected] [email protected]
Abstract— In the era of information technology, data plays 1. Extractive summarization: It extracts the most relevant and
significant role. The data which prevails on the internet are significant sentences which are actually the subclass of the
unstructured and are not in a concise manner. To make the raw sentences from the original text [2].Extractive summarization
data into a structured, readable, coherent and concise and to is similar to highlighting the prominent sentences from the
extract the summary of data, the text summarization concept is original textdocument.
introduced. The text summarization involves in providing
asummary of the useful information from the raw data without 2. Abstractive Summarization: It extracts the main gist of the
dissolving the main theme of the data. Nowadays readers face the original text and generates the summary with its own words.
challenge of reading comments, reviews, news articles, blogs, etc., Abstractive summarization is equivalent to recreating the
as they are too informal and noisy. Retrieving the correct gist of original text with new phrases and produces the summary[2].
the text which is necessary for all the readers is a quite difficult
task. In order to overcome the problems faced by the readers, Both the extractive summarization and abstractive
TFRSP (Text Frequency Ranking Sentence Prediction) summarization has benefits and drawbacks of its own.
algorithmis proposedto generate a precise summary that uses Extractive summarization picks out the sentences that are vital
supervised and unsupervised learning algorithms. The proposed and correct but has the problem of incoherency while
approach uses the combination of TF-IDF-TR (Term Frequency abstractive summarization generates word-by-word summary
– Inverse Document Frequency – Text Rank) as an unsupervised that will be crisp and in readable format but it may lead to loss
learning algorithm and Seq2Seq (Sequence to Sequence) model as of key facts in case of huge documents.
a supervised learning algorithm to obtain the benefits of both
extractive and abstractive summarization. The results of the
The proposedTFRSP algorithm deals with the integration
proposed TFRSPapproach is compared with the existing methods of both extractive summarization and abstractive
of text summarization using the Recall Oriented Understudy of summarization which uses supervised and unsupervised
Gisting Evaluation (ROUGE) and attains a high ROUGE score, learning algorithms.The Term Frequency-Inverse Document
hence achieves high accuracy ofsummary. Frequency (TF-IDF) is used in combination with the Text
Rank (TR) algorithm in extractive summarization. It is an
Keywords— Text Summarization, Natural Language unsupervised learning technique that does not have the need to
Processing, Extractive Summarization, Unsupervised learning, supervise the model, which is modified as TF-IDF-TR in the
Abstractive Summarization, Supervised learning, ROUGE proposed TFRSP approach. Abstractive summarization
involves the sequence to sequence (seq2seq) model which is a
I. INTRODUCTION supervised learning algorithm that includes training and
The digitalized world makes rapid growth in the field of testing datasets. TheTFRSP method collects the dataset from
information retrieval. People are relying on a variety of Amazon Product reviews which is preprocessed and the TF-
resources to stay updated. Considering the time as a factor, IDF-TR algorithm is used to generate the extractive summary
people want the information to be in a short and precise in the first phase which is then fed into the second phase
manner. There exists a major issue while reading the news abstractive seq2seq model as an input to obtain a more precise
articles and online reviews or feedbacks which are hard to summary, hence obtaining an effective summary when the
conclude unless they are completely read. This leads to the performance is calculated using the Recall-Oriented
evolution of the text summarization concept for the Understudy for Gisting Evaluation (ROUGE)score.
betterment of information retrieval. Text summarization is the The key factor of choosing the TFRSP algorithm is to
concept of extracting the main corpus of information as a combine the strengths of both abstractive and extractive
summary from the original text in a brief, orderly and human summarization techniques. Extractive summarization works
interpreted manner [1]. Automatic text summarization uses on huge documents and can produce the summary in the form
the ideas of Natural Language Processing (NLP) to obtain the of important corpus retrieval. But abstractive summarization
summary systematically. Automatic text summarization works only on lesser documents but still can produce the
generates a human interpreted summary in the form of a summary with higher accuracy in the form of human
system-generatedsummary. generated summary. When these two different approachesare
Text summarization is classified as shown in fig.1,
namely extractive and abstractivesummarization:
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 14,2021 at 22:42:20 UTC from IEEE Xplore. Restrictions apply.
2020 4th International Conference on Computer, Communication and Signal Processing (ICCCSP)
combined the resulting summary could be more accurate. The B. Text RankAlgorithm
ROUGE score evaluates the summary produced by the The text rank is an unsupervised algorithm which is used
system with the human-interpreted referencesummary. for ranking the sentences with the help of weights as a value.
Text rank algorithm has its base origin from Google’s page
rank algorithm which ranks the pages based on its hyperlinks
and its importance[8]. A directed graph is constructed with the
help of sentences as of its name graph-based ranking
algorithm.The sentences are considered as nodes or vertices
and the similarity between two nodes is connected with the
help of edges[9].Text rank algorithm isa recommender based
algorithm in which the importance of the sentences is
recommended by the vertices connected by the edges within
the graph.
C. Sequence to SequenceModel
Fig. 1. Extractive and Abstractive Text Summarization The proposed TFRSP algorithm involves an abstractive
summarization algorithm which is termed as sequence to
The key benefit of text summarization is to minimize the sequence model which is useful in creating the new phrases by
reading time of the end-users.Text summarization plays an retaining the meaning of the source document. The Sequence to
important part throughout the fields of medicinal, news sequence model is first introduced by Google which powers
articles, financial and legal document analysis, blogs, applications like Google translate, image captioning, text
literature, online reviews, etc. [3] summarization, online chat bots,etc. It is an encoder-decoder
based model that maps the sequences of different lengths of
II. RELATEDWORK input and output to each other [10].The encoder-decoder
The literature survey for the text summarization is briefly component has the subcomponent Long Short Term
Memory(LSTM) that is useful in capturing long term
explained in the followingsegment. The main discussion is
dependencies. The encoder-decoder model consists of 2 phases
associated to the supervised and unsupervised algorithms and
- training and inference phases [11]. The encoder and decoder
evaluation metric [4] which is used to evaluate the summary aremeantfor both the training and inference phase. In the
generated. training phase, the encoder reads the entire input word by word
A. Term Frequency (TF) - Inverse Document Frequency and processes the information present in the input sequence and
(IDF) stores it as a hidden state. The hidden state from the encoder is
The algorithm used in the extractive summarization is TF- meant to initialize the decoder and is trained to predict the next
IDF. Here, TF represents Term Frequency in which the word in the sequence from the previous hidden state
frequency of the words is counted. The frequency acquired is word[12].In the inference phase, the sequence to sequence
used to find the importance of the word. Higher the model is tested with new sequences for which the target
frequency of the word higher is the importance of the word in summary sequence will not be known[13].
the document [5] [6]. The simplest way of explaining TF is, it D. ROUGEscore
measures the frequent occurrence of the word in the document
[7]. IDF represents the Inverse Document Frequency which ROUGE - Recall Oriented Understudy of Gisting
allots a higher value to the rare words and lower value to the Evaluation is a metric for text summarization. It is used to
recurrent words. At times TF miscalculates the stop words as measure the n-gram matches of the system-summary retrieved
important as of their frequent occurrence. To rectify the issue from text summarization with the reference summary generated
faced by TF, IDF identifies the rare occurrence of words in by the humans. ROUGE measure includesprecision, recall, and
the document. IDF denotes the inverse version ofTF; both f-measure values which when combined will yield the ROUGE
when combined produce TF-IDF which is the multiplication score. ROUGE-1 score evaluates the overlapping of unigrams
of TF and IDF. The formula for TF-IDF is mentioned below from the system generated summary with the human-generated
[5][7]: reference summary [4]. ROUGE -2 evaluates the bigrams
overlapping in the similar way as
ROUGE -1. Among the various ROUGE -n scores ROUGE-1
( , )= () ℎ ℎ is considered as having the highest accuracy in finding outthe
( , ) = log (1) overlapping words.
( , )= ( , )× (, ) = (2)
ℎ
keywords. Lemmatization is a process of obtaining a minimized as the initial process of extractive summarization is
dictionary word which lags in stemming process. The words done to consider the most important facts.
are converted into lowercase and cleaned text is obtained. The complete working of the TFRSP approach for
The preprocessing steps are shown in the flow in the fig.2. summary generation is shown in the figure 4. The input text
document is preprocessed and extractive summary is obtained
The TFRSP algorithm used to obtain a concise and
from the unsupervised phase. Then the output of phase 1 is fed
meaningful summary is shown in the fig. 3. The TFRSP into the supervised phase which generates the final summary
algorithm consists of two phases namely unsupervised phase as of shown in figure4.
and supervised phase as represented in the fig. 4.
C. UnsupervisedPhase
The data after preprocessing is fed into the stream of
extractive summarization where the TF-IDF-TR algorithm is
applied over the preprocessed text document. Here the
concept of TF-IDF is used where the most frequent words are
extracted and the ranking of sentences is done using the text
rank algorithm. Using the concept of TF-IDF-TR the
coherency of the summary is maintained as the sentences are
ranked based on the similarity of the sentences. The weighted
graph is constructed after the TF-IDF algorithm and the
wordsareconsidered as the vertices in the graph. Text rank
algorithm uses the cosine similarity matrix for finding the
most important words or sentences with the help of similarity
and relevancy for the sentences. But in TFRSP algorithm
Text rank algorithm is combined with TF-IDF to produce the
most import sentences based on ranking algorithm along with
term frequencycalculation.