0% found this document useful (0 votes)
15 views

Feature Based Automatic Text Summarization Methods a Comprehensive State-Of-The-Art Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Feature Based Automatic Text Summarization Methods a Comprehensive State-Of-The-Art Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Received 31 October 2022, accepted 13 December 2022, date of publication 20 December 2022,

date of current version 29 December 2022.


Digital Object Identifier 10.1109/ACCESS.2022.3231016

Feature Based Automatic Text Summarization


Methods: A Comprehensive State-of-the-Art
Survey
DIVAKAR YADAV 1 , (Senior Member, IEEE), RISHABH KATNA2 , ARUN KUMAR YADAV1 ,
AND JORGE MORATO 3
1 School of Computer and Information Sciences (SOCIS), Indira Gandhi National Open University (IGNOU), Maidan Garhi, New Delhi 110068 India
2 Department of Computer Science and Engineering, National Institute of Technology Hamirpur (NIT Hamirpur), Hamirpur, Himachal Pradesh 177005, India
3 Department of Computer Science, Universidad Carlos III de Madrid, 28911 Leganés, Spain
Corresponding author: Jorge Morato ([email protected])

ABSTRACT With the advent of the World Wide Web, there are numerous online platforms that generate
huge amounts of textual material, including social networks, online blogs, magazines, etc. This textual
content contains useful information that can be used to advance humanity. Text summarization has been
a significant area of research in natural language processing (NLP). With the expansion of the internet,
the amount of data in the world has exploded. Large volumes of data make locating the required and best
information time-consuming. It is impractical to manually summarize petabytes of data; hence, computerized
text summarization is rising in popularity. This study presents a comprehensive overview of the current
status of text summarizing approaches, techniques, standard datasets, assessment criteria, and future research
directions. The summarizing approaches are assessed based on several characteristics, including approach-
based, document-number-based, Summarization domain-based, document-language-based, output summary
nature, etc. This study concludes with a discussion of many obstacles and research opportunities linked to
text summarizing research that may be relevant for future researchers in this field.

INDEX TERMS Abstractive summarization, cosine-similarity, deep learning, extractive summarization,


graph-based algorithm, neural networks.
I. INTRODUCTION technologies can resolve the issue. Consequently, ATS has
The World Wide Web (WWW) has become an immense become a focus of NLP study.
information resource. Today, some websites generate more ATS systems are designed to accomplish objectives like as
data every day than was produced in the previous ten years extracting the most important and relevant information from
combined. However, the majority of the data generated by a document, generating summaries that are much shorter than
these websites is irrelevant, redundant, and loud, masking the the original content, etc. The ATS systems can be categorized
most pertinent information. In addition, users must explore generally into one of the following categories:
several files and web pages to find the information they
a. Single-document summarization system:This type gen-
seek. It wastes the time of many users. A strong document
erates a single summary for a single document.
summary can fix the aforementioned issue. If every online
b. Multi-document summarization system:The generation
page provided a concise summary of its content, it would save
of a single summary for multiple documents is per-
time for many users and boost website engagement. However,
formed in this type.
it is not possible to manually summarize each web page on
the World Wide Web. Automated text summarization (ATS) These systems are more susceptible to duplication and inac-
curacy due to the fact that various documents may contain
identical sentences representing different information (inac-
The associate editor coordinating the review of this manuscript and curacy) and different sentences representing identical infor-
approving it for publication was Seifedine Kadry . mation (redundancy).

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 133981
D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

There are three primary methods for generating sum-


maries:
a. Extractive approach: In this approach, important sen-
tences from a document are picked and combined to
generate a final summary. Major steps in an extractive
approach include:
i. Document pre-processing
ii. Create a provisional representation of the docu-
ment
iii. Score sentences according to their retrieval value
iv. Select the sentences with the highest scores.
b. Abstractive approach:This strategy seeks a much
deeper comprehension of the document. Instead than
selecting meaningful sentences directly, it generates
new sentences that convey the same information using
natural language processing algorithms. Important
steps in an abstractive approach include:
i. Preprocessing the document FIGURE 1. Paper workflow.
ii. Making an Intermediate representation of the
document
iii. Generating new sentences based on IR. Al-Saleh and Menai [2] on Arabic text summarization tech-
c. Hybrid approach: This approach combines both the niques; Kumar et al. [3] on multilingual text summarization;
abstractive and the extractive approaches to generate and some studies ([4], [5]) attempted to provide an overview
the summary. of the entire field of text summarization.
Automatic Text summarization is one of the most challenging Existing survey articles, however, do have limitations.
areas of text and data mining. There are numerous obsta- Either the information covered is minimal ([1], [2], [6]),
cles associated with developing high-quality automated sum- the articles examined are outdated and do not address the
maries mentioned as below: most recent developments in this subject, or the information
supplied is difficult to comprehend. By presenting a succinct,
(i) Redundancy:Most ATS systems generate phrases with
up-to-date, and comprehensible overview of the topic of text
similar informational content. Because the size of
summarizing, this survey paper overcomes these drawbacks
the summaries is limited, more valuable and diverse
of prior publications.
information-carrying sentences may not be included
In this paper, we explore the various classifications of Text
in the summary. It may result in the loss of crucial
summarizing approaches based on several parameters such as
information.
methodology, document count, language, etc. We also briefly
(ii) Time-zones for multi-document summarization:
address investigations undertaken within each classification.
Different documents in a dataset can belong to different
We listed the outcomes, benefits, and drawbacks of each
time zones. Hence, they might use temporal words to
study. Finally, we present a comprehensive review of the
convey different meanings. It is a big challengein the
performance of various approaches on prominent datasets.
multi-document summarization.
However, a comprehensive analysis of each study is outside
(iii) Generating short summaries for very large documents
the scope of this work. In addition, this paper discusses the
like novels, books, etc.
most popular and effective methodologies, as a comprehen-
(iv) Generated summaries may not maintain a proper flow.
sive treatment of all approaches would exceed its scope.
This is more significant in extractive text summariza-
The main contributions of this study are as follows:
tion.
These significant challenges in text summarization are the • Provides a tabular and comprehensive analysis of differ-
focus of intense research. Nevertheless, certain models per- ent studies, making it easy for the reader to compare and
form better than others in certain criteria, such as abstractive evaluate various methodologies
summarizers’ ability to maintain a decent flow and decrease • Describes the benefits and drawbacks of each study
repetition, but they cannot solve the remaining problems. analysed in this paper.
Numerous research articles have been published on this • Offers a comprehensive analysis of numerous strategies
subject. Survey papers are vital for imparting concept knowl- and their performance on popular datasets.
edge to a novice audience and offering information on current • A comprehensive discussion of future horizons, recom-
trends and future horizons in a single document. Some survey mended methods, and research directions.
papers covered a specific subdomain of text summarization: The flow of this paper is explained in the diagram in
Jain et al. [1] surveyed on legal document summarization; FIGURE 1.

133982 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

FIGURE 3. Number of documents about ATS in Scopus and WoS in the


period 2011-2021.

FIGURE 4. Number of citations obtained by the documents collected from


Scopus and WoS in the period 2011-2021.

in the research described in section 2. Section 5 demonstrates


alternative methods of classifying ATS. In section 6, we con-
duct a comprehensive examination of the prevalent strategies
for text summarizing and provide some observations on the
enhancements and results gained by other investigations. The
seventh segment discusses the difficulties of text summary,
followed by a conclusion in the last portion.

II. A BRIEF BIBLIOMETRIC STUDY ON THE EVOLUTION


OF THE FIELD
Following is a brief literature overview demonstrating how
interest in the topic has progressed (figure 2). Following this
is a classification of approaches according to the approach
taken by the various summarization systems.
Regarding the academic interest generated by reputable
publications, a study of the works published in the past few
years is informative. FIGURE 2 depicts the primary approach
FIGURE 2. Procedure adopted to perform the study of scientific articles. used to classify and analyze scientific papers. This diagram’s
sequence is based on the principles provided in [7] and [8].
A search was conducted in the Web of Science (WoS) and
This paper’s body is divided into numerous sections. Scopus databases to determine the evolution of the works
The first section provides a quick bibliographic analysis published in the field. Their selection reflects the fact that
of the growing interest in the topic and identified tendencies. they are the data sources with the most extensive cover-
The classification of text summarizing algorithms based on age and the greatest prevalence in bibliometric research.
various factors is discussed in Section 2. Section 3 enumer- Both resources are complimentary because their geographical
ates the assessment criteria employed by various studies to scopes and journal collections are distinct [9]. In addition, the
compare and contrast their systems with those of others. journals included in these databases are chosen based on their
Section 4 provides a listing of the significant datasets utilized quality and influence.

VOLUME 10, 2022 133983


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 1. Queries performed in scopus and wos core collection Several observations can be made regarding the num-
databases.
ber of works published in the various languages: 1) the
number of works does not reflect the number of speakers;
for example, Nigerian pidgin is the 14th most spoken lan-
guage, but it is not mentioned in the results; 2) There are
languages among the 30 more spoken that have no study,
such as Cantonese, Tagalog, Hausa, Swahili, Nigerian, and
Javanese; 3) Indian languages are well represented: Ben-
gali (28), Hindi (17), Punjabi (8), Kannada (8), Telugu (7),
KonKonkani (5), Assamese (4), Tamil (2), or Marathi (2).
However, the representation of Hindi, the third-most-spoken
language, is inadequate, and other languages, such as Nepali,
are not mentioned.

III. EVALUATION METRICS


Automatic text summarizing approaches are evaluated using
performance measurement measures, as is the case with all
other methods. These metrics are discussed in this section.

A. ROUGE (RECALL-ORIENTED UNDERSTUDY OF GISTING


EVALUATION)
It is the most popular evaluation metric used in the field of
text summarization. ROUGE has four types:
a) ROUGE-N: In this metric, N stands for N-grams
TABLE 2. Languages with more than 5 works in the scopus collection co-occurrence statistics. It measures the quality of a
(english excluded). summary using n-gram recall between the summary
and a set of manually generated summaries as shown
in Eq. (1).

P−N
ROUGH P
S∈{Reference summaries} gramn ∈S Count match (gramn )
= P P
S∈{Reference summaries} gramn ∈S Count(gramn )
(1)
b. ROUGE-L: Here, L stands for longest common sub-
string. A sentence is represented as a set of words. The
longer the LCS between our summary and manual sum-
mary sentences, the better the quality of the summary.
c. ROUGE-W: Here, W stands for weighted LCS. It tries
the limitation of LCS that it cannot differentiate
LCSs of different spatial relations within their word
embedding.
Given that we aim to study the current trajectory in comput- d. ROUGE-S: S stands for skip-bigrams co-occurrence
ing, we restricted our search to the years 2011 through 2021. statistics. Skip-bigrams are bigrams that do not have
The executed queries are listed in TABLE 1. to appear together in a sentence. For the sentence ‘‘I
Figures 3 and 4 reveal a strong upward trend that has am Ram’’, the skip-bigrams generated will be {(‘‘I’’,
become more evident since 2018. In the previous two years, ‘‘am’’), (‘‘I’’, ‘‘Ram’’), (‘‘am’’, ‘‘Ram’’)}.
there has been a slowdown, although this may be related to
ROUGE-S uses skip-bigrams to compute the similarity
the time required to update the database’s publications. The
between our generated summary and manual summaries.
trend of citations is quite progressive, indicating a focus on
the achievements made in ATS during the period. In fact, the
B. GENERIC PERFORMANCE METRICS
h-index in WoS is 39 while in Scopus it is 56.
1) PRECISION
Most systems are language-dependent, and the dearth of
native speakers or digital resources in certain languages It is computed by dividing the number of sentences common
impedes study. Analyzing the summaries, titles and key- in the Reference and Candidate summary by the number of
words in Scopes show that most of the language’s studies are sentences in the candidate summary as shown in Eq. (2).
amongst the most spoken languages in the world (TABLE 2). Precision = N (Sr ∩ Sc)/N (Sc) (2)

133984 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

where, DUC distributed seven datasets from 2001 to 2007. DUC


2002 is the most popular dataset for extractive summariza-
Sr := Reference summary tion ([23], [38], [56], [70]). These datasets are available at
Sc := Candidate summary https://duc.nist.gov/data.html.
N (S) := Number of sentences insummary S.
B. CNN/DAILY MAIL
2) RECALL It contains over 300,000 articles from CNN and Daily Mail.
It is computed by dividing the number of sentences common The dataset is generated using a python script available at
in the Reference and Candidate summary by the number of CNN [71]. The processed version of this dataset is available
sentences in the reference summary as shown in Eq (3). on GitHub [72]. It is a very popular dataset among extrac-
tive ([58], [73]) and abstractive summarization studies ([44],
Recall = N (Sr ∩ Sc)/N (Sr) (3) [66]).

where,
C. OPINOSIS
Sr := Reference summary It is a dataset constructed from user reviews on a given
topic. It is very suitable for semantic analysis and has been
Sc := Candidate summary
used by multiple studies for the same purpose. It consists of
N (S) := Number of sentences in summary S. 51 topics, with each topic having 100s of review sentences.
It also comes with gold standard summaries and some scripts
3) F-MEASURE to evaluate the performance of a summarizer using ROUGE
It is computed by computing the harmonic mean between metric. The dataset and related material can be downloaded
precision and recall as shown in Eq. (4). from Opinosis [74]. This dataset was prepared by [45], [75],
and [76] for their research.
F = 2(Precision)(Recall)/(Precision + Recall) (4)
D. GIGAWORD
C. SUPERT
This dataset consists of more than 4 million articles. It is a
Summarization evaluation with Pseudo references and BERT
part of TensorFlow dataset collections and is highly popular
(SUPERT) is an un-supervised summary evaluation metric
among abstractive summarization studies [77]. The source
for evaluating multi-document summary by measuring the
code for this dataset is available at Gigaword [78].
semantic similarity between the summary and the pseudo ref-
erence summary. SUPERT was made by [81]. The limitation
of ROUGE is that it needs manual summaries to judge the E. MEDLINE CORPUS
quality of a summary. SUPERT can be used to summarize a The MEDLINE corpus is provided by NLM (National
dataset that does not have manual summaries. Library of Medicine). NLM produces this dataset in the form
of XML documents on an annual basis. This dataset can be
IV. DATASETS FOR TEXT SUMMARIZATION downloaded from [79]. Shang et al. [59] used this dataset to
In this section, we discuss about the popular dataset, used for develop an extractive summarizer.
text summarization methods among researchers.
F. LCSTS (LARGE SCALE CHINESE SHORT TEXT
A. DOCUMENT UNDERSTANDING CONFERENCES (DUC) SUMMARIZATION DATASET)
The National Institute of Standards and Technology (NIST) It is a Chinese text summarization dataset. This dataset con-
provides these groups of datasets. DUC is part of a Defense sists of 2 million short texts from a Chinese microblogging
Advanced Research Projects Agency (DARPA) program, website Sina Weibo. It is also provided with short summaries
Translingual Information Detection, Extraction, and Sum- for each blog, written by the blog authors. It is a very suitable
marization (TIDES), explicitly calling for major advances choice for Chinese abstractive summarization systems as the
in summarization technology. The datasets consist of the dataset size is large and it can be used to train neural networks
following parts: efficiently. Li et al. [77] used this dataset to develop an
• Documents
encoder-decoder based abstractive text summarizer.
• Summaries, results, etc.
G. BC3(BRITISH COLUMBIA UNIVERSITY DATASET)
– manually created summaries
– automatically created baseline summaries The corpus is composed of 40 email threads (3222 records)
– submitted summaries created by the participating from the W3C corpus. Each thread is commented on by three
groups’ systems different commenters.The dataset consists of:
– tables with the evaluation results (i) Extracted abstracts
– additional supporting data and software (ii) Abstract abstracts with linked sentences

VOLUME 10, 2022 133985


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 3. Papers mentioning the ats type in the scopus database (based
on title and keyword sections).

Yousefi-Azar and Hamey [36] used this dataset to develop a


deep learning based extractive text summarizer.

H. EASC (ESSEX ARABIC SUMMARY CORPUS)


FIGURE 5. Text summarization classification.
This dataset consists of Arabic articles and extractive sum-
maries generated for those articles. It is one of the most ii. Intermediate representation
popular Arabic datasets used in text summarization. Alami iii. Sentence scoring
et al. [37] and Elayeb et al. [80] used this dataset for Arabic iv. Summary construction and post-processing
text summarization.
The preprocessing and summary construction stages are
common for most extractive text summarizers. They are
I. GEOCLEF
mostly different in terms of techniques for intermediate rep-
GeoCLEF is used in geographical studies. It consists of
resentation and sentence scoring. Most of the research around
169,447 documents; each document consists of stories and
extractive text summarization is also focused on these steps.
newswires from the Los Angeles Times newspaper (1994)
The main extractive text summarization methods are dis-
and the Glasgow Herald newspaper (1995). It is used by
cussed in the following sections.
Perea-Ortega et al. [55] for developing a geographical infor-
In the following subsections, we will review the methods
mation retrieval system.
employed to each of the main approaches in text summa-
rization. In first place, we will discuss the Extractive Text
V. CLASSIFICATION OF SUMMARIZATION APPROACHES
methods: statistical, topic-based methods, clustering, graph,
Based on the summarization approach, text summarization
semantic, machine learning, deep-learning methods, fuzzy-
can be further divided into 3 main types:
logic techniques, and discourse based (RST). Next, we will
a. Extractive approach discuss the Abstractive Text methods: graph based tree-based,
b. Abstractive approach domain specific methods, and deep-learning methods, and
c. Hybrid approach finally, the Hybrid Text methods.
The impact of these summarization approaches in the study
mentioned above shows a growing of the abstractive types in 1) STATISTICAL-BASED METHODS
the last decade (Table 3) In these methods, statistical features are used to compute a
A selection of relevant papers was made based on quality sentence’s importance. Statistical features may include sen-
aspects.For each of the approaches to be described below tence position [10], sentence length, number of proper nouns
and for each technique applied with that approach, we have in the sentence, term frequency [10], and cosine similarity
selected those articles that, mentioning the technique used can be used for computing sentence scores [11] as shown in
in each approach, most clearly and illustratively describe its TABLE 4.
practical application.
In the remaining of this section, we discuss the 2) TOPIC-BASED METHODS
classifications of text summarization methods based on In this approach, the main topics of a document are extracted.
different classification parameters. The different classifica- Then the sentences are scored based on their coverage of
tions of a text summarization system are represented in document topics. TF-IDF [6], Term frequency, Document
the FIGURE 5. titles [12] can be used to find document topics. Further, N-
In the following subsections, each of these approaches will gram co-occurrence and semantic sentence similarity can also
be discussed. identify document topics [13] as shown in TABLE 5.

A. EXTRACTIVE TEXT SUMMARIZATION 3) CLUSTERING-BASED TECHNIQUES


In this approach, the most important sentences are selected In this method, the sentences are clustered based on
from documents and then assembled to produce the summary. some similarity measure. Then a summarizer extracts the
The typical workflow of the extractive-based approach is: most central sentences from each cluster and processes
i. Preprocessing them to generate a summary. Clustering algorithms like

133986 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

k-means ([14], [15], [16], [17]), k-medoids [18], etc. are used 8) OPTIMIZATION BASED METHODS
for sentence clustering as shown in TABLE 6. In these techniques, the summarization problem is formu-
lated as an optimization problem. The steps involved in an
4) GRAPH-BASED TECHNIQUES optimization-based technique are as follows:
In these methods, the document is represented as a graph • Preprocessing and converting the document to an inter-
of sentences. The sentences represent the nodes. The edges mediate representation.
represent the similarity between the nodes. The similarity • Using an optimization algorithm to extract summary
between words can be represented using some similarity sentences from the IR.
measures like cosine similarity ([6], [19], [20]). Graph-based Multi-Objective Artificial Bee Colony algorithm (MOABC)
techniques are prevalent for extractive summarizers. Popular is the most common optimization algorithm used by many
summarizers such as TextRank [21], LexRank [19] and [22] studies ([28], [38], [39], [40]) as discussed in TABLE 11.
use a graph-based approach. The sentences are then scored
based on the properties of the graph. The summary of such 9) FUZZY-LOGIC BASED TECHNIQUES
methods is shown in TABLE 7. In these techniques, Fuzzy-logic based systems are used to
compute the sentence scores. Fuzzy-logic techniques are pop-
5) SEMANTIC-BASED TECHNIQUES ular because we can represent scores more precisely. The
typical workflow of a fuzzy-logic based system is given as
In these methods, sentence semantics are also taken into con-
under:
sideration. LSA (Latent Semantic Analysis), ESA (Explicit
• Extracting meaningful features from a sentence. e.g.,
Semantic Analysis) and SRL (Semantic Role Labeling) are
some ways of doing semantic analysis of textual data. Out of sentence length, term-weight etc.
• Using a fuzzy system to provide scores to those features.
the three, LSA is the most common and is used by most of
the studies ([12], [24], [25], [26], [27]) as show in TABLE 8. The score ranges between 0 and 1.
Common steps in semantic analysis using LSA is: Babar and Patil [12], Abbasi-ghalehtaki et al. [28], Azhari and
Jaya Kumar [41], and Goularte et al. [42] developed fuzzy-
• Creating a matrix representation of the input. systems based text summarizers. Some studies even inte-
• Apply SVD (Singular value decomposition) to cap- grated different domains like cellular learning algorithms [28]
ture the relationship between individual terms and sen- and neural networks [41] with the fuzzy systems to further
tences. improve the results as shown in TABLE 12.

10) DISCOURSE BASED


6) MACHINE-LEARNING-BASED TECHNIQUES Discourse based studies include analyzing bigger language
Machine learning approaches have gained popularity in structures like lexemes, grammar and context and their
recent years. These techniques convert the text summa- effect on sentence weights. Rhetorical structure theory (RST)
rization problem into a supervised classification problem, has been used widely by multiple studies ([34], [43])
in which each sentence is classified as either a ‘summary’ for discourse analysis and text summarization as shown
or ‘non summary’ sentence. In the end, ‘summary’ sentences in TABLE 13.
are collected to generate the summary. In recent years, it has been observed that machine-learning,
Rather than defining rules manually, the model is trained deep-learning, rhetorical structure theory and fuzzy-systems
on a training set. The set consists of documents and their based techniques are getting more popular for extractive text
respective human-generated summaries. Various classifica- summarization. Hence, for future research, these techniques
tion techniques like SVM ([27], [28], [29]), Naive-Bayes can be explored extensively.
([27], [29], [30]), Decision-Trees [30], Ensemble methods The main advantages and disadvantages of extractive text
([27], [31], [32]) and neural-network ([33], [34], [35]) have summarization are pointed out as below:
been used for text summarization as shown in TABLE 9. • Extractive summarizers are easier to implement than
abstractive summarizers.
7) DEEP-LEARNING BASED METHODS • Capture more accurate information as sentences are
Deep learning techniques are getting more and more pop- directly extracted from the document without altering
ular for text summarization. Seq2seq and encoder-decoder the contents.
based models [36] are used for extractive text summarization. • Generate more accurate information as this is not how
Alami et al. [37] developed deep learning and clustering- humans generate summaries.
based model for Arabic text summarization. Feed forward • Multi-document extractive summarization suffers from
neural networks are also being used for extractive summa- sentence redundancy.
rization [33]. The brief about these methods is shown in • Can mix information from different timelines, resulting
TABLE 10. in wrong summaries.

VOLUME 10, 2022 133987


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 4. Studies on statistical-based methods for extractive text summarization.

TABLE 5. Studies on topic-based methods for extractive text summarization.

TABLE 6. Studies on clustering-based techniques for extractive text summarization.

B. ABSTRACTIVE TEXT SUMMARIZATION are generated by paraphrasing, merging the sentences of the
In this approach, the summary is generated in the same way original document. Abstractive text summarization requires a
humans summarize documents. The summary does not con- deeper understanding of the input document, the context and
sist of sentences from the documents, rather new sentences the semantics. It also requires some deeper understanding of

133988 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 7. Studies on graph-based techniques for extractive text summarization.

NLP concepts. The typical workflow for the abstractive text C. TREE-BASED METHODS
summarizer is: In these techniques, parsers convert text documents to parse
a. Preprocessing trees. Then various tree-processing methods like pruning and
b. Creating an Intermediate representation linearization are used to generate tree summaries. Deep learn-
c. Generating summary from intermediate representation. ing models like encoder-decoder neural networks can also
be used to generate meaningful information from the parse
In the following subsections, the techniques and methods
trees [46]. Techniques like sentence fusion are also used to
used in Abstractive Text summarization are discussed.
eliminate redundancy in the generated summary [47]. The
further details about these methods are shown in TABLE 15.
1) GRAPH-BASED METHODS
In these methods, the individual words are taken as D. DOMAIN-SPECIFIC METHODS
the graph’s nodes. The edges represent the structure of Many studies focus on domain-specific text summarizers.
the sentence. AMR (Abstract Meaning Representation) These studies can be benefitted by using knowledge dictio-
graphs are popular graph-based text representation meth- naries unique to each domain. In addition, the sentences that
ods. Various sentence generators are integrated with AMR do not hold much importance in normal text summarization
graphs for abstractive text summarization [44]. Ganesan can be imperative depending on the domain. Sports news may
et al. [45] developed a popular text summarizer, Opinosis. contain some sport-specific keywords that are important to
The brief about these methods is shown in TABLE 14. convey the necessary information about a game, e.g., ‘‘out’’
The processing steps of the OPINOSIS model are in cricket is considered an important word that is more signif-
as follows: icant than other words like ‘‘high’’ Okumura and Miura [48]
• The path in the intermediate is considered as the developed a sports news summarization system utilizing the
summary. above domain characteristics. Lee et al. [49] developed a text
• The goal is to find the best path. summarizer for Chinese news articles. Further details about
• To do this, rank all the paths and sort them based on these methods are shown in TABLE 16.
decreasing scores.
• Use a similarity measure metric (e.g., Cosine similar- E. DEEP-LEARNING BASED METHODS
ity) to remove redundant paths. Advances in deep learning have made abstractive text
• The best path is chosen for the summary. summarization more approachable. Sequence-to-sequence

VOLUME 10, 2022 133989


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 8. Studies on semantic based techniques for extractive text summarization.

models is being explored for abstractive text summariza- disadvantages of hybrid text summarization are as shown
tion ([50], [51]). Pre-trained transformers are also used for below:
abstractive text summarization [51] as shown in TABLE 17 •Generates better quality summaries than pure extractive
The main advantages and disadvantages of extractive text models.
summarization are pointed out as below: •It is easier to implement than abstract text summarizers.
•Generate better quality summaries as the sentences are •The quality of summaries is less than pure abstractive
not directly extracted from the document. summarizers.
•Summaries are safe from plagiarism.
•More complex to implement than extractive summarizers. VI. OTHER CLASSIFICATION CRITERIA
•Captures less information as some of the information can The following classification shows other criteria for classify-
be lost while rephrasing the sentences ing scientific papers:
a. Classification based on the number of papers: single or
1) HYBRID TEXT SUMMARIZATION multiple.
In this approach, a hybrid of extractive and abstractive sum- b. Classification according to the domain of the abstract
marizers generates the summary. Generally, hybrid text sum- c. Classification based on the number of languages used.
marizers generate better quality summaries than extractive d. Classification based on the nature of the output
summarizers, and they are less complex than abstractive text These classifications are discussed and exemplified
summarizers. Lloret et al. [52] developed a hybrid summa- below.
rization system called Compendium Gupta and Kaur [53]
developed a machine learning-based model, and Binwahlan A. BASED ON THE NUMBER OF DOCUMENTS
et al. [54] developed a fuzzy-systems based hybrid text sum- The text summarization methods based on the number of
marization model. The details about few of such methods documents are classified in different categories as discussed
are as shown in TABLE 18. Some of the advantages and in below sections.

133990 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 9. Studies on machine learning based methods for extractive text summarization.

TABLE 10. Studies on deep learning-based methods for extractive text summarization.

1) SINGLE-DOCUMENT Abbasi-ghalehtaki et al. [28], and Alguliyev et al. [14]


In this type, the summary is generated for a single developed single document text summarizers, as shown in
document. It is easier than multi-document text summa- TABLE 19.
rization as the single document has generally only one
topic and is written in a single period. It is less prone 2) MULTI-DOCUMENT
to redundancy than multidocument text summarization. In this type, a single summary is generated for multiple
Perea-Ortega et al. [55], Sankarasubramaniam et al. [56], documents. It is more complex than single document text

VOLUME 10, 2022 133991


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 11. Studies on optimization-based methods for extractive text summarization.

TABLE 12. Studies on fuzzy systems-based methods for extractive text summarization.

TABLE 13. Studies on rst-based methods for extractive text summarization.

133992 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 14. Studies on graph-based methods for abstractive text summarization.

TABLE 15. Studies on tree-based methods for abstractive text summarization.

TABLE 16. Studies on domain-based methods for abstractive text summarization.

summarization as the documents may refer to different peri- and specific domain based text summarization as discussed
ods. In addition, different documents may cover different top- below:
ics, which makes multi-document text summarization more
challenging Ferreira et al. [23], Nguyen et al. [57], Barzilay
1) GENERIC DOMAIN TEXT SUMMARIZATION
and McKeown [47], Xu and Durrett [58], and Patel et al. [20]
developed multi-document text summarizers as discussed in This type of text summarization is based on without having
TABLE 20. a specific domain. In this type of summarization, the impor-
tance of a sentence, keyword or key phrase depends on its
grammatical properties, e.g., proper nouns, numerical terms
B. BASED ON THE SUMMARIZATION DOMAIN and references can be given higher importance. It is more
Based on summarization domain, text summarization is common than domain-specific summarization as these algo-
of two types: generic domain-based text summarization rithms tend to perform well in different domains but may end

VOLUME 10, 2022 133993


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 17. Studies on deep learning-based methods for abstractive text summarization.

TABLE 18. Studies on hybrid text summarization.

up losing some important domain information in summary Desouki [50] worked on generic text summarizers as shown
Ferreira et al. [23], Babar and Patil [12], and Al-Maleh and in TABLE 21.

133994 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 19. Studies on single document text summarization.

TABLE 20. Studies on multi-document text summarization.

2) SPECIFIC DOMAIN TEXT SUMMARIZATION Moradi et al.[61] worked on biomedical summarization,


This type oftext summarization is concerned with a specific Farzindar and Lapalme [62] and Jain et al. [1] on legal
domain. In this type, the importance of a sentence, key- documents, Perea-Ortega et al. [55] on the geographical
word or key phrase depends not only on its grammatical study and Lloret et al. [52] on scientific paper summarization
properties but also on its relation to the domain of study. as shown in TABLE 22.
This approach can capture better domain-specific summaries
as some keywords, key phrases which are important for C. BASED ON LANGUAGE
some domains, may not hold much importance in others. Based on language, the text summarization methods are clas-
Shang et al. [59], Reeve et al. [60], Rouane et al. [17] or sified in different categories as discussed in the section below:

VOLUME 10, 2022 133995


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 21. Studies on generic domain-based summarization.

TABLE 22. Studies on domain-specific text summarization.

1) MONOLINGUAL E. CROSS LINGUAL


In this type of summarization, the document and the sum- In this type of summarization, the document is of one lan-
mary are in the same language. Perea-Ortega et al. [55] and guage and the summary is generated in some other language.
Sankarasubramaniam et al. [56] worked on summarizers for Linhares Pontes et al. [64] developed a French to English text
the English language Al-Maleh and Desouki [50] worked on summarizer as shown Table 25.
the Arabic text summarization, shown in TABLE 23.

D. MULTILINGUAL F. BASED ON NATURE OF OUTPUT SUMMARY


In this type of summarization, the document and the summary Based on the nature of the output summary, the summariza-
are written in multiple languages. Rani and Lobiyal [63] made tion methods are classified in to two categories as discussed
a summarizer for Hindi and English as shown TABLE 24. below:

133996 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 23. Studies on mono-lingual text summarization.

1) GENERIC Alguliyev et al. [15] used the K-means algorithms on the


The output is not influenced by external factors. The gener- DUC 2002 dataset and got a ROUGE-1 score of 0.4727 Mohd
ated summary is not controlled by external queries Babar and et al. [16] employed a k-means-based model on the DUC
Patil [12], Gupta and Kaur [53], Sankarasubramaniam et al. 2007 dataset and got a ROUGE-1 score of 0.34. This clearly
[56], and Chatterjee and Sahoo [65] developed non-query- indicates that k-means is a promising technique in text sum-
based text summarizers as shown in Table 26. marization and can produce great results.

2) QUERY-BASED B. LSA (LATENT SEMANTIC ANALYSIS)


The summary can be controlled using user-defined queries. In this method, a document is first converted into a term-
The summary is generated based on the user The summary to-sentence matrix. This representation can be then used to
can be controlled using user-defined queries. The summary collect information about the words that occur commonly
is generated based on the user requirements. This approach together. That information can then be used to generate qual-
is prevalent among search engines depending on the query. ity summaries. The performance of LSA-based models is fur-
Some sentences can have more importance than others. Shang ther improved using SVD(Singular Value Decomposition).
et al. [59], He et al. [66], Salton et al. [67], and Van Lierde Babar and Patil [12] used LSA with a fuzzy system model
and Chow ([68], [69]) developed query based models for text to get a precision of 0.8654.Priya and Umamaheswari [24]
summarization as shown Table 27. used LSA with TF-IDF on a Hotel review dataset to get an
accuracy of 0.54. LSA based models can produce signifi-
VII. ANALYSIS OF POPULAR TEXT SUMMARIZATION cant results, most modern studies are shifting towards neural
TECHNIQUES network-based models. However, an LSA model alongside
In this section, we are going to perform a detailed anal- a neural network-based model can definitely achieve some
ysis of the various popular text summarization techniques. interesting results.
These techniques have always been a popular choice among
researchers as they are well researched, efficient, and have C. TEXTRANK
the most tendency to be improvised on. We will also ana- In this method, a document is represented in the form of a
lyze studies incorporating these techniques, their results, and graph. Each node of the graph represents a word, and the
enhancement ideas will also be discussed. edges between two nodes represent the relationship between
two words. It also applies a voting mechanism such that nodes
A. K-MEANS CLUSTERING having more incoming edges are given higher ranks. Also,
In this algorithm, an unlabeled dataset is divided into ‘k’ while ranking a node the ranks of the nodes casting the vote
number of clusters. Items in each cluster have properties are taken into consideration [21].
similar to each other.
For text summarization, k-means can be used to cluster D. LEXRANK
sentences containing similar information. This can be help- Like TextRank, it is also a graph-based voting algo-
ful in removing redundant sentences and improving overall rithm. In this algorithm, the nodes of the graph are rep-
summary quality. resented by the sentences of the document and the edges

VOLUME 10, 2022 133997


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 24. Studies on multilingual text summarization.

TABLE 25. Studies on cross-lingual text summarization.

TABLE 26. Studies on generic output-based text summarization.

represent the similarity between two sentences. It employs This algorithm tries to convert the text summarization prob-
a recommendation-based mechanism to compute sentence lem into an optimization problem, with the best summary
ranks [19]. representing the global minima.
Unlike textRank, the edge weights are computed based on Sanchez-Gomez et al. [40] used MOABC on DUC 2002
some similarity metric (e.g., Cosine similarity), producing dataset to get a 2.23% improvement on ROUGE-2 scores
better output in some scenarios. over state-of-the-art methods. Abbasi-Ghalehtaki et al. [28]
implemented a MOABC + cellular automation theory-based
algorithm on the DUC 2002 dataset to get significant results.
E. MOABC
This algorithm is an enhancement of the popular ABC (Arti-
ficial Bee Colony) algorithm. The ABC algorithm is inspired
by the natural food searching behavior of honeybees. In the F. MACHINE LEARNING TECHNIQUES
ABC algorithm, the optimization is done in three phases: 1) LOGISTIC REGRESSION
Logistic regression is a classification algorithm, which is
i. Employed bees: These bees exploit the food source, very useful in binary classification i.e., whether the gender
return to the hive, and report to the onlooker bees. of the author is male or female. Unlike linear regression,
ii. Onlooker bees: These bees gather data from employed it models the data using a non-linear function like the sigmoid
bees, then select the food source to gather data from. function It can also be used for classification problems, where
iii. Scout bees: These bees try to find random food sources the number of classes in the output are gmore than 2. The
for our employed bees to exploit. mathematical expression for the sigmoid function is given in

133998 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

TABLE 27. Studies on query-based text summarization.

the following equation. the sentences that have the highest relevance, with the least
1 redundancy with respect to the rest of the sentences generated
∅sig (z) = (5) for the summary.
1 + e−z
Machine learning-based methods achieved significant
Neto et al. [30] used the logistic regression classifier on the results in the text summarization domain, however, due to
TIPSTER collection. They got a precision of 0.34 for the limited dataset sizes, the models could not learn that effi-
model. Neto et al. [32] used logistic regression on the EASC ciently and thus they could not compete with the state-of-
(Essex Arabic Summary Corpus) and got a ROUGE-1 score the-art graph-based models. However, neural network-based
of 0.129. models overcame the limitations of machine-learning-based
models and produced even better results than the state-of-the-
2) SVM art graph-based models.
The main idea behind an SVM classifier is to choose a hyper-
plane that can segregate n-dimensional data into different G. NEURAL NETWORK-BASED APPROACHES
classes with minimum overlapping. Support vectors are used Task summarization can be formalized as a seq2seq model,
to create the hyperplane, hence the name ‘Support vector where input sequence is the input document, the output
machines’. In an SVM model, the distance between a point sequence is the output summary. Since the input size can keep
x and the hyperplane, represented by (w, b), where, varying, we cannot use a traditional neural network for this
task. These seq2seq models are getting very popular in recent
||W || = 1if| < w, x > +b| (6)
times. The most popular seq2seq models being used in for text
Shen et al. [27] used SVM on the LookSmart web directory summarization are RNN, LSTM anGRU.
along with LSA and achieved significant results. Neto et al.
[30] used SVM on the TIPSTER collection to a precision 1) RNN
of 0.34. RNN (Recurrent Neural Networks) belong to a class of neu-
ral networks that can use the previous outputs as input for
3) RANDOM FOREST next state. The structure of a basic RNN model is given in
Random forest classifiers are a part of ensemble-based learn- FIGURE 6.
ing methods. Their main features are ease of implementa- The activation vector (a) is computed as shown in Eq. (7):
tion, efficiency, and great output in a variety of domains.  
a<t> = g1 Waa a<t−1> + Wax x <t> +ba (7)
In the Random Forest approach, many decision trees are
constructed during the training stage. Then, a majority vot- The output value (yt ) is computed as shown in Eq. (8):
ing method is used among those decision trees during the
classification stage to get the final output. Alami et al. [32] y<t> = g2 (Wya a<t> + by (8)
used a Random Forest classifier on the EASC collection and
got a ROUGE-1 score of 0.129. John and Wilscy [82] used 2) LSTM
Random Forest and Maximum Marginal Relevance (MMR), Although RNN can generate significant results for text sum-
achieving significant results. The MMR coefficient selects marization, they suffer from the ‘vanishing gradient’ problem

VOLUME 10, 2022 133999


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

FIGURE 6. RNN.

FIGURE 8. GRU Unit.

Some approaches for multi-document summarization can


also generate improper references, e.g., assume one sentence
in a document contains a proper noun, and the following
sentence consists of a reference to the noun. If the sum-
marizer ranks the second sentence higher than the first and
does not include it, it will create improper references to
other sentences. It is a massive challenge in multi-document
FIGURE 7. LSTM Unit. summarization.

while backpropagation. This limits the learning abilities of B. CHALLENGES RELATED TO APPLICATIONS OF TEXT
the model. To counter this, LSTM (Long short-term memory) SUMMARIZATION
models were introduced. In an LSTM model, a gate-based Since most current studies focus on a specific text domain,
mechanism is employed in each LSTM cell that is used to i.e., news, biomedical documents, etc., some of these domains
memorize the relevant information. This solves the vanishing do not have significant economic value. Focusing on a long
gradient problem of RNN. The cell of an LSTM model is text, such as an essay, dissertation thesis or reports, may be
shown in Figure 7. more economically profitable. However, since the processing
3) GRU of long text requires high computational power, it remains a
Gated Recurrent Units (GRU) are another modification over major challenge.
standard RNNs that can solve the vanishing gradient problem.
Similar to LSTM units, GRU units have a gate-based mech- C. CHALLENGES RELATED TO USER-SPECIFIC
anism to store the relevant data for backpropagation training. SUMMARIZATION TASKS
The construction of a GRU cell is given in FIGURE 8. Summarizing semi-structured resources like web pages
databases is an important application of text summarization
VIII. CHALLENGES AND FUTURE SCOPES since most of the textual data is present in a semi-structured
Even with these advancement in text summarization, multiple format. This type of summarization is more complex than
challenges still exist, and researchers are working to over- simple text summarization since there is much more noise in
come the challenges. These challenges can also act as future the data. Hence developing efficient summarizers for these
research directions for the new studies. These challenges domains is a massive challenge.
are in many domains like multi-document summarization,
applications of text summarization and some user-specific D. CHALLENGES RELATED TO FEATURE SELECTION,
summarization tasks. Few of the challenges are as discussed PREPROCESSING AND DATASETS
below: For any natural language processing problem, the perfor-
mance of the selected methods dramatically depends on the
A. CHALLENGES RELATED TO MULTI-DOCUMENT selection of the features, so is valid with text summariza-
SUMMARIZATION tion techniques. Irrespective of the methods such as machine
Multi-document text summarization is more complex than learning, statistical, fuzzy, deep learning etc. that have been
single-document text summarization due to the following used at a large scale in recent times for such problems,
issues: selecting appropriate features for concerning documents to
i. Redundancy be summarized is still a significant challenge in front of
ii. Temporal dimension researchers. So, there is much scope in solving the feature
iii. Co-references selection problem, such as determining the most appropriate
iv. Sentence reordering features to summarize the dataset, discovering new features,

134000 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

optimizing the commonly used features, using features for [4] A. Nenkova and K. McKeown, ‘‘A survey of text summarization tech-
semantic, adding grammatical features, linguistics features niques,’’ in Mining Text Data, C. C. Aggarwal and C. Zhai, Eds. Boston,
MA, USA: Springer, 2012, pp. 43–76, doi: 10.1007/978-1-4614-3223-4_3.
etc. Preprocessing a dataset using appropriate methods also [5] W. S. El-Kassas, C. R. Salama, A. A. Rafea, and H. K. Mohamed, ‘‘Auto-
affects the performance of the summarization methods, so it matic text summarization: A comprehensive survey,’’ Expert Syst. Appl.,
also needs attention in the future. One can explore the appro- vol. 165, Mar. 2021, Art. no. 113679, doi: 10.1016/j.eswa.2020.113679.
[6] B. Polepalli Ramesh, R. J. Sethi, and H. Yu, ‘‘Figure-associated text
priate stemming approaches, stop word removal techniques, summarization and evaluation,’’ PLoS ONE, vol. 10, no. 2, Feb. 2015,
tokenizers, and suitable POS taggers to categorize token Art. no. e0115671, doi: 10.1371/journal.pone.0115671.
[7] B. Kitchenham, O. Pearl Brereton, D. Budgen, M. Turner, J. Bailey, and
classes among nouns, verbs, adjectives, adverbs, etc. The
S. Linkman, ‘‘Systematic literature reviews in software engineering—A
creation of a new dataset is also a demanding task. Many systematic literature review,’’ Inf. Softw. Technol., vol. 51, no. 1, pp. 7–15,
little-explored domains, such as legal, tourism, health, etc., Jan. 2009, doi: 10.1016/j.infsof.2008.09.009.
[8] D. Moher, A. Liberati, J. Tetzlaff, and D. G. Altman, ‘‘Preferred reporting
need new datasets to be created and used to expedite the items for systematic reviews and meta-analyses: The PRISMA statement,’’
summarization work at a different level. BMJ, vol. 339, p. b2535, Jul. 2009, doi: 10.1136/bmj.b2535.
[9] P. Mongeon and A. Paul-Hus, ‘‘The journal coverage of Web of science
IX. CONCLUSION and scopus: A comparative analysis,’’ Scientometrics, vol. 106, no. 1,
Text summarization is an exciting research topic among the pp. 213–228, Jan. 2016, doi: 10.1007/s11192-015-1765-5.
[10] M. Kutlu, C. Cigir, and I. Cicekli, ‘‘Generic text summarization for
NLP community that helps to produce concise information.
Turkish,’’ Comput. J., vol. 53, no. 8, pp. 1315–1323, Jan. 2010, doi:
The idea of this study is to present the latest research and 10.1093/comjnl/bxp124.
progress made in this field with a systematic review of rel- [11] C. Bouras and V. Tsogkas, ‘‘Noun retrieval effect on text summa-
rization and delivery of personalized news articles to the user’s desk-
evant research articles. In this study, we consolidated the top,’’ Data Knowl. Eng., vol. 69, no. 7, pp. 664–677, Jul. 2010, doi:
research works from different repositories, related to various 10.1016/j.datak.2010.02.005.
text summarization methods, datasets, techniques, and evalu- [12] S. A. Babar and P. D. Patil, ‘‘Improving performance of text sum-
marization,’’ Proc. Comput. Sci., vol. 46, pp. 354–363, Jan. 2015, doi:
ation metrics. 10.1016/j.procs.2015.02.031.
We have also added a section on ‘‘Analysis of Popular [13] M. A. Tayal, M. M. Raghuwanshi, and L. G. Malik, ‘‘ATSSC: Devel-
Text Summarization Techniques’’, which articulates the most opment of an approach based on soft computing for text summariza-
tion,’’ Comput. Speech Lang., vol. 41, pp. 214–235, Jan. 2017, doi:
popular techniques in the text summarization domain and 10.1016/j.csl.2016.07.002.
gives the strengths and limitations of each technique, and [14] R. M. Alguliyev, R. M. Aliguliyev, N. R. Isazade, A. Abdi, and N. Idris,
hints at future research directions. We have presented the ‘‘A model for text summarization,’’ Int. J. Intell. Inf. Technol., vol. 13, no. 1,
pp. 67–85, Jan. 2017, doi: 10.4018/IJIIT.2017010104.
information in a tabular format, covering the advantages and [15] R. M. Alguliyev, R. M. Aliguliyev, N. R. Isazade, A. Abdi, and N. Idris,
disadvantages of each research paper, which can make it ‘‘COSUM: Text summarization based on clustering and optimization,’’
easier for the readers to use our review paper as a base paper Expert Syst., vol. 36, no. 1, Feb. 2019, Art. no. e12340.
[16] M. Mohd, R. Jan, and M. Shah, ‘‘Text document summarization using word
for text summarization domain knowledge. We presented a embedding,’’ Expert Syst. Appl., vol. 143, Apr. 2020, Art. no. 112958, doi:
detailed discussion on the different types of text summa- 10.1016/j.eswa.2019.112958.
rization studies based on approach (extractive, abstractive [17] O. Rouane, H. Belhadef, and M. Bouakkaz, ‘‘Combine clustering
and frequent itemsets mining to enhance biomedical text summariza-
and hybrid), the number of documents (single-document and tion,’’ Expert Syst. Appl., vol. 135, pp. 362–373, Nov. 2019, doi:
multi-document), summarization domain (generic domain 10.1016/j.eswa.2019.06.002.
and domain-specific summarization), language (monolin- [18] Y.-H. Hu, Y.-L. Chen, and H.-L. Chou, ‘‘Opinion mining from online hotel
reviews—A text summarization approach,’’ Inf. Process. Manag., vol. 53,
gual, multilingual, cross-lingual), and nature of the output no. 2, pp. 436–449, Mar. 2017, doi: 10.1016/j.ipm.2016.12.002.
summary (generic and query-based summarizer). We also [19] G. Erkan and D. R. Radev, ‘‘LexRank: Graph-based lexical centrality as
presented a detailed analysis of various studies in a tabular salience in text summarization,’’ J. Artif. Intell. Res., vol. 22, pp. 457–479,
Dec. 2004.
format, which will save the readers the hassle of reading [20] D. Patel, S. Shah, and H. Chhinkaniwala, ‘‘Fuzzy logic based multi doc-
through long texts and save their time. We also gave a ument summarization with improved sentence scoring and redundancy
removal technique,’’ Expert Syst. Appl., vol. 134, pp. 167–177, Nov. 2019,
detailed review of various datasets used in this domain and
doi: 10.1016/j.eswa.2019.05.045.
provided references to the datasets. We discussed various [21] R. Mihalcea and P. Tarau, ‘‘TextRank: Bringing order into text,’’ in
standard evaluation metrics used (ROUGE, F-measure, recall, Proc. Conf. Empirical Methods Natural Lang. Process., Barcelona, Spain,
Jul. 2004, pp. 404–411. [Online]. Available: https://aclanthology.org/W04-
precision etc.), which can be used to measure the quality of 3252
a text summarization model. Finally, we discussed various [22] R. Elbarougy, G. Behery, and A. El Khatib, ‘‘Extractive Arabic text sum-
challenges faced in text summarization that can lead future marization using modified PageRank algorithm,’’ Egyptian Informat. J.,
vol. 21, no. 2, pp. 73–81, Jul. 2020, doi: 10.1016/j.eij.2019.11.001.
studies in the domain. [23] R. Ferreira, L. de Souza Cabral, F. Freitas, R. D. Lins, G. de França Silva,
S. J. Simske, and L. Favaro, ‘‘A multi-document summarization system
REFERENCES
based on statistics and linguistic treatment,’’ Expert Syst. Appl., vol. 41,
[1] D. Jain, M. D. Borah, and A. Biswas, ‘‘Summarization of legal documents: no. 13, pp. 5780–5787, 2014, doi: 10.1016/j.eswa.2014.03.023.
Where are we now and the way forward,’’ Comput. Sci. Rev., vol. 40, [24] V. Priya and K. Umamaheswari, ‘‘Enhanced continuous and discrete multi
May 2021, Art. no. 100388, doi: 10.1016/j.cosrev.2021.100388. objective particle swarm optimization for text summarization,’’ Cluster
[2] A. B. Al-Saleh and M. E. B. Menai, ‘‘Automatic Arabic text summariza- Comput., vol. 22, no. S1, pp. 229–240, Jan. 2019, doi: 10.1007/s10586-
tion: A survey,’’ Artif. Intell. Rev., vol. 45, no. 2, pp. 203–234, Feb. 2016, 018-2674-1.
doi: 10.1007/s10462-015-9442-x. [25] I. V. Mashechkin, M. I. Petrovskiy, D. S. Popov, and D. V. Tsarev,
[3] Y. Kumar, K. Kaur, and S. Kaur, ‘‘Study of automatic text summarization ‘‘Automatic text summarization using latent semantic analysis,’’ Pro-
approaches in different languages,’’ Artif. Intell. Rev., vol. 54, no. 8, gram. Comput. Softw., vol. 37, no. 6, pp. 299–305, Nov. 2011, doi:
pp. 5897–5929, Dec. 2021, doi: 10.1007/s10462-021-09964-4. 10.1134/S0361768811060041.

VOLUME 10, 2022 134001


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

[26] J.-Y. Yeh, H.-R. Ke, W.-P. Yang, and I.-H. Meng, ‘‘Text summa- [48] N. Okumura and T. Miura, ‘‘Automatic labelling of documents based on
rization using a trainable summarizer and latent semantic analysis,’’ ontology,’’ in Proc. IEEE Pacific Rim Conf. Commun., Comput. Signal
Inf. Process. Manage., vol. 41, no. 1, pp. 75–95, Jan. 2005, doi: Process. (PACRIM), Victoria, BC, Canada, Aug. 2015, pp. 34–39, doi:
10.1016/j.ipm.2004.04.003. 10.1109/PACRIM.2015.7334805.
[27] D. Shen, Q. Yang, and Z. Chen, ‘‘Noise reduction through summariza- [49] C.-S. Lee, Y.-J. Chen, and Z.-W. Jian, ‘‘Ontology-based fuzzy event extrac-
tion for web-page classification,’’ Inf. Process. Manage., vol. 43, no. 6, tion agent for Chinese e-news summarization,’’ Expert Syst. Appl., vol. 25,
pp. 1735–1747, Nov. 2007, doi: 10.1016/j.ipm.2007.01.013. no. 3, pp. 431–447, Oct. 2003, doi: 10.1016/S0957-4174(03)00062-9.
[28] R. Abbasi-ghalehtaki, H. Khotanlou, and M. Esmaeilpour, ‘‘Fuzzy [50] M. Al-Maleh and S. Desouki, ‘‘Arabic text summarization using deep
evolutionary cellular learning automata model for text summariza- learning approach,’’ J. Big Data, vol. 7, no. 1, p. 109, Dec. 2020, doi:
tion,’’ Swarm Evol. Comput., vol. 30, pp. 11–26, Oct. 2016, doi: 10.1186/s40537-020-00386-7.
10.1016/j.swevo.2016.03.004. [51] U. Khandelwal, K. Clark, D. Jurafsky, and L. Kaiser, ‘‘Sample effi-
[29] Y. Ko, J. Park, and J. Seo, ‘‘Improving text categorization using the cient text summarization using a single pre-trained transformer,’’ 2019,
importance of sentences,’’ Inf. Process. Manage., vol. 40, no. 1, pp. 65–79, arXiv:1905.08836.
2004, doi: 10.1016/S0306-4573(02)00056-0. [52] E. Lloret, M. T. Romá-Ferri, and M. Palomar, ‘‘COMPENDIUM:
[30] J. L. Neto, A. A. Freitas, and C. A. A. Kaestner, ‘‘Automatic text summa- A text summarization system for generating abstracts of research
rization using a machine learning approach,’’ in Proc. 16th Brazilian Symp. papers,’’ Data Knowl. Eng., vol. 88, pp. 164–175, Nov. 2013, doi:
Artif. Intell., Adv. Artif. Intell., Berlin, Germany, 2002, pp. 205–215. 10.1016/j.datak.2013.08.005.
[31] D. D. A. Bui, G. Del Fiol, J. F. Hurdle, and S. Jonnalagadda, ‘‘Extractive [53] V. Gupta and N. Kaur, ‘‘A novel hybrid text summarization system for
text summarization system to aid data extraction from full text in sys- Punjabi text,’’ Cognit. Comput., vol. 8, no. 2, pp. 261–277, Apr. 2016, doi:
tematic review development,’’ J. Biomed. Informat., vol. 64, pp. 265–272, 10.1007/s12559-015-9359-3.
Dec. 2016, doi: 10.1016/j.jbi.2016.10.014. [54] M. S. Binwahlan, N. Salim, and L. Suanmali, ‘‘Fuzzy swarm diversity
[32] N. Alami, M. Meknassi, and N. En-Nahnahi, ‘‘Enhancing unsupervised hybrid model for text summarization,’’ Inf. Process. Manage., vol. 46,
neural networks based text summarization with word embedding and no. 5, pp. 571–588, Sep. 2010, doi: 10.1016/j.ipm.2010.03.004.
ensemble learning,’’ Expert Syst. Appl., vol. 123, pp. 195–211, Jun. 2019, [55] J. M. Perea-Ortega, E. Lloret, L. Alfonso Ureña-López, and M. Palomar,
doi: 10.1016/j.eswa.2019.01.037. ‘‘Application of text summarization techniques to the geographical infor-
[33] A. Sinha, A. Yadav, and A. Gahlot, ‘‘Extractive text summarization using mation retrieval task,’’ Expert Syst. Appl., vol. 40, no. 8, pp. 2966–2974,
neural networks,’’ 2018, arXiv:1802.10137. Jun. 2013, doi: 10.1016/j.eswa.2012.12.012.
[34] J. Xu, Z. Gan, Y. Cheng, and J. Liu, ‘‘Discourse-aware neural extractive text [56] Y. Sankarasubramaniam, K. Ramanathan, and S. Ghosh, ‘‘Text summariza-
summarization,’’ in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, tion using Wikipedia,’’ Inf. Process. Manage., vol. 50, no. 3, pp. 443–461,
Jul. 2020, pp. 5021–5031, doi: 10.18653/v1/2020.acl-main.451. May 2014, doi: 10.1016/j.ipm.2014.02.001.
[57] H. Nguyen, E. Santos, and J. Russell, ‘‘Evaluation of the impact of user-
[35] B. Mutlu, E. A. Sezer, and M. A. Akcayol, ‘‘Candidate sentence selection
cognitive styles on the assessment of text summarization,’’ IEEE Trans.
for extractive text summarization,’’ Inf. Process. Manage., vol. 57, no. 6,
Syst., Man, Cybern., A, Syst. Humans, vol. 41, no. 6, pp. 1038–1051,
Nov. 2020, Art. no. 102359, doi: 10.1016/j.ipm.2020.102359.
[36] M. Yousefi-Azar and L. Hamey, ‘‘Text summarization using unsupervised Nov. 2011, doi: 10.1109/TSMCA.2011.2116001.
[58] J. Xu and G. Durrett, ‘‘Neural extractive text summarization with syntactic
deep learning,’’ Expert Syst. Appl., vol. 68, pp. 93–105, Feb. 2017, doi:
compression,’’ in Proc. Conf. Empirical Methods Natural Lang. Process.
10.1016/j.eswa.2016.10.017.
9th Int. Joint Conf. Natural Lang. Process., Hong Kong, Nov. 2019,
[37] N. Alami, M. E. Mallahi, H. Amakdouf, and H. Qjidaa, ‘‘Hybrid method
pp. 3292–3303, doi: 10.48550/ARXIV.1902.00863.
for text summarization based on statistical and semantic treatment,’’ Mul-
[59] Y. Shang, Y. Li, H. Lin, and Z. Yang, ‘‘Enhancing biomedical text sum-
timedia Tools Appl., vol. 80, no. 13, pp. 19567–19600, May 2021, doi:
marization using semantic relation extraction,’’ PLoS ONE, vol. 6, no. 8,
10.1007/s11042-021-10613-9.
Aug. 2011, Art. no. e23862, doi: 10.1371/journal.pone.0023862.
[38] J. M. Sanchez-Gomez, M. A. Vega-Rodríguez, and C. J. Pérez, [60] L. H. Reeve, H. Han, and A. D. Brooks, ‘‘The use of domain-specific con-
‘‘Comparison of automatic methods for reducing the Pareto front to a single cepts in biomedical text summarization,’’ Inf. Process. Manage., vol. 43,
solution applied to multi-document text summarization,’’ Knowl.-Based no. 6, pp. 1765–1776, Nov. 2007, doi: 10.1016/j.ipm.2007.01.026.
Syst., vol. 174, pp. 123–136, Jun. 2019, doi: 10.1016/j.knosys. [61] M. Moradi, M. Dashti, and M. Samwald, ‘‘Summarization of biomed-
2019.03.002. ical articles using domain-specific word embeddings and graph rank-
[39] J. M. Sanchez-Gomez, M. A. Vega-Rodríguez, and C. J. Pérez, ‘‘Paralleliz- ing,’’ J. Biomed. Informat., vol. 107, Jul. 2020, Art. no. 103452, doi:
ing a multi-objective optimization approach for extractive multi-document 10.1016/j.jbi.2020.103452.
text summarization,’’ J. Parallel Distrib. Comput., vol. 134, pp. 166–179, [62] A. Farzindar and G. Lapalme, ‘‘Legal text summarization by exploration
Dec. 2019, doi: 10.1016/j.jpdc.2019.09.001. of the thematic structure and argumentative roles,’’ in Proc. Text Summa-
[40] J. M. Sanchez-Gomez, M. A. Vega-Rodríguez, and C. J. Pérez, ‘‘Exper- rization Branches Out, Barcelona, Spain, Jul. 2004, pp. 27–34. [Online].
imental analysis of multiple criteria for extractive multi-document text Available: https://aclanthology.org/W04-1006
summarization,’’ Expert Syst. Appl., vol. 140, Feb. 2020, Art. no. 112904, [63] R. Rani and D. K. Lobiyal, ‘‘An extractive text summarization approach
doi: 10.1016/j.eswa.2019.112904. using tagged-LDA based topic modeling,’’ Multimedia Tools Appl., vol. 80,
[41] M. Azhari and Y. Jaya Kumar, ‘‘Improving text summarization using no. 3, pp. 3275–3305, Jan. 2021, doi: 10.1007/s11042-020-09549-3.
neuro-fuzzy approach,’’ J. Inf. Telecommun., vol. 1, no. 4, pp. 367–379, [64] E. Linhares Pontes, S. Huet, J.-M. Torres-Moreno, and A. C. Linhares,
Oct. 2017, doi: 10.1080/24751839.2017.1364040. ‘‘Compressive approaches for cross-language multi-document summa-
[42] F. B. Goularte, S. M. Nassar, R. Fileto, and H. Saggion, ‘‘A text sum- rization,’’ Data Knowl. Eng., vol. 125, Jan. 2020, Art. no. 101763, doi:
marization method based on fuzzy rules and applicable to automated 10.1016/j.datak.2019.101763.
assessment,’’ Expert Syst. Appl., vol. 115, pp. 264–275, Jan. 2019, doi: [65] N. Chatterjee and P. K. Sahoo, ‘‘Random indexing and modified
10.1016/j.eswa.2018.07.047. random indexing based approach for extractive text summarization,’’
[43] S. Hou and R. Lu, ‘‘Knowledge-guided unsupervised rhetorical parsing for Comput. Speech Lang., vol. 29, no. 1, pp. 32–44, Jan. 2015, doi:
text summarization,’’ Inf. Syst., vol. 94, Dec. 2020, Art. no. 101615, doi: 10.1016/j.csl.2014.07.001.
10.1016/j.is.2020.101615. [66] J. He, W. Kryściński, B. McCann, N. Rajani, and C. Xiong,
[44] S. Dohare, H. Karnick, and V. Gupta, ‘‘Text summarization using abstract ‘‘CTRLsum: Towards generic controllable text summarization,’’ 2020,
meaning representation,’’ 2017, arXiv:1706.01678. arXiv:2012.04281.
[45] K. Ganesan, C. Zhai, and J. Han, ‘‘Opinosis: A graph based approach [67] G. Salton, J. Allan, C. Buckley, and A. Singhal, ‘‘Automatic analysis,
to abstractive summarization of highly redundant opinions,’’ in Proc. theme generation, and summarization of machine-readable texts,’’ Sci-
23rd Int. Conf. Comput. Linguistics (Coling), Beijing, China, Aug. 2010, ence, vol. 264, no. 5164, pp. 1421–1426, Jun. 1994, doi: 10.1126/sci-
pp. 340–348. [Online]. Available: https://aclanthology.org/C10-1039 ence.264.5164.1421.
[46] K. Song, L. Lebanoff, Q. Guo, X. Qiu, X. Xue, and C. Li, ‘‘Joint parsing [68] H. Van Lierde and T. W. S. Chow, ‘‘Query-oriented text summarization
and generation for abstractive summarization,’’ in Proc. AAAI, Apr. 2020, based on hypergraph transversals,’’ Inf. Process. Manage., vol. 56, no. 4,
vol. 34, no. 5, pp. 8894–8901, doi: 10.1609/aaai.v34i05.6419. pp. 1317–1338, Jul. 2019, doi: 10.1016/j.ipm.2019.03.003.
[47] R. Barzilay and K. R. McKeown, ‘‘Sentence fusion for multidocument [69] H. Van Lierde and T. W. S. Chow, ‘‘Learning with fuzzy hypergraphs: A
news summarization,’’ Comput. Linguistics, vol. 31, no. 3, pp. 297–328, topical approach to query-oriented text summarization,’’ Inf. Sci., vol. 496,
Sep. 2005. pp. 212–224, Sep. 2019, doi: 10.1016/j.ins.2019.05.020.

134002 VOLUME 10, 2022


D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey

[70] A. Joshi, E. Fidalgo, E. Alegre, and L. Fernández-Robles, ‘‘SummCoder: Department of Computer Science and AMP; Engineering, National Institute
An unsupervised framework for extractive text summarization based on of Technology, from 2019 to 2022, Madan Mohan Malaviya University of
deep auto-encoders,’’ Expert Syst. Appl., vol. 129, pp. 200–215, Sep. 2019, Technology (MMMUT), Gorakhpur (UP) from 2016 to 2019, and Jaypee
doi: 10.1016/j.eswa.2019.03.045. Institute of Information Technology, Noida (UP) from 2005 to 2016. He is
[71] CNN-Dailymail. GitHub. Accessed: Jun. 26, 2022. [Online]. Available: currently an Associate Professor with the Department of Computer Science
https://github.com/abisee/cnn-dailymail and Engineering, National Institute of Technology Hamirpur (NIT Hamir-
[72] GitHub—JafferWilson/Process-Data-of-CNN-DailyMail: This Repository
Holds the Output of the Repository. Accessed: Jun. 26, 2022.
pur) (An Institution of National Importance). He supervised eight Ph.D.
[Online]. Available: https://github.com/abisee/cnn-dailymail and thesis and 31 M.Tech. dissertations. He has published more than 125 research
https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail papers in international journals and conference proceedings of repute. His
[73] H. M. Lynn, C. Choi, and P. Kim, ‘‘An improved method of automatic research interests include machine learning, soft computing, information
text summarization for web contents using lexical chain with semantic- retrieval, NPL, and e-learning. He is a member of ACM.
related terms,’’ Soft Comput., vol. 22, no. 12, pp. 4013–4023, Jun. 2018,
doi: 10.1007/s00500-017-2612-9.
[74] K. Ganesan. Opinosis Dataset—Topic Related Review Sentences.
Accessed: Jun. 30, 2022. [Online]. Available: https://kavita-ganesan.
com/opinosis-opinion-dataset/
[75] R. C. Belwal, S. Rai, and A. Gupta, ‘‘A new graph-based extractive text RISHABH KATNA received the bachelor’s and
summarization using keywords or topic modeling,’’ J. Ambient Intell. master’s degrees in computer science and engi-
Humanized Comput., vol. 12, no. 10, pp. 8975–8990, Oct. 2021, doi: neering from the National Institute of Technology
10.1007/s12652-020-02591-x. Hamirpur (NIT Hamirpur), Hamirpur, in 2021 and
[76] R. C. Belwal, S. Rai, and A. Gupta, ‘‘Text summarization using topic-based 2022, respectively. He has been working with
vector space model and semantic measure,’’ Inf. Process. Manage., vol. 58, Standard Chartered GBS as an Intern Software
no. 3, May 2021, Art. no. 102536, doi: 10.1016/j.ipm.2021.102536.
Engineer. He is currently working as a Software
[77] P. Li, W. Lam, L. Bing, and Z. Wang, ‘‘Deep recurrent generative decoder
for abstractive text summarization,’’ in Proc. Conf. Empirical Methods
Engineer at Qualcomm. His research interests
Natural Lang. Process., Copenhagen, Denmark, 2017, pp. 2091–2100, doi: include social networking, automation for web
10.18653/v1/D17-1222. developers, and natural language processing.
[78] Datasets/Gigaword.Py at Master · Tensorflow/Datasets. GitHub.
Accessed: Jun. 26, 2022. [Online]. Available: https://github.
com/tensorflow/datasets
[79] MEDLINE/PubMed Data. Accessed: Jun. 26, 2022. [Online]. Available:
https://www.nlm.nih.gov/databases/download/pubmed_medline.html
[80] B. Elayeb, A. Chouigui, M. Bounhas, and O. B. Khiroun, ‘‘Automatic Ara- ARUN KUMAR YADAV received the Ph.D.
bic text summarization using analogical proportions,’’ Cognit. Comput., degree in computer science and engineering,
vol. 12, no. 5, pp. 1043–1069, Sep. 2020, doi: 10.1007/s12559-020-09748- in 2016. He is currently an Assistant Profes-
y. sor with the Department of Computer Science
[81] Y. Gao, W. Zhao, and S. Eger, ‘‘SUPERT: Towards new frontiers in unsu- and Engineering, National Institute of Technol-
pervised evaluation metrics for multi-document summarization,’’ 2020, ogy Hamirpur (NIT Hamirpur). He is also work-
arXiv:2005.03724.
ing on government sponsored funded projects
[82] A. John and M. Wilscy, ‘‘Random forest classifier based multi-
document summarization system,’’ in Proc. IEEE Recent Adv. Intell. and supervised many students. He has published
Comput. Syst. (RAICS), Trivandrum, India, Dec. 2013, pp. 31–36, doi: more than 20 research papers in reputed interna-
10.1109/RAICS.2013.6745442. tional/national journals and conference proceed-
ings. His research interests include information retrieval, machine learning,
and deep learning.
DIVAKAR YADAV (Senior Member, IEEE)
received the bachelor’s degree in computer science
and engineering, in 1999, the master’s degree
in information technology, in 2005, and the
Ph.D. degree in computer science and engineering,
in 2010. He spent one year as a Postdoctoral JORGE MORATO received the Ph.D. degree in
Fellow at the Universidad Carlos III de Madrid, library science from the Universidad Carlos III de
Spain, from 2011 to 2012. He has more than 22 Madrid, Spain, on the topic of knowledge infor-
Years of Teaching and Research Experience. He mation systems and their relationship with linguis-
did his PDF from the University of Carlos-III, in tics. He is currently a Professor of information
2011 to 2012, Ph.D. (CSE) in 2010, M.Tech. (IT) from IIIT Allahabad in science with the Department of Computer Science,
2005 and B.Tech. (CSE) from IET Lucknow in 1999. He is working as a Universidad Carlos III de Madrid. His research
Professor at the School of Computer and Information Science, Indira Gandhi interests include NLP, information retrieval, web
National Open University (IGNOU), New Delhi, since September 2022. positioning, and knowledge organization systems.
Prior to join IGNOU, he worked as an Associate Professor and Head,

VOLUME 10, 2022 134003

You might also like