Feature Based Automatic Text Summarization Methods a Comprehensive State-Of-The-Art Survey
Feature Based Automatic Text Summarization Methods a Comprehensive State-Of-The-Art Survey
ABSTRACT With the advent of the World Wide Web, there are numerous online platforms that generate
huge amounts of textual material, including social networks, online blogs, magazines, etc. This textual
content contains useful information that can be used to advance humanity. Text summarization has been
a significant area of research in natural language processing (NLP). With the expansion of the internet,
the amount of data in the world has exploded. Large volumes of data make locating the required and best
information time-consuming. It is impractical to manually summarize petabytes of data; hence, computerized
text summarization is rising in popularity. This study presents a comprehensive overview of the current
status of text summarizing approaches, techniques, standard datasets, assessment criteria, and future research
directions. The summarizing approaches are assessed based on several characteristics, including approach-
based, document-number-based, Summarization domain-based, document-language-based, output summary
nature, etc. This study concludes with a discussion of many obstacles and research opportunities linked to
text summarizing research that may be relevant for future researchers in this field.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 133981
D. Yadav et al.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey
TABLE 1. Queries performed in scopus and wos core collection Several observations can be made regarding the num-
databases.
ber of works published in the various languages: 1) the
number of works does not reflect the number of speakers;
for example, Nigerian pidgin is the 14th most spoken lan-
guage, but it is not mentioned in the results; 2) There are
languages among the 30 more spoken that have no study,
such as Cantonese, Tagalog, Hausa, Swahili, Nigerian, and
Javanese; 3) Indian languages are well represented: Ben-
gali (28), Hindi (17), Punjabi (8), Kannada (8), Telugu (7),
KonKonkani (5), Assamese (4), Tamil (2), or Marathi (2).
However, the representation of Hindi, the third-most-spoken
language, is inadequate, and other languages, such as Nepali,
are not mentioned.
P−N
ROUGH P
S∈{Reference summaries} gramn ∈S Count match (gramn )
= P P
S∈{Reference summaries} gramn ∈S Count(gramn )
(1)
b. ROUGE-L: Here, L stands for longest common sub-
string. A sentence is represented as a set of words. The
longer the LCS between our summary and manual sum-
mary sentences, the better the quality of the summary.
c. ROUGE-W: Here, W stands for weighted LCS. It tries
the limitation of LCS that it cannot differentiate
LCSs of different spatial relations within their word
embedding.
Given that we aim to study the current trajectory in comput- d. ROUGE-S: S stands for skip-bigrams co-occurrence
ing, we restricted our search to the years 2011 through 2021. statistics. Skip-bigrams are bigrams that do not have
The executed queries are listed in TABLE 1. to appear together in a sentence. For the sentence ‘‘I
Figures 3 and 4 reveal a strong upward trend that has am Ram’’, the skip-bigrams generated will be {(‘‘I’’,
become more evident since 2018. In the previous two years, ‘‘am’’), (‘‘I’’, ‘‘Ram’’), (‘‘am’’, ‘‘Ram’’)}.
there has been a slowdown, although this may be related to
ROUGE-S uses skip-bigrams to compute the similarity
the time required to update the database’s publications. The
between our generated summary and manual summaries.
trend of citations is quite progressive, indicating a focus on
the achievements made in ATS during the period. In fact, the
B. GENERIC PERFORMANCE METRICS
h-index in WoS is 39 while in Scopus it is 56.
1) PRECISION
Most systems are language-dependent, and the dearth of
native speakers or digital resources in certain languages It is computed by dividing the number of sentences common
impedes study. Analyzing the summaries, titles and key- in the Reference and Candidate summary by the number of
words in Scopes show that most of the language’s studies are sentences in the candidate summary as shown in Eq. (2).
amongst the most spoken languages in the world (TABLE 2). Precision = N (Sr ∩ Sc)/N (Sc) (2)
where,
C. OPINOSIS
Sr := Reference summary It is a dataset constructed from user reviews on a given
topic. It is very suitable for semantic analysis and has been
Sc := Candidate summary
used by multiple studies for the same purpose. It consists of
N (S) := Number of sentences in summary S. 51 topics, with each topic having 100s of review sentences.
It also comes with gold standard summaries and some scripts
3) F-MEASURE to evaluate the performance of a summarizer using ROUGE
It is computed by computing the harmonic mean between metric. The dataset and related material can be downloaded
precision and recall as shown in Eq. (4). from Opinosis [74]. This dataset was prepared by [45], [75],
and [76] for their research.
F = 2(Precision)(Recall)/(Precision + Recall) (4)
D. GIGAWORD
C. SUPERT
This dataset consists of more than 4 million articles. It is a
Summarization evaluation with Pseudo references and BERT
part of TensorFlow dataset collections and is highly popular
(SUPERT) is an un-supervised summary evaluation metric
among abstractive summarization studies [77]. The source
for evaluating multi-document summary by measuring the
code for this dataset is available at Gigaword [78].
semantic similarity between the summary and the pseudo ref-
erence summary. SUPERT was made by [81]. The limitation
of ROUGE is that it needs manual summaries to judge the E. MEDLINE CORPUS
quality of a summary. SUPERT can be used to summarize a The MEDLINE corpus is provided by NLM (National
dataset that does not have manual summaries. Library of Medicine). NLM produces this dataset in the form
of XML documents on an annual basis. This dataset can be
IV. DATASETS FOR TEXT SUMMARIZATION downloaded from [79]. Shang et al. [59] used this dataset to
In this section, we discuss about the popular dataset, used for develop an extractive summarizer.
text summarization methods among researchers.
F. LCSTS (LARGE SCALE CHINESE SHORT TEXT
A. DOCUMENT UNDERSTANDING CONFERENCES (DUC) SUMMARIZATION DATASET)
The National Institute of Standards and Technology (NIST) It is a Chinese text summarization dataset. This dataset con-
provides these groups of datasets. DUC is part of a Defense sists of 2 million short texts from a Chinese microblogging
Advanced Research Projects Agency (DARPA) program, website Sina Weibo. It is also provided with short summaries
Translingual Information Detection, Extraction, and Sum- for each blog, written by the blog authors. It is a very suitable
marization (TIDES), explicitly calling for major advances choice for Chinese abstractive summarization systems as the
in summarization technology. The datasets consist of the dataset size is large and it can be used to train neural networks
following parts: efficiently. Li et al. [77] used this dataset to develop an
• Documents
encoder-decoder based abstractive text summarizer.
• Summaries, results, etc.
G. BC3(BRITISH COLUMBIA UNIVERSITY DATASET)
– manually created summaries
– automatically created baseline summaries The corpus is composed of 40 email threads (3222 records)
– submitted summaries created by the participating from the W3C corpus. Each thread is commented on by three
groups’ systems different commenters.The dataset consists of:
– tables with the evaluation results (i) Extracted abstracts
– additional supporting data and software (ii) Abstract abstracts with linked sentences
TABLE 3. Papers mentioning the ats type in the scopus database (based
on title and keyword sections).
k-means ([14], [15], [16], [17]), k-medoids [18], etc. are used 8) OPTIMIZATION BASED METHODS
for sentence clustering as shown in TABLE 6. In these techniques, the summarization problem is formu-
lated as an optimization problem. The steps involved in an
4) GRAPH-BASED TECHNIQUES optimization-based technique are as follows:
In these methods, the document is represented as a graph • Preprocessing and converting the document to an inter-
of sentences. The sentences represent the nodes. The edges mediate representation.
represent the similarity between the nodes. The similarity • Using an optimization algorithm to extract summary
between words can be represented using some similarity sentences from the IR.
measures like cosine similarity ([6], [19], [20]). Graph-based Multi-Objective Artificial Bee Colony algorithm (MOABC)
techniques are prevalent for extractive summarizers. Popular is the most common optimization algorithm used by many
summarizers such as TextRank [21], LexRank [19] and [22] studies ([28], [38], [39], [40]) as discussed in TABLE 11.
use a graph-based approach. The sentences are then scored
based on the properties of the graph. The summary of such 9) FUZZY-LOGIC BASED TECHNIQUES
methods is shown in TABLE 7. In these techniques, Fuzzy-logic based systems are used to
compute the sentence scores. Fuzzy-logic techniques are pop-
5) SEMANTIC-BASED TECHNIQUES ular because we can represent scores more precisely. The
typical workflow of a fuzzy-logic based system is given as
In these methods, sentence semantics are also taken into con-
under:
sideration. LSA (Latent Semantic Analysis), ESA (Explicit
• Extracting meaningful features from a sentence. e.g.,
Semantic Analysis) and SRL (Semantic Role Labeling) are
some ways of doing semantic analysis of textual data. Out of sentence length, term-weight etc.
• Using a fuzzy system to provide scores to those features.
the three, LSA is the most common and is used by most of
the studies ([12], [24], [25], [26], [27]) as show in TABLE 8. The score ranges between 0 and 1.
Common steps in semantic analysis using LSA is: Babar and Patil [12], Abbasi-ghalehtaki et al. [28], Azhari and
Jaya Kumar [41], and Goularte et al. [42] developed fuzzy-
• Creating a matrix representation of the input. systems based text summarizers. Some studies even inte-
• Apply SVD (Singular value decomposition) to cap- grated different domains like cellular learning algorithms [28]
ture the relationship between individual terms and sen- and neural networks [41] with the fuzzy systems to further
tences. improve the results as shown in TABLE 12.
B. ABSTRACTIVE TEXT SUMMARIZATION are generated by paraphrasing, merging the sentences of the
In this approach, the summary is generated in the same way original document. Abstractive text summarization requires a
humans summarize documents. The summary does not con- deeper understanding of the input document, the context and
sist of sentences from the documents, rather new sentences the semantics. It also requires some deeper understanding of
NLP concepts. The typical workflow for the abstractive text C. TREE-BASED METHODS
summarizer is: In these techniques, parsers convert text documents to parse
a. Preprocessing trees. Then various tree-processing methods like pruning and
b. Creating an Intermediate representation linearization are used to generate tree summaries. Deep learn-
c. Generating summary from intermediate representation. ing models like encoder-decoder neural networks can also
be used to generate meaningful information from the parse
In the following subsections, the techniques and methods
trees [46]. Techniques like sentence fusion are also used to
used in Abstractive Text summarization are discussed.
eliminate redundancy in the generated summary [47]. The
further details about these methods are shown in TABLE 15.
1) GRAPH-BASED METHODS
In these methods, the individual words are taken as D. DOMAIN-SPECIFIC METHODS
the graph’s nodes. The edges represent the structure of Many studies focus on domain-specific text summarizers.
the sentence. AMR (Abstract Meaning Representation) These studies can be benefitted by using knowledge dictio-
graphs are popular graph-based text representation meth- naries unique to each domain. In addition, the sentences that
ods. Various sentence generators are integrated with AMR do not hold much importance in normal text summarization
graphs for abstractive text summarization [44]. Ganesan can be imperative depending on the domain. Sports news may
et al. [45] developed a popular text summarizer, Opinosis. contain some sport-specific keywords that are important to
The brief about these methods is shown in TABLE 14. convey the necessary information about a game, e.g., ‘‘out’’
The processing steps of the OPINOSIS model are in cricket is considered an important word that is more signif-
as follows: icant than other words like ‘‘high’’ Okumura and Miura [48]
• The path in the intermediate is considered as the developed a sports news summarization system utilizing the
summary. above domain characteristics. Lee et al. [49] developed a text
• The goal is to find the best path. summarizer for Chinese news articles. Further details about
• To do this, rank all the paths and sort them based on these methods are shown in TABLE 16.
decreasing scores.
• Use a similarity measure metric (e.g., Cosine similar- E. DEEP-LEARNING BASED METHODS
ity) to remove redundant paths. Advances in deep learning have made abstractive text
• The best path is chosen for the summary. summarization more approachable. Sequence-to-sequence
models is being explored for abstractive text summariza- disadvantages of hybrid text summarization are as shown
tion ([50], [51]). Pre-trained transformers are also used for below:
abstractive text summarization [51] as shown in TABLE 17 •Generates better quality summaries than pure extractive
The main advantages and disadvantages of extractive text models.
summarization are pointed out as below: •It is easier to implement than abstract text summarizers.
•Generate better quality summaries as the sentences are •The quality of summaries is less than pure abstractive
not directly extracted from the document. summarizers.
•Summaries are safe from plagiarism.
•More complex to implement than extractive summarizers. VI. OTHER CLASSIFICATION CRITERIA
•Captures less information as some of the information can The following classification shows other criteria for classify-
be lost while rephrasing the sentences ing scientific papers:
a. Classification based on the number of papers: single or
1) HYBRID TEXT SUMMARIZATION multiple.
In this approach, a hybrid of extractive and abstractive sum- b. Classification according to the domain of the abstract
marizers generates the summary. Generally, hybrid text sum- c. Classification based on the number of languages used.
marizers generate better quality summaries than extractive d. Classification based on the nature of the output
summarizers, and they are less complex than abstractive text These classifications are discussed and exemplified
summarizers. Lloret et al. [52] developed a hybrid summa- below.
rization system called Compendium Gupta and Kaur [53]
developed a machine learning-based model, and Binwahlan A. BASED ON THE NUMBER OF DOCUMENTS
et al. [54] developed a fuzzy-systems based hybrid text sum- The text summarization methods based on the number of
marization model. The details about few of such methods documents are classified in different categories as discussed
are as shown in TABLE 18. Some of the advantages and in below sections.
TABLE 9. Studies on machine learning based methods for extractive text summarization.
TABLE 10. Studies on deep learning-based methods for extractive text summarization.
TABLE 12. Studies on fuzzy systems-based methods for extractive text summarization.
summarization as the documents may refer to different peri- and specific domain based text summarization as discussed
ods. In addition, different documents may cover different top- below:
ics, which makes multi-document text summarization more
challenging Ferreira et al. [23], Nguyen et al. [57], Barzilay
1) GENERIC DOMAIN TEXT SUMMARIZATION
and McKeown [47], Xu and Durrett [58], and Patel et al. [20]
developed multi-document text summarizers as discussed in This type of text summarization is based on without having
TABLE 20. a specific domain. In this type of summarization, the impor-
tance of a sentence, keyword or key phrase depends on its
grammatical properties, e.g., proper nouns, numerical terms
B. BASED ON THE SUMMARIZATION DOMAIN and references can be given higher importance. It is more
Based on summarization domain, text summarization is common than domain-specific summarization as these algo-
of two types: generic domain-based text summarization rithms tend to perform well in different domains but may end
TABLE 17. Studies on deep learning-based methods for abstractive text summarization.
up losing some important domain information in summary Desouki [50] worked on generic text summarizers as shown
Ferreira et al. [23], Babar and Patil [12], and Al-Maleh and in TABLE 21.
represent the similarity between two sentences. It employs This algorithm tries to convert the text summarization prob-
a recommendation-based mechanism to compute sentence lem into an optimization problem, with the best summary
ranks [19]. representing the global minima.
Unlike textRank, the edge weights are computed based on Sanchez-Gomez et al. [40] used MOABC on DUC 2002
some similarity metric (e.g., Cosine similarity), producing dataset to get a 2.23% improvement on ROUGE-2 scores
better output in some scenarios. over state-of-the-art methods. Abbasi-Ghalehtaki et al. [28]
implemented a MOABC + cellular automation theory-based
algorithm on the DUC 2002 dataset to get significant results.
E. MOABC
This algorithm is an enhancement of the popular ABC (Arti-
ficial Bee Colony) algorithm. The ABC algorithm is inspired
by the natural food searching behavior of honeybees. In the F. MACHINE LEARNING TECHNIQUES
ABC algorithm, the optimization is done in three phases: 1) LOGISTIC REGRESSION
Logistic regression is a classification algorithm, which is
i. Employed bees: These bees exploit the food source, very useful in binary classification i.e., whether the gender
return to the hive, and report to the onlooker bees. of the author is male or female. Unlike linear regression,
ii. Onlooker bees: These bees gather data from employed it models the data using a non-linear function like the sigmoid
bees, then select the food source to gather data from. function It can also be used for classification problems, where
iii. Scout bees: These bees try to find random food sources the number of classes in the output are gmore than 2. The
for our employed bees to exploit. mathematical expression for the sigmoid function is given in
the following equation. the sentences that have the highest relevance, with the least
1 redundancy with respect to the rest of the sentences generated
∅sig (z) = (5) for the summary.
1 + e−z
Machine learning-based methods achieved significant
Neto et al. [30] used the logistic regression classifier on the results in the text summarization domain, however, due to
TIPSTER collection. They got a precision of 0.34 for the limited dataset sizes, the models could not learn that effi-
model. Neto et al. [32] used logistic regression on the EASC ciently and thus they could not compete with the state-of-
(Essex Arabic Summary Corpus) and got a ROUGE-1 score the-art graph-based models. However, neural network-based
of 0.129. models overcame the limitations of machine-learning-based
models and produced even better results than the state-of-the-
2) SVM art graph-based models.
The main idea behind an SVM classifier is to choose a hyper-
plane that can segregate n-dimensional data into different G. NEURAL NETWORK-BASED APPROACHES
classes with minimum overlapping. Support vectors are used Task summarization can be formalized as a seq2seq model,
to create the hyperplane, hence the name ‘Support vector where input sequence is the input document, the output
machines’. In an SVM model, the distance between a point sequence is the output summary. Since the input size can keep
x and the hyperplane, represented by (w, b), where, varying, we cannot use a traditional neural network for this
task. These seq2seq models are getting very popular in recent
||W || = 1if| < w, x > +b| (6)
times. The most popular seq2seq models being used in for text
Shen et al. [27] used SVM on the LookSmart web directory summarization are RNN, LSTM anGRU.
along with LSA and achieved significant results. Neto et al.
[30] used SVM on the TIPSTER collection to a precision 1) RNN
of 0.34. RNN (Recurrent Neural Networks) belong to a class of neu-
ral networks that can use the previous outputs as input for
3) RANDOM FOREST next state. The structure of a basic RNN model is given in
Random forest classifiers are a part of ensemble-based learn- FIGURE 6.
ing methods. Their main features are ease of implementa- The activation vector (a) is computed as shown in Eq. (7):
tion, efficiency, and great output in a variety of domains.
a<t> = g1 Waa a<t−1> + Wax x <t> +ba (7)
In the Random Forest approach, many decision trees are
constructed during the training stage. Then, a majority vot- The output value (yt ) is computed as shown in Eq. (8):
ing method is used among those decision trees during the
classification stage to get the final output. Alami et al. [32] y<t> = g2 (Wya a<t> + by (8)
used a Random Forest classifier on the EASC collection and
got a ROUGE-1 score of 0.129. John and Wilscy [82] used 2) LSTM
Random Forest and Maximum Marginal Relevance (MMR), Although RNN can generate significant results for text sum-
achieving significant results. The MMR coefficient selects marization, they suffer from the ‘vanishing gradient’ problem
FIGURE 6. RNN.
while backpropagation. This limits the learning abilities of B. CHALLENGES RELATED TO APPLICATIONS OF TEXT
the model. To counter this, LSTM (Long short-term memory) SUMMARIZATION
models were introduced. In an LSTM model, a gate-based Since most current studies focus on a specific text domain,
mechanism is employed in each LSTM cell that is used to i.e., news, biomedical documents, etc., some of these domains
memorize the relevant information. This solves the vanishing do not have significant economic value. Focusing on a long
gradient problem of RNN. The cell of an LSTM model is text, such as an essay, dissertation thesis or reports, may be
shown in Figure 7. more economically profitable. However, since the processing
3) GRU of long text requires high computational power, it remains a
Gated Recurrent Units (GRU) are another modification over major challenge.
standard RNNs that can solve the vanishing gradient problem.
Similar to LSTM units, GRU units have a gate-based mech- C. CHALLENGES RELATED TO USER-SPECIFIC
anism to store the relevant data for backpropagation training. SUMMARIZATION TASKS
The construction of a GRU cell is given in FIGURE 8. Summarizing semi-structured resources like web pages
databases is an important application of text summarization
VIII. CHALLENGES AND FUTURE SCOPES since most of the textual data is present in a semi-structured
Even with these advancement in text summarization, multiple format. This type of summarization is more complex than
challenges still exist, and researchers are working to over- simple text summarization since there is much more noise in
come the challenges. These challenges can also act as future the data. Hence developing efficient summarizers for these
research directions for the new studies. These challenges domains is a massive challenge.
are in many domains like multi-document summarization,
applications of text summarization and some user-specific D. CHALLENGES RELATED TO FEATURE SELECTION,
summarization tasks. Few of the challenges are as discussed PREPROCESSING AND DATASETS
below: For any natural language processing problem, the perfor-
mance of the selected methods dramatically depends on the
A. CHALLENGES RELATED TO MULTI-DOCUMENT selection of the features, so is valid with text summariza-
SUMMARIZATION tion techniques. Irrespective of the methods such as machine
Multi-document text summarization is more complex than learning, statistical, fuzzy, deep learning etc. that have been
single-document text summarization due to the following used at a large scale in recent times for such problems,
issues: selecting appropriate features for concerning documents to
i. Redundancy be summarized is still a significant challenge in front of
ii. Temporal dimension researchers. So, there is much scope in solving the feature
iii. Co-references selection problem, such as determining the most appropriate
iv. Sentence reordering features to summarize the dataset, discovering new features,
optimizing the commonly used features, using features for [4] A. Nenkova and K. McKeown, ‘‘A survey of text summarization tech-
semantic, adding grammatical features, linguistics features niques,’’ in Mining Text Data, C. C. Aggarwal and C. Zhai, Eds. Boston,
MA, USA: Springer, 2012, pp. 43–76, doi: 10.1007/978-1-4614-3223-4_3.
etc. Preprocessing a dataset using appropriate methods also [5] W. S. El-Kassas, C. R. Salama, A. A. Rafea, and H. K. Mohamed, ‘‘Auto-
affects the performance of the summarization methods, so it matic text summarization: A comprehensive survey,’’ Expert Syst. Appl.,
also needs attention in the future. One can explore the appro- vol. 165, Mar. 2021, Art. no. 113679, doi: 10.1016/j.eswa.2020.113679.
[6] B. Polepalli Ramesh, R. J. Sethi, and H. Yu, ‘‘Figure-associated text
priate stemming approaches, stop word removal techniques, summarization and evaluation,’’ PLoS ONE, vol. 10, no. 2, Feb. 2015,
tokenizers, and suitable POS taggers to categorize token Art. no. e0115671, doi: 10.1371/journal.pone.0115671.
[7] B. Kitchenham, O. Pearl Brereton, D. Budgen, M. Turner, J. Bailey, and
classes among nouns, verbs, adjectives, adverbs, etc. The
S. Linkman, ‘‘Systematic literature reviews in software engineering—A
creation of a new dataset is also a demanding task. Many systematic literature review,’’ Inf. Softw. Technol., vol. 51, no. 1, pp. 7–15,
little-explored domains, such as legal, tourism, health, etc., Jan. 2009, doi: 10.1016/j.infsof.2008.09.009.
[8] D. Moher, A. Liberati, J. Tetzlaff, and D. G. Altman, ‘‘Preferred reporting
need new datasets to be created and used to expedite the items for systematic reviews and meta-analyses: The PRISMA statement,’’
summarization work at a different level. BMJ, vol. 339, p. b2535, Jul. 2009, doi: 10.1136/bmj.b2535.
[9] P. Mongeon and A. Paul-Hus, ‘‘The journal coverage of Web of science
IX. CONCLUSION and scopus: A comparative analysis,’’ Scientometrics, vol. 106, no. 1,
Text summarization is an exciting research topic among the pp. 213–228, Jan. 2016, doi: 10.1007/s11192-015-1765-5.
[10] M. Kutlu, C. Cigir, and I. Cicekli, ‘‘Generic text summarization for
NLP community that helps to produce concise information.
Turkish,’’ Comput. J., vol. 53, no. 8, pp. 1315–1323, Jan. 2010, doi:
The idea of this study is to present the latest research and 10.1093/comjnl/bxp124.
progress made in this field with a systematic review of rel- [11] C. Bouras and V. Tsogkas, ‘‘Noun retrieval effect on text summa-
rization and delivery of personalized news articles to the user’s desk-
evant research articles. In this study, we consolidated the top,’’ Data Knowl. Eng., vol. 69, no. 7, pp. 664–677, Jul. 2010, doi:
research works from different repositories, related to various 10.1016/j.datak.2010.02.005.
text summarization methods, datasets, techniques, and evalu- [12] S. A. Babar and P. D. Patil, ‘‘Improving performance of text sum-
marization,’’ Proc. Comput. Sci., vol. 46, pp. 354–363, Jan. 2015, doi:
ation metrics. 10.1016/j.procs.2015.02.031.
We have also added a section on ‘‘Analysis of Popular [13] M. A. Tayal, M. M. Raghuwanshi, and L. G. Malik, ‘‘ATSSC: Devel-
Text Summarization Techniques’’, which articulates the most opment of an approach based on soft computing for text summariza-
tion,’’ Comput. Speech Lang., vol. 41, pp. 214–235, Jan. 2017, doi:
popular techniques in the text summarization domain and 10.1016/j.csl.2016.07.002.
gives the strengths and limitations of each technique, and [14] R. M. Alguliyev, R. M. Aliguliyev, N. R. Isazade, A. Abdi, and N. Idris,
hints at future research directions. We have presented the ‘‘A model for text summarization,’’ Int. J. Intell. Inf. Technol., vol. 13, no. 1,
pp. 67–85, Jan. 2017, doi: 10.4018/IJIIT.2017010104.
information in a tabular format, covering the advantages and [15] R. M. Alguliyev, R. M. Aliguliyev, N. R. Isazade, A. Abdi, and N. Idris,
disadvantages of each research paper, which can make it ‘‘COSUM: Text summarization based on clustering and optimization,’’
easier for the readers to use our review paper as a base paper Expert Syst., vol. 36, no. 1, Feb. 2019, Art. no. e12340.
[16] M. Mohd, R. Jan, and M. Shah, ‘‘Text document summarization using word
for text summarization domain knowledge. We presented a embedding,’’ Expert Syst. Appl., vol. 143, Apr. 2020, Art. no. 112958, doi:
detailed discussion on the different types of text summa- 10.1016/j.eswa.2019.112958.
rization studies based on approach (extractive, abstractive [17] O. Rouane, H. Belhadef, and M. Bouakkaz, ‘‘Combine clustering
and frequent itemsets mining to enhance biomedical text summariza-
and hybrid), the number of documents (single-document and tion,’’ Expert Syst. Appl., vol. 135, pp. 362–373, Nov. 2019, doi:
multi-document), summarization domain (generic domain 10.1016/j.eswa.2019.06.002.
and domain-specific summarization), language (monolin- [18] Y.-H. Hu, Y.-L. Chen, and H.-L. Chou, ‘‘Opinion mining from online hotel
reviews—A text summarization approach,’’ Inf. Process. Manag., vol. 53,
gual, multilingual, cross-lingual), and nature of the output no. 2, pp. 436–449, Mar. 2017, doi: 10.1016/j.ipm.2016.12.002.
summary (generic and query-based summarizer). We also [19] G. Erkan and D. R. Radev, ‘‘LexRank: Graph-based lexical centrality as
presented a detailed analysis of various studies in a tabular salience in text summarization,’’ J. Artif. Intell. Res., vol. 22, pp. 457–479,
Dec. 2004.
format, which will save the readers the hassle of reading [20] D. Patel, S. Shah, and H. Chhinkaniwala, ‘‘Fuzzy logic based multi doc-
through long texts and save their time. We also gave a ument summarization with improved sentence scoring and redundancy
removal technique,’’ Expert Syst. Appl., vol. 134, pp. 167–177, Nov. 2019,
detailed review of various datasets used in this domain and
doi: 10.1016/j.eswa.2019.05.045.
provided references to the datasets. We discussed various [21] R. Mihalcea and P. Tarau, ‘‘TextRank: Bringing order into text,’’ in
standard evaluation metrics used (ROUGE, F-measure, recall, Proc. Conf. Empirical Methods Natural Lang. Process., Barcelona, Spain,
Jul. 2004, pp. 404–411. [Online]. Available: https://aclanthology.org/W04-
precision etc.), which can be used to measure the quality of 3252
a text summarization model. Finally, we discussed various [22] R. Elbarougy, G. Behery, and A. El Khatib, ‘‘Extractive Arabic text sum-
challenges faced in text summarization that can lead future marization using modified PageRank algorithm,’’ Egyptian Informat. J.,
vol. 21, no. 2, pp. 73–81, Jul. 2020, doi: 10.1016/j.eij.2019.11.001.
studies in the domain. [23] R. Ferreira, L. de Souza Cabral, F. Freitas, R. D. Lins, G. de França Silva,
S. J. Simske, and L. Favaro, ‘‘A multi-document summarization system
REFERENCES
based on statistics and linguistic treatment,’’ Expert Syst. Appl., vol. 41,
[1] D. Jain, M. D. Borah, and A. Biswas, ‘‘Summarization of legal documents: no. 13, pp. 5780–5787, 2014, doi: 10.1016/j.eswa.2014.03.023.
Where are we now and the way forward,’’ Comput. Sci. Rev., vol. 40, [24] V. Priya and K. Umamaheswari, ‘‘Enhanced continuous and discrete multi
May 2021, Art. no. 100388, doi: 10.1016/j.cosrev.2021.100388. objective particle swarm optimization for text summarization,’’ Cluster
[2] A. B. Al-Saleh and M. E. B. Menai, ‘‘Automatic Arabic text summariza- Comput., vol. 22, no. S1, pp. 229–240, Jan. 2019, doi: 10.1007/s10586-
tion: A survey,’’ Artif. Intell. Rev., vol. 45, no. 2, pp. 203–234, Feb. 2016, 018-2674-1.
doi: 10.1007/s10462-015-9442-x. [25] I. V. Mashechkin, M. I. Petrovskiy, D. S. Popov, and D. V. Tsarev,
[3] Y. Kumar, K. Kaur, and S. Kaur, ‘‘Study of automatic text summarization ‘‘Automatic text summarization using latent semantic analysis,’’ Pro-
approaches in different languages,’’ Artif. Intell. Rev., vol. 54, no. 8, gram. Comput. Softw., vol. 37, no. 6, pp. 299–305, Nov. 2011, doi:
pp. 5897–5929, Dec. 2021, doi: 10.1007/s10462-021-09964-4. 10.1134/S0361768811060041.
[26] J.-Y. Yeh, H.-R. Ke, W.-P. Yang, and I.-H. Meng, ‘‘Text summa- [48] N. Okumura and T. Miura, ‘‘Automatic labelling of documents based on
rization using a trainable summarizer and latent semantic analysis,’’ ontology,’’ in Proc. IEEE Pacific Rim Conf. Commun., Comput. Signal
Inf. Process. Manage., vol. 41, no. 1, pp. 75–95, Jan. 2005, doi: Process. (PACRIM), Victoria, BC, Canada, Aug. 2015, pp. 34–39, doi:
10.1016/j.ipm.2004.04.003. 10.1109/PACRIM.2015.7334805.
[27] D. Shen, Q. Yang, and Z. Chen, ‘‘Noise reduction through summariza- [49] C.-S. Lee, Y.-J. Chen, and Z.-W. Jian, ‘‘Ontology-based fuzzy event extrac-
tion for web-page classification,’’ Inf. Process. Manage., vol. 43, no. 6, tion agent for Chinese e-news summarization,’’ Expert Syst. Appl., vol. 25,
pp. 1735–1747, Nov. 2007, doi: 10.1016/j.ipm.2007.01.013. no. 3, pp. 431–447, Oct. 2003, doi: 10.1016/S0957-4174(03)00062-9.
[28] R. Abbasi-ghalehtaki, H. Khotanlou, and M. Esmaeilpour, ‘‘Fuzzy [50] M. Al-Maleh and S. Desouki, ‘‘Arabic text summarization using deep
evolutionary cellular learning automata model for text summariza- learning approach,’’ J. Big Data, vol. 7, no. 1, p. 109, Dec. 2020, doi:
tion,’’ Swarm Evol. Comput., vol. 30, pp. 11–26, Oct. 2016, doi: 10.1186/s40537-020-00386-7.
10.1016/j.swevo.2016.03.004. [51] U. Khandelwal, K. Clark, D. Jurafsky, and L. Kaiser, ‘‘Sample effi-
[29] Y. Ko, J. Park, and J. Seo, ‘‘Improving text categorization using the cient text summarization using a single pre-trained transformer,’’ 2019,
importance of sentences,’’ Inf. Process. Manage., vol. 40, no. 1, pp. 65–79, arXiv:1905.08836.
2004, doi: 10.1016/S0306-4573(02)00056-0. [52] E. Lloret, M. T. Romá-Ferri, and M. Palomar, ‘‘COMPENDIUM:
[30] J. L. Neto, A. A. Freitas, and C. A. A. Kaestner, ‘‘Automatic text summa- A text summarization system for generating abstracts of research
rization using a machine learning approach,’’ in Proc. 16th Brazilian Symp. papers,’’ Data Knowl. Eng., vol. 88, pp. 164–175, Nov. 2013, doi:
Artif. Intell., Adv. Artif. Intell., Berlin, Germany, 2002, pp. 205–215. 10.1016/j.datak.2013.08.005.
[31] D. D. A. Bui, G. Del Fiol, J. F. Hurdle, and S. Jonnalagadda, ‘‘Extractive [53] V. Gupta and N. Kaur, ‘‘A novel hybrid text summarization system for
text summarization system to aid data extraction from full text in sys- Punjabi text,’’ Cognit. Comput., vol. 8, no. 2, pp. 261–277, Apr. 2016, doi:
tematic review development,’’ J. Biomed. Informat., vol. 64, pp. 265–272, 10.1007/s12559-015-9359-3.
Dec. 2016, doi: 10.1016/j.jbi.2016.10.014. [54] M. S. Binwahlan, N. Salim, and L. Suanmali, ‘‘Fuzzy swarm diversity
[32] N. Alami, M. Meknassi, and N. En-Nahnahi, ‘‘Enhancing unsupervised hybrid model for text summarization,’’ Inf. Process. Manage., vol. 46,
neural networks based text summarization with word embedding and no. 5, pp. 571–588, Sep. 2010, doi: 10.1016/j.ipm.2010.03.004.
ensemble learning,’’ Expert Syst. Appl., vol. 123, pp. 195–211, Jun. 2019, [55] J. M. Perea-Ortega, E. Lloret, L. Alfonso Ureña-López, and M. Palomar,
doi: 10.1016/j.eswa.2019.01.037. ‘‘Application of text summarization techniques to the geographical infor-
[33] A. Sinha, A. Yadav, and A. Gahlot, ‘‘Extractive text summarization using mation retrieval task,’’ Expert Syst. Appl., vol. 40, no. 8, pp. 2966–2974,
neural networks,’’ 2018, arXiv:1802.10137. Jun. 2013, doi: 10.1016/j.eswa.2012.12.012.
[34] J. Xu, Z. Gan, Y. Cheng, and J. Liu, ‘‘Discourse-aware neural extractive text [56] Y. Sankarasubramaniam, K. Ramanathan, and S. Ghosh, ‘‘Text summariza-
summarization,’’ in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, tion using Wikipedia,’’ Inf. Process. Manage., vol. 50, no. 3, pp. 443–461,
Jul. 2020, pp. 5021–5031, doi: 10.18653/v1/2020.acl-main.451. May 2014, doi: 10.1016/j.ipm.2014.02.001.
[57] H. Nguyen, E. Santos, and J. Russell, ‘‘Evaluation of the impact of user-
[35] B. Mutlu, E. A. Sezer, and M. A. Akcayol, ‘‘Candidate sentence selection
cognitive styles on the assessment of text summarization,’’ IEEE Trans.
for extractive text summarization,’’ Inf. Process. Manage., vol. 57, no. 6,
Syst., Man, Cybern., A, Syst. Humans, vol. 41, no. 6, pp. 1038–1051,
Nov. 2020, Art. no. 102359, doi: 10.1016/j.ipm.2020.102359.
[36] M. Yousefi-Azar and L. Hamey, ‘‘Text summarization using unsupervised Nov. 2011, doi: 10.1109/TSMCA.2011.2116001.
[58] J. Xu and G. Durrett, ‘‘Neural extractive text summarization with syntactic
deep learning,’’ Expert Syst. Appl., vol. 68, pp. 93–105, Feb. 2017, doi:
compression,’’ in Proc. Conf. Empirical Methods Natural Lang. Process.
10.1016/j.eswa.2016.10.017.
9th Int. Joint Conf. Natural Lang. Process., Hong Kong, Nov. 2019,
[37] N. Alami, M. E. Mallahi, H. Amakdouf, and H. Qjidaa, ‘‘Hybrid method
pp. 3292–3303, doi: 10.48550/ARXIV.1902.00863.
for text summarization based on statistical and semantic treatment,’’ Mul-
[59] Y. Shang, Y. Li, H. Lin, and Z. Yang, ‘‘Enhancing biomedical text sum-
timedia Tools Appl., vol. 80, no. 13, pp. 19567–19600, May 2021, doi:
marization using semantic relation extraction,’’ PLoS ONE, vol. 6, no. 8,
10.1007/s11042-021-10613-9.
Aug. 2011, Art. no. e23862, doi: 10.1371/journal.pone.0023862.
[38] J. M. Sanchez-Gomez, M. A. Vega-Rodríguez, and C. J. Pérez, [60] L. H. Reeve, H. Han, and A. D. Brooks, ‘‘The use of domain-specific con-
‘‘Comparison of automatic methods for reducing the Pareto front to a single cepts in biomedical text summarization,’’ Inf. Process. Manage., vol. 43,
solution applied to multi-document text summarization,’’ Knowl.-Based no. 6, pp. 1765–1776, Nov. 2007, doi: 10.1016/j.ipm.2007.01.026.
Syst., vol. 174, pp. 123–136, Jun. 2019, doi: 10.1016/j.knosys. [61] M. Moradi, M. Dashti, and M. Samwald, ‘‘Summarization of biomed-
2019.03.002. ical articles using domain-specific word embeddings and graph rank-
[39] J. M. Sanchez-Gomez, M. A. Vega-Rodríguez, and C. J. Pérez, ‘‘Paralleliz- ing,’’ J. Biomed. Informat., vol. 107, Jul. 2020, Art. no. 103452, doi:
ing a multi-objective optimization approach for extractive multi-document 10.1016/j.jbi.2020.103452.
text summarization,’’ J. Parallel Distrib. Comput., vol. 134, pp. 166–179, [62] A. Farzindar and G. Lapalme, ‘‘Legal text summarization by exploration
Dec. 2019, doi: 10.1016/j.jpdc.2019.09.001. of the thematic structure and argumentative roles,’’ in Proc. Text Summa-
[40] J. M. Sanchez-Gomez, M. A. Vega-Rodríguez, and C. J. Pérez, ‘‘Exper- rization Branches Out, Barcelona, Spain, Jul. 2004, pp. 27–34. [Online].
imental analysis of multiple criteria for extractive multi-document text Available: https://aclanthology.org/W04-1006
summarization,’’ Expert Syst. Appl., vol. 140, Feb. 2020, Art. no. 112904, [63] R. Rani and D. K. Lobiyal, ‘‘An extractive text summarization approach
doi: 10.1016/j.eswa.2019.112904. using tagged-LDA based topic modeling,’’ Multimedia Tools Appl., vol. 80,
[41] M. Azhari and Y. Jaya Kumar, ‘‘Improving text summarization using no. 3, pp. 3275–3305, Jan. 2021, doi: 10.1007/s11042-020-09549-3.
neuro-fuzzy approach,’’ J. Inf. Telecommun., vol. 1, no. 4, pp. 367–379, [64] E. Linhares Pontes, S. Huet, J.-M. Torres-Moreno, and A. C. Linhares,
Oct. 2017, doi: 10.1080/24751839.2017.1364040. ‘‘Compressive approaches for cross-language multi-document summa-
[42] F. B. Goularte, S. M. Nassar, R. Fileto, and H. Saggion, ‘‘A text sum- rization,’’ Data Knowl. Eng., vol. 125, Jan. 2020, Art. no. 101763, doi:
marization method based on fuzzy rules and applicable to automated 10.1016/j.datak.2019.101763.
assessment,’’ Expert Syst. Appl., vol. 115, pp. 264–275, Jan. 2019, doi: [65] N. Chatterjee and P. K. Sahoo, ‘‘Random indexing and modified
10.1016/j.eswa.2018.07.047. random indexing based approach for extractive text summarization,’’
[43] S. Hou and R. Lu, ‘‘Knowledge-guided unsupervised rhetorical parsing for Comput. Speech Lang., vol. 29, no. 1, pp. 32–44, Jan. 2015, doi:
text summarization,’’ Inf. Syst., vol. 94, Dec. 2020, Art. no. 101615, doi: 10.1016/j.csl.2014.07.001.
10.1016/j.is.2020.101615. [66] J. He, W. Kryściński, B. McCann, N. Rajani, and C. Xiong,
[44] S. Dohare, H. Karnick, and V. Gupta, ‘‘Text summarization using abstract ‘‘CTRLsum: Towards generic controllable text summarization,’’ 2020,
meaning representation,’’ 2017, arXiv:1706.01678. arXiv:2012.04281.
[45] K. Ganesan, C. Zhai, and J. Han, ‘‘Opinosis: A graph based approach [67] G. Salton, J. Allan, C. Buckley, and A. Singhal, ‘‘Automatic analysis,
to abstractive summarization of highly redundant opinions,’’ in Proc. theme generation, and summarization of machine-readable texts,’’ Sci-
23rd Int. Conf. Comput. Linguistics (Coling), Beijing, China, Aug. 2010, ence, vol. 264, no. 5164, pp. 1421–1426, Jun. 1994, doi: 10.1126/sci-
pp. 340–348. [Online]. Available: https://aclanthology.org/C10-1039 ence.264.5164.1421.
[46] K. Song, L. Lebanoff, Q. Guo, X. Qiu, X. Xue, and C. Li, ‘‘Joint parsing [68] H. Van Lierde and T. W. S. Chow, ‘‘Query-oriented text summarization
and generation for abstractive summarization,’’ in Proc. AAAI, Apr. 2020, based on hypergraph transversals,’’ Inf. Process. Manage., vol. 56, no. 4,
vol. 34, no. 5, pp. 8894–8901, doi: 10.1609/aaai.v34i05.6419. pp. 1317–1338, Jul. 2019, doi: 10.1016/j.ipm.2019.03.003.
[47] R. Barzilay and K. R. McKeown, ‘‘Sentence fusion for multidocument [69] H. Van Lierde and T. W. S. Chow, ‘‘Learning with fuzzy hypergraphs: A
news summarization,’’ Comput. Linguistics, vol. 31, no. 3, pp. 297–328, topical approach to query-oriented text summarization,’’ Inf. Sci., vol. 496,
Sep. 2005. pp. 212–224, Sep. 2019, doi: 10.1016/j.ins.2019.05.020.
[70] A. Joshi, E. Fidalgo, E. Alegre, and L. Fernández-Robles, ‘‘SummCoder: Department of Computer Science and AMP; Engineering, National Institute
An unsupervised framework for extractive text summarization based on of Technology, from 2019 to 2022, Madan Mohan Malaviya University of
deep auto-encoders,’’ Expert Syst. Appl., vol. 129, pp. 200–215, Sep. 2019, Technology (MMMUT), Gorakhpur (UP) from 2016 to 2019, and Jaypee
doi: 10.1016/j.eswa.2019.03.045. Institute of Information Technology, Noida (UP) from 2005 to 2016. He is
[71] CNN-Dailymail. GitHub. Accessed: Jun. 26, 2022. [Online]. Available: currently an Associate Professor with the Department of Computer Science
https://github.com/abisee/cnn-dailymail and Engineering, National Institute of Technology Hamirpur (NIT Hamir-
[72] GitHub—JafferWilson/Process-Data-of-CNN-DailyMail: This Repository
Holds the Output of the Repository. Accessed: Jun. 26, 2022.
pur) (An Institution of National Importance). He supervised eight Ph.D.
[Online]. Available: https://github.com/abisee/cnn-dailymail and thesis and 31 M.Tech. dissertations. He has published more than 125 research
https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail papers in international journals and conference proceedings of repute. His
[73] H. M. Lynn, C. Choi, and P. Kim, ‘‘An improved method of automatic research interests include machine learning, soft computing, information
text summarization for web contents using lexical chain with semantic- retrieval, NPL, and e-learning. He is a member of ACM.
related terms,’’ Soft Comput., vol. 22, no. 12, pp. 4013–4023, Jun. 2018,
doi: 10.1007/s00500-017-2612-9.
[74] K. Ganesan. Opinosis Dataset—Topic Related Review Sentences.
Accessed: Jun. 30, 2022. [Online]. Available: https://kavita-ganesan.
com/opinosis-opinion-dataset/
[75] R. C. Belwal, S. Rai, and A. Gupta, ‘‘A new graph-based extractive text RISHABH KATNA received the bachelor’s and
summarization using keywords or topic modeling,’’ J. Ambient Intell. master’s degrees in computer science and engi-
Humanized Comput., vol. 12, no. 10, pp. 8975–8990, Oct. 2021, doi: neering from the National Institute of Technology
10.1007/s12652-020-02591-x. Hamirpur (NIT Hamirpur), Hamirpur, in 2021 and
[76] R. C. Belwal, S. Rai, and A. Gupta, ‘‘Text summarization using topic-based 2022, respectively. He has been working with
vector space model and semantic measure,’’ Inf. Process. Manage., vol. 58, Standard Chartered GBS as an Intern Software
no. 3, May 2021, Art. no. 102536, doi: 10.1016/j.ipm.2021.102536.
Engineer. He is currently working as a Software
[77] P. Li, W. Lam, L. Bing, and Z. Wang, ‘‘Deep recurrent generative decoder
for abstractive text summarization,’’ in Proc. Conf. Empirical Methods
Engineer at Qualcomm. His research interests
Natural Lang. Process., Copenhagen, Denmark, 2017, pp. 2091–2100, doi: include social networking, automation for web
10.18653/v1/D17-1222. developers, and natural language processing.
[78] Datasets/Gigaword.Py at Master · Tensorflow/Datasets. GitHub.
Accessed: Jun. 26, 2022. [Online]. Available: https://github.
com/tensorflow/datasets
[79] MEDLINE/PubMed Data. Accessed: Jun. 26, 2022. [Online]. Available:
https://www.nlm.nih.gov/databases/download/pubmed_medline.html
[80] B. Elayeb, A. Chouigui, M. Bounhas, and O. B. Khiroun, ‘‘Automatic Ara- ARUN KUMAR YADAV received the Ph.D.
bic text summarization using analogical proportions,’’ Cognit. Comput., degree in computer science and engineering,
vol. 12, no. 5, pp. 1043–1069, Sep. 2020, doi: 10.1007/s12559-020-09748- in 2016. He is currently an Assistant Profes-
y. sor with the Department of Computer Science
[81] Y. Gao, W. Zhao, and S. Eger, ‘‘SUPERT: Towards new frontiers in unsu- and Engineering, National Institute of Technol-
pervised evaluation metrics for multi-document summarization,’’ 2020, ogy Hamirpur (NIT Hamirpur). He is also work-
arXiv:2005.03724.
ing on government sponsored funded projects
[82] A. John and M. Wilscy, ‘‘Random forest classifier based multi-
document summarization system,’’ in Proc. IEEE Recent Adv. Intell. and supervised many students. He has published
Comput. Syst. (RAICS), Trivandrum, India, Dec. 2013, pp. 31–36, doi: more than 20 research papers in reputed interna-
10.1109/RAICS.2013.6745442. tional/national journals and conference proceed-
ings. His research interests include information retrieval, machine learning,
and deep learning.
DIVAKAR YADAV (Senior Member, IEEE)
received the bachelor’s degree in computer science
and engineering, in 1999, the master’s degree
in information technology, in 2005, and the
Ph.D. degree in computer science and engineering,
in 2010. He spent one year as a Postdoctoral JORGE MORATO received the Ph.D. degree in
Fellow at the Universidad Carlos III de Madrid, library science from the Universidad Carlos III de
Spain, from 2011 to 2012. He has more than 22 Madrid, Spain, on the topic of knowledge infor-
Years of Teaching and Research Experience. He mation systems and their relationship with linguis-
did his PDF from the University of Carlos-III, in tics. He is currently a Professor of information
2011 to 2012, Ph.D. (CSE) in 2010, M.Tech. (IT) from IIIT Allahabad in science with the Department of Computer Science,
2005 and B.Tech. (CSE) from IET Lucknow in 1999. He is working as a Universidad Carlos III de Madrid. His research
Professor at the School of Computer and Information Science, Indira Gandhi interests include NLP, information retrieval, web
National Open University (IGNOU), New Delhi, since September 2022. positioning, and knowledge organization systems.
Prior to join IGNOU, he worked as an Associate Professor and Head,