0% found this document useful (0 votes)
20 views

T5-Based Model For Abstractive Summarization A Semi-Supervised Learning Approach With Consistency Loss Functions

Uploaded by

onkarrborude02
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

T5-Based Model For Abstractive Summarization A Semi-Supervised Learning Approach With Consistency Loss Functions

Uploaded by

onkarrborude02
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

applied

sciences
Article
T5-Based Model for Abstractive Summarization:
A Semi-Supervised Learning Approach with Consistency
Loss Functions
Mingye Wang 1, *, Pan Xie 1 , Yao Du 1 and Xiaohui Hu 2

1 School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China;
[email protected] (P.X.); [email protected] (Y.D.)
2 Science and Technology on Integrated Information System Laboratory, Institute of Software,
Chinese Academy of Sciences, Beijing 100045, China; [email protected]
* Correspondence: [email protected]

Abstract: Text summarization is a prominent task in natural language processing (NLP) that con-
denses lengthy texts into concise summaries. Despite the success of existing supervised models,
they often rely on datasets of well-constructed text pairs, which can be insufficient for languages
with limited annotated data, such as Chinese. To address this issue, we propose a semi-supervised
learning method for text summarization. Our method is inspired by the cycle-consistent adversarial
network (CycleGAN) and considers text summarization as a style transfer task. The model is trained
by using a similar procedure and loss function to those of CycleGAN and learns to transfer the style
of a document to its summary and vice versa. Our method can be applied to multiple languages,
but this paper focuses on its performance on Chinese documents. We trained a T5-based model and
evaluated it on two datasets, CSL and LCSTS, and the results demonstrate the effectiveness of the
proposed method.

Keywords: natural language processing; automatic text summarization; abstractive summarization;


semi-supervised learning; consistency loss function

Citation: Wang, M.; Xie, P.; Du, Y.;


Hu, X. T5-Based Model for
Abstractive Summarization: A 1. Introduction
Semi-Supervised Learning Approach
Automatic text summarization is a crucial task in natural language processing (NLP)
with Consistency Loss Functions.
Appl. Sci. 2023, 13, 7111. https://
that aims to condense the core information of a given corpus into a brief summary. With the
doi.org/10.3390/app13127111
exponential growth of textual data, including documents, articles, and news, automatic
summarization has become increasingly important.
Academic Editor: Alessandro Di Text summarization methods can be classified into two categories: extractive and
Nuovo
abstractive. Extractive summarization selects the most important sentences from the origi-
Received: 20 April 2023 nal corpus based on statistical or linguistic features, whereas abstractive summarization
Revised: 1 June 2023 generates a summary by semantically understanding the text and expressing it in a new
Accepted: 9 June 2023 way [1]. Abstractive summarization is more challenging than extractive summarization,
Published: 14 June 2023 but it is also considered superior, as it avoids the issues of coherence and consistency in the
summaries generated with extractive methods.
Deep learning has achieved state-of-the-art results in NLP, and more researchers
have shifted their focus to abstractive summarization. The sequence-to-sequence (seq2seq)
Copyright: © 2023 by the authors. model [2] combined with an attention mechanism has become a benchmark in abstractive
Licensee MDPI, Basel, Switzerland.
summarization [3–5]. However, these methods require well-constructed datasets, which
This article is an open access article
can be difficult and costly to build.
distributed under the terms and
In this paper, we propose a semi-supervised learning method for text summarization
conditions of the Creative Commons
that treats summarization as a style transfer task. Our approach uses a transfer text-to-text
Attribution (CC BY) license (https://
transformer (T5) model as the text generator and trains it with loss functions from the
creativecommons.org/licenses/by/
4.0/).
cycle-consistent adversarial network (CycleGAN) for semantic transfer.

Appl. Sci. 2023, 13, 7111. https://doi.org/10.3390/app13127111 https://www.mdpi.com/journal/applsci


Appl. Sci. 2023, 13, 7111 2 of 16

The remainder of this paper is structured as follows. In Section 2, we review previous


research related to our work. Section 3 describes our method of text summarization in
detail. Section 4 presents the experimental results of our proposed model. In Section 5, we
perform an extensive ablation study to validate the effectiveness of our model. Finally, we
summarize our work in Section 6.

2. Related Works
2.1. Automatic Text Summarization
Automatic text summarization is a crucial task in the field of natural language process-
ing (NLP), and it has received a significant amount of attention from researchers in recent
years. Over the years, a range of methods and models have been proposed to improve
the quality of automatic text summaries. In the early days of NLP research, traditional ap-
proaches to text summarization were based on sentence ranking algorithms that evaluated
the importance of sentences in a given text. These methods used statistical features, such as
frequency and centrality, to rank sentences and select the most important ones to form a
summary [6–8].
With the advent of machine learning techniques in the 1990s, researchers have applied
these methods to NLP to improve the quality of summaries. In automatic text summariza-
tion, this is mostly considered a sequence classification problem. Models are trained to
differentiate summary sentences from non-summary sentences [9–12]. These methods are
referred to as extractive, as they essentially extract important phrases or sentences from
the text without fully understanding their meaning. Thanks to the tremendous success
of deep learning techniques, many extractive summarization studies have been proposed
based on techniques including the encoder–decoder classifier [13], recurrent neural net-
work (RNN) [14], sentence embeddings [15], reinforcement learning, and long short-term
memory (LSTM) network [16].
Moreover, the development of deep learning has given rise to a method called abstract
summarization. Abstract summarization has improved significantly and has become a
crucial area of research in the NLP field. Researchers have made remarkable progress in
this field by leveraging deep learning techniques, such as RNN [3], LSTM [17], and classic
seq2seq models [4,5].
With the introduction of the transformer architecture in 2017 [18], transformer-based
models have significantly outperformed other models in many NLP tasks. This architecture
has been naturally applied to the text summarization task, leading to the development of
several models based on pre-trained language models, including BERT [19], BART [20],
and T5 [21]. These models have demonstrated remarkable performance on various NLP
tasks, including text summarization.

2.2. Text Style Transfer


Text style transfer is a task in the field of NLP that focuses on modifying the style
of a text without altering its content. This task has received considerable attention from
researchers due to its potential applications in many areas, such as creative writing, machine
translation, and sentiment analysis.
The early methods for text style transfer mainly focused on rule-based approaches,
where linguistic patterns and attributes were manually defined and applied to modify the
style of text [22]. These methods, though simple and effective, are limited by the fixed set
of rules that they rely on, which may not adapt well to changing styles and genres.
With the advent of deep learning, several machine-learning-based approaches have
been proposed. The most well-known method is the sequence-to-sequence (seq2seq)
model [2]. Seq2seq models have been used in various NLP tasks, such as text summarization
and machine translation, due to their ability to encode the source text and generate a
target text.
Recently, generative adversarial networks (GANs) [23] were applied to the task of
text style transfer. The idea of GANs is to train two neural networks: a generator and a
Appl. Sci. 2023, 13, 7111 3 of 16

discriminator. The generator tries to generate text that is indistinguishable from the target
style, while the discriminator tries to differentiate between the generated text and the real
target text.

2.3. Cycle-Consistent Adversarial Network


The cycle-consistent adversarial network (CycleGAN) is a generative adversarial
network (GAN) architecture for image-to-image translation tasks. This approach has been
widely used in various domains, including but not limited to image style transfer, domain
adaptation, and super-resolution. The key idea of CycleGAN is to train two generator–
discriminator pairs, with each pair consisting of a generator and a discriminator. One
generator aims to translate an image from the source domain to the target domain, while
the other generator aims to translate an image from the target domain back to the source
domain. The discriminator in each pair is trained to distinguish the translated images from
the real images in the corresponding domain. The cycle consistency loss is introduced to
force the translated image to be transformed back into the original image.
Figure 1 illustrates how CycleGAN works in one direction.

Figure 1. Working principle of CycleGAN.

CycleGAN is focused on the application of style transfer in computer vision. For exam-
ple, Zhu et al. [24] originally proposed CycleGAN for unpaired image-to-image translation,
where there was no one-to-one mapping between the source and target domains. This
method has been widely used in tasks such as colorization, super-resolution, and style
transfer. Based on CycleGAN, different models have been proposed for face transfer [25],
Chinese handwritten character generation [26], image generation from text [27], image
correction [28], and tasks in the audio field [29–31].
One of the highlights of CycleGAN is the implementation of two consistency losses
in addition to the original GAN loss: identity mapping loss and cycle consistency loss.
The identity mapping loss implies that the source data should not be changed during
transformation if they are already in the target domain. The cycle consistency loss comes
with the idea of back translation: The result of back translation should be the same as
the original source. These two loss functions cause the CycleGAN model to keep great
consistency during its transfer procedure; thus, it is possible to handle unpaired images
and achieve outstanding results.

2.4. Transfer Text-to-Text Transformer


The transfer text-to-text transformer (T5) [21] is a state-of-the-art pre-trained language
model based on the transformer architecture. It adopts a unified text-to-text framework
that can handle any natural language processing (NLP) task by converting both the input
and output into natural language texts. T5 can be easily scaled up by varying the number
of parameters (from 60M to 11B), which enables it to achieve superior performance on
various NLP benchmarks. Moreover, T5 employs a full-attention mechanism that allows it
to capture long-range dependencies and complex semantic relations in natural language
Appl. Sci. 2023, 13, 7111 4 of 16

texts. T5 has been successfully applied to many NLP tasks, such as machine translation,
text summarization, question answering, and sentiment analysis [21].
The T5 model follows the typical encoder–decoder structure, and its architecture is
shown in Figure 2.

Figure 2. Architecture of the T5 model.

One of the key features of T5’s text-to-text framework is the use of different prefixes to
indicate different tasks, thus transforming all NLP problems into text generation problems.
For example, to perform sentiment analysis on a given sentence, T5 simply adds the prefix
“sentiment:” before the sentence and generates either “positive” or “negative” as the output.
This feature makes it possible to train a single model that can perform multiple tasks
without changing its architecture or objective function.

3. Proposed Methodology
3.1. Overall
This section presents the foundation of our semi-supervised method for automatic text
summarization. Unlike existing models, which rely heavily on paired text for supervised
training, our approach leverages a small paired dataset followed by a semi-supervised
training process with unpaired corpora. The algorithm used in our method is illustrated in
Algorithm 1, where L denotes the loss incurred by comparing two texts.
Our approach is inspired by the CycleGAN architecture, which uses two generators to
facilitate style transfer in two respective directions. The first part of our method comprises
a warm-up step that employs real text pairs to clarify the tasks of the style transferers
Ta2s and Ts2a and generate basic outputs. The subscripts a2s and s2a, which represent
“article-to-summary” and vice versa, are employed to clarify the transfer direction. The
second part adopts a similar training procedure to that of CycleGAN with consistency loss
functions to further train the models without supervision.
Specifically, the identity mapping loss ensures that a text should not be summarized if
it is already a summary and vice versa. The corresponding training procedure is based on
calling the model to re-generate an identity of the input text. The loss is then calculated by
measuring the difference between the original text and the generated identity. This part is
designed to train the model to be capable of identifying the characteristics of two distinct
text domains. In the following sections of the paper, a superscript idt is used to indicate
re-generated identity texts.
Appl. Sci. 2023, 13, 7111 5 of 16

In contrast, the cycle consistency loss trains the model to reconstruct a summary after
expanding it or vice versa. The corresponding training procedure follows a cyclical process:
For a real summary s, the model Ts2a first expands it and generates a fake article. The term
“fake” indicates that it is generated by our model, rather than a real example from datasets.
Next, the fake article is sent to Ta2s to re-generate its summary. For real articles, the same
cycle steps are utilized. This part is designed to train the model to be capable of transferring
texts between two domains. In the following, a superscript fake is used to indicate the fake
texts generated by the models, and a superscript cyc is used to indicate the final outputs
after such a cycle procedure.

Algorithm 1 Semi-supervised automatic text summarization.


1: for each batch ∈ gold_batches do
2: fine-tune Ta2s and Ts2a with batch . Finetune with real text pairs
3: end for
4: for epoch ∈ [1, nb_epochs] do
5: for all ( ai , si ) such that ai ∈ Articles and si ∈ Summaries do
6: ( aidt idt
i , si ) ← ( Ts2a ( ai ), Ta2s ( si )) . Re-expand and re-summary
7: ( Lidt
a , L idt ) ← ( L ( a , aidt ), L ( s , sidt ))
s i i i i . identity mapping loss
f ake f ake
8: ( si ) ← ( Ta2s ( ai ), Ts2a (si ))
, ai . Generate fake summary and article
cyc cyc f ake f ake
9: ( ai , si ) ← ( Ts2a (si ), Ta2s ( ai )) . Restore article and summary
cyc cyc cyc cyc
10: ( L a , Ls ) ← ( L( ai , ai ), L(si , si )) . cycle consistency loss
cyc cyc
11: Loss ← Lidt idt
a + Ls + L a + Ls . Total loss
12: Back-propagation of Loss
13: end for
14: end for

As observed, despite the integration of the CycleGAN loss functions, we refrain from
constructing a GAN architecture for our task. This decision arises from two factors: firstly,
the challenge involved in the back-propagation phase of discrete sampling during text
generation; secondly, the lack of discernible improvement vis-à-vis our method during
development and the inherent instability in the training process.
The back-propagation of gradients for text generation in a GAN framework presents an
arduous problem, which is primarily due to the discrete nature of text data. Consequently,
the GAN model for text generation often entails the adoption of reinforcement learning or
the use of Gumbel–softmax approximation. These techniques are complicated and may
render the training process unstable, leading to the production of sub-optimal summaries.
Moreover, we found no clear evidence of improved performance through the use of
GAN-based models in our task in comparison with our semi-supervised method with
CycleGAN loss functions. Therefore, we conclude that our approach presents a promis-
ing solution for automatic text summarization and is better suited for our task given its
simplicity and effectiveness.

3.2. Style Transfer Model


As mentioned previously, we view the summarization task as a style transfer problem.
To accomplish this, we employ a T5 model, which offers several advantages over alternative
models. Firstly, the native tasks of the T5 model align well with the requirements of the
style transfer task. Secondly, by modifying the prefix of the input text, a T5 model can
perform tasks in both directions, i.e., from text to summary and vice versa.
As illustrated in Figure 3, a single T5 model can perform the tasks of Ta2s and Ts2a
outlined in Algorithm 1 by changing the prefix of the input text. Therefore, we only require
one generator for both directions, unlike in the original CycleGAN architecture.
Appl. Sci. 2023, 13, 7111 6 of 16

Figure 3. T5 model with different prefixes.

The versatility of the T5 model in undertaking various natural language processing


tasks has been well documented in recent research. The model’s pre-training process
enables it to perform a wide range of tasks, including question answering, text classification,
and text generation. By leveraging the strengths of the T5 model, our approach provides
an effective solution to the problem of automatic text summarization.

3.3. Training with the T5 Model


Our training procedure consists of two parts: a supervised part and an unsupervised
part. In the supervised part, we use small labeled data for warm-up while following the
same procedure as that in the original T5 model. In this part, we fine-tune the T5 model
with pairs of articles and summaries using different prefixes to indicate the generation
direction. The loss function for the supervised part is cross-entropy, which is the same loss
as that used in the original T5 model.
In the unsupervised part, we adopt a training procedure inspired by the CycleGAN
architecture, thus incorporating identity mapping loss and cycle consistency loss. The
identity mapping loss deters the model from re-summarizing a summary or expanding a
full article by minimizing the difference between the input and output texts. Meanwhile,
the cycle consistency loss ensures that the model preserves the source text after a cyclical
transfer by minimizing the difference between the input and reconstructed texts. Figure 4
illustrates these two processes.

(a) Identity mapping loss (b) Cycle consistency loss

Figure 4. CycleGAN losses of the proposed model.

We propose a novel training procedure that uses a single T5 model for both generation
tasks with different prefixes. Given an article a and its summary s, we use the T5 model to
generate a fake summary s f ake from a and a fake article a f ake from s. To indicate the desired
Appl. Sci. 2023, 13, 7111 7 of 16

task, we prepend a prefix string to the input text. The generation process can be formulated
as follows:

s f ake = Ts ( a) = T ( Ps ⊕ a)
(1)
a f ake = Te (s) = T ( Pe ⊕ s)
where Ts () and Te () denote the T5 model with the summary prefix and the expansion
prefix, respectively.
The training process follows a typical supervised paradigm, a cross-entropy
loss [32] is calculated to measure the difference between two texts, and the model
is trained via back-propagation.
C
L( x, x f ake ) = − ∑ pi ( x ) log pi ( x f ake ) (2)
i =1

where C is the vocabulary size, and pi () is the probability of i-th word in the vocabulary.
For the rest of the dataset, where an article a and a summary s are not paired, we calcu-
late the two consistency losses. The identity mapping loss is calculated by re-summarizing
a summary or re-expanding an article as follows:

aidt = Te ( a) sidt = Ts (s)


(3)
Lidt idt
a = L ( a, a ) Lidt idt
s = L ( s, s )

As for the cycle consistency loss, the model first generates s f ake and a f ake as stated
before; then, it regenerates acycle and scycle based on s f ake and a f ake . After such a cycle,
the losses are calculated as follows:

a f ake = Ts ( a) s f ake = Te (s)


acyc = Te (s f ake ) scyc = Ts ( a f ake ) (4)
cyc cyc
La = L( a, acyc ) Ls = L(s, scyc )
The training algorithm is, thus, adapted as in Algorithm 2 (T for T5 model, ⊕ for
concatenation of texts). We use Ps and Pe to denote pre f ix_summarize and pre f ix_expand,
respectively.

Algorithm 2 Semi-supervised automatic text summarization with T5.


1: Set pre f ix_summarize and pre f ix_expand as Ps and Pe
2: for each batch ∈ gold_batches do
3: ( article, summary) ← batch;
4: fine-tune T with ( Ps ⊕ article, summary) and ( Pe ⊕ summary, article)
5: . Fine-tune with real text pairs
6: end for
7: for epoch ∈ [1, nb_epochs] do
8: for all ( ai , si ) such that ai ∈ Articles and si ∈ Summaries do
9: ( aidt idt
i , si ) ← ( T ( Pe ⊕ ai ), T ( Ps ⊕ si )) . Re-expand and re-summarize
10: idt idt idt idt
( L a , Ls ) ← ( L( ai , ai ), L(si , si )) . identity mapping loss
f ake f ake
11: ( si ) ← ( T ( Ps ⊕ ai ), T ( Pe ⊕ si )) . Generate fake summary and article
, ai
cyc cyc f ake f ake
12: ( ai , si ) ← ( T ( Pe ⊕ si ), T ( Ps ⊕ ai )) . Restore article and summary
cyc cyc cyc cyc
13: ( L a , Ls ) ← ( L( ai , ai ), L(si , si )) . cycle consistency loss
cyc cyc
14: Loss ← λidt Lidt ( ai , aidt
i ) + λ L ( s
idt idt i i, s idt ) + λ
cyc cyc i i ) + λcyc Lcyc ( si , si )
L ( a , a
15: . Total loss
16: Back-propagation of Loss
17: end for
18: end for
Appl. Sci. 2023, 13, 7111 8 of 16

Here, the hyperparameters λidt and λcyc control the weights of the two types of losses.

4. Experiments
This section presents the experimental details for evaluating the performance of
our method.

4.1. Datasets
We conducted experiments on two datasets: CSL (Chinese Scientific Literature Dataset) [33]
and LCSTS (Large Scale Chinese Short Text Summarization Dataset) [34].
The CSL is the first scientific document dataset in Chinese consisting of 396,209 papers’
meta-information obtained from the National Engineering Research Center for Science and
Technology Resources Sharing Service (NSTR) and spanning from 2010 to 2020. In our
experiments, we used the paper titles and abstracts to generate summary–article pairs for
training and evaluation purposes. To facilitate evaluation and comparison, we chose the
subset of CSL used in the Chinese Language Generation Evaluation (CLGE) [35] for our
experiments. This sub-dataset comprised 3500 computer science papers.
The LCSTS is a large dataset collecting 2,108,915 Chinese news articles published on
Weibo, the most popular Chinese microblogging website. The data in LCSTS include news
titles and contents posted by verified media accounts. Similarly to with CSL, we used the
news titles and contents to create summary–article pairs for our experiments.
Examples from these datasets can be viewed in Figures A1 and A2.
For the unsupervised training part, our model did not have access to the matched
summary–article pairs. Instead, we intentionally broke the pairs and randomly shuffled the
data, ensuring that the model did not receive matched data during this part of the training.

4.2. Implementation Details


The original datasets contained well-paired texts. We used only a fraction of the
paired data during the warm-up stage. The unsupervised part used text samples of the
corresponding dataset without pair information.
Since the original T5 model does not support the Chinese language, we chose Mengzi [36],
a high-performing lightweight (103M parameters) pre-trained language model for Chinese
in our experiments (Mengzi includes a family of pre-trained models, among which we
used the T5-based one).
We used the AdamW optimizer to train the model with the learning rate, β1, β2, e,
and weight decay as 5 × 10−5 , 0.9, 0.999, 1 × 10−6 , and 0.01, respectively. Moreover, we
set the learning rate with a cosine decay schedule. We restricted the length of sentences in
each batch to a maximum of 512 tokens, and we set the batch size to 8. The two consistency
losses were weighted with factors of 0.1 for the identity mapping loss and 0.2 for the cycle
consistency loss. The higher weight for the cycle consistency loss was due to its direct
contribution to the model’s ability to transfer texts, which was the primary objective of the
task. In contrast, the identity mapping loss helped preserve the characteristics of the input
texts, but it did not directly contribute to the summarization process. All of the experiments
were conducted by using Python 3.7.12 with PaddlePaddle 2.3 and PyTorch 1.11 while
running on an NVIDIA Tesla 32GB V100 GPU. For clarity, the hyperparameter settings
used in our experiments are presented in Table 1.
Appl. Sci. 2023, 13, 7111 9 of 16

Table 1. Hyperparameters used to train the model.

Hyperparameter Value
Optimizer AdamW
Learning rate 5 × 10−5
β1 0.9
β2 0.999
e 1 × 10−6
Weight decay 0.01
Learning rate schedule Cosine decay
Sentence length 512 tokens
Batch size 8
Identity mapping loss weight 0.1
Cycle consistency loss weight 0.2

4.3. Results
In this section, we present the results of our proposed approach for automatic text
summarization and compare its performance with baselines on four commonly used eval-
uation metrics: the ROUGE-1, ROUGE-2, ROUGE-L [37], and BLEU [38] scores. ROUGE
is the acronym for Recall-Oriented Understudy for Gisting Evaluation, and BLEU is the
acronym for BiLingual Evaluation Understudy.
The evaluation metrics play a critical role in assessing the effectiveness of a summa-
rization model. The ROUGE and BLEU scores are widely used to evaluate the quality of
generated summaries. ROUGE measures the overlap between the generated summary
and the reference summary at the n-gram level, whereas BLEU assesses the quality of
the summary by computing the n-gram precision between the generated summary and
the reference summary. By comparing the performance of our proposed model with the
baselines on these four metrics, we can determine the effectiveness of our approach in
automatic text summarization. To provide clarity, we present the formal definitions of these
metrics as follows:

∑S∈{ Re f erenceSummaries} ∑ gramn ∈S Countmatch ( gramn )


ROUGE-N = (5)
∑S∈{ Re f erenceSummaries} ∑ gramn ∈S Count( gramn )
where n stands for the length of the n-gram, gramn , and Countmatch ( gramn ) is the max-
imum number of n-grams co-occurring in a candidate summary and a set of reference
summaries. By switching the reference and summary, we get the precision and recall values.
The final ROUGE-N score is, hence, the F1 score. We used ROUGE-1 and ROUGE-2 in our
experiments. ROUGE-L is based on the longest common subsequence (LCS). It is calculated
in the same way as ROUGE-N, but by replacing the n-gram match with the LCS.
N
BLEU = BP · exp( ∑ wn log pn ) (6)
n =1

where pn is the proportion of correctly predicted n-grams within all predicted n-grams.
Typically, we use N = 4 kinds of grams and uniform weights wn = N/4. BP is the brevity
penalty, which penalizes sentences that are too short:

1, if c > r
Brevity Penalty = (7)
e(1−r/c) , if c <= r

where c is the predicted length and r is the target length.


Appl. Sci. 2023, 13, 7111 10 of 16

We conducted experiments on two Chinese datasets: CSL [33], which consists of


abstracts from the scientific literature and their corresponding titles, and LCSTS [34], which
consists of Chinese news articles and their corresponding human-written summaries. Due
to the lack of research on semi-supervised Chinese summarization, all baselines used in this
study were fully supervised models and were proposed by the organizers of the original
corresponding datasets. For the CSL dataset, we conducted the supervised part of the
experiment with two fractions of the original dataset: one using 50 paired samples, and the
other using 250, while the remaining data were used for the unsupervised part of our
method. For the LCSTS dataset, which was larger than CSL, we conducted the experiments
with 200 and 1000 paired samples.
We also performed an ablation study in comparison with the T5 model trained with
labeled data only and without our proposed loss functions. The T5 models in Table 2 refer
to the results obtained in these cases.
Table 2 illustrates the performance of the baselines and our proposed approach on the
CSL dataset, while Table 3 shows the results on the LCSTS dataset.

Table 2. CSL results.

Models ROUGE-1 ROUGE-2 ROUGE-L BLEU


ALBERT-tiny 52.75 37.96 48.11 21.63
BERT-base 63.83 51.29 59.76 41.45
BERT-wwm-ext 63.44 51 59.4 41.19
RoBERTa-wwm-ext 63.23 50.74 58.99 41.31
LSTM-seq2seq 46.48 30.48 41.8 22
Original T5 50 34.82 19.93 32.62 3.85
T5 50 with CL (ours) 53.13 41.03 50.85 33.95
Original T5 250 56.45 45.01 53.96 37.48
T5 250 with CL (ours) 59.41 47.93 56.16 38.91

Table 3. LCSTS results.

Models ROUGE-1 ROUGE-2 ROUGE-L BLEU


RNN-Word 17.7 8.5 15.8 -
RNN-Char 21.5 8.9 18.6 -
RNN-context-Word 26.8 16.1 24.1 -
RNN-context-Char 29.9 17.4 27.2 -
mT5 - - 34.8 -
CPM-2 - - 35.88 -
Original T5 200 23.61 12.00 21.80 3.99
T5 200 with CL (ours) 28.28 15.48 25.84 10.56
Original T5 1000 28.01 15.59 25.66 9.51
T5 1000 with CL (ours) 30.09 18.59 29.00 14.74

The results presented in Tables 2 and 3 demonstrate that our method achieved compa-
rable performance to that of early supervised large models and even outperformed them
in several metrics, despite using only a lightweight model and a limited amount of data.
However, the performance of recent supervised models was still better than that of our
semi-supervised method. For instance, on CSL, our best results achieved over 93% of the
fully supervised BERT-base’s performance on every metric, significantly outperforming
LSTM-seq2seq and ALBERT-tiny. Regarding LCSTS, our model achieved better results than
the best early fully supervised model, RNN-context-Char, by about 6%, and it had a score
that was approximately 81% of the ROUGE-L of recent models, such as mT5 and CPM2.
The experimental results confirm the effectiveness of our proposed approach in automatic
text summarization.
In addition to comparing our results with those of other models, it is important to
highlight the comparison between the results of our models and that of the original T5
Appl. Sci. 2023, 13, 7111 11 of 16

models without unsupervised learning. This comparison sheds light on the effectiveness
of incorporating unsupervised learning techniques in our approach, as evidenced by
the improved summarization performance, particularly when well-paired data or “gold
batches” were limited. Our semi-supervised method notably improved the performance
across every metric compared to the fully supervised T5 model trained on a limited
amount of labeled data. When labeled text pairs were extremely rare, the proposed method
significantly improved the performance on every metric, especially the BLEU score (from
3.85 to 33.95 on SCL and from 3.99 to 10.56 on LCSTS). As the number of golden batches
increased, the original T5 achieved better results, while our method still ameliorated
its performance. This demonstrates the effectiveness of our approach in leveraging the
information contained in unlabeled data.
The present study showcases a portion of the experimental findings, which are visually
presented in Figures A1 and A2.

5. Conclusions
This study presents a novel semi-supervised learning method for abstractive summa-
rization. To achieve this, we employed a T5-based model to process texts and utilized an
identity mapping constraint and a cycle consistency constraint to exploit the information
contained in unlabeled data. The identity mapping constraint ensures that the input and
output of the model have a similar representation, whereas the cycle consistency constraint
ensures that the input text can be reconstructed from the output summary. Through this ap-
proach, we aim to improve the generalization ability of the model by leveraging unlabeled
data while requiring only a limited number of labeled examples.
A key contribution of this study is the successful application of CycleGAN’s training
process and loss functions to NLP tasks, particularly text summarization. Our method
demonstrates significant advantages in addressing the problem of limited annotated data
and showcases its potential for wide applicability in a multilingual context, especially when
handling Chinese documents. Despite not modifying the model architecture, our approach
effectively leverages the strengths of the original T5 model while incorporating the benefits
of semi-supervised learning.
Our proposed method was evaluated on various datasets, and the experimental results
demonstrate its effectiveness in generating high-quality summaries with a limited number
of labeled examples. In addition, our method employs lightweight models, making it
computationally efficient and practical for real-world applications.
Our approach can be particularly useful in scenarios where obtaining large amounts of
labeled data is challenging, such as when working with rare languages or specialized domains.
It is worth noting that our proposed method can be further improved by using more ad-
vanced pre-training techniques or by fine-tuning on larger datasets. Additionally, exploring
different loss functions and architectures could also lead to better performance.
In summary, our study introduces a novel semi-supervised learning approach for
abstractive summarization, which leverages the information contained in unlabeled data
and requires only a few labeled examples. The proposed approach offers a practical and
efficient method for generating high-quality summaries, and the experimental results
demonstrate its effectiveness on various datasets.

6. Limitations and Future Work


In this section, we discuss the limitations of our proposed T5-based abstractive sum-
marization method and suggest directions for future work to address these limitations.
Semi-supervised training requirement: Our model cannot be trained entirely in an
unsupervised manner. Instead, it requires a small amount of labeled data for a “warm-up”
in a semi-supervised training setting. In our experiments, we found that the performance
of the model trained in a completely unsupervised fashion was inferior to that of the
semi-supervised approach. Future work could explore ways to reduce the reliance on
Appl. Sci. 2023, 13, 7111 12 of 16

labeled data or investigate alternative unsupervised training techniques to improve the


model’s performance.
Room for improvement in model performance: Although our model can match the
performance of some earlier supervised training models, there is still a gap between its
performance and that of more recent state-of-the-art models. Future research could focus on
refining the model architecture, incorporating additional contextual information, or explor-
ing novel training strategies to further enhance the performance of our proposed method.
Domain adaptability: The adaptability of our model to other domains remains to
be tested through further experimentation. Our current results demonstrate the model’s
effectiveness on specific datasets, but its generalizability to different contexts and domains
is still an open question. Future work could involve testing the model on a diverse range
of datasets and languages, as well as developing techniques for domain adaptation to
improve its applicability across various settings.

Author Contributions: Conceptualization, M.W.; methodology, M.W.; software, M.W.; validation,


P.X. and Y.D.; formal analysis, M.W.; investigation, M.W.; resources, P.X.; data curation, Y.D.; writing—
original draft preparation, M.W.; writing—review and editing, M.W. and X.H.; visualization, M.W.;
supervision, X.H.; project administration, X.H.; funding acquisition, X.H. All authors have read and
agreed to the published version of the manuscript.
Funding: This research was funded by the key R&D project of the Ministry of Science and Technology
of the People’s Republic of China with grant number 2020-JCJQ-ZD-079-00.
Institutional Review Board Statement: Not applicable.
Data Availability Statement: The datasets and baselines utilized in our experiments are available at
the following URLs: https://github.com/ydli-ai/CSL and http://icrc.hitsz.edu.cn/Article/show/
139.html. The codes and outputs of our proposed model can also be accessed at https://github.com/
StarsMoon/ATS (20 April 2023).
Conflicts of Interest: The authors declare no conflict of interest.
Appl. Sci. 2023, 13, 7111 13 of 16

Appendix A

Figure A1. Some experimental results on CSL with human translation.


Appl. Sci. 2023, 13, 7111 14 of 16

Figure A2. Some experimental results on LCSTS with human translation.

References
1. Yao, K.; Zhang, L.; Luo, T.; Wu, Y. Deep reinforcement learning for extractive document summarization. Neurocomputing 2018,
284, 52–62. [CrossRef]
2. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014,
27, 3104–3112.
Appl. Sci. 2023, 13, 7111 15 of 16

3. Chopra, S.; Auli, M.; Rush, A.M. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings
of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 93–98.
4. Hou, L.; Hu, P.; Bei, C. Abstractive document summarization via neural model with joint attention. In Proceedings of the National
CCF Conference on Natural Language Processing and Chinese Computing, Dalian, China, 8–12 November 2017; Springer:
Berlin/Heidelberg, Germany, 2017; pp. 329–338.
5. Nayeem, M.T.; Fuad, T.A.; Chali, Y. Neural diverse abstractive sentence compression generation. In Proceedings of the European
Conference on Information Retrieval, Cologne, Germany, 14–18 April 2019; pp. 109–116.
6. Ferreira, R.; Cabral, L.; Lins, R.D.; Silva, G.; Favaro, L. Assessing sentence scoring techniques for extractive text summarization.
Expert Syst. Appl. 2013, 40, 5755–5764. [CrossRef]
7. Radev, D.R. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. Qiqihar Jr. Teach. Coll. 2004, 22, 2004.
8. Alguliev, R.M.; Aliguliyev, R.M.; Isazade, N.R. Multiple documents summarization based on evolutionary optimization algorithm.
Expert Syst. Appl. 2013, 40, 1675–1689. [CrossRef]
9. Conroy, J.M.; O’Leary, D.P. Text summarization via hidden Markov models. In Proceedings of the 24th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, USA, 13 September 2001.
10. Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Texts. In Proceedings of the 2004 Conference on Empirical Methods in
Natural Language Processing, 20 October 2004.
11. Bollegala, D.T.; Okazaki, N.; Ishizuka, M. A machine learning approach to sentence ordering for multidocument summarization
and its evaluation. In Proceedings of the International Conference on Natural Language Processing, Jeju Island, Republic of
Korea, 11–13 October 2005.
12. Baralis, E.; Cagliero, L.; Mahoto, N.; Fiori, A. GRAPHSUM: Discovering correlations among multiple terms for graph-based
summarization. Inf. Sci. 2013, 249, 96–109. [CrossRef]
13. Cheng, J.; Lapata, M. Neural Summarization by Extracting Sentences and Words. arXiv 2016, arXiv:1603.07252.
14. Nallapati, R.; Zhai, F.; Zhou, B. SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summa-
rization of Documents. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February
2016.
15. Anand, D.; Wagh, R. Effective Deep Learning Approaches for Summarization of Legal Texts. J. King Saud Univ.-Comput. Inf. Sci.
2019, 34, 2141–2150. [CrossRef]
16. Mohsen, F.; Wang, J.; Al-Sabahi, K. A hierarchical self-attentive neural extractive summarizer via reinforcement learning
(HSASRL). Appl. Intell. 2020, 50, 2633–2646. [CrossRef]
17. Rush, A.M.; Chopra, S.; Weston, J. A Neural Attention Model for Abstractive Sentence Summarization. arXiv 2015,
arXiv:1509.00685.
18. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need.
arXiv 2017, 30, 5998–6008.
19. Zhang, H.; Gong, Y.; Yan, Y.; Duan, N.; Xu, J.; Wang, J.; Gong, M.; Zhou, M. Pretraining-Based Natural Language Generation for
Text Summarization. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong
Kong, China, 21 November 2019.
20. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural
Language Generation, Translation, and Comprehension. arXiv 2019, arXiv:1910.13461.
21. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer
Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551.
22. Ban, H. Stylistic Characteristics of English News. In Proceedings of the Japan-Korea Joint Symposium on Emotion & Sensibility,
Daejeon, Republic of Korea, 4–5 June 2004.
23. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial
Nets. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014.
24. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In
Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017.
25. Wu, R.; Gu, X.; Tao, X.; Shen, X.; Tai, Y.W.; Jia, J.I. Landmark Assisted CycleGAN for Cartoon Face Generation. arXiv 2019,
arXiv:1907.01424.
26. Bo, C.; Zhang, Q.; Pan, S.; Meng, L. Generating Handwritten Chinese Characters using CycleGAN. In Proceedings of the 2018
IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018.
27. Gorti, S.K.; Ma, J. Text-to-Image-to-Text Translation using Cycle Consistent Adversarial Networks. arXiv 2018, arXiv:1808.04538.
28. Harms, J.; Lei, Y.; Wang, T.; Zhang, R.; Zhou, J.; Tang, X.; Curran, W.J.; Liu, T.; Yang, X. Paired cycle-GAN-based image correction
for quantitative cone-beam computed tomography. Med. Phys. 2019, 46, 3998–4009. [CrossRef] [PubMed]
29. Kaneko, T.; Kameoka, H. CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. In
Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Roma, Italy, 3–7 September 2018.
30. Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion.
In ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 9
April 2019.
Appl. Sci. 2023, 13, 7111 16 of 16

31. Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram
Conversion. arXiv 2020, arXiv:2010.11672.
32. Bishop, C. Pattern Recognition and Machine Learning; Stat Sci; Springer: Berlin/Heidelberg, Germany, 2006.
33. Li, Y.; Zhang, Y.; Zhao, Z.; Shen, L.; Liu, W.; Mao, W.; Zhang, H. CSL: A Large-scale Chinese Scientific Literature Dataset. In
Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October
2022; pp. 3917–3923.
34. Hu, B.; Chen, Q.; Zhu, F. LCSTS: A Large Scale Chinese Short Text Summarization Dataset. arXiv 2015, arXiv:1506.05865.
35. CLUEbenchmark. Chinese Language Generation Evaluation. 2020. Available online: https://github.com/CLUEbenchmark/
CLGE (accessed on 8 June 2023).
36. Zhang, Z.; Zhang, H.; Chen, K.; Guo, Y.; Hua, J.; Wang, Y.; Zhou, M. Mengzi: Towards Lightweight Yet Ingenious Pre-Trained
Models for Chinese. 2021. Available online: http://xxx.lanl.gov/abs/2110.06696 (accessed on 8 June 2023).
37. Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for
Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81.
38. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of
the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318.
[CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like