0% found this document useful (0 votes)
17 views20 pages

Open-Ended Long Text Generation Via Masked Languag

Uploaded by

anicomanesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views20 pages

Open-Ended Long Text Generation Via Masked Languag

Uploaded by

anicomanesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/372917244

Open-ended Long Text Generation via Masked Language Modeling

Conference Paper · January 2023


DOI: 10.18653/v1/2023.acl-long.13

CITATIONS READS

9 197

4 authors, including:

Zecheng Tang
Soochow University (PRC)
13 PUBLICATIONS 42 CITATIONS

SEE PROFILE

All content following this page was uploaded by Zecheng Tang on 05 November 2023.

The user has requested enhancement of the downloaded file.


Open-ended Long Text Generation via Masked Language Modeling

Xiaobo Liang∗ Zecheng Tang∗ Juntao Li† Min Zhang


Soochow University
{xbliang3, zctang}@stu.suda.edu.cn,
{ljt,minzhang}@suda.edu.cn

Abstract Model Type Iter Tokens/s


BART base AR - 151.3
Pre-trained autoregressive (AR) language mod- BART base + Planning † AR - 5.8
els such as BART and GPTs have dominated
BERT-CRF † NAR 0 2,597.4
Open-ended Long Text Generation (Open-
RoBERTa base NAR 0 1,561.2
LTG). However, the AR nature will decrease 1 1,068.9
the inference efficiency along with the increase 4 505.2
of generation length, which hinder their ap-
plication in Open-LTG. To improve inference Table 1: Inference speed of each model with a single
efficiency, we alternatively explore the poten- GPU (NVIDIA A100 40GB). For a fair comparison, we
tial of the pre-trained masked language models force all models to generate 200 tokens. The models
(MLMs) along with a representative iterative labeled with † are implemented with the Hugging Face
non-autoregressive (NAR) decoding strategy platform, while the rest are implemented with Fairseq.
for Open-LTG. Our preliminary study shows
that pre-trained MLMs can merely generate et al., 2021a), pre-trained AR language models can
short text and will collapse for long text mod- achieve promising Open-LTG. However, the low
eling. To enhance the long text generation
inference efficiency of AR impedes their usability
capability of MLMs, we introduce two sim-
ple yet effective strategies for the iterative
in real-world applications. Table 1 presents the
NAR model: dynamic sliding window attention inference speed of a few typical AR language
(DSWA) and linear temperature decay (LTD). models. We can see that BART (Lewis et al., 2020)
It can alleviate long-distance collapse problems requires at least 1.3 seconds to generate a story
and achieve longer text generation with a flex- with 200 tokens on the powerful NVIDIA A100
ible trade-off between performance and infer- GPU, and extra planning (Hua and Wang, 2020)
ence speedup. Experiments on the storytelling can make the inference process even slower (more
and multi-paragraph opinionated article writing
than 30 seconds to create a 200-tokens story). In
tasks show that pre-trained MLMs can achieve
more than 3 × → 13 × speedup with better great contrast with AR models, NAR models (e.g.,
performance than strong AR models. Our code BERT-CRF (Su et al., 2021)) can generate more
is available at GitHub* . than 12 stories with the same length within one
second, but their effectiveness in open-ended long
1 Introduction text generation has not been proven yet.
Pre-trained language models (PLMs) like The high inference efficiency of NAR models is
BART (Lewis et al., 2020) and GPTs (Radford at the sacrifice of output dependency modeling, in
et al.; Radford et al.; Brown et al., 2020) have which each generation is executed in parallel (Xiao
achieved remarkable progress in Open-LTG. et al., 2022). Thus, NAR models are mainly ex-
Through modeling languages from left to right, plored and utilized for text generation tasks with
they can autoregressively “create” fluent and gram- adequate input information to predict each output
matical content. With the further enhancement of token of different positions and extra correlations
planning strategies (Hua and Wang, 2020; Hu et al., to constrain the generation process, e.g., neural
2022) or high-level representation learning (Guan machine translation (Gu et al., 2018; Huang et al.,
2022), summarization (Qi et al., 2021; Agrawal and

Equal Contribution Carpuat, 2022), sentence compression (Su et al.,

Corresponding Author
* https://github.com/dropreg/OpenLTG- 2021), dialogue generation (Zou et al., 2021), and
MLM constrained story-ending generation (Yang et al.,
223
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics
Volume 1: Long Papers, pages 223–241
July 9-14, 2023 ©2023 Association for Computational Linguistics
2021). To the best of our knowledge, none of the 2 Related Work
existing research explores Open-LTG with NAR
models, particularly based on pre-trained MLMs. Long Text Generation Text generation tasks can
be classified into two categories: directed genera-
We fill this gap by first conducting a prelim- tion and open-end generation. The directed genera-
inary study to calibrate the potential and limita- tion (Sutskever et al., 2014; Li et al., 2015; Vaswani
tions of a pre-trained MLM, i.e., RoBERTa (Liu et al., 2017) for long text scenarios has long source
et al., 2019)† , on two story generation corpora, i.e., than the target, which is also constrained by source
ROCStories (ROC) (Mostafazadeh et al., 2016) and sequence, e.g., neural machine translation and sum-
WritingPrompts (WP) (Fan et al., 2018). To achieve marization. These tasks aim to solve the quadratic
conditional generation, we simply use RoBERTa as growth requirement of the memory and computa-
both the encoder and the decoder with mixed atten- tional of the self-attention mechanism. The open-
tion (He et al., 2018) to achieve encoder-decoder ended generation task (Guo et al., 2018; Tan et al.,
cross-attention. Through experiments, we found 2020; Goldfarb-Tarrant et al., 2020; Hua and Wang,
that: (1) pre-trained MLMs can achieve competi- 2020; Orbach and Goldberg, 2020; Hu et al., 2022)
tive performance in the iterative NAR fashion for desire to generate more freedom content and has
open-ended short text generation (e.g., a paragraph recently become a promising research direction.
with around 40 tokens), (2) pre-trained MLMs fail Previous works have explored multiple generation
to model Open-LTG (with about 140 tokens on av- strategies to generate high-quality and fluent text,
erage), which will generate uninformative content e.g., planning then generating (Guo et al., 2018;
with high-frequency and repeated tokens (e.g., “.” Tan et al., 2020; Goldfarb-Tarrant et al., 2020;
and “,”). Furthermore, we offer three possible rea- Hua and Wang, 2020; Orbach and Goldberg, 2020;
sons for the attention mechanism of MLMs and Hu et al., 2022) and introducing external knowl-
inference strategy to explain the collapse of the edge (Guan et al., 2020; Xu et al., 2020). Although
iterative NAR model based on pre-trained MLMs the above strategies enable the model to achieve
for the Open-LTG scenario. significant advances, time-consuming is still a crit-
Inspired by the above observations, we introduce ical issue that hinders their usage in real-world
two improvement strategies: Dynamic Sliding Win- applications (Guan et al., 2021a; Tan et al., 2020).
dow Attention (DSWA) and linear temperature de-
Iterative Non-autoregressive Generation Non-
cay strategy (LTD) to maintain more informative
autoregressive (NAR) model breaks the sequential
context content in the iterative NAR generation. As
dependencies from front to back for parallel text
a result, iterative NAR models based on pre-trained
generation (Gu et al., 2018; Guo et al., 2020; Sa-
MLMs can achieve much longer text generation
haria et al., 2020). Furthermore, the iterative-based
than the vanilla setting. Experiments on two Open-
NAR model (Lee et al., 2018; Gu et al., 2019;
LTG tasks (i.e., storytelling and multi-paragraph
Chi et al., 2021) can achieve comparable perfor-
opinionated article writing) with four widely-used
mance with the AR model. The typical CMLM
datasets demonstrate that the pre-trained MLM can
model (Ghazvininejad et al., 2019) can generate
achieve better performance (BLEU score, ROUGE
fluent results conditioned on the predictions from
score, BERT score, and Perplexity) than multi-
the previous iteration instead of previous tokens:
ple strong AR models without extra post-training,
structure modification, or using more model param-
P(Yt |X) = P(Yt |Yt−1 , X) (1)
eters. Importantly, our approach can speed up the
inference process due to non-autoregressive prop-
erties, making the pre-trained MLM as a promis- Benefiting from this, the iterative NAR model is
ing candidate for the Open-LTG community. The more flexibly compared with the AR model, which
RoBERTa base achieves more than 3 × → 13 × can easily generate consistent and controllable text
with better performance to the competitive BART. for each iteration step. To the best of our knowl-
edge, the iterative NAR model has never been used
to solve open-ended generation. Especially, we in-
vestigate its usability for the long text scenario, i.e.,

MLMs can achieve iterative NAR generation with the target lengths between 100 and 400, which is still
mask-predict inference strategy (Ghazvininejad et al., 2019). under-explored in the directed generation tasks.
224
the shared parameter MLM encoder and then try
to recover the masked sequence, where the mixed-
attention mechanism (He et al., 2018) is applied to
aggregate the source Hsrc
L and the target Hl :
tgt
l l−1 L l−1
H̄tgt = Mixed-ATTN(Htgt , Hsrc ) + Htgt
l l l
(3)
Htgt = FFN(H̄tgt ) + H̄tgt .

Mixed-attention does not break the original atten-


tion mechanism, which only utilizes the target hid-
den states as query vector and the concatenated
vector of source and target hidden states as key
Figure 1: The overview of MLM for text generation. and value. It is worth noting that this approach is
(We concatenate the hidden states of X and Y as the
available for transformer encoder models without
key and value of the mixed-attention mechanism.)
additional parameters.
3 Preliminary Study Specifically, we uniformly mask 1 to n (target
length) tokens from Y for model training. The train-
We first present the training and inference paradigm
ing objective is thus to minimize the conditional
of utilizing the pre-trained MLMs for Open-LTG
MLM loss like the pre-training stage:
(§ 3.1), e.g., BERT or RoBERTa. Then, we study
M
X
the significant collapse problem in a long text gen- LMLM = − log P(yi |X , YM )
eration scenario by conducting preliminary experi- i=1
(4)
ments on two datasets with different target lengths exp(utgt /T )
P(yj |X , YM ) = P ,
(§ 3.2). Finally, we investigate the reason for the

|u

|
exp(utgt /T )
tgt
above issues with an exhaustive case study and
where M is the number of masked tokens, utgt
exploration tests to motivate our method design
is the output logit, and T is the temperature to
(§ 3.3), where the model can generate text in non-
re-estimate the final probability.
autoregressive manner to speed up the inference.
Model Inference We use an iterative refinement
3.1 Text Generation via Pre-trained MLMs strategy to generate text like CMLM (Ghazvinine-
Pre-trained MLMs are typically used as the encoder jad et al., 2019). In particular, We use the fully
to extract the representations of sentences instead masked sequence {m1 , m2 , · · · , mn } to initialize
of generating texts. Previous works (Dong et al., the target sequence and predict all masked tokens
2019; Wang et al., 2019) have indicated that the at the first step. Then, we iteratively regenerate the
MLM encoder can support text generation tasks via low-confidence tokens at the subsequent iteration
attention masks or Gibbs sampling. In contrast, we steps to obtain better performance. For Open-LTG,
introduce mixed attention and parameter sharing to we utilize the nucleus sampling (Holtzman et al.,
the encoder-based model to solve the sequence to 2019) decoding strategy instead of beam search.
sequence tasks, as shown in Figure 1.
Length Prediction It is necessary to obtain the
Model Training Given the parallel text gener- target length to initialize the full mask sequence
ation dataset D={(X , Y)}|D| , we can feed the as model input before inference. Specifically, we
source X into the MLM encoder to obtain the rep- provide two strategies: 1) Fixed Length, which
resentation Hsrc
l of l-th layer. Concretely, each initializes the target length according to the average
layer comprises two sub-layers, including one self- length of the validation set or human experience. 2)
attention layer and one feed-forward layer: Prediction Module, which uses the mean-pooling
layer followed by one classification layer to predict
l l−1 l−1
H̄src = Self-ATTN(Hsrc ) + Hsrc
(2)
the target length by feeding Hsrc L into them:
l l l
Hsrc = FFN(H̄src ) + H̄src . L
P(Ltgt |X ) = Softmax(WL (Mean-Pooling(Hsrc ))), (5)

Then, we random mask Y = {y1 , y2 , · · · , y|Y| } to where Ltgt is the target length, and WL is the learn-
obtain corrupted target YM = {y1 , m2 , · · · , m|Y| } able parameter. Specifically, we will adjust Ltgt
(m is the symbol of mask token “<mask>”). As be- according to the specific offset, which is the param-
fore, we can obtain the representation Htgt
l by using eter based on the validation dataset.
225
Figure 2: The iterative inference process of typical good and bad cases, randomly sampled from ROC and WP. The
histogram refers to the output distributions (Iter=1) across candidate tokens for a randomly picked position.

Data Model B-1 B-2 R-1 R-2 Dist Rep model outputs unreadable and meaningless. One
ROC
BART 30.06 14.37 22.37 2.42 3.93 79.07 intuitive question is What causes the collapse prob-
RoBERTa 30.89 14.36 25.01 3.48 5.24 73.42
BART 29.69 10.26 24.34 2.20 0.47 90.15
lem in Open-LTG when using pre-trained MLMs?
WP
RoBERTa 15.80 5.21 10.08 0.84 8.48 17.08
3.3 Analysis and Possible Improvements
Table 2: The performance on WP and ROC.
We show typical good case and bad case in Fig-
ure 2, which are randomly selected from the ROC
3.2 Extensive Trials and WP datasets respectively to demonstrate the
generation process. For each iterative refinement
Study Settings We use Writing Prompt (WP)
step of bad case, the informative tokens will be re-
and ROC Stories (ROC) datasets to conduct exper-
placed by the placeholder token “<mask>” and are
iments for validating whether pre-trained MLMs
replaced by the function words at the subsequent
can work better on Open-LTG tasks. In particular,
steps. Thus it is unable to generate fluent results
these two datasets have different lengths for target
like good case. According to this observation, we
sentences, i.e., the average length of WP is 140 and
try to provide some possible explanations for the
ROC is 40, and more details are given in Section 5
aforementioned collapse issues:
and Appendix A. We choose RoBERTa base (Liu
1) The most intuitive reason is that the function
et al., 2019) as our backbone model and use BLEU,
words are often located at the front of the output
ROUGE, Distinct, and Lexical Repetition metrics
distribution, which dominates the high probability
for evaluation. During inference, we set nucleus
region, causing the informative tokens hard to be
sampling hyper-parameter top-p=0.9, temperature
sampled.. The output distribution trained with the
T =1.0, and limit the maximum iteration steps to 6
ROC dataset contains more prompt-related tokens
for ROC and 8 for WP.
than WP, e.g., the “swim” and “water” in the top
Results As shown in Table 2, For the ROC 50 candidates of ROC output, as shown in Figure 2
dataset, the RoBERTa base model obtains com- (distribution histogram). Worse still, the function
parable performance with BART. However, the words dominate the high probability regions (from
generation quality significantly decreases for the 35% to 45%) for the bad case and lead to terrible
WP dataset, which involves much longer targets. initialization at the first iteration step.
Specifically, most of the generated results are made 2) The iterative refinement mechanism depends
up of duplicated function words or punctuations, on the token confidence of generated sequences,
e.g., “it”, “to”, “the”, and “.”, etc, which makes the and it is easier for the low-confidence but infor-
226
Data Recurrent B-1 B-2 R-1 R-2 Dist Rep
1 15.80 5.21 10.08 0.84 17.08 94.25
WP 2 22.42 8.70 16.81 2.14 34.82 83.87
4 26.91 10.67 21.32 2.81 50.32 35.93

Table 3: The performance of different recurrent steps.

mative tokens to be masked. In fact, the iterative


refinement mechanism is designed for directed gen-
eration tasks, e.g., neural machine translation or
summarization, which usually apply the argmax op-
eration to sample results, and the evaluation of con-
fidence is reasonable in different iterations. Never- Figure 3: The overview of sliding window attention.
theless, we use the nucleus sampling strategy for
inference in Open-LTG, which leads to the low-
confidence tokens with high priority being masked. we will adjust the confidence score of the output
3) The massive absent context tokens suffer a distributions to keep the informative tokens in sub-
more serious multi-modality problem on long text sequent iteration steps instead of being masked.
generation in early iteration steps. As a result, the
model is inclined to generate duplicated tokens due 4 Method
to the multi-modal output distribution. Although it-
In this section, we propose two simple yet effective
erative refinement can provide additional context to
strategies for attention mechanism and inference
alleviate this issue, the model still cannot generate
to mitigate the model collapse problems: Dynamic
the expected results. The possible explanation is
Sliding Window Attention (DSWA) and Linear
that the self-attention layer needs the context token
Temperature Decay (LTD). These designs do not
as key-value pairs to calculate the token represen-
break the paradigm of MLM so that it can flexibly
tation. Unfortunately, the massive uninformative
adapt to the pre-trained models.
mask tokens (“<mask>”) in context lead to model
collapse steadily worsening in the following itera-
4.1 Dynamic Sliding Window Attention
tion steps. Thus, we utilize the recurrent generation
mechanism for model training and inference to re- We first introduced the sliding window mecha-
duce the context dependency, which can also flexi- nism (Beltagy et al., 2020) for the self-attention
bly control the maximum length of the generated layer to adjust each token’s attention pattern, which
sequence (please refer to the Appendix B for more also ensures that the top layer’s token representa-
details about the model architectures and experi- tions can have a large receptive field, similar to
ments). The results are shown in Table 3. We can CNN (Wu et al., 2018). Figure 3 illustrates the
observe that the model can gradually improve its attention mask of the mixed attention layer of pre-
performance as the recurrent steps increase, demon- trained MLMs. It is worth noting that the key-value
strating that informative context dependency is the pairs consist of two parts: the source representa-
implicit reason for the model collapse. tion of the last layer (with green background) and
the target representation of the current layer (with
Improvements Based on the above analysis and yellow background):
findings, we categorize these critical factors into
two types: the defects of attention mechanism
l l−1 L l−1
H̄tgt = Mixed-ATTN(Win(Htgt ), Hsrc ) + Htgt
(6)
and inappropriate inference strategies. In partic- l
Htgt l
= FFN(H̄tgt l
) + H̄tgt ,
ular, we believe that each token should not pays
attention to all context information, and most to- where the operation Win(◦) employs a fixed-size
kens only need the neighbor tokens’ information to window to select the neighbor token representa-
represent the hidden states and predict the results. tions. Meanwhile, the query can attend all source
Therefore, we will change the self-attention mech- sequence hidden states and the target sequence hid-
anism of the pre-trained MLMs so that each tokens den states in the window, stemming the impact of
can attend to the restricted neighbors. Besides, massive absent context.
227
generate the target during the training stage. Dur-
ing the inference, we apply DSWA to the mixed-
attention layer and LTD to sample the results ac-
cording to the probability distributions.
Besides, the model uses the ground truth tokens
Figure 4: Re-estimate the output distribution by LTD. as context to predict the masked tokens during the
training stage and applies the randomly sampled
tokens as context during the inference stage. This
Dynamic Schedule Intuitively, it is not essential
discrepancy makes the model only refine a frac-
to use a fixed receptive field for each layer, e.g., the
tion of the low confidence tokens, which causes
top layer may need to reduce the receptive field to
the degeneration in practice. Thus, we update all
perform prediction. Thus, we propose a dynamic
target tokens according to model predictions at
schedule strategy for the inference stage to adjust
each iteration step by utilizing the SMART mecha-
the window size Swin of each layer:
nism (Ghazvininejad et al., 2020).
L−i
Swin = max(αmin , ∗ αmax ) ∗ Sfix , (7)
L
5 Experiments
where i is the current layer number, L is the max
layer number of pre-trained MLM encoder, Sfix is 5.1 Settings
the fixed window size for model training, and the Datasets We conduct experiments on three Open-
αmin and αmax is the lower and upper bound of LTG tasks, i.e., storytelling (ROC (Mostafazadeh
coefficient hyper-parameter selected from [0, 1]. et al., 2016), WP (Fan et al., 2018), and WikiPlots
With this strategy, we can alleviate the multi- and multi-paragraph opinionated article writing
modality problem by restricting the model to attend (OPINION (Hua and Wang, 2020)). For ROC
to the tokens in the window instead of the whole datasets, we follow (Guan et al., 2021b) to mask
sequence, thus degenerating the multi-modal distri- all the names with specific placeholders to improve
bution into a uni-modal distribution. As a bonus, the generation ability. We fine-tune the model us-
the top-p candidates of output distribution can con- ing our approach without additional corpus. More
tain more informative tokens. details are illustrated in Appendix A.
4.2 Linear Temperature Decay
Implementation & Baselines We utilize the pre-
To further improve the effectiveness of sampling, trained RoBERTa base‡ as our backbone model and
we use the confidence-based iterative refinement implement all experiments with the open library
by adjusting the temperature with linear schedule: Fairseq toolkit§ (Ott et al., 2019). In addition, we
exp(ul /T ) also compare our method with the strong baselines,
P(yi |X , Win(YM )) = P ,
l′exp(ul′ /T )
(8)
e.g., the widely-used AR models like BART (Lewis
T = β ∗ (1 −
t
),
et al., 2020), HINT (Guan et al., 2021b) for story-
T telling tasks, and PAIR (Hua and Wang, 2020) for
where β is hyper-parameter, t ∈ {0, · · · , T } is multi-paragraph level text generation task. It is
the current iteration step, and T is the maximum worth noting that the layer and model parameters
iteration step. Actually, the output distributions of RoBERTa (125M) are close to BART (140M),
will be flattened when T > 1, and become sharp so it can be used to compare the inference speed
when T < 1. Therefore, by applying this strategy, directly. For the inference stage, we set the max
we can penalize the distribution from peaked to flat iteration step as 6 for ROC and 8 for others. We set
in the former iteration steps and encourage it from the hyper-parameter αmin =0.125, αmax =0.75, and
flat to peaked in the later steps. The aforementioned window size Swin equals 64. We set top-p=0.9 for
process is shown in Figure 4. all baseline models, set β=1.6 for ROC and 1.8 for
WP and WikiPlots, and set β=1.5 for OPINION.
4.3 Training and Inference

Given the parallel data, we use vanilla self-attention https://dl.fbaipublicfiles.com/
fairseq/models/roberta.base.tar.gz
to obtain source sentence representation and sliding §
https://github.com/facebookresearch/
window mixed-attention with fixed window size to fairseq

228
BLEU ROUGE Repetation Distinct BERT Score
Data Model PPL Speedup
B-1(↑) B-2(↑) R-1(↑) R-2(↑) R-L(↑) LR-n(↓) SR-n SR-m D-4(↑) P(↑) R(↑) F1(↑)
BERT-CRF 18.90 7.04 14.98 1.73 12.26 36.60 - - 33.11 74.07 71.32 72.65 - -
HINT 32.97 16.91 25.54 3.87 18.48 5.96 73.93 45.27 57.93 78.40 77.14 77.74 26.16 -
ROC BART 30.06 14.37 22.37 2.42 15.52 3.93 69.53 40.04 79.07 76.34 76.83 76.57 65.21 1.0 ×
Ours 33.22 17.08 26.82 3.91 18.22 3.28 70.52 43.71 68.93 77.86 78.23 78.03 53.00 2.9 ×
Ground-Truth - - - - - 2.50 70.74 40.99 46.46 - - - 53.35 -
BERT-CRF 18.50 7.42 17.70 2.30 12.91 83.80 - - 8.58 71.50 66.38 68.82 - -
HINT 22.44 8.38 18.66 1.69 11.71 26.05 80.56 46.50 36.92 71.23 67.72 69.38 14.18 -
WP BART 29.29 9.96 23.57 1.98 12.04 0.73 74.92 33.82 90.38 71.64 71.38 71.50 88.74 1.0 ×
Ours 32.80 11.65 26.67 2.43 12.97 0.73 78.67 35.29 86.70 72.17 72.09 72.12 85.88 6.4 ×
Ground-Truth - - - - - 0.45 80.23 34.36 49.23 - - - 55.39 -
BERT-CRF 16.33 6.42 18.41 1.64 12.24 78.28 - - 29.80 63.27 65.53 64.37 - -
HINT 19.86 8.61 19.36 2.14 10.98 9.86 70.42 50.49 55.16 72.28 68.36 70.18 15.63 -
WikiPlots BART 27.15 10.51 22.63 2.45 11.42 1.58 75.88 44.41 92.60 71.24 73.61 72.36 68.63 1.0 ×
Ours 30.06 12.39 25.88 3.55 12.62 4.50 79.06 41.16 83.97 71.74 73.64 72.63 61.36 13.3 ×
Ground-Truth - - - - - 0.98 75.13 46.72 91.71 - - - 40.88 -

Table 4: Performance on ROC Stories, Writing Prompt, and WikiPlots.

Evaluation Metrics We utilize BLEU (B-n) (Pa- Model Refine


ARGGEN
pineni et al., 2002), ROUGE (R-n) (Lin, 2004), BLEU-4 ROUGE-L METEOR

Lexical Repetition (LR-n, 4-gram repetition for PAIRf ull


% 34.09/32.59* 55.42/49.39* 32.74/50.63*
! 36.09/34.42* 56.86/50.82* 33.30/51.39*
n-times) (Shao et al., 2019), Semantic Repetition
(SR-n, average top-n semantic similarity between % 31.42 53.55 55.58
Ours
! 37.76 59.24 59.70
any two sentences) (Guan et al., 2021b)¶ , average
semantic overlap (S-m, average semantic similar- Table 5: Results of the OPINION dataset. The data
ity of all the sentences), Distinct (D-n) (Li et al., noted with * represent our implementation.
2016) and BERTScore (Zhang et al., 2019) for the
storytelling task. As for the multi-paragraph opin- and lower perplexity, demonstrating the effective-
ionated articles writing, we utilize B-n, R-n, and ness of our model.
METEOR (Banerjee and Lavie, 2005) to evalu- For the OPINION dataset, we use the specific
ate the results. The settings of n are mainly due plans to initialize the model input and then try to
to the length of the generated text and details are generate the missing text according to PAIRf ull
illustrated in each subsection below. We report settings, where these special plans are extracted
the LR-2 and SR-1 for ROC stories and LR-5 and from the ground truth. The results are shown in Ta-
SR-10 for WP to reflect the lexical and semantic ble 5. The PAIR results are based on BART, the AR
repetition of the generation texts. We also report model, so it has high quality even without refine-
the Repetition and Distinct scores of ground truth ment. Our model achieves better results than PAIR
as a reference. We calculate the perplexity (PPL) when using iterative refinement, demonstrating that
using GPT2 (Radford et al.) for each model, which as a masked language model, RoBERTa is more
is the most common fluency metric. suitable to complete the planning sequence than an
AR model. In addition, we found that the model
5.2 Main Results
works better without dynamic sliding window at-
Table 4 summarize the evaluation results on each tention, because the additional context information
storytelling test set. We choose the appropriate provided a good initialization to the model.
checkpoint based on the repetition and distinct com-
parison with the ground truth of the validation set. 5.3 Ablation Results
We can observe that our approach achieves better We conduct the ablation study in Table 6 to evaluate
performance on all datasets than the strong base- the effectiveness of each inference strategy. We can
line model. Especially, The text generated by the observe the performance drops when without using
RoBERTa model has high-quality and fluent results, any strategy, and this phenomenon is significant for
which have high BLEU, ROUGE, BERT scores, longer WP datasets. In particular, the results are

https://huggingface.co/sentence- more in tune with the current prompt benefit from
transformers/bert-base-nli-mean-tokens the DSWA, such as better BLEU and ROUGE, and
229
Data Model B-1 R-L Rep Dist PPL Strategy Length B-1 R-L LR-n D-4 PPL
Ours 33.22 18.22 3.28 68.93 53.00 Ground-Truth 157.42 33.21 13.17 0.67 86.92 86.86

ROC w/o DSWA 32.12 17.67 3.71 68.53 48.87 Fixed 153.51 32.80 12.97 0.90 86.70 85.88
w/o LTD 33.04 17.73 11.29 69.66 78.07 Prediction 155.55 31.96 12.94 0.63 86.53 85.56
w/o ALL 31.86 16.96 14.49 67.30 67.75
Ours 32.80 12.97 0.73 86.70 85.88
Table 7: Length prediction of different strategies.
WP w/o DSWA 29.37 12.31 0.90 86.07 86.95
w/o LTD 29.80 13.88 17.80 64.53 63.08
w/o ALL 12.95 6.60 90.58 32.15 17.69
Metrics Win Loss Tie κ
Fluency 38.0 35.0 27.0 0.55
Table 6: Ablation study of different inference strategies. Coherence 39.5 30.5 30.0 0.44
Relevance 47.5 23.5 29.0 0.61

Table 8: Human evaluation results on mixed dataset. κ


denotes Fleiss’ kappa value.

mance with a slight drop, even the prediction is


also a viable choice for the inference stage.

Figure 5: Inference speed for different datasets.


6.3 Human Evaluation

the model generates more repetition text without For human evaluation, we compare our method
LTD. Thus, the DSWA and LTD are crucial for with strong baseline BART. We sample 100 cases
Open-LTG, which can reduce the context depen- from the model outputs on three different datasets
dencies to predict the output distribution better, and in total. We hire three annotators to give their pref-
improve the confidence score for each iteration step erences (win, loss and tie) for three evaluation cri-
to adopt the open-ended scenarios. teria: fluency, coherence, and relevance, which re-
flect the intra-sentence linguistic quality (Xu et al.,
6 Analysis and Discussion 2020), inter-sentence relatedness & causal depen-
dency and consistency of the generation results,
6.1 Speedup for Inference
respectively. More details are illustrated in Ap-
Figure 5 illustrate the generation speed with the pendix C. We apply the Fleiss’ kappa (Fleiss, 1971)
NVIDIA A100 GPU, which all run with the batch to measure the agreement among three annotators,
size equal to 1 on each test dataset. Our model can and the results are listed in Table 8, where we report
speed up the inference from 3 × to 13 × with differ- the percentage(%) of each preference when com-
ent target lengths, i.e., from 133 token/s to 391 to- paring with BART model. We can observe that our
ken/s for the ROC dataset, from 137 token/s to 882 method can achieve better performance on three
token/s for the WP dataset, and from 132 token/s to criteria when comparing with the BART model, es-
1753 token/s for the WikiPlots dataset. Although pecially for the relevance criterion, which indicates
the smaller iteration step can further accelerate the that such a NAR generation paradigm can mitigate
speed, the perplexity drops significantly. the inconsistent issues of long text generation tasks.
It is worth noting that all the inter-annotator agree-
6.2 Length Prediction ments are either moderate (κ ∈ [0.4, 0.6]) or sub-
We validate the different length prediction strate- stantial (κ ∈ [0.6, 0.8]). Besides, we also plot the
gies on the WP dataset, as shown in Table 7. We detailed percentage for ROC, WP, and WikiPlots
initialize the full mask sequence with ground truth on Figure 6, which can clearly exhibit the discrete
length to inference. For the prediction method, we distributions across three datasets. The fluency and
select the specific offset according to the validation coherence of the sentence generated by our mod-
set, e.g., −20 for WP and −100 for WikiPlots. Be- els obviously decreased as the length increased,
sides, the prediction modules work better for short similar to the BART model. We will improve the
text dataset ROC with offset 0. We also found text quality and overall fluency and solve the above
that the fixed strategy obtained comparable perfor- problems for Open-LTG scenarios in future work.
230
attempted to mitigate these issues by conducting
experiments on comparatively innocuous story gen-
eration and opinion generation tasks. Furthermore,
we have replaced all the names in those corpora
with special placeholders. Although some mea-
sures are taken to mitigate the problematic biases,
such issues cannot be solved completely. Thus,
Figure 6: Discrete distribution for different datasets.
we urge the users to carefully examine the gener-
ation results and cautiously apply our method in
real-world applications. Additionally, it is worth
7 Conclusion noting that all the corpora used in our experiments
This paper explores Open-LTG with NAR models are only for scientific research.
based on pre-trained MLMs. We first examined As for the human evaluation process, we resort
the potential and limitations of MLMs along with to open source web library Django|| to build our
the iterative NAR inference for open-ended text own human evaluation interface. Before releasing
generation and observed that MLMs would col- the human evaluation cases, we carefully check that
lapse for Open-LTG. Through extensive study and there is no private information or other problematic
analysis, we found the reason is the inappropri- biases in the cases. Besides, we did not collect per-
ate attention mechanism and inference strategies, sonal information or ask the annotators about their
and introduced two simple strategies to alleviate private information during the annotation process.
such a problem, i.e., dynamic sliding window at- We hired three annotators and paid each of them
tention and linear temperature decay. Experiments $0.29 for each case comparison. The payment is
demonstrate that our model achieves competitive reasonable since there are only 100 cases for anno-
performance and significant speedup. We hope tation, and it would cost average 4 hours for one to
our research can make pre-trained MLMs as new finish all the comparisons.
candidates for the Open-LTG community.
Acknowledgements
8 Limitation This work is supported by the National Science
Although our NAR approach can generate fluent Foundation of China (NSFC No. 62206194), the
and meaningful text, it inevitably suffers from the Natural Science Foundation of Jiangsu Province,
typical generation problems like in the AR fashion: China (Grant No. BK20220488), and JSS-
(1) off-prompt: the provided prompt is very short, CBS20210661. This work is also supported by
which causes the model can not focus on meaning- Beijing Academy of Artificial Intelligence (BAAI).
ful content and generate reasonable text. Besides,
the model usually simply copy prompt text to gen-
erate results instead of planning reasonable content,
References
such as the case 3 as shown in Table 13 in Ap- Sweta Agrawal and Marine Carpuat. 2022. An imita-
pendix D. (2) incoherent between sentences: When tion learning curriculum for text editing with non-
autoregressive models. In Proceedings of the 60th
the model is initialized, it does not consider the Annual Meeting of the Association for Computational
logical order between sentences, so it can only rely Linguistics (Volume 1: Long Papers), pages 7550–
on the training data to learn automatically. We will 7563.
consider how to generate a suitable initialization
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An
to help the model generate coherence results. Our automatic metric for mt evaluation with improved cor-
paper’s primary concern focuses on accelerating relation with human judgments. In Proceedings of
the generation speed, and we will put how to solve the acl workshop on intrinsic and extrinsic evaluation
these problems in future work. measures for machine translation and/or summariza-
tion, pages 65–72.
Ethics Statement Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020.
Our method heavily relies on the pre-trained lan- Longformer: The long-document transformer. arXiv
preprint arXiv:2004.05150.
guage models, e.g., RoBERTa, which may inherit
the problematic biases (Radford et al.). We have ||
https://www.djangoproject.com

231
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Jian Guan, Xiaoxi Mao, Changjie Fan, Zitao Liu, Wen-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind biao Ding, and Minlie Huang. 2021a. Long text
Neelakantan, Pranav Shyam, Girish Sastry, Amanda generation by modeling sentence-level and discourse-
Askell, et al. 2020. Language models are few-shot level coherence. In Proceedings of the 59th Annual
learners. Advances in neural information processing Meeting of the Association for Computational Lin-
systems, 33:1877–1901. guistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long
Ethan A Chi, Julian Salazar, and Katrin Kirchhoff. 2021. Papers), pages 6379–6393, Online. Association for
Align-refine: Non-autoregressive speech recognition Computational Linguistics.
via iterative realignment. In Proceedings of the 2021
Conference of the North American Chapter of the Jian Guan, Xiaoxi Mao, Changjie Fan, Zitao Liu, Wen-
Association for Computational Linguistics: Human biao Ding, and Minlie Huang. 2021b. Long text
Language Technologies, pages 1920–1927. generation by modeling sentence-level and discourse-
level coherence. In Proceedings of the 59th Annual
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- Meeting of the Association for Computational Lin-
aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, guistics and the 11th International Joint Conference
and Hsiao-Wuen Hon. 2019. Unified language model on Natural Language Processing (Volume 1: Long
pre-training for natural language understanding and Papers), pages 6379–6393.
generation. Advances in Neural Information Process-
ing Systems, 32. Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu,
and Jun Wang. 2018. Long text generation via adver-
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. sarial training with leaked information. In Proceed-
Hierarchical neural story generation. In Proceedings ings of the AAAI conference on artificial intelligence,
of the 56th Annual Meeting of the Association for volume 32.
Computational Linguistics (Volume 1: Long Papers),
pages 889–898. Junliang Guo, Linli Xu, and Enhong Chen. 2020.
Jointly masked sequence-to-sequence model for non-
Joseph L Fleiss. 1971. Measuring nominal scale agree- autoregressive neural machine translation. meeting
ment among many raters. Psychological bulletin, of the association for computational linguistics.
76(5):378.
Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo
Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Chen, and Tie-Yan Liu. 2018. Layer-wise coordina-
Luke Zettlemoyer. 2019. Mask-predict: Parallel de- tion between encoder and decoder for neural machine
coding of conditional masked language models. In translation. Advances in Neural Information Process-
Proceedings of the 2019 Conference on Empirical ing Systems, 31.
Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
Processing (EMNLP-IJCNLP), pages 6112–6121. Yejin Choi. 2019. The curious case of neural text de-
generation. In International Conference on Learning
Marjan Ghazvininejad, Omer Levy, and Luke Zettle- Representations.
moyer. 2020. Semi-autoregressive training im-
proves mask-predict decoding. arXiv preprint Zhe Hu, Hou Pong Chan, Jiachen Liu, Xinyan Xiao,
arXiv:2001.08785. Hua Wu, and Lifu Huang. 2022. Planet: Dynamic
content planning in autoregressive transformers for
Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty, Ralph long-form text generation. In Proceedings of the 60th
Weischedel, and Nanyun Peng. 2020. Content plan- Annual Meeting of the Association for Computational
ning for neural story generation with aristotelian Linguistics (Volume 1: Long Papers), pages 2288–
rescoring. In Proceedings of the 2020 Conference on 2305.
Empirical Methods in Natural Language Processing
(EMNLP), pages 4319–4338. Xinyu Hua and Lu Wang. 2020. Pair: Planning and
iterative refinement in pre-trained transformers for
Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK long text generation. In Proceedings of the 2020 Con-
Li, and Richard Socher. 2018. Non-autoregressive ference on Empirical Methods in Natural Language
neural machine translation. In International Confer- Processing (EMNLP), pages 781–793.
ence on Learning Representations.
Xiao Shi Huang, Felipe Perez, and Maksims Volkovs.
Jiatao Gu, Changhan Wang, and Junbo Zhao. 2019. Lev- 2022. Improving non-autoregressive translation mod-
enshtein transformer. Advances in Neural Informa- els without distillation. In International Conference
tion Processing Systems, 32. on Learning Representations.

Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Jason Lee, Elman Mansimov, and Kyunghyun Cho.
Minlie Huang. 2020. A knowledge-enhanced pre- 2018. Deterministic non-autoregressive neural se-
training model for commonsense story generation. quence modeling by iterative refinement. In Proceed-
Transactions of the Association for Computational ings of the 2018 Conference on Empirical Methods
Linguistics, 8:93–108. in Natural Language Processing, pages 1173–1182.

232
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Jiusheng Chen, Ruofei Zhang, et al. 2021. Bang:
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Bridging autoregressive and non-autoregressive gen-
Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: eration with large scale pretraining. In International
Denoising sequence-to-sequence pre-training for nat- Conference on Machine Learning, pages 8630–8639.
ural language generation, translation, and comprehen- PMLR.
sion. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
7871–7880. Dario Amodei, Ilya Sutskever, et al. Language mod-
els are unsupervised multitask learners.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and William B Dolan. 2016. A diversity-promoting Chitwan Saharia, William Chan, Saurabh Saxena, and
objective function for neural conversation models. Mohammad Norouzi. 2020. Non-autoregressive ma-
In Proceedings of the 2016 Conference of the North chine translation with latent alignments. empirical
American Chapter of the Association for Computa- methods in natural language processing.
tional Linguistics: Human Language Technologies,
pages 110–119. Zhihong Shao, Minlie Huang, Jiangtao Wen, Wenfei Xu,
and Xiaoyan Zhu. 2019. Long and diverse text gen-
Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. eration with planning-based hierarchical variational
A hierarchical neural autoencoder for paragraphs and model. In Proceedings of the 2019 Conference on
documents. In Proceedings of the 53rd Annual Meet- Empirical Methods in Natural Language Processing
ing of the Association for Computational Linguistics and the 9th International Joint Conference on Natu-
and the 7th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages
ral Language Processing (Volume 1: Long Papers), 3257–3268.
pages 1106–1115.
Yixuan Su, Deng Cai, Yan Wang, David Vandyke, Si-
Chin-Yew Lin. 2004. Rouge: A package for automatic mon Baker, Piji Li, and Nigel Collier. 2021. Non-
evaluation of summaries. In Text summarization autoregressive text generation with pre-trained lan-
branches out, pages 74–81. guage models. In Proceedings of the 16th Confer-
ence of the European Chapter of the Association for
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Computational Linguistics: Main Volume, pages 234–
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 243.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
proach. arXiv preprint arXiv:1907.11692. Sequence to sequence learning with neural networks.
Advances in neural information processing systems,
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong 27.
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
Pushmeet Kohli, and James Allen. 2016. A corpus Bowen Tan, Zichao Yang, Maruan Al-Shedivat, Eric P
and cloze evaluation for deeper understanding of Xing, and Zhiting Hu. 2020. Progressive generation
commonsense stories. In Proceedings of the 2016 of long text.
Conference of the North American Chapter of the
Association for Computational Linguistics: Human Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Language Technologies, pages 839–849. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Eyal Orbach and Yoav Goldberg. 2020. Facts2story: you need. Advances in neural information processing
Controlling text generation by key facts. In Pro- systems, 30.
ceedings of the 28th International Conference on
Computational Linguistics, pages 2329–2345. Alex Wang, Kyunghyun Cho, and CIFAR Azrieli Global
Scholar. 2019. Bert has a mouth, and it must speak:
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Bert as a markov random field language model.
Sam Gross, Nathan Ng, David Grangier, and Michael NAACL HLT 2019, page 30.
Auli. 2019. fairseq: A fast, extensible toolkit for
sequence modeling. In Proceedings of the 2019 Con- Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin,
ference of the North American Chapter of the Associa- and Michael Auli. 2018. Pay less attention with
tion for Computational Linguistics (Demonstrations), lightweight and dynamic convolutions. In Interna-
pages 48–53. tional Conference on Learning Representations.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li,
Jing Zhu. 2002. Bleu: a method for automatic evalu- Min Zhang, Tao Qin, and Tie-yan Liu. 2022. A
ation of machine translation. In Proceedings of the survey on non-autoregressive generation for neural
40th annual meeting of the Association for Computa- machine translation and beyond. arXiv preprint
tional Linguistics, pages 311–318. arXiv:2204.09269.

Weizhen Qi, Yeyun Gong, Jian Jiao, Yu Yan, Weizhu Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Raul
Chen, Dayiheng Liu, Kewen Tang, Houqiang Li, Puri, Pascale Fung, Animashree Anandkumar, and

233
Bryan Catanzaro. 2020. Megatron-cntrl: Control-
lable story generation with external knowledge using
large-scale language models. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 2831–2845.
Kexin Yang, Wenqiang Lei, Dayiheng Liu, Weizhen
Qi, and Jiancheng Lv. 2021. Pos-constrained paral-
lel decoding for non-autoregressive generation. In
Proceedings of the 59th Annual Meeting of the Asso-
ciation for Computational Linguistics and the 11th
International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 5990–
6000.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein-
berger, and Yoav Artzi. 2019. Bertscore: Evaluating
text generation with bert. In International Confer-
ence on Learning Representations.

Yicheng Zou, Zhihua Liu, Xingwu Hu, and Qi Zhang.


2021. Thinking clearly, talking fast: Concept-guided
non-autoregressive generation for open-domain di-
alogue systems. In Proceedings of the 2021 Con-
ference on Empirical Methods in Natural Language
Processing, pages 2215–2226.

234
A Dataset and Pre-processing achieve AR decoding (same as BART) when set-
ting the segment as one single token.
Dataset Input Reference #Train #Valid #Test Concretely, we feed the input text X into the
ROC 9.01 37.66 176,688 9,816 9,818
BERT model to obtain the representation Hsrc . We
WP 25.51 141.60 53,516 4,000 4,000 then concatenate the hidden states of the input and
WikiPlots 3.41 354.8 102,936 5,000 5,000 previously generated context segments to feed them
OPINION 17.88 104.36 42,462 6,480 7,562 into the decoder mixed-attention layer and generate
the k-th segment:
Table 9: Statistic of datasets.
l l−1 L L l−1
H̄S k = Mixed-ATTN(H k , Hsrc , H̃S <k ) + H k
S S
The statistic of each dataset is shown in ta- l l l
(9)
HS k = FFN(H̄S k ) + H̄S k ,
ble 9, and we provide the download address of
OPINION ** , ROCStories, WritingPrompts †† , and where H̃SL<k is the representation of the previous
WikiPlots ‡‡ . In particular, we have to pre-process segment using the ground truth instead of the gen-
the dataset to ensure RoBERTa can handle each eration results HSL<k to guarantee the reliability of
sample. We first use the NLTK tokenizer to split the context information. The model recovers the k-
each sample into individual sentences, generally th masked segment and calculates the cross-entropy
according to punctuation as a separator. Then, we of those masked tokens SM as the the MLM loss:
collect the segment with a pre-defined segment
number K to make the different pieces hold com- K |S
X XM|

LMLM = − log P(Sjk |X , S <k , Sj\M


k
), (10)
parable lengths. Finally, we truncate the sample k=1 j=1
with a sequence length over 512 to satisfy the
BERT maximum length limitation. Furthermore, where Sj\M
k is the observed tokens of k-th segment.
we also provide the library version or link informa- Besides, we will select a segment number before
tion, which is used in our paper: Transformers == model training and then use it to split the training
v4.0.0, NLTK == v3.5, and evaluation scripts §§ . data, ensuring the same number of segments for
training and inference in the experiment.
B Recurrent Segment Generation
C Human Evaluation
As shown in Figure 7, to gradually increase the con-
text during the decoding stage, we divide the one-
pass parallel decoding into multiple decoding steps. Dataset #Num Length
Specifically, we split the target Y into multiple ROC 40 40
WP 35 140
segments {S 1 , S 2 , · · · , S K }, where each segment WikiPlots 25 350
consists of multiple tokens/sentences by specifying
the length of each segment. Then, the model will Table 10: Statistic of human evaluation data, where
generate those segments incrementally, ensuring #Num denotes the number of cases in human evaluation
that each decoding step depends on the previously dataset.
generated context to provide adequate information.
In other words, we introduce NAR to generate each We show the human evaluation interface in Fig-
segment and use recurrent segment generation to ure 8 that was built using the python web library
keep segment-level coherence. Meanwhile, the Django ¶¶ . To test the generation ability between
model can obtain a flexible decoding paradigm by our method and the strong AR model (BART) in
manipulating the length of the segments, e.g., the different generation tasks, we sample cases for dif-
model can achieve one-pass decoding when set- ferent tasks. The statistic of sampled evaluation
ting the segment as the whole target sequence and datasets is shown in Table 10. In each compari-
** https://drive.google.com/file/d/ son, each annotator is shown with one model in-
1gs_4fJj3U6Mrt8ekNIoDHRwSUc9WQbzp/view put (prompt) and two outputs generated from two
††
https://drive.google.com/drive/ models, namely “Text A” and “Text B”. Then, the
folders/1i_2YfzpDnfuLyyctOyDabn3Br0OcK1Tj
?usp=sharing annotators are asked to select the better text in each
‡‡
https://github.com/markriedl/ comparison in terms of fluency, coherence, and
WikiPlots relevance. In case of a situation where annotators
§§
https://huggingface.co/evaluate-
¶¶
metric https://www.djangoproject.com

235
Figure 7: The overview of recurrent segment generation. The green blocks and arrows represent the generation
results and the corresponding flow directions.

Figure 8: Human Evaluation Interface

think two texts are hard to distinguish, the “Tie” tences, such as “plants, seeds and watered. finally, i
choice is allowed. We can ensure that each annota- had a beautiful garden." in case 11. We also provide
tor is independent during their annotation process the results of WP and WikiPlots for Table 12 and
and the total annotation process is fair. Table 13. Although these results are relatively un-
grammatical and incoherent, the pre-trained MLM
D Case Study (RoBERTa) achieves competitive performance as
BART.
We randomly selected some cases from different Besides, the results have some grammar errors
datasets to facilitate the evaluation, which was gen- for our models, e.g., “when i got home i went to
erated by the BART and our model. Table 11 illus- the kitchen.” in case 10. The possible explanation
trates the results on the ROC dataset, and we can is that the non-autoregressive model may generate
see that our model results are close to the prompt grammatically incorrect sentences during the iter-
text benefit from the NAR fashion. For example, ation refinement procedure due to multi-modality
topic case 2 is about “candy", the BART generates problems. We will add grammar corrections for
the sentence with fruit “grapes," instead, our model each iteration in future work to help the model
generates the “chocolate," and the whole sentence produce better results.
is close to the topic candy. Furthermore, our model
can generate a high correlation for different sen-
236
Case Type Text
Prompt the man made music .
1 BART he put his name on a paperpha pamphlet . someone subscribed . he sold it . his name got popular .
it was for a dance to a his friends . they invited him to the studio to play his music .
Ours
it played over and over reminding the man of his past . the new song did n’t become a success , but very popular .
Prompt i had an intense craving for candy .
she wanted me to buy grapes . some of the ingredients did not fall in pretty . i could not make sauces however .
2 BART
i have decided to take up cooking .
i went to the grocery store to buy some chocolate . i went to the store and found one very empty bag of candy .
Ours
he took care of it for a few minutes . when i got home , i was out of candy .
Prompt i sat down in the quiet room .
i took a turn for the head held high . i felt ill around my shoulders . i closed my eyes and got out of bed .
3 BART
i had flung smoke at reality .
there was a very clean room . i could n’t find my phone . i was scared and felt like something was going to happen !
Ours
i stood up and took a breath and closed the door . i was home and the i was happy .
Prompt i went to the store to buy breakfast food .
i returned to the store for breakfast and the cost for food was much too much . he asked me to try in the big eggs .
4 BART
to my dismay , it was expired . in the Mangrove , the eggs in the pan were all burnt .
i bought cereal and oatmeal eggs , but i was sad because they were expensive . then i thought about how much i cared about breakfast .
Ours
i brought it home to my wife . when i got back home , i gave her breakfast .
Prompt the court date was set for earlier this month .
i managed to get my client into court on time , so my dad ordered [FEMALE] to get up early . the day of the release ,
5 BART i went to sleep and woke up early . i missed getting to sit with him over breakfast .
however , it turns out she was finally able to get back to court this week .
the clerk announced the date had been postponed for the week . he went to the courthouse the day before the court .
Ours
he so told her to stop smoking that she called the police for help . it now appeared that the trial was postponed .
Prompt i decided to ride my bike to the park .
i immediately rode my bike across the road . i picked up the first cyclist to get on my bike .
6 BART
i waited two minutes and then waited five minutes . i decided that it was time to turn my bike .
as i got on my bike i made a plans for the day . after going to the park , i decided to ride my bike home .
Ours
i rode around the park until i a my home . i pulled over and put my bike back .
Prompt i opened up a new bank account last month .
one of the accounts i opened would not have any cash . when i opened the account , i had fun with some old stock .
7 BART
i couldnt wait to check out the shares in subsequent months . i also realized how busy i was without new passwords .
i was able to put almost everything i need in my account . i looked a website of how to make money now .
Ours
i intend to pay out to that account once a while . i have made a lot of money with this account .
Prompt when we were young my brother and i made spears out of newspapers .
we must have been 3 or 4 in the Arist . unable to get my spears down i put them into a ball .
8 BART
we used fuses to pick screws . [MALE] liked his five foot tall six foot Luther beast .
one year we went to a field in an open field . i told us to would sell them when we got home early .
Ours
we practiced the spears until it was dark . afterwards , my mother took me to the hospital to treat my wounds .
Prompt i had a nearly new keyboard .
i was torturing it with it . i was torturing it with it .
9 BART
then i was so excited i ran to the bathroom . i rushed my new keyboard away and it barely fit my needs .
it had a great keyboard . when i got my new keyboard , there were black stains all over it .
Ours
i was so upset that i decided to throw my keyboard away . when i put it back on , it was working great at work .
Prompt one day i was really hungry .
it was almost time to eat so i checked the fridge to see what was there . i saw a bag of raw steak and wondered how went there .
10 BART
the cheapest way to eat was to come up with a replacement . i picked a different restaurant later that day .
i got dressed and went to something to eat . i was very hungry and went to the store and bought a lot of fruit .
Ours
when i got home i went to the kitchen . i got some cheese and bread and ate the turkey .
Prompt i decided to start a garden .
i researched the crop and the dates . i grew tomatoes . he had just for most of his meals ,
11 BART
so he was out of milk . [FEMALE] garden is excellent at thinking about the future .
i planted some seeds online from the local garden store . i planted the seedlings in the soil and started planting .
Ours
i gave in with what to do of the plants , and seeds and watered . finally , i had a beautiful garden .

Table 11: Representative ROC examples for BART and Pre-trained MLMs (RoBERTa)

237
Case Type Text
1 Prompt caffeine is made illegal and is now rated schedule i by the dea .
with ’conservatives ’ is for ’conservatives ’ . that is basis of the term . the name is the republic . remnants of the religion were
wiped from the lips of most . hegemony . democracy was a jungle . every entity was put to a test . to put it gently ,
a ban would be imposed on the world , forcing every person who had taken part to create some form of protein to join the majority
BART
of their species . food was a main reason why the republic was a flourishing , independent nation .
governments rushed to clamp down on caffeine , showing how virtually any government could kill them .
we feed billions of people who needed one of their pills through cigarettes . protests had come and gone but nothing remained for bobby .
" i went to the coffee shop last month ago and saw that . i was a kid at the time . sometimes i would daydream about my older brother ,
at least male , alive as rain fell on the roof . and a can of coke . thanks . he gave this to me to have a copy of every book i ever picked up .
Ours i were in the parking section , and as i looked down , i smelled it ! no wonder what my headphones do ...
i grabbed some plastic cups and placed them on the coffee table , along with a white sheet on the way to the office . i ordered a joe .
now it had been time to go . jesus christ when one of my friends said to me , " come on . just save me the milk ! "
2 Prompt a magical mirror shows your reflection and your future soulmate . you only see your reflection .
night would always come first . that ’s how every man usually looks . one point is this ." wait a minute !
where do amateur artists live ? " " in outer space , " the mirror roamed through the neighborhood , staring intently at the road-lights .
he moved his feet a few feet so that even the hardest parts would stay put . " i ’m sitting in your shit hole ,
BART
found a pot in your room ." said you ." i sure am ." ACL looked at me with a cheery smile ,
" i could n’t tell . i think my dad was an atheist . they got acquainted the same way because he always wore glasses .
people ca n’t wrap their wrists around idiots or say they ’re complaining .
she ’s staring me by the lenses of a painting a gagged man tells me the only noise that comes to my mind is question .
question . my head burns with fear . i wonder when i will walk into the mirror , how long my a full of hope and love
Ours has faded back and forth ? do you tell you a lie , because i ’ll find you out . and if you ask you three i ’ll have to look at light brown .
i look ca n’t really help it . being able to empathize with what i ’m going through i can a myself . she i must have be having children .
but what i want is that ? the first girl since last time came around . do n’t care because i love her , the way i love her .
3 Prompt it ’s 3016 , cryogenic people are being revived for the first time .
" knock ! knock ! " i gasped as the glass door cracked open , revealing a thin version of me . a lot of blood , especially from the infection .
i quickly grabbed my utensil and held it close against my mouth , thumb plungingikhail ’s face back down toward his mouth .
the kitchen was booked and locked , all but impossible to see . smitherin turned to me , his hand outstretched ." fix anything ,
BART
" i sighed , slowly closing the deal . my fingers were too large , uncomfortable , and my left arm sank behind his neck .
i pressed a silent raise , still gripped tight , until my ankles buckled beneath me ." fine , we have her back ,
" i managed , leaning closer for another variation .
the massive tremors of the ringing in my s ears , beating furiously against my heart for a moment . i leaned back and against the floor ,
i was here , finally . the last one left these days behind still wonder at night , as to what this all meant . i felt cruel at best ,
but least in a place like this we could talk . the , the people were sitting on an space vessel in space and had at least days left .
Ours
not good enough to write the process in a journal . i felt as comfortable as i could as they ’ new body was brought back for a while .
everything was normal , no important , unknown secrets . but the thing about cryogenic people , they took you in and out .
we knew the ways , but still some were left .
4 Prompt you are visited by your guardian angel but he/she is an asshole .
con the morning , there was a clear wind coming through thecloud ( um.. sunny ? is thatreally what it is ! ) down into the woods .
the last thing i remember was climbing up the steps of my home . my eyelids began to flutter nervously , i rubbed my eyes ,
moaning beneath my own robotic fingers . the line rose from a distant object , inviting me to focus .
BART
my reflexive positioning prevented me from climbing back down . again , i thought i had recieved enough therapy to ensure such
things had not happened . would it be the last ? or was it the previous trick we played ?
i sighed and climbed up into my favorite bed of the house ; the big one that i recently occupied .
her eyes met mine cold metal waiting eyes . with a blink a moment later i was replaced with a deep , dark , earthy smell filled the room .
in it was that glorious smile on my face as i stared at my beautiful toes and joined in a whirl when there was eternity .
her blue , emerald green eyes and foot gave away those evil , sly grin on the face .
Ours
what little thing i remember was all the smells that tasted like steaming water . little touch did make me turn as white as a bird of heart .
make or worse . a tear ran its way down my body , taking note of my surroundings . i looked to i .
i was facing a woman with a pink dress . my body seemed to lower itself into a numb state of the moment .
5 Prompt wayne enterprises goes bankrupt , forcing batman to pursue justice through more economical means .
fire batman watched that shutter of his fist hit the ground , shattering him . his furious scream sounded as he shifted a small pile of broken
paint on the side of the building . wallet , phone , controller , and was buttoned down , lying upon the floor in the middle of his .
dabble , dabble . he closed his eyes and attempted to try and make sense of the implications . to splinter himself forever ,
BART
he needed to pull together a means of escape , happiness , and serenity that would bring him back to reality . his floor was exposed to
the current rut in which the cash machine and carton of barks had gone , making him fix his entire apartment . questions about
the earlier charges ? complying ? no . get assistance ! james felt himself shaking and looking around , like the sky was blue
after 30 hours of work was out again . batman sighed , walking down the street . he was in the hesian era and ,
in fact , the father of batman, after all . a friend a friend and big money ? eh ? " waggin ’ death , a bitch . the last batman of the world
Ours sat at his table with the latest edition of " the slash of day . " was the last show he ever watched .
when he looked into that camera , batman called ready excited . his life had been locked behind race car running lights .
twirling " do n’t go out " out to ground batman was him out of the room . " roger .

Table 12: Representative WP examples for BART and Pre-trained MLMs (RoBERTa)

238
Case Type Text
1 Prompt Magic Wonderland
During in 2021, Gavin Fleming (Marshall Hall), his son Duff Jackson (Hayden Farther), and his and Maggie Peterson (McCullough)
explore mythical lands, and come across an ancient legend concerning the Wall, an artifact that may be more mythical than what
the descendants of the Elemental Box, purchased and locked in deep holes in the human body. As the Dunder Mifflin alternate
universe endlessly and alone, Gavin meets Belle Duncan (Brassman), a beautiful young woman who haunts Slater’s dreams
and constantly challenges him to battle her, so aided by a new creature called the X-Rex (Amaradurorylum), Gavin is advised to fight
beast-like creatures around the wall and defend the Wall from destruction when the beast enters. The beast grows into a giant male humanoid,
BART
known in his form as the Castle Unbound, a winged humanoid among the fantastical creatures. Gavin adores Belle,
and tries to spend time with her after she has recovered. Belle suggests to Gavin that he turn his back on the villain and encounter the
Wolfrunner. He reluctantly agrees. Gavin does not agree, and Emma begins to welcome him only after a battle of wits. Throughout an attack
of beasts, Liam Kane (Lyle Marvin) bursts through the defences of the Wall, following the guard staff, and scares everyone away.
He forces Gavin to alert the guards to the camp’s presence. However, those he has encountered are unaware that he is a friend of Emma
Peterson, a Japanese-Australian employee who suffers from nightmares and dreams of superpowers and past matches
The story follows the aia and his best friend playing the game Magic Wonderland, a game they play in " Under the Tree " .
According to the legend, magic creatures also exist in a parallel world called the Wonderland. Coming out of which is a large house considered
to be haunted, but he is allowed to stay. After his magic over he goes for a while and does strange things about it. The world he went to is
about to happen something that happens to everyone. While waiting at the French manor he (Derek) learns of the kingdom of Aiaia and all of
the other characters were left with a little guide on how to use the magic of all the the creatures in the world. When the new year came,
all the people on the island decided that that would be A lot of fun for the two boys to explore the Ahaia’s fairy world. They decide to
Ours
blow up the castle. On their way to school, they foolishly (missed) the black magic, and was had to move in. But Aia and his help make Aia
a magical man and a beast, making it difficult to figure out what he is about to do, and why. But now this evil plan has been set up,
the story remains the same. Aia is more or less lost in a quest to find a world similar like his sister’s, too. They had their pet bird, a green
dragon. Then, so that everyone did perfectly normal. When theia ran out of money, Aia decided to the home. The events are the as to if it was
in Aia’s life his father did not love him, after all, leaving him money to look after him. Also, he was to find out who is behind the magical fairy
world, and they will be in love until the end of their lives, and then the world from there. During the game, Aia giving one another a kiss.
2 Prompt Beyond Apollo
Savika (Saurashtra Prakash) is a demon hunter of Lore Love (Madhuravalli who has set off for Chennai) and doesn’t want to interact with women.
Tensely wanting to save her own girl, he approaches her in a customized john vehicle. Upon hearing about the coming of the eye,
he enlists Glyndar (Urba Rao), the last man he knows and a high society man called Ramesh (Isha Kher). They meet in the limelight After feeling
sick when he asks her to go out, he decides to travel to Seta village by car as his long distance companions. There, he enlists the help of an attractive
BART woman named Kadeb (Jaswini Gopal), and is immediately attracted to her for her beauty. At Seta, Gadeb unwittingly breaks into Kadeb’s cell
and steals money from him. A quarrel ensues between Gadeb and Skylady, an official in Seta who is in charge of the operation, over the case.
During the meeting, Skylady and Gadeb beat Gadeb up and gave him her pocket money and dancing lessons. Gadec sees this and flees while Skylady
takes a cab in a hurry. She then steals weapons and goes off with Kadeb. Nightfall starts and Gadec runs into Kadeb, who secretly intends to steal
the money that Gadeb gave her to sell to her heir. He is shocked when Gadeb offers her a way out. Skylady
The crew of Apollo is one person after another living in the O’ Beel family. That is, from the time they, on the planet Bumblebee, 12-year-old Roxi
is about to be the pilot of an orphon-based spaceship. So, the crew of Apollo decides to be a rescue mission. Back at Earth, the crew is ready to leave
for the moon. On the station are OX-O-s- that, like themselves, can travel, using the help of space suits from the isle, stored in special’s as year 3031,
a hundred years away, when the moon is built, so he and Fifi decide to see if they can find out about the ship. At the same time, a new member
is inducted into the crew, and completes the planning for further exploration. Then the nanobots appear and begin saying " Enter into space.
Ours This I’ll do she replies in Just but not only Number One even after the end of Apollo, that is, not yet. . the planeto has been (andarently) transferred
to a planet we came from called Dusty. They must go back to Apollo. Who cannot and why they did not abandon her. One’s afraid that one
is coming. Soon they decide to join at first for friends, but erupts start to be the ship. The Zesti wants to take Shoxi home, tries to stop him.
But they refuse to see him again until one of them becomes a crew member and, he says, it was the only time he went out to take to work.
However, it turns out that it was nothing but a very old man called R. who is tired of himself having a affair with their and their beau.
However, without them all„ Six must deal with his very father, Olaf, and being a space pilot, who as a result has plans for the future.
3 Prompt Macbett
YoungRecently released gangster, Ronnie Abbett, pairs up with Jake, an older lieutenant in the Marines. Instead of killing him by torpedo,
he eventually exposes him in the hands of an army of locals who want to hold him prisoner even for one night. The drug lord is especially
antagonistic as, near the end of the film, the "likely" blood of the terrorist murders in a bar kills him. The gang tries to punish the gangsters,
making them excited over the pretense of love. Frankantly, the gangsters’ leader tries to coerced Ronnie to help them, while alcohol,
drugs andreedness win him over by tricking him into accepting his debt. Adoption of drugs greatly affects Ronnie, and he complains
BART to his alcoholic brother about Daniel, who promptly kills him and tells Ronnie’s mother to stay away from him. Ronnie tries to be supportive
of Mike, who is working at Seagraves. The rest of the gang, including Mike, are led by a man named Dan, who is actually Ronnie’s adoptive father
and enjoys side-play with Dan when they go out. However, Danny and Mike are against the most recent gang activities. Hell saves Mike’s life
and Jim, a family friend, helps him out. The meanwhile, the new shift surgeons start robbing the bars and poor performers practice hollows,
sending mugs on the streets, hitting people who cried out Loud at the climax so much as collapsing. They later see Mond Roger Lewis
(Bruce Mancini), the bartender’s brother who supposedly does coconut liquor in a bar fight, surrounded by relatives
Macbett runs a small coffee shop on the grounds of his father’s farm. Mary and her are go to Scotland where John Macbett had his first meeting
with Sir Andrew Macbett’s family and other things. Macbett, however, has a lot of respect for the character of " " Macbett " . In the plot
an man, a woman, and the manor, " Teneggi. Macbett. Macbett at the funeral, and we learns that Mac’s father, Nail, Sr. died in an accident.
John Pendleton was rich but he had nothing worth good for, but not even Macbett’s distant relatives, one of them Mary.
Mary both do want to go see Jack Nelly and Celia’s father a little man (John Macbett). Later on, Mary and everyone, including
Macbett and Mary, in. They sell the house and sell it to Servant’s the next. who, after having watching the news; had been a party called for
Ours
Macbett Macbett, who came here, while a other people get killed inside. Macbett decides that meeting with Clint and Denegan has started
a new life Mac. S. Eton, who was Macbett’s old friend, and fell in love with her. Macbett used to not fallen in love with Mary and that because
he was in so much that was Keley’s land. In the end, Mary died when he was a child. We also find that his wife, Carol, doesn’t want to get married
any more, after having a child. Macbett had a son, Macbett. Macbett. Macbett and Scenein time with Mary and the rest of their family,
except for Mary who is up with Macbett. Mac Macbett saysI don’t know what to do " . Jack replies " int " .
Overly without any memory of who he was is really dead not only in but but but but his two brothers.

Table 13: Representative WikiPlots examples for BART and Pre-trained MLMs (RoBERTa)

239
ACL 2023 Responsible NLP Checklist
A For every submission:
 A1. Did you describe the limitations of your work?
Left blank.

 A2. Did you discuss any potential risks of your work?


Left blank.

 A3. Do the abstract and introduction summarize the paper’s main claims?
Left blank.

 A4. Have you used AI writing assistants when working on this paper?
Left blank.

B  Did you use or create scientific artifacts?


Left blank.

 B1. Did you cite the creators of artifacts you used?


Left blank.

 B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
Left blank.

 B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided
that it was specified? For the artifacts you create, do you specify intended use and whether that is
compatible with the original access conditions (in particular, derivatives of data accessed for research
purposes should not be used outside of research contexts)?
Left blank.

 B4. Did you discuss the steps taken to check whether the data that was collected / used contains any
information that names or uniquely identifies individual people or offensive content, and the steps
taken to protect / anonymize it?
Left blank.

 B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and
linguistic phenomena, demographic groups represented, etc.?
Left blank.

 B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits,
etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the
number of examples in train / validation / test splits, as these provide necessary context for a reader
to understand experimental results. For example, small differences in accuracy on large test sets may
be significant, while on small test sets they may not be.
Left blank.

C  Did you run computational experiments?


Left blank.

 C1. Did you report the number of parameters in the models used, the total computational budget
(e.g., GPU hours), and computing infrastructure used?
Left blank.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing
assistance.

240
 C2. Did you discuss the experimental setup, including hyperparameter search and best-found
hyperparameter values?
Left blank.

 C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary
statistics from sets of experiments), and is it transparent whether you are reporting the max, mean,
etc. or just a single run?
Left blank.

 C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did
you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE,
etc.)?
Left blank.

D  Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.

 D1. Did you report the full text of instructions given to participants, including e.g., screenshots,
disclaimers of any risks to participants or annotators, etc.?
Left blank.

 D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students)
and paid participants, and discuss if such payment is adequate given the participants’ demographic
(e.g., country of residence)?
Left blank.

 D3. Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? For example, if you collected data via crowdsourcing, did your instructions to
crowdworkers explain how the data would be used?
Left blank.

 D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
Left blank.

 D5. Did you report the basic demographic and geographic characteristics of the annotator population
that is the source of the data?
Left blank.

241

View publication stats

You might also like