Japanese Abstractive Summarization
Japanese Abstractive Summarization
Abstract - In this study, we developed an automatic Generator Network model has advantages in both abstractive
abstractive text summarization algorithm in Japanese and extractive text summarizations.
using a neural network. We used a sequence-to-sequence The model developed in this study was constructed with
encoder-decoder model for experimentation purposes. The reference to the text summarization model using BERT [1]
encoder obtained a feature-based input vector of sentences and has two stages. In the first stage, input text was encoded
using BERT. A transformer-based decoder returned the into context representations using BERT and the output was
summary sentence from the output as generated by the
a draft summary text, generated from the input text processed
encoder. This experiment was performed using the
livedoor news corpus with the above model. However, with BERT using Transformer-based decoder. In the second
issues arose as the same texts were repeated in the stage, the draft summary text was reverified using BERT for
summary sentence. crisper summarizations. In our experiment, we employ only
the first stage in our text summarization model.
Keywords-component; abstractive text summarization; BERT; In this study, we build a text summarization model using
livedoor news corpus BERT, and evaluate the model. We also highlight future
issues that could arise from the text generated as a result of
I. INTRODUCTION training in the Japanese corpus.
Text summarization is the process of effectively II. RELATED WORKS
summarizing long sentences. Text summarization algorithms
in machine learning are mainly divided into two types: extract A. BERT
and abstract summaries. In the former type, an input sentence Recently, pre-learning models, such as BERT, have widely
is split into smaller sentences and a summary sentence is been incorporated in neural network models. Especially,
generated by combining important sentences. On the other those trained with BERT have achieved state-of-the-art
hand, the latter abstract type apprehends the input sentence performance on natural language processing tasks. BERT is
and yields a corresponding summary sentence by itself. Both pre-trained with a huge unlabeled corpus and can accomplish
are similar in that they summarize the key points of the input better performance by fine-tuning with another corpus.
sentence, however, the extract summary can only process We briefly describe the structure of the BERT model with
sentences gathered from the input sentence while abstract reference to [4]. BERT has several layers; each layer has a
summaries generate summary sentences by themselves, and Multi-Head Attention and a linear affine with the residual
hence, are more flexible in this regard. In this study, we focus connection. In our experiment, we utilize the BERT-base
on abstract summaries. model that has 12-layers and 768-hidden sizes.
In the recent years, various models have been proposed for
abstract text summarization [1] [3]. Zhang et al. [1] proposed B. Multi-Head Attention
an abstractive text summarization model using Bidirectional BERT and Transformer primarily comprise Multi-Head
Encoder Representations from Transformers (BERT). Attention layers. Multi-Head Attention divides the Attention
Experimentation results as reported in the paper revealed that input into multiple parts and concatenates them with multiple
their model achieved new state-of-the-art performace on both outputs. Multi-Head Attention is more accurate than
CNN/Daily Mail and New York Times datasets. Viswani et Attention. Multi-Head Attention can be calculated as shown
al. [2] proposed an abstractive text summarization model in equations (1), (2), and (3) as described in reference [2].
named Pointer-Generator Network based CopyNet. Pointer-
( A I A
A -A H A D
HA ' A H' A
) ) D A A
HA D D D D
K A, B JD HA AFD
' D A K A BD
A A I AF D D AH A JD
HA D 1.- K' AA 'A D D H AH D F I
1.- A D I D D
𝑄𝐾 4 feed the output generated from the encoder into the input of
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 3 8𝑉 (1) the decoder, it was not the proposed choice of decoder.
5𝑑7
Furthermore, we opted for a Transformer-based decoder as
ℎ𝑒𝑎𝑑; = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛<𝑄𝑊; > , 𝐾𝑊;? , 𝑉𝑊;@ A (2) opposed to using Recurrent Neural Networks (RNNs) such
as LSTM and GRU for the following reasons: (1) Time for
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉) = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑I , … , ℎ𝑒𝑎𝑑K )𝑊 L (3) training; transformer-based decoders are built with Multi-
Equation (1) represents the formula for a single input query Head Attentions, which can perform parallel computations,
In Multi-Head Attention, the input is split into units of head and are, hence, faster. (2) Accuracy; transformers are far
as shown in (2), and each output is concatenated as shown in more accurate than RNNs on machine translation tasks. (3)
(3) to generate the overall output. In practice, we compute Long-range dependency; attention used in transformers is
the attention function on a set of queries simultaneously, easier to learn long-range dependencies compared to RNNs
packed together into a matrix Q. The keys and values are also such as LSTM.
packed together into matrices K and V. 𝑊; > , 𝑊;? , 𝑊;@ and C. Abstractive summarization model
𝑊 L are parameter matrices. 𝑑7 used in Equation (1) is The input is denoted as 𝑋 = {𝑥I , 𝑥Q , … , 𝑥R } , sequence
dimension of keys K. representing sentence breaks as S = {sI , sQ , … , sV } , and
III. MODEL corresponding summary as 𝐴 = {𝑎I , 𝑎Q , … , 𝑎7 }. We started
by entering 𝑋 and S into BERT.
Fig. 1 shows an overview of the model used in our If 𝑓WXR (𝑥) is assumed to be the sentence number of 𝑥 ,
experiment. sequence 𝑆 is computed as 𝑆 = 𝑓WXR (𝑋) 𝑚𝑜𝑑 2 . The
A. Encoder resulting BERT encoder output is denoted as 𝐻. Next, we
Several pre-learning models, such as BERT, are widely input 𝐻 and the output of decoder at the t-th time step.
utilized in encoder-decoder models. As BERT is efficient for The probability of vocabulary at the t-th time step can be
fast training with high precision, it ensures higher accuracy obtained as shown in (4). This probability was conditioned
in existing models. In the study conducted by Zhang et al. [1], on the decoder output until the t-th time step and output of
BERT is used as the encoder to achieve state-of-the-art the encoder (𝐻).
performance in the abstract text summarization task. We 𝑃[ (𝑤) = 𝑓]X^L]X_ (𝑤|𝐻, 𝑌b[ ) (4)
applied pre-learning model BERT as the encoder of our The loss of training, 𝐿, is calculated as shown in (5) using
model in this experiment. the probability of vocabulary, 𝑃[ (𝑤).
R
step step
Figure 2. Result of training with livedoor news corpus (left: the loss of training, right: the accuracy of training)