0% found this document useful (0 votes)
12 views

Japanese Abstractive Summarization

This document discusses using BERT for abstractive text summarization in Japanese. The authors developed a neural network model with an encoder-decoder architecture. The encoder uses BERT to encode input text into context representations. A Transformer-based decoder then generates a draft summary from the encoder output. The model was trained on the livedoor news corpus but generated repeated texts in summaries, highlighting an area for future improvement.

Uploaded by

Aiom Mitri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Japanese Abstractive Summarization

This document discusses using BERT for abstractive text summarization in Japanese. The authors developed a neural network model with an encoder-decoder architecture. The encoder uses BERT to encode input text into context representations. A Transformer-based decoder then generates a draft summary from the encoder output. The model was trained on the livedoor news corpus but generated repeated texts in summaries, highlighting an area for future improvement.

Uploaded by

Aiom Mitri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Japanese abstractive text summarization using BERT

Yuuki Iwasaki Akihiro Yamashita


Department of Computer Science Department of Computer Science
National Institute of Technology, Tokyo College National Institute of Technology, Tokyo College
Tokyo, Japan Tokyo, Japan
[email protected] [email protected]

Yoko Konno Katsushi Matsubayashi


CHOWA GIKEN Corporation Department of Computer Science
Hokkaido, Japan National Institute of Technology, Tokyo College
[email protected] Tokyo, Japan
[email protected]

Abstract - In this study, we developed an automatic Generator Network model has advantages in both abstractive
abstractive text summarization algorithm in Japanese and extractive text summarizations.
using a neural network. We used a sequence-to-sequence The model developed in this study was constructed with
encoder-decoder model for experimentation purposes. The reference to the text summarization model using BERT [1]
encoder obtained a feature-based input vector of sentences and has two stages. In the first stage, input text was encoded
using BERT. A transformer-based decoder returned the into context representations using BERT and the output was
summary sentence from the output as generated by the
a draft summary text, generated from the input text processed
encoder. This experiment was performed using the
livedoor news corpus with the above model. However, with BERT using Transformer-based decoder. In the second
issues arose as the same texts were repeated in the stage, the draft summary text was reverified using BERT for
summary sentence. crisper summarizations. In our experiment, we employ only
the first stage in our text summarization model.
Keywords-component; abstractive text summarization; BERT; In this study, we build a text summarization model using
livedoor news corpus BERT, and evaluate the model. We also highlight future
issues that could arise from the text generated as a result of
I. INTRODUCTION training in the Japanese corpus.
Text summarization is the process of effectively II. RELATED WORKS
summarizing long sentences. Text summarization algorithms
in machine learning are mainly divided into two types: extract A. BERT
and abstract summaries. In the former type, an input sentence Recently, pre-learning models, such as BERT, have widely
is split into smaller sentences and a summary sentence is been incorporated in neural network models. Especially,
generated by combining important sentences. On the other those trained with BERT have achieved state-of-the-art
hand, the latter abstract type apprehends the input sentence performance on natural language processing tasks. BERT is
and yields a corresponding summary sentence by itself. Both pre-trained with a huge unlabeled corpus and can accomplish
are similar in that they summarize the key points of the input better performance by fine-tuning with another corpus.
sentence, however, the extract summary can only process We briefly describe the structure of the BERT model with
sentences gathered from the input sentence while abstract reference to [4]. BERT has several layers; each layer has a
summaries generate summary sentences by themselves, and Multi-Head Attention and a linear affine with the residual
hence, are more flexible in this regard. In this study, we focus connection. In our experiment, we utilize the BERT-base
on abstract summaries. model that has 12-layers and 768-hidden sizes.
In the recent years, various models have been proposed for
abstract text summarization [1] [3]. Zhang et al. [1] proposed B. Multi-Head Attention
an abstractive text summarization model using Bidirectional BERT and Transformer primarily comprise Multi-Head
Encoder Representations from Transformers (BERT). Attention layers. Multi-Head Attention divides the Attention
Experimentation results as reported in the paper revealed that input into multiple parts and concatenates them with multiple
their model achieved new state-of-the-art performace on both outputs. Multi-Head Attention is more accurate than
CNN/Daily Mail and New York Times datasets. Viswani et Attention. Multi-Head Attention can be calculated as shown
al. [2] proposed an abstractive text summarization model in equations (1), (2), and (3) as described in reference [2].
named Pointer-Generator Network based CopyNet. Pointer-

978-1-7281-4666-9/19/$31.00 ©2019 IEEE


A K A A K HA D BD
H D AF HA D D D D
H D D H D H ID H IB A

( A I A
A -A H A D
HA ' A H' A
) ) D A A
HA D D D D
K A, B JD HA AFD
' D A K A BD
A A I AF D D AH A JD
HA D 1.- K' AA 'A D D H AH D F I
1.- A D I D D

Figure 1. Overview of our text summarization model

𝑄𝐾 4 feed the output generated from the encoder into the input of
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 3 8𝑉 (1) the decoder, it was not the proposed choice of decoder.
5𝑑7
Furthermore, we opted for a Transformer-based decoder as
ℎ𝑒𝑎𝑑; = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛<𝑄𝑊; > , 𝐾𝑊;? , 𝑉𝑊;@ A (2) opposed to using Recurrent Neural Networks (RNNs) such
as LSTM and GRU for the following reasons: (1) Time for
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉) = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑I , … , ℎ𝑒𝑎𝑑K )𝑊 L (3) training; transformer-based decoders are built with Multi-
Equation (1) represents the formula for a single input query Head Attentions, which can perform parallel computations,
In Multi-Head Attention, the input is split into units of head and are, hence, faster. (2) Accuracy; transformers are far
as shown in (2), and each output is concatenated as shown in more accurate than RNNs on machine translation tasks. (3)
(3) to generate the overall output. In practice, we compute Long-range dependency; attention used in transformers is
the attention function on a set of queries simultaneously, easier to learn long-range dependencies compared to RNNs
packed together into a matrix Q. The keys and values are also such as LSTM.
packed together into matrices K and V. 𝑊; > , 𝑊;? , 𝑊;@ and C. Abstractive summarization model
𝑊 L are parameter matrices. 𝑑7 used in Equation (1) is The input is denoted as 𝑋 = {𝑥I , 𝑥Q , … , 𝑥R } , sequence
dimension of keys K. representing sentence breaks as S = {sI , sQ , … , sV } , and
III. MODEL corresponding summary as 𝐴 = {𝑎I , 𝑎Q , … , 𝑎7 }. We started
by entering 𝑋 and S into BERT.
Fig. 1 shows an overview of the model used in our If 𝑓WXR (𝑥) is assumed to be the sentence number of 𝑥 ,
experiment. sequence 𝑆 is computed as 𝑆 = 𝑓WXR (𝑋) 𝑚𝑜𝑑 2 . The
A. Encoder resulting BERT encoder output is denoted as 𝐻. Next, we
Several pre-learning models, such as BERT, are widely input 𝐻 and the output of decoder at the t-th time step.
utilized in encoder-decoder models. As BERT is efficient for The probability of vocabulary at the t-th time step can be
fast training with high precision, it ensures higher accuracy obtained as shown in (4). This probability was conditioned
in existing models. In the study conducted by Zhang et al. [1], on the decoder output until the t-th time step and output of
BERT is used as the encoder to achieve state-of-the-art the encoder (𝐻).
performance in the abstract text summarization task. We 𝑃[ (𝑤) = 𝑓]X^L]X_ (𝑤|𝐻, 𝑌b[ ) (4)
applied pre-learning model BERT as the encoder of our The loss of training, 𝐿, is calculated as shown in (5) using
model in this experiment. the probability of vocabulary, 𝑃[ (𝑤).
R

B. Decoder 𝐿 = − f log 𝑃(𝑦[ |𝐻, 𝑎[kI ) (5)


A Transformer-based decoder was configured as the encoder ;lm
in our model. With BERT, because it cannot simultaneously
accuracy
loss

step step

Figure 2. Result of training with livedoor news corpus (left: the loss of training, right: the accuracy of training)

for training and 30,000 for verification. The maximum length


IV. EXPERIMENT of input sequence was set to 512 tokens, however, some data
values exceeded this limit. In such cases, only the first 512
A. Setting
tokens were entered into the model. As, in sentence
In this experiment, we used a pre-trained model with BERT summarization, the key points are often affixed at the
that was developed in Kurohara and Kawahara laboratory of beginning of the sentence, thus this method was deemed
Kyoto College. Most of the BERT hyperparameters were appropriate.
same as that of BERT-Base (i.e., 12-layers, 768-hidden, 12-
heads) in [4]. The model was trained for 30 epochs with 1.8 C. Training
billion Japanese Wikipedia corpuses. The input text was When 100,000 livedoor news corpus datapoints of 1 epoch
divided into sub-words with byte pair encoding (BPE) using were trained for 15 epochs, the loss of each vocabulary in the
the morphological analysis system Juman++ [5]. validation data was approximately 6.5 words, and the
The vocabulary size stood at 32,000 words. The decoder in accuracy of identifying the right vocabulary was
our model comprised 8 Multi-Head Attention layers; the approximately 67%. Fig. 2 shows the result of the training.
division of hidden size as 3,072; and the embedding vector The graphics card used for training was Titan X (Pascal) with
as 768, same as the encoder. a memory of 12GB. The time taken for training was
We used Adam as the model optimizer; set the parameters approximately three days.
as 𝛽I = 0.9,𝛽Q = 0.999,𝜖 = 1𝑒 − 9 . The maximum
V. RESULT AND ANALYSIS
learning rate was set to 1𝑒 − 4. The dynamic learning rate
was adopted as the model learning rate as per reference [2]. Fig. 3 and Fig. 4 show the results (translated into English)
The learning rate is computed as shown in equation (6), of the text summarization of 100,00 livedoor news corpora
min(𝑐𝑠 km.} , 𝑤𝑠 kI.} ∗ 𝑐𝑠) using a model that were translated into English for text
𝑙𝑟 = max _ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ (6) summarization by learning for 15 epochs. The text under
𝑤𝑠 km.}
In principle, the learning rate increases linearly up to the “Output Text” is the model output in Japanese. When
warmup step (ws). If the current step (cs) exceeds the generating the text, a beam search with a width of 4 was used.
warmup step, the learning rate gradually decreases. The Prior to the training process, WordPiece was used to further
learning rate peaks when current step and warmup step are divide out-of-vocabulary words into multiple words. The “##”
equal. At this point, the learning rate is denoted as 𝑙𝑟 = identifier appended to words, as shown in Fig. 3, indicates a
𝑚𝑎𝑥_𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 . For our experiment, we set letter at the beginning of a word that was divided into a sub-
𝑤𝑎𝑟𝑚𝑢𝑝 𝑠𝑡𝑒𝑝 (𝑤𝑠) = 4000 and max_learning_rate = word. For example, the word “スモー##ク”("smoke”) in Fig.
0.0001 for training. The model contains BERT that has 12 3 is not present in the vocabulary, thus, it is divided into “ス
Multi-Head Attention layers and Transformer-based decoder モー” and “##ク” to fit the vocabulary.
that has 8 Multi-Head Attention layers. As we set the max A word is divided as long as it can be represented by another
input sequence to 512, the batch size is set to 4 because of word in the vocabulary after which it is treated as an
the GPU memory. unknown word.
Input text
B. Dataset [CLS]Director Wayne Wang from Hong Kong who has won the
The livedoor news corpus contains 130,000 Japanese news Silver Bear Award of the Berlin International Film Festival, one
articles from Live Door News; each news article is of the world's three largest film festivals in the film “Smoke”
accompanied by a 3-line summary. The article body and ([UNK]), and Japan ’s world-famous Beat Takeshi A movie
summary are set as the input and output of the experiment, titled “When a Woman Sleeps” (Hide on the [UNK] month
[UNK] day). Starring outside his own director's work, since
respectively. From the dataset, 100,000 datapoints were used [UNK] “Blood and Bones”, Takeshi, who was actually [UNK]
for the first time in years, said, “I was anxious about my images." This news has ripples on the internet. Will it disappear?
performance enough to hold my head.” I made it. The reason is Why isn't the Kanji breaking down more and more? ·What does
that the boundary between dreams and delusions and reality is it mean? There are many comments, such as "I'm too ugly" and
vague and mysterious. [UNK] People talked to each other. This what was the strict instruction in elementary school ... The kanji
film, which was a short story by Spanish writer Javier Marias, stop, Haneno [UNK], is one of the first strict instruction of
was filmed in a quiet resort hotel. [UNK] 's mystery, staring at a elementary school students, and many people are surprised that
writer (Nishijima) who is obsessed with his mysterious it will disappear. However, more surprisingly, because there are
relationship and runs out of curiosity. “I thought that the star was various ways of writing Kanji handwritten characters, it is
Nishijima-kun,” Takeshi, who smiled, read the script and read, originally written in the regular Kanji table established by the
“I thought it was a bad idea. He said he had a hard time Ministry of Education, Culture, Sports, Science and Technology.
understanding a complicated story. Nishijima, who admires that Usually, the “example” has been “guidance guide” and “printing
he is more familiar with film expression than himself, relieved, letters”, not “correct letter shapes”. Some people have heard
“If you say“ good ”, there will be no mistake.” “I was watching about this for the first time in this case. I didn't know that I didn't
only acting, and when I finished watching, I didn't remember have the right shape. I was surprised and convinced, but there
what kind of movie it was”, revealing an undiscovered side was also a comment that there were various ways of writing. By
called “Kitano in the world”. “I think it will come out because the way, the confusing “stroke order” is the same on the right,
Wayne Wang’s film is taken by Beat Takeshi.” (Laughs), and although there is a guide, there is nothing officially
Nishijima decided to appear immediately. I was really shocked established. In recent years, there has been an increase in
and really moved [SEP] confusion about kanji at entrance examinations and financial
Output text institution windows, so this time, we are trying to spread this
映画「スモー##ク」で世界三大映画祭祭に数えられる映 recognition socially.[SEP]
画「女が眠るとき」が公開された「自分の演技が不安だ Output text
し、完成作も不安で頭を抱えながら自分の演技ばかりを 文部科学省の外##局となる文化##庁が、漢字の手書き文
ずっと観ていて。自分の演技ばかりをずっと観ていたと 字に##ついて語##った「は##ねにはこだ##わらず広く許
##い##う[SEP] 容する」と##い##う指針##案をまとめたネットでは「学
(The movie “Smoke”', the movie “When a woman sleeps” that 校に##よ##るものがない」「質##んな書き##方がある」
was counted at the world's three largest film festivals was などの声があ##が##った[SEP]
released I was always watching my performance.[SEP]) (The Agency for Cultural Affairs, the Ministry of Education,
Figure 3. The generated summary text by trained model Culture, Sports, Science and Technology (MEXT) spoke about
the handwritten characters of kanji, and the voice that says
Fig. 3 demonstrates that the model was able to learn "There are various ways of writing" on the Internet, which
correctly as, it can be seen, to some extent, that the summary summarizes the draft guidelines of "tolerate widely without
text retained the key points of the input text. splashing"[SEP])
Figure 4. Another example of generated summary text
It can be seen from Fig. 3 that the phrase “自分の演技ばか
りを観ていた” (“I was only watching my performance.”) Another limitation of this study is where the model returned
has appeared twice and is not suitable for text summarization. incorrect translations of words in the output summary, such
Cases where the model returned a phrase more than once in as, in Fig. 4, “質##んな” should translate to “色んな,”
the summary occurred 3 to 4 times out of 10. The which means “various”.
characteristics of the input sentence that are likely to cause As described above, problems such as repeated contents,
this could not be identified. inability to handle unknown words, and simple word
Fig. 4 highlights some words, such as “[UNK],” that were mistakes occurred in this study.
not present in the vocabulary. Such cases were handled by
WordPiece; words were split until a reference was found in A. Coverage mechanism
the vocabulary. The chunks that did not find a vocabulary There is often a problem that the content of the summary
match were processed as unknown words as it not possible sentence is repeated in the sentence summarization model by
to find references for all unknown words. the neural network of a sequence-to-sequence model as
Figure 4 shows another summary example. referenced in [1]. This can be avoided by implementing a
Input text mechanism called “coverage mechanism” that adds
[CLS]On this month [UNK], the Agency for Cultural Affairs, an additional losses to words that have been used multiple times.
external office of the Ministry of Education, Culture, Sports, The loss calculation formula can be obtained by (7), (8), and
Science and Technology, put together a draft guideline on “Kanji
handwritten characters, which are widely accepted, not stuck,
(9) with reference to [3].
and not splashed”. As a result, various kanji characters will be „
accepted, but various opinions such as voices of doubt and 𝑐 [ = ∑[kI
[ „ lm 𝑎
[ (7)
support are on the internet. According to the Agency for Cultural [
where 𝑐 is the sum of all attention distributions up to the t-
Affairs, if there are parts of the kanji that correspond to the th step of the decoder output and 𝑎 is the output when 𝑐 [ is
framework, try to tolerate small differences in character shape. input to the decoder.
For example, the angle of the first stroke of “word” may be
diagonal, horizontal, or vertical. "Click here for an article with 𝑐𝑜𝑣𝑙𝑜𝑠𝑠 = f min(𝑎;[ , 𝑐;[ ) (8)
;
Equation (8) represents the loss by the coverage mechanism This would solve the problem of not using words that are
to be added to the loss function. Therefore, the overall loss not in the vocabulary.
of the model is as shown in (9).
C. Model improvement
𝑙𝑜𝑠𝑠[ = 𝐿 + 𝜆 ∗ 𝑐𝑜𝑣𝑙𝑜𝑠𝑠 (9)
where 𝐿 is the loss, as computed in (5), and 𝜆 is the The third problem is simple word mistakes.
hyperparameter. In this study, as explained in section I, we used the first
By this method, the loss compounds when a word is stage, out of the two stages, of the model developed in [1].
repeated. Therefore, the model learns to avoid the same word The second stage 2 verifies each word in the summary
as much as possible. This can assist our model to minimize sentence using BERT.
content repetition in the summary sentence. When the draft summary sentence and encoder output are
entered into the BERT decoder, the draft summary is masked
B. Copy mechanism word by word and, thereafter, fed into BERT to obtain a
In human text summarization, words and phrases from the context vector. A probability for each of the masked word is
original sentence are often used as is in the summary computed using a Transformer-based decoder, as configured
sentence. With machine learning, however, the sentence in stage one.
summarization model collates a vocabulary using neural The purpose of the second stage is to improve the accuracy
networks, and words outside of the vocabulary cannot be of the summarized sentence by checking word by word the
used. Hence, you may not, for instance, be able to use the correctness of the draft summary. Therefore, the simple word
names of people in the text. This has a considerable impact mistakes that were encountered in our output sentence can be
on the accuracy of the summarized sentence. However, it can eliminated by adding a refined BERT and Transformer-
be handled by implementing copy mechanism defined in [3]. based decoder to check each word in the summarization
Copy mechanism allocates a temporary number to an out-of- model.
vocabulary word in the input sentence so as to be replicated
in the summary sentence. With reference to [2], the equation VI. CONCLUSION AND FUTURE WORK
for the copy mechanism is shown in (10), (11), (12) and (13). We conducted an experiment to demonstrate Japanese
Š abstractive text summarization with a neural network model
𝑢[ = 𝑜[ 𝑊^ ℎŠ (10)
using BERT. Our model comprised a BERT encoder and
Š

‹Œ• Ž•
Transformer-based decoder. The dataset used in this paper
𝑎[ = ∑’ ‘ (11) was the livedoor news corpus consisting of 130,000
‘“” ‹Œ• Ž•
We start by calculating the attention probability datapoints, of which 100,000 were used for training.
distributions using encoder output as ℎŠ and decoder output The results of the experiment revealed that the model was
able to learn correctly as the summary sentence captured the
up to t-th step as 𝑜[ . Equation (11) applies softmax with 𝑢[ .
key points of the text to some extent. However, the contents
Then, we compute a gate, 𝑔[ , using the encoder output, ℎ,
of the summary sentence were repeated, and the model was
and decoder output up to t-th step, 𝑜[ . The gate value helps
unable to handle unknown words. Additionally, there was a
to select words from the input text to be added to the output
problem of simple word mistakes. We believe that the above
text. Gate is a real number from 0 to 1, where 𝑔[ and
problems could be solved by utilizing the coverage and copy
(1 − 𝑔[ ) are coefficients used to generate and copy words
mechanisms, and by improving the models.
separately.
In the future, we will explore these recommendations with
𝑔[ = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑<𝑊• ∗ [𝑜[ , ℎ] + 𝑏• A (12) new experiments and compare the results.
We calculate the probability of copy and generate using the
gate 𝑔[ . ACKNOWLEDGMENT
𝑃[ (𝑤) = (1 − 𝑔[ )𝑃[™L^š› (𝑤) + 𝑔[ f 𝑎[; (13) This work was supported by JSPS KAKENHI Grant Number
;:•ž l• 19K12906.
The right hand side terms of (13) can be defined as follows:
(1 − 𝑔[ )𝑃[™L^š› (𝑤) is the generate probability, 𝑃[™L^š› (𝑤) is REFERENCES
the vocabulary probability of 𝑃[ (𝑤), as shown in (4), and [1] Haoyu Zhang, Jianjun Xu, Ji Wang, “Pretraining-Based Natural
𝑔[ ∑;:•žl• 𝑎[; is the copy probability. Language Generation for Text Summarization”, arXiv preprint
arXiv:1902.09243
A sequence-to-sequence model inherently maps a decoder [2] Ashish Vaswani et al., “Attention Is All You Need”, arXiv preprint
output to the same form as the vocabulary number and uses arXiv:1706.03762
that as the final probability. However, the copy mechanism [3] Abigail See, Peter J. Liu, Christopher D. Manning, “Get To The
Point: Summarization with Pointer-Generator Networks”, arXiv
calculates the final probability by finding the probability of preprint arXiv:1704.04368
each of the encoder words and adding it to the output [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova,
probability of the decoder. In addition, an identifier is “BERT: Pre-training of Deep Bidirectional Transformers for
temporarily assigned to an out-of-vocabulary word, which Language Understanding”, arXiv preprint arXiv:1810.04805
[5] Kyoto University, Morphological Analysis System JUMAN++,
can be used as is when generating the summary sentence. http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN++

You might also like