Transformer-Based Korean Pretrained Language Models - NLP - Ai
Transformer-Based Korean Pretrained Language Models - NLP - Ai
Abstract—With the advent of Transformer, which was used in translation models in 2017, attention-based architectures
began to attract attention. Furthermore, after the emergence of BERT, which strengthened the NLU-specific encoder
part, which is a part of the Transformer, and the GPT architecture, which strengthened the NLG-specific decoder part,
various methodologies, data, and models for learning the Pretrained Language Model began to appear. Furthermore, in
arXiv:2112.03014v1 [cs.CL] 25 Nov 2021
the past three years, various Pretrained Language Models specialized for Korean have appeared. In this paper, we
intend to numerically and qualitatively compare and analyze various Korean PLMs released to the public.
1 I NTRODUCTION
models
<EOS>
_자연어
<EOS>
NLP
_일종
NLP
one
모델
이다
of
의
is
Encoder Decoder Encoder Decoder
Transformer
<MASK>
트랜스포머
models
models
BERT
_자연어
models
NLP
GPT
_일종
one
one
모델
이다
NLP
one
of
of
의
is
is
of
is
Fig. 2. Three main types of PLM. Encoder-Centric Models (left) trained with MLM Task for Language Understanding, Decoder-
Centric Models(center) trained with Next Token Prediction Task for Language Generation. Seq2Seq Models (right) trained with
various objects and tasks (NMT, Summarization, etc) with Next Token Prediction for language generation with understanding.
Detailed objectives or architectures can be differ by individual models.
3.1.4 KcBERT
KcBERT [25] is a Korean BERT model trained based
on the BERT model using about 12 GB of Naver
Fig. 3. Objective of Encoder-Centric Models. this objective is to
predict tokens that positions of input are masked, designed for politics news comments data. Tokenizer uses Word-
Natural Language Understanding. piece [26] BPE and is preprocessed to handle emojis
and special characters.
Encoder centric models are focused on extract-
ing the features of language. Some tasks like classi- 3.1.5 SoongsilBERT (KcBERT2)
fication, clustering, tagging are able to use this type SoongsilBERT14 is a language model pretrained by
of models as PLM. using community data of Soongsil University and
3.1.1 KoBERT Modu Corpus in addition to the news comments
data used in KcBERT. Most of the settings are
KoBERT8 is the first Korean pretrained model
identical, except it is trained based on the RoBERTa
shared on huggingface released by SKT-Brain. It
model and uses Byte-level BPE Tokenizer. More-
is mostly the same as BERT’s configuration, but
over, SoongsilBERT is more fitting to community
the tokenizer uses SentencePiece 9 , not the Word
terminology. In other words, it does not perform
Piece Toeknizer used in BERT. For the data used
well in non-community domains.
for pretraining, 5 million sentences and 54 million
words were used in the Korean wiki.
3.1.6 KcELECTRA
3.1.2 HanBERT KcELECTRA15 is a model trained by collecting ad-
HanBERT10 is a BERT model trained using about ditional data (mainly comments) to the data used
150GB (General Domain: 70GB, Patent Domain: for KcBERT. In NSMC Task, the model is currently
75GB) and 700 million sentences of Korean corpus. recording State-of-the-Arts.
The tokenizer used a private tokenizer called Moran
Tokenizer, and the vocab size was 54000. 11. https://github.com/monologg/KoELECTRA
12. https://corpus.korean.go.kr/
8. https://github.com/SKTBrain/KoBERT 13. Large-scale Korean open domain encyclopedia.
9. https://github.com/google/sentencepiece 14. https://github.com/jason9693/Soongsil-BERT
10. https://github.com/tbai2019/HanBert-54k-N 15. https://github.com/Beomi/KcELECTRA
4
TABLE 1
A Result of Single Sentence Tasks.
TABLE 2
A Result of Multiple Sentence & Agent Tasks.
4.2.1 KorNLI Result Wikipedia articles, 60,407 Q&A pairs for the train-
NLI task is the task classifying the relationship ing set, and 5,774 Q&A pairs for Dev set. In this task,
between two sentences as ”entailment”, ”contradic- KoBigBird scored the highest score (87.08 EM score
tion” and ”neutral”. KorNLI dataset has 942,854 ex- / 94.71 F1-score), On the other hand, KcBERT and
amples (pair) for training, 2,490 exmamples for eval- KoBERT are not performed well even scored lower
uation, and 5,010 examples for testing. KoELECTRA than XLM. It seems the sentence length of corpus
scored State-of-the-Arts in this task. However, most for pretraining is too short to understand long-term
of Korean PLM scored lower points than XLM, not sequence.
Korean centered model.
[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Denoising sequence-to-sequence pre-training for natural
Pre-training of deep bidirectional transformers for lan- language generation, translation, and comprehension.”
guage understanding,” in Proceedings of the 2019 Conference [19] J. Ham, Y. J. Choe, K. Park, I. Choi, and H. Soh, “Kornli
of the North American Chapter of the Association for Compu- and korsts: New benchmark datasets for korean natural
tational Linguistics: Human Language Technologies, Volume 1 language understanding,” CoRR, vol. abs/2004.03289,
(Long and Short Papers), 2019, pp. 4171–4186. 2020. [Online]. Available: https://arxiv.org/abs/2004.
[3] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, 03289
“Improving language understanding by generative pre- [20] J. Moon, W. I. Cho, and J. Lee, “BEEP! Korean corpus
training.” of online news comments for toxic speech detection,”
[4] H. Sak, A. Senior, and F. Beaufays, “Long short-term in Proceedings of the Eighth International Workshop on
memory recurrent neural network architectures for large Natural Language Processing for Social Media. Online:
scale acoustic modeling.” Association for Computational Linguistics, Jul. 2020,
[5] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, pp. 25–31. [Online]. Available: https://www.aclweb.org/
A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., anthology/2020.socialnlp-1.4
“Huggingface’s transformers: State-of-the-art natural lan- [21] S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song,
guage processing,” arXiv preprint arXiv:1910.03771, 2019. J. Kim, Y. Song, T. Oh, J. Lee, J. Oh, S. Lyu, Y. Jeong, I. Lee,
[6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, S. Seo, D. Lee, H. Kim, M. Lee, S. Jang, S. Do, S. Kim,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell K. Lim, J. Lee, K. Park, J. Shin, S. Kim, L. Park, A. Oh, J.-W.
et al., “Language models are few-shot learners,” arXiv Ha, and K. Cho, “Klue: Korean language understanding
preprint arXiv:2005.14165, 2020. evaluation,” 2021.
[7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, [22] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R.
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, Bowman, “Glue: A multi-task benchmark and analysis
G. Heigold, S. Gelly et al., “An image is worth 16x16 platform for natural language understanding,” 2019.
words: Transformers for image recognition at scale,” in [23] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
International Conference on Learning Representations, 2020. M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
[8] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, the limits of transfer learning with a unified text-to-text
A. Radford, M. Chen, and I. Sutskever, “Zero-shot text- transformer,” 2020.
to-image generation,” CoRR, vol. abs/2102.12092, 2021. [24] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mass:
[Online]. Available: https://arxiv.org/abs/2102.12092 Masked sequence to sequence pre-training for language
[9] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, generation,” in ICML, 2019.
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase [25] J. Lee, “Kcbert: Korean comments bert,” in Proceedings
representations using rnn encoder–decoder for statistical of the 32nd Annual Conference on Human and Cognitive
machine translation,” in Proceedings of the 2014 Confer- Language Technology, 2020, pp. 437–440.
ence on Empirical Methods in Natural Language Processing [26] M. Schuster and K. Nakajima, “Japanese and korean voice
(EMNLP), 2014, pp. 1724–1734. search,” in 2012 IEEE International Conference on Acoustics,
[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to se- Speech and Signal Processing (ICASSP). IEEE, 2012, pp.
quence learning with neural networks,” in Advances in 5149–5152.
neural information processing systems, 2014, pp. 3104–3112. [27] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert,
[11] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine a distilled version of bert: smaller, faster, cheaper and
translation by jointly learning to align and translate,” lighter,” arXiv preprint arXiv:1910.01108, 2019.
arXiv preprint arXiv:1409.0473, 2014. [28] J. Park and D. Kim, “Kobigbird: Pretrained bigbird
[12] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap- model for korean,” Nov. 2021. [Online]. Available:
proaches to attention-based neural machine translation,” https://doi.org/10.5281/zenodo.5654154
in Proceedings of the 2015 Conference on Empirical Methods in [29] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and
Natural Language Processing, 2015, pp. 1412–1421. I. Sutskever, “Language models are unsupervised multi-
[13] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling task learners,” 2019.
neural machine translation,” in Proceedings of the Third [30] B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, J. D. Hyeon,
Conference on Machine Translation: Research Papers, 2018, pp. S. Park, S. Kim, S. Kim, D. Seo et al., “What changes
1–9. can large-scale language models bring? intensive study
[14] D. So, Q. Le, and C. Liang, “The evolved transformer,” on hyperclova: Billions-scale korean generative pretrained
in International Conference on Machine Learning, 2019, pp. transformers,” in Proceedings of the 2021 Conference on
5877–5886. Empirical Methods in Natural Language Processing, 2021, pp.
[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, 3405–3424.
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: [31] I. Kim, G. Han, J. Ham, and W. Baek, “Kogpt: Kakao-
A robustly optimized bert pretraining approach,” arXiv brain korean(hangul) generative pre-trained transformer,”
preprint arXiv:1907.11692, 2019. https://github.com/kakaobrain/kogpt, 2021.
[16] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, [32] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary,
“ELECTRA: Pre-training text encoders as discriminators G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer,
rather than generators,” in ICLR, 2020. [Online]. Available: and V. Stoyanov, “Unsupervised cross-lingual representa-
https://openreview.net/pdf?id=r1xMH1BtvB tion learning at scale,” CoRR, vol. abs/1911.02116, 2019.
[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [Online]. Available: http://arxiv.org/abs/1911.02116
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
“Generative adversarial networks,” Communications of the
ACM, vol. 63, no. 11, pp. 139–144, 2020.
[18] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: