0% found this document useful (0 votes)
146 views

Transformer-Based Korean Pretrained Language Models - NLP - Ai

This document provides a survey of Korean pretrained language models from the past three years. It introduces different types of Korean models, benchmark datasets, and compares model performance scores. The document aims to comprehensively analyze various publicly available Korean PLMs through research results from companies and individuals.

Uploaded by

Ppp P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views

Transformer-Based Korean Pretrained Language Models - NLP - Ai

This document provides a survey of Korean pretrained language models from the past three years. It introduces different types of Korean models, benchmark datasets, and compares model performance scores. The document aims to comprehensively analyze various publicly available Korean PLMs through research results from companies and individuals.

Uploaded by

Ppp P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1

Transformer-based Korean Pretrained Language


Models: A Survey on Three Years of Progress
Kichang Yang, Undergraduate Student, Soongsil University

Abstract—With the advent of Transformer, which was used in translation models in 2017, attention-based architectures
began to attract attention. Furthermore, after the emergence of BERT, which strengthened the NLU-specific encoder
part, which is a part of the Transformer, and the GPT architecture, which strengthened the NLG-specific decoder part,
various methodologies, data, and models for learning the Pretrained Language Model began to appear. Furthermore, in
arXiv:2112.03014v1 [cs.CL] 25 Nov 2021

the past three years, various Pretrained Language Models specialized for Korean have appeared. In this paper, we
intend to numerically and qualitatively compare and analyze various Korean PLMs released to the public.

Index Terms—Computational Linguistics, Natural Language Processing, Machine Learning, AI

1 I NTRODUCTION

T HE hot keyword in the field of Natural Lan-


guage Processing and, furthermore, Machine
Learning for the last three years, was the Trans-
a comprehensive survey by combining the research
results of individual researchers/developers and
Korean companies such as Naver2 , Kakao3 , and
former [1]-based BERT [2] or GPT [3] model using SKT4 do. In this paper, the contribution we would
the attention algorithm. Originally, the Transformer like to claim is as follows.
was a model focused on the NMT model that ap-
peared because of the gradient bottleneck problem • Introduction and summary of the type of
that occurred during train RNN [4] for implement- Korean models that have been released so
ing the Neural Machine Translation model, but far
BERT, which performs only the role corresponding • Introduction and arrangement of Korean
to NLU in the Transformer, and GPT, which per- benchmark datasets that have been released
forms only the role corresponding to NLG. With so far
the advent of the two models, various models, algo- • Comprehensive score analysis of published
rithms, and data pre-processing methods appeared. models.
In addition to this, as a pretrained-model shar-
ing platform called ”Transformers” [5] created by
huggingface1 appeared, the NLP/AI field achieved
unprecedented growth in both academia and in- 2 R ELATED W ORKS
dustry. Most recently, large-scale models such as
GPT3 [6] that scale-up the parameters and data 2.1 Neural Machine Translation
of GPT by hundreds of times or more, or models The most popular framework for NMT is the
that expanded (ViT [7]) or mixed (DALL-E [8]) encoder-decoder model [1], [9], [10], [11], [12].
modality was appeared, it seemed to be getting Adopting attention module greatly improved the
a little closer to AGI. On the other hand, based performance of encoder-decoder model by using
on this huggingface platform and the Transformer context vector instead of fixed length vector [11],
series model, active research and development of [12]. By exploiting multiple attentive heads, the
models specialized in the Korean domain were also Transformer model has become the de-facto stan-
conducted in many areas of companies, schools, and dard model in NMT [1], [13], [14].
individuals. Accordingly, we would like to conduct
2. https://www.navercorp.com
contact: [email protected] 3. https://www.kakaocorp.com
1. https://huggingface.co/ 4. https://www.sktelecom.com
2

2.3 Korean NLP Benchmarks


Various fine-tuning data and test sets that can
measure the performance of Korean natural lan-
guage tasks have been released. NSMC5 dataset
is a Sentiment Analysis dataset labeled on Naver
movie review comment data. Naver and Changwon
University unveiled NaverNER6 , the Korean NER
data, at a competition held together in 2018. Kakao
Brain released the KorNLI and KorSTS [19] datasets
for measuring the NLU performance of Korean in
2020. Also, in 2019, LG CNS released KorQuAD7 ,
a SQuAD dataset for Korean to measure the per-
formance of Korean question and answer tasks. In
2020, the BEEP! [20] dataset for the Korean Hate
Speech Classification task was released. Most re-
Fig. 1. The emergence of hugging face provided a platform to
share the pretrained model in the natural language processing cently, KLUE [21], Korean version of GLUE [22]
field and machine learning field, and at the same time brought benchmark was released. However, we are not re-
the revival of the Transformer model-based attention algorithm, porting this results as lots of models are not re-
leading to tremendous growth in both the industry / academia in
the AI and NLP fields.
ported for this benchmark yet.

3 KOREAN PLM A RCHITECTURES


2.2 Pretraining with Unsupervised Feature-
Language models after 2018 can be classified into
based Approaches
three major types according to the pretraining
method (Fig. 2). (1) The first group of models
In recently, there are several approaches for pre- is Encoder-Centric Models, which focus on “Un-
training methods of the main stream, using feature- derstanding” of language (NLU) by using objec-
based approaches. OpenAI GPT [3] uses decoder tive functions such as predicting the correspond-
of Transformer architecture with next token pre- ing MASK (MLM) after creating and inserting the
diction (Auto Regressive) object. On the other side MASK in the input sentence. These models are later
of GPT, BERT [2] uses submodule of Transformer fintuned for tasks such as classification or feature
architecture(encoder), with Masked Language Mod- extraction. As a representative model, BERT series
eling(MLM) and Next Sentence Prediction(NSP) ob- PLMs are applicable. (2) The second case is using
ject for pretraining. RoBERTa [15] is similar to BERT the objective function to predict the next token of
architecture except it trained without NSP object each input token. Since these models are optimized
and static masking during the pretraining process. for inference corresponding to Auto-Regressive,
ELECTRA [16] uses MLM object with an adver- they are mainly used for Downstream-Task (Chat-
sarial objective used in GAN [17] architecture for bot, Lyric Generation, etc) learning corresponding
pretraining, using only a discriminator in finetun- to Language Generation (NLG). This mainly applies
ing different from GAN. BART [18] uses both an to GPT-based PLMs. (3) The third case is a model
encoder and decoder (ie. full architecture of trans- that utilizes the entire architecture of Transformer,
former) architecture with several permutation and which has recently been introduced in many ways.
deletion objectives. In the finetuning step, simply Models such as T5 [23], BART [18], and MASS [24]
plugging in the task-specific inputs and outputs are representative. The model trained in this way
into each PLM, we introduce and finetune all of the shows some significant performance improvement
parameters end-to-end. However, recent researches not only in NLU and NLG, but also in tasks where
of Large-Scale PLM like GPT-3 [6] show that any the effect of PLM is hard to see, such as NMT. In this
finetuning steps are not needed as the size of PLM section, we introduce the tokenizers and parameters
and data are large enough to remember all of tasks of the Korean pretrained models that have been
and information from training data. However, as
there are very few number of large-scale Korean 5. https://github.com/e9t/nsmc
PLMs exist, our survey does not include these types 6. https://github.com/naver/nlp-challenge
of PLMs. 7. https://korquad.github.io/
3

models

<EOS>

_자연어

<EOS>
NLP

_일종
NLP

one

모델

이다
of


is
Encoder Decoder Encoder Decoder

Transformer

<MASK>

트랜스포머
models

models
BERT 

_자연어
models
NLP
GPT

_일종
one

one

모델

이다
NLP
one
of

of


is

is

of
is
Fig. 2. Three main types of PLM. Encoder-Centric Models (left) trained with MLM Task for Language Understanding, Decoder-
Centric Models(center) trained with Next Token Prediction Task for Language Generation. Seq2Seq Models (right) trained with
various objects and tasks (NMT, Summarization, etc) with Next Token Prediction for language generation with understanding.
Detailed objectives or architectures can be differ by individual models.

released so far based on the three categories we 3.1.3 KoELECTRA


classified above. KoELECTRA11 is an ELECTRA-based language
model trained from ‘Modu Corpus’12 released by
3.1 Encoder-Centric Models the National Institute of Korean Language (NIKL),
Korean Wikipedia, NamuWiki13 , and various news
data.

3.1.4 KcBERT
KcBERT [25] is a Korean BERT model trained based
on the BERT model using about 12 GB of Naver
Fig. 3. Objective of Encoder-Centric Models. this objective is to
predict tokens that positions of input are masked, designed for politics news comments data. Tokenizer uses Word-
Natural Language Understanding. piece [26] BPE and is preprocessed to handle emojis
and special characters.
Encoder centric models are focused on extract-
ing the features of language. Some tasks like classi- 3.1.5 SoongsilBERT (KcBERT2)
fication, clustering, tagging are able to use this type SoongsilBERT14 is a language model pretrained by
of models as PLM. using community data of Soongsil University and
3.1.1 KoBERT Modu Corpus in addition to the news comments
data used in KcBERT. Most of the settings are
KoBERT8 is the first Korean pretrained model
identical, except it is trained based on the RoBERTa
shared on huggingface released by SKT-Brain. It
model and uses Byte-level BPE Tokenizer. More-
is mostly the same as BERT’s configuration, but
over, SoongsilBERT is more fitting to community
the tokenizer uses SentencePiece 9 , not the Word
terminology. In other words, it does not perform
Piece Toeknizer used in BERT. For the data used
well in non-community domains.
for pretraining, 5 million sentences and 54 million
words were used in the Korean wiki.
3.1.6 KcELECTRA
3.1.2 HanBERT KcELECTRA15 is a model trained by collecting ad-
HanBERT10 is a BERT model trained using about ditional data (mainly comments) to the data used
150GB (General Domain: 70GB, Patent Domain: for KcBERT. In NSMC Task, the model is currently
75GB) and 700 million sentences of Korean corpus. recording State-of-the-Arts.
The tokenizer used a private tokenizer called Moran
Tokenizer, and the vocab size was 54000. 11. https://github.com/monologg/KoELECTRA
12. https://corpus.korean.go.kr/
8. https://github.com/SKTBrain/KoBERT 13. Large-scale Korean open domain encyclopedia.
9. https://github.com/google/sentencepiece 14. https://github.com/jason9693/Soongsil-BERT
10. https://github.com/tbai2019/HanBert-54k-N 15. https://github.com/Beomi/KcELECTRA
4

3.1.7 DistilKoBERT HyperCLOVA [30]: HyperCLOVA is first


DistilKoBERT16 is a lightweighted version of version of Korean Large-Scale PLM. Param-


KoBERT distillation based on huggingface’s Distil- eter size is up to 82B, but the models(i.e
Bert [27]model. The Teacher model and tokenizer parameters) are not published now.
used are the same as KoBERT. • SKT KoGPT-trinity: SKT KoGPT-trinity
(We’ll call this model as SKGPT) is the first
3.1.8 KoBigBird public version of Large-scaled Korean PLM.
KoBigBird [28] is released for long-range under- The size of parameters is 1.2B and trained
standing of Korean language. This model covered with Ko-DATA dataset which is inner refined
with more than 8 times longer than the usual (512 corpus of SKT for training the model.
tokens) BERT models. • KakaoBrain KoGPT [31]19 : Kakao Brain’s
KoGPT(for preventing confused with
KoGPT2 released by SKT, We’ll call this
3.2 Decoder-Centric Models
model as KakaoGPT) is the largest(size
of model) public version of Korean PLM.
Parameter size is 6B.
SKGPT and KakaoGPT announced with their
down-stream task results as finetuning, unlike Hy-
Fig. 4. Objective of Decoder-Centric Models. this objective is just perCLOVA registered as prompt-tuning version.
predict next token of each position of input tokens, designed for
Auto-Regressive inference, which forward model iterative until 3.3 Seq2Seq-Centric Models
eos (end-of-sequence) token predicted.

Decoder centric models are focused on a genera-


tion of languages. Some tasks like dialog (someone
called chatbot), Lyric-generation, or other types of
”generate” language are able to use this type of Fig. 5. Objective of Seq2Seq-Centric Models. this type’s objec-
models as PLM. Like fig. 5, objective functions of tive is usually mixed with NLU and NLG function.
Decoder-centric models are too simple, just predict-
Seq2Seq [10] centric models use seq2seq trans-
ing the next token of all sequences is done. Unfor-
former architecture for both NLU and NLG. Lots
tunately, in Korean, a very few number of models
of pretraining method are available as so many
that trained on this type are released as many PLM
seq2seq tasks exist. Unfortunately, in Korean, a
focused on NLU, not NLG.
few numbers of PLM trained with this method are
opened.
3.2.1 SKT-AI KoGPT2
KoGPT217 is a language model that GPT2 [29]-based 3.3.1 KoBART
PLM for the first Korean Natural Language Gener- KoBART20 is one of Seq2Seq versions of PLM, based
ation released by SKT-AI. Korean Wikipedia, Modu on BART model, which have training objects: text
Corpus, and the Blue House National Petition18 and infilling (for NLU) and Auto-Regressive(for NLG).
private data like news were used for model training. It was trained on more than 40GB corpus.
Char BPE Tokenizer is used for tokenization, in
addition to the custom (unused) tokens that used 4 E XPERIMENT
to train the downstream task. We exploit an aggregate Down-Stream task bench-
mark results of several pretrained models we dis-
3.2.2 Large-Scale PLM cussed above. Using the benchmark datasets intro-
As mentioned above, in abstraction, we will intro- duced in related works, We report results with 2
duce about large-scale LMs for Korean, but we will aspects. (1) Tasks deal with only a single sentence
not deal with these models because of computa- task (TABLE 1), and (2) Tasks deal with multiple
tional limitations. sentences or have some interactions with multiple
agents (TABLE 2).
16. https://github.com/monologg/DistilKoBERT
17. https://github.com/SKT-AI/KoGPT2 19. https://github.com/kakaobrain/kogpt
18. https://github.com/akngs/petitions 20. https://github.com/SKT-AI/KoBART
5

TABLE 1
A Result of Single Sentence Tasks.

Models NSMC* BEEP!(Dev)2 Naver NER3 Size(MB)


KoELECTRA(Small) 89.36 63.07 85.4 54
KoELECTRA(Base) 90.63 67.61 88.11 431
DistilKoBERT 88.6 60.72 84.65 108
KoBERT 89.59 66.21 87.92 351
SoongsilBERT(Small) 90.7 66 84 213
SoongsilBERT(Base) 91.2 69 85.2 370
KcBERT(Base) 89.62 68.78 84.34 417
KcBERT(Large) 90.68 69.91 85.53 1200
KoBigBird(Base) 91.18 - - 436
KoBART 90.24 - - 473
KoGPT2 91.13 - - 490
HanBERT 90.06 68.32 87.70 614
XLM-Roberta-Base 89.03 64.06 86.65 1030
KcELECTRA-base 91.71 74.05 86.90 475
1
measured by accuracy.
2, 3
measured by F1 score.

4.1 Single Sentence Tasks 4.1.3 Naver NER Result


Korean Benchmarks with a single sentence are
Naver NER dataset is a data published by process-
mainly focused on classification or tagging task.
ing Korean Wikipedia into text form. Total num-
NSMC is Korean sentiment classification bench-
ber of training sets is 90,000 examples. KoELEC-
mark which have binary classes, labeled with
TRA (Base) model achieved State-of-the-Art in this
NAVER Corp’s Shopping review comments. and
task. One of the interesting things is KcBERT and
BEEP! is Korean hate-speech classification bench-
SoongsilBERT, unlike in NSMC or BEEP! results,
mark labeled with ”Hate”, ”Offensive” and ”None”
these models are not performed well, even worse
classes. Naver NER is namely Named Entity Recog-
than the general multilingual model (XLM) [32] that
nition benchmark of Korean, which opend by
is not specialized in Korean.
NAVER Corp.

4.1.1 NSMC Result


NSMC is one of the benchmark datasets classify- 4.2 Multiple Sentence and Agent Tasks
ing whether sentiment is positive or negative. All
sentences are come from the commercial review The result of this task showed different patterns
sentence of NAVER. The size of this dataset is 150k than before. KoELECTRA and KoBigBird showed
sentences for training and 50k sentences for testing. best results, whereas KcBERT and SoongsilBERT
KcELECTRA has recorded State-of-the-Art (SOTA) showed better before. In this task, unlike before, the
in this task with 91.71 accuracy. texts of the datasets are much longer as there are
some interaction between sentences (NLI, STS) or
4.1.2 BEEP! Result agents (QA). KorNLI and KorSTS are NLI dataset
BEEP! is a human-annotated corpus where the in- for Korean, released by Kakao Brain. and Question
tensity of hate speech is tagged with the labels of Pair (Korean) dataset is a pharaphrase detection
‘hate’, ‘offensive’, and ‘none’, built upon celebrity benchmark that finds the similarity between two
news comments on a Korean online news platform. question sentences for Korean. Unfortunately, we
KcELECTRA achieved highest score of the models can not access this dataset anymore as this repo
with 69.91 F1 Scores. One of the interesting things is is removed now. Finally, KorQuAD dataset is a
that DistilKoBERT, lightweight version of KoBERT, Korean version of SQuAD (QA) dataset. Although
the score is degraded more than 5 points unlike the latest version of this dataset is 2.0, We used 1.0
NSMC scores of the two models are nearly the same. as lots of models reported in this version.
6

TABLE 2
A Result of Multiple Sentence & Agent Tasks.

Models KorNLI1 KorSTS2 Question Pair3 KorQuaD (Dev)4 Size (MB)


KoELECTRA (Small) 78.6 80.79 94.85 82.11 / 91.13 54
KoELECTRA (Base) 82.24 85.53 95.25 84.83 / 93.45 431
DistilKoBERT 72 72.59 92.48 54.40 / 77.97 108
KoBERT 79.62 81.59 94.85 51.75 / 79.15 351
SoongsilBERT (Small) 76 74.2 92 - 213
SoongsilBERT (Base) 78.3 76 94 - 370
KcBERT (Base) 74.85 75.57 93.93 60.25 / 84.39 417
KcBERT (Large) 76.99 77.49 94.06 62.16 / 86.64 1200
KoBigBird (Base) - - - 87.08 / 94.71 436
KoBART - 81.66 94.34 - 473
KoGPT2 - 78.4 - - 490
HanBERT 80.32 82.73 94.72 78.74 / 92.02 614
XLM-Roberta-Base 80.23 78.45 93.8 64.70 / 88.94 1030
KcELECTRA-base 81.65 82.65 95.78 70.60 / 90.11 475
1, 3
measured by accuracy.
2
measured by spearman correlation.
4
measured by (1) EM score and (2) F1 score.

4.2.1 KorNLI Result Wikipedia articles, 60,407 Q&A pairs for the train-
NLI task is the task classifying the relationship ing set, and 5,774 Q&A pairs for Dev set. In this task,
between two sentences as ”entailment”, ”contradic- KoBigBird scored the highest score (87.08 EM score
tion” and ”neutral”. KorNLI dataset has 942,854 ex- / 94.71 F1-score), On the other hand, KcBERT and
amples (pair) for training, 2,490 exmamples for eval- KoBERT are not performed well even scored lower
uation, and 5,010 examples for testing. KoELECTRA than XLM. It seems the sentence length of corpus
scored State-of-the-Arts in this task. However, most for pretraining is too short to understand long-term
of Korean PLM scored lower points than XLM, not sequence.
Korean centered model.

4.2.2 KorSTS Result 5 C ONCLUSION


STS task is identical to NLI, except for scoring In this survey, we discussed several Korean pre-
metric. This dataset score similarity between two trained language models and benchmarks and com-
sentences from 1 (not similar) to 5 (identical). In pared with these models. Of course, there are many
KorSTS, it has 5,749 examples for training, and 1,500 more publicly available Korean language models
examples for evaluation, and 1,379 examples for other than the model we introduced, but we could
testing. Like 4.2.1, KoELECTRA recorded the best not include all of them for reasons such as the length
score in this task. of this paper or the reason that the benchmark
results were not reported in various ways. In future
4.2.3 Question Pair Result works, we expect that the latest Korean benchmarks
Question Pair dataset has 6,888 examples of train such as KLUE and various surveys will appear
sets and 688 examples of test sets. KcELECTRA to promote the development of Korean NLP and
model has recorded the best in this task. How- furthermore, Computational Linguistics.
ever, Question Pair dataset is currently unavailable
because the repository of this task and dataset is
vanished. R EFERENCES
4.2.4 KorQuAD Result [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is
The total data of KorQuAD are divided into all you need,” Advances in Neural Information Processing
10,645 paragraphs and 66,181 Q&A pairs for 1,560 Systems, vol. 30, pp. 5998–6008, 2017.
7

[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Denoising sequence-to-sequence pre-training for natural
Pre-training of deep bidirectional transformers for lan- language generation, translation, and comprehension.”
guage understanding,” in Proceedings of the 2019 Conference [19] J. Ham, Y. J. Choe, K. Park, I. Choi, and H. Soh, “Kornli
of the North American Chapter of the Association for Compu- and korsts: New benchmark datasets for korean natural
tational Linguistics: Human Language Technologies, Volume 1 language understanding,” CoRR, vol. abs/2004.03289,
(Long and Short Papers), 2019, pp. 4171–4186. 2020. [Online]. Available: https://arxiv.org/abs/2004.
[3] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, 03289
“Improving language understanding by generative pre- [20] J. Moon, W. I. Cho, and J. Lee, “BEEP! Korean corpus
training.” of online news comments for toxic speech detection,”
[4] H. Sak, A. Senior, and F. Beaufays, “Long short-term in Proceedings of the Eighth International Workshop on
memory recurrent neural network architectures for large Natural Language Processing for Social Media. Online:
scale acoustic modeling.” Association for Computational Linguistics, Jul. 2020,
[5] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, pp. 25–31. [Online]. Available: https://www.aclweb.org/
A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., anthology/2020.socialnlp-1.4
“Huggingface’s transformers: State-of-the-art natural lan- [21] S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song,
guage processing,” arXiv preprint arXiv:1910.03771, 2019. J. Kim, Y. Song, T. Oh, J. Lee, J. Oh, S. Lyu, Y. Jeong, I. Lee,
[6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, S. Seo, D. Lee, H. Kim, M. Lee, S. Jang, S. Do, S. Kim,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell K. Lim, J. Lee, K. Park, J. Shin, S. Kim, L. Park, A. Oh, J.-W.
et al., “Language models are few-shot learners,” arXiv Ha, and K. Cho, “Klue: Korean language understanding
preprint arXiv:2005.14165, 2020. evaluation,” 2021.
[7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, [22] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R.
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, Bowman, “Glue: A multi-task benchmark and analysis
G. Heigold, S. Gelly et al., “An image is worth 16x16 platform for natural language understanding,” 2019.
words: Transformers for image recognition at scale,” in [23] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
International Conference on Learning Representations, 2020. M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
[8] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, the limits of transfer learning with a unified text-to-text
A. Radford, M. Chen, and I. Sutskever, “Zero-shot text- transformer,” 2020.
to-image generation,” CoRR, vol. abs/2102.12092, 2021. [24] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mass:
[Online]. Available: https://arxiv.org/abs/2102.12092 Masked sequence to sequence pre-training for language
[9] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, generation,” in ICML, 2019.
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase [25] J. Lee, “Kcbert: Korean comments bert,” in Proceedings
representations using rnn encoder–decoder for statistical of the 32nd Annual Conference on Human and Cognitive
machine translation,” in Proceedings of the 2014 Confer- Language Technology, 2020, pp. 437–440.
ence on Empirical Methods in Natural Language Processing [26] M. Schuster and K. Nakajima, “Japanese and korean voice
(EMNLP), 2014, pp. 1724–1734. search,” in 2012 IEEE International Conference on Acoustics,
[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to se- Speech and Signal Processing (ICASSP). IEEE, 2012, pp.
quence learning with neural networks,” in Advances in 5149–5152.
neural information processing systems, 2014, pp. 3104–3112. [27] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert,
[11] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine a distilled version of bert: smaller, faster, cheaper and
translation by jointly learning to align and translate,” lighter,” arXiv preprint arXiv:1910.01108, 2019.
arXiv preprint arXiv:1409.0473, 2014. [28] J. Park and D. Kim, “Kobigbird: Pretrained bigbird
[12] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap- model for korean,” Nov. 2021. [Online]. Available:
proaches to attention-based neural machine translation,” https://doi.org/10.5281/zenodo.5654154
in Proceedings of the 2015 Conference on Empirical Methods in [29] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and
Natural Language Processing, 2015, pp. 1412–1421. I. Sutskever, “Language models are unsupervised multi-
[13] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling task learners,” 2019.
neural machine translation,” in Proceedings of the Third [30] B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, J. D. Hyeon,
Conference on Machine Translation: Research Papers, 2018, pp. S. Park, S. Kim, S. Kim, D. Seo et al., “What changes
1–9. can large-scale language models bring? intensive study
[14] D. So, Q. Le, and C. Liang, “The evolved transformer,” on hyperclova: Billions-scale korean generative pretrained
in International Conference on Machine Learning, 2019, pp. transformers,” in Proceedings of the 2021 Conference on
5877–5886. Empirical Methods in Natural Language Processing, 2021, pp.
[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, 3405–3424.
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: [31] I. Kim, G. Han, J. Ham, and W. Baek, “Kogpt: Kakao-
A robustly optimized bert pretraining approach,” arXiv brain korean(hangul) generative pre-trained transformer,”
preprint arXiv:1907.11692, 2019. https://github.com/kakaobrain/kogpt, 2021.
[16] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, [32] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary,
“ELECTRA: Pre-training text encoders as discriminators G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer,
rather than generators,” in ICLR, 2020. [Online]. Available: and V. Stoyanov, “Unsupervised cross-lingual representa-
https://openreview.net/pdf?id=r1xMH1BtvB tion learning at scale,” CoRR, vol. abs/1911.02116, 2019.
[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [Online]. Available: http://arxiv.org/abs/1911.02116
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
“Generative adversarial networks,” Communications of the
ACM, vol. 63, no. 11, pp. 139–144, 2020.
[18] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart:

You might also like