SlideShare a Scribd company logo
2018.11.15.
AI Labs
NL.K team
김성현
BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding
Overview - BERT?
1P
• Bi-directional Encoder Representations from Transformers (BERT)[1] is a language model
based on fine pre-trained word representation using bi-directional Transformer [2]
• Fine pre-trained BERT language model + transfer learning → NLP application!
….
James Marshall "Jimi" Hendrix
was an American rock guitarist,
singer, and songwriter.
….
Who is Jimi Hendrix?
Pre-trained BERT
1 classification layer
"Jimi" Hendrix was an American rock guitarist, singer, and
songwriter.
[1] Jacob et al., 2018, arxiv [2] Ashish et al., 2017, arXiv [3] Rajpurkar et al., 2016, arXiv
SQuAD v1.1 dataset leaderboard [3]
Introduction – Language Model
• How to encode and decode the natural language? → Language model
2P
Encoder
Decoder
0101101001011
1111001011011
Machine
translation
Named
entity
TTS
MRC
STT
POS
tagging
James Marshall "Jimi" Hendrix
was an American rock guitarist,
singer, and songwriter.
Although his mainstream career
spanned only four years,
…
Who is Jimi Hendrix?
"Jimi" Hendrix was an American
rock guitarist, singer, and songwriter.
Language model
Introduction – Word Embedding
• Word embedding is a language modeling where words or phrases from the un-
labeled large corpus are mapped to vectors of real numbers
• Skip-gram word embedding model vectorizing a word using target words to
predict the surrounding words
3P
“Duct tape may works anywhere”
“duct”
“may”
“tape”
“work”
“anywhere”
Word One-hot-vector
“duct” [1, 0, 0, 0, 0]
“tape” [0, 1, 0, 0, 0]
“may” [0, 0, 1, 0, 0]
“work” [0, 0, 0, 1, 0]
“anywhere” [0, 0, 0, 0, 1]
Introduction – Word Embedding
• Visualization of word embedding: https://ronxin.github.io/wevi/
4P
• However, word embedding algorithm could not represent ‘context’ of natural
language
Introduction – Markov Model
• Markov model represents the context of natural language
• Bi-gram language model could calculate the probability of sentence based on
the Markov chain
5P
I don’t like rabbits turtles snails
I 0 0.33 0.66 0 0 0
don’t 0 0 1.0 0 0 0
like 0 0 0 0.33 0.33 0.33
rabbits 0 0 0 0 0 0
turtles 0 0 0 0 0 0
snails 0 0 0 0 0 0
“I like rabbits”
“I like turtles”
“I don’t like snails”
0.66 * 0.33 = 0.22
0.66 * 0.33 = 0.22
0.33 * 1.0 * 0.33 = 0.11
Introduction – Recurrent Neural Network
• Recurrent neural network (RNN) contains a node with directed cycle
6P
Current step input
Predicted next step
output
Current step
hidden layer
Previous step
hidden layer
One-hot-vector
Basic RNN model architecture An example to predict the next character
• However, RNN compute the target output in consecutive order
• Distance dependency problem
soft max
Introduction – Attention Mechanism
• Attention is motivated by how we pay attention to different regions of an image
or correlate words in one sentence
7P
Hidden state
Key
Query
Value
Introduction – Attention Mechanism
• Attention for neural machine translation (NMT) https://distill.pub/2016/augmented-rnns/#attentional-interfaces
8P
• Attention for speech to text (STT)
Introduction – Transformer [1]
9P
• Transformer architecture
[1] Ashish et al., 2017, arXiv
Introduction – Transformer
10P
• Transformer architecture
Introduction – Transformer
11P
• Transformer architecture
Introduction – Transformer
12P
• Transformer architecture
To predict next word and minimize the difference between label and output
Research Aims
• To improve Transformer based language model by proposing BERT
• To show that the fine-turning based language model based on pre-trained
representations achieves state-of-the-art performance on many natural
language processing task
13P
Input embedding layer
Transformer layer
Contextual representation of token
Methods
• Model architecture
14P
BertBASE BertLARGE
Transformer layer 12 24
Hidden state size 768 1024
Self-attention head 12 16
Total 110M 340M
Methods
• Corpus data for pre-train word embedding
• BooksCorpus (800M words)
• English Wikipedia (2,500M words without lists, tables and headers)
• 30,000 token vocabulary
• Data preprocessing for pre-train word embedding
• A word is separated as WordPiece [1-2] tokenizing
He likes playing → He likes play ##ing
• Make a ‘token sequence’ which two sentences packed together for pre-training
• ‘Token sequence’ with two sentences is constructed by pair of two sentences
15P
Example of two sentences token sequence
Classification
label token
Next sentence or Random chosen sentence (50%)
[1] Sennrich et al., 2016, ACL [2] Kudo, 2018, ACL
Methods
• Data preprocessing for pre-train word embedding
• Masked language model (MLM) masked some percentage of the input tokens at random
16P
Original token sequence
Randomly chosen token (15%)
Masking
[MASK]
Randomly replacing
hairy
Unchanging
80% 10% 10%
Example of two sentences token sequence
[MASK]
Methods
• The input embeddings is the sum of token embeddings, the segmentation embeddings
and the position embeddings
17P
[MASK]
512. . .
= IsNext or NotNext
Methods
• Training options
• Train batch size: 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch)
• Steps: 1M
• Epoch: 40 epochs
• Adam learning rate: 1e-4
• Weight decay: 0.01
• Drop out probability: 0.1
• Activation function: GELU
• Environmental setup
• BERTBASE: 4 Cloud TPUs (16 TPU chips total)
• BERTLARGE: 16 Cloud TPUs (64 TPU chips total) ≈ 72 P100 GPU
• Training time: 4 days
18P
Methods
• Experiments (total 11 NLP task)
• GLUE datasets
‒ MNLI: Multi-Genre Natural Language Inference
‒ To predict whether second sentence is an entailment, contradiction or neutral
‒ QQP: Quora Question Pairs
‒ To predict two questions are semantically equivalent
‒ QNLI: Question Natural Language Inference
‒ Question and Answering datasets
‒ SST-2: The Stanford Sentiment Treebank
‒ Single-sentence classification task from movie reviews with human annotations of their sentiment
‒ CoLA: The Corpus of Linguistic Acceptability
‒ Single-sentence classification to predict whether an English sentence is linguistically acceptable or not
‒ STS-B: The Semantic Textual Similarity Benchmark
‒ News headlines dataset with annotated score from 1 to 5 denoting how similar the two sentences are in
terms of semantic meaning
‒ MRPC: Microsoft Research Paraphrase Corpus
‒ Online news sources with human annotations for whether the sentences in the pair the semantically
equivalent
‒ RTE: Recognizing Textual Entailment
‒ Similar to MNLI, but with much less training data
‒ WNLI: Winograd NLI
‒ Small natural language inference dataset to predict sentence class
• SQuAD v1.1
• CoNLL 2003 Named Entity Recognition datasets
• SWAG: Situations With Adversarial Generations
‒ To decide among four choices the most plausible continuation sentence
19P
Methods
• Experiments (total 11 NLP task)
20P
Sentence pair classification Single sentence pair classification
Question and answering (SQuAD v1.1) Single sentence tagging
Results
• GLUE test results
21P
• SQuAD v1.1
Results
• Named Entity Recognition (CoNLL-2003)
22P
• SWAG
Conclussion
• BERT is undoubtedly a breakthrough in the use of machine learning for natural
language processing
• Bi-directional Transformer architecture enhances the natural language
processing performance
23P
Discussion
• English SQuAD v1.1 test
24P
• Korean BERT training
BERT En vocabulary BERT Ko vocabulary
BERT Korean model
감사합니다

More Related Content

What's hot (20)

PPTX
Nlp toolkits and_preprocessing_techniques
ankit_ppt
 
PPT
Natural language procssing
Rajnish Raj
 
PPTX
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 
ODP
Topic Modeling
Karol Grzegorczyk
 
PDF
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
PDF
Natural language processing (NLP) introduction
Robert Lujo
 
PPTX
Bert
Abdallah Bashir
 
PDF
Introduction to natural language processing
Minh Pham
 
PDF
Natural Language Processing
Toine Bogers
 
PPTX
Attention Is All You Need
Illia Polosukhin
 
PPTX
Language models
Maryam Khordad
 
PPTX
Natural Language Processing: Parsing
Rushdi Shams
 
PDF
Gpt models
Danbi Cho
 
PPTX
Introduction to natural language processing, history and origin
Shubhankar Mohan
 
PDF
Glove global vectors for word representation
hyunyoung Lee
 
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
PPTX
Word2Vec
mohammad javad hasani
 
PPTX
Parts of Speect Tagging
theyaseen51
 
PPTX
NLP_KASHK:Minimum Edit Distance
Hemantha Kulathilake
 
Nlp toolkits and_preprocessing_techniques
ankit_ppt
 
Natural language procssing
Rajnish Raj
 
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 
Topic Modeling
Karol Grzegorczyk
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Natural language processing (NLP) introduction
Robert Lujo
 
Introduction to natural language processing
Minh Pham
 
Natural Language Processing
Toine Bogers
 
Attention Is All You Need
Illia Polosukhin
 
Language models
Maryam Khordad
 
Natural Language Processing: Parsing
Rushdi Shams
 
Gpt models
Danbi Cho
 
Introduction to natural language processing, history and origin
Shubhankar Mohan
 
Glove global vectors for word representation
hyunyoung Lee
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
Parts of Speect Tagging
theyaseen51
 
NLP_KASHK:Minimum Edit Distance
Hemantha Kulathilake
 

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

PDF
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
PDF
Automatic Personality Prediction with Attention-based Neural Networks
Jinho Choi
 
PDF
Colloquium talk on modal sense classification using a convolutional neural ne...
Ana Marasović
 
PPTX
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Matīss ‎‎‎‎‎‎‎  
 
PDF
AINL 2016: Nikolenko
Lidia Pivovarova
 
PDF
W1L2_11-667 - Building Blocks of Modern LLMs 2: Pretraining Tasks
cniclsh1
 
PDF
Pretraining Task - Auto-Regressive LM, Transformer Encoder-Decoders
cniclsh1
 
PPTX
Natural Language Processing (NLP).pptx
HelmandAtssar
 
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
PDF
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Jimmy Lai
 
PDF
Natural Language Processing (NLP)
Yuriy Guts
 
PPT
Introduction to Natural Language Processing
Pranav Gupta
 
PDF
Deep learning for nlp
Viet-Trung TRAN
 
PDF
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
PDF
Deep learning for natural language embeddings
Roelof Pieters
 
PDF
Turkish language modeling using BERT
AbdurrahimDerric
 
PDF
Visual-Semantic Embeddings: some thoughts on Language
Roelof Pieters
 
PPTX
Introduction to Neural Information Retrieval and Large Language Models
sadjadeb
 
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
PPTX
MixedLanguageProcessingTutorialEMNLP2019.pptx
MariYam371004
 
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Automatic Personality Prediction with Attention-based Neural Networks
Jinho Choi
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Ana Marasović
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Matīss ‎‎‎‎‎‎‎  
 
AINL 2016: Nikolenko
Lidia Pivovarova
 
W1L2_11-667 - Building Blocks of Modern LLMs 2: Pretraining Tasks
cniclsh1
 
Pretraining Task - Auto-Regressive LM, Transformer Encoder-Decoders
cniclsh1
 
Natural Language Processing (NLP).pptx
HelmandAtssar
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Jimmy Lai
 
Natural Language Processing (NLP)
Yuriy Guts
 
Introduction to Natural Language Processing
Pranav Gupta
 
Deep learning for nlp
Viet-Trung TRAN
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Deep learning for natural language embeddings
Roelof Pieters
 
Turkish language modeling using BERT
AbdurrahimDerric
 
Visual-Semantic Embeddings: some thoughts on Language
Roelof Pieters
 
Introduction to Neural Information Retrieval and Large Language Models
sadjadeb
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
MixedLanguageProcessingTutorialEMNLP2019.pptx
MariYam371004
 
Ad

More from Seonghyun Kim (18)

PDF
코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구
Seonghyun Kim
 
PDF
뇌의 정보처리와 멀티모달 인공지능
Seonghyun Kim
 
PDF
인공지능과 윤리
Seonghyun Kim
 
PDF
한국어 개체명 인식 과제에서의 의미 모호성 연구
Seonghyun Kim
 
PDF
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터
Seonghyun Kim
 
PDF
Backpropagation and the brain review
Seonghyun Kim
 
PDF
Theories of error back propagation in the brain review
Seonghyun Kim
 
PPTX
KorQuAD v1.0 참관기
Seonghyun Kim
 
PDF
딥러닝 기반 자연어 언어모델 BERT
Seonghyun Kim
 
PDF
Enriching Word Vectors with Subword Information
Seonghyun Kim
 
PPTX
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Seonghyun Kim
 
PDF
챗봇의 역사
Seonghyun Kim
 
PDF
The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...
Seonghyun Kim
 
PDF
Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...
Seonghyun Kim
 
PDF
How Environment and Self-motion Combine in Neural Representations of Space
Seonghyun Kim
 
PDF
Computational Cognitive Models of Spatial Memory in Navigation Space: A review
Seonghyun Kim
 
PDF
Learning Anticipation via Spiking Networks: Application to Navigation Control
Seonghyun Kim
 
PDF
A goal-directed spatial navigation model using forward trajectory planning ba...
Seonghyun Kim
 
코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구
Seonghyun Kim
 
뇌의 정보처리와 멀티모달 인공지능
Seonghyun Kim
 
인공지능과 윤리
Seonghyun Kim
 
한국어 개체명 인식 과제에서의 의미 모호성 연구
Seonghyun Kim
 
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터
Seonghyun Kim
 
Backpropagation and the brain review
Seonghyun Kim
 
Theories of error back propagation in the brain review
Seonghyun Kim
 
KorQuAD v1.0 참관기
Seonghyun Kim
 
딥러닝 기반 자연어 언어모델 BERT
Seonghyun Kim
 
Enriching Word Vectors with Subword Information
Seonghyun Kim
 
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Seonghyun Kim
 
챗봇의 역사
Seonghyun Kim
 
The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...
Seonghyun Kim
 
Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...
Seonghyun Kim
 
How Environment and Self-motion Combine in Neural Representations of Space
Seonghyun Kim
 
Computational Cognitive Models of Spatial Memory in Navigation Space: A review
Seonghyun Kim
 
Learning Anticipation via Spiking Networks: Application to Navigation Control
Seonghyun Kim
 
A goal-directed spatial navigation model using forward trajectory planning ba...
Seonghyun Kim
 
Ad

Recently uploaded (20)

PDF
7.2 Physical Layer.pdf123456789101112123
MinaMolky
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PDF
The Complete Guide to the Role of the Fourth Engineer On Ships
Mahmoud Moghtaderi
 
PPTX
Precedence and Associativity in C prog. language
Mahendra Dheer
 
PPT
IISM Presentation.ppt Construction safety
lovingrkn
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
Jual GPS Geodetik CHCNAV i93 IMU-RTK Lanjutan dengan Survei Visual
Budi Minds
 
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PDF
Farm Machinery and Equipments Unit 1&2.pdf
prabhum311
 
PPTX
Unit 2 Theodolite and Tachometric surveying p.pptx
satheeshkumarcivil
 
PPTX
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
PPTX
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
PDF
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
PPTX
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
PPTX
cybersecurityandthe importance of the that
JayachanduHNJc
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PDF
CFM 56-7B - Engine General Familiarization. PDF
Gianluca Foro
 
PDF
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
7.2 Physical Layer.pdf123456789101112123
MinaMolky
 
Inventory management chapter in automation and robotics.
atisht0104
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
The Complete Guide to the Role of the Fourth Engineer On Ships
Mahmoud Moghtaderi
 
Precedence and Associativity in C prog. language
Mahendra Dheer
 
IISM Presentation.ppt Construction safety
lovingrkn
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Jual GPS Geodetik CHCNAV i93 IMU-RTK Lanjutan dengan Survei Visual
Budi Minds
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
Information Retrieval and Extraction - Module 7
premSankar19
 
Farm Machinery and Equipments Unit 1&2.pdf
prabhum311
 
Unit 2 Theodolite and Tachometric surveying p.pptx
satheeshkumarcivil
 
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
cybersecurityandthe importance of the that
JayachanduHNJc
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
CFM 56-7B - Engine General Familiarization. PDF
Gianluca Foro
 
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 1. 2018.11.15. AI Labs NL.K team 김성현 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • 2. Overview - BERT? 1P • Bi-directional Encoder Representations from Transformers (BERT)[1] is a language model based on fine pre-trained word representation using bi-directional Transformer [2] • Fine pre-trained BERT language model + transfer learning → NLP application! …. James Marshall "Jimi" Hendrix was an American rock guitarist, singer, and songwriter. …. Who is Jimi Hendrix? Pre-trained BERT 1 classification layer "Jimi" Hendrix was an American rock guitarist, singer, and songwriter. [1] Jacob et al., 2018, arxiv [2] Ashish et al., 2017, arXiv [3] Rajpurkar et al., 2016, arXiv SQuAD v1.1 dataset leaderboard [3]
  • 3. Introduction – Language Model • How to encode and decode the natural language? → Language model 2P Encoder Decoder 0101101001011 1111001011011 Machine translation Named entity TTS MRC STT POS tagging James Marshall "Jimi" Hendrix was an American rock guitarist, singer, and songwriter. Although his mainstream career spanned only four years, … Who is Jimi Hendrix? "Jimi" Hendrix was an American rock guitarist, singer, and songwriter. Language model
  • 4. Introduction – Word Embedding • Word embedding is a language modeling where words or phrases from the un- labeled large corpus are mapped to vectors of real numbers • Skip-gram word embedding model vectorizing a word using target words to predict the surrounding words 3P “Duct tape may works anywhere” “duct” “may” “tape” “work” “anywhere” Word One-hot-vector “duct” [1, 0, 0, 0, 0] “tape” [0, 1, 0, 0, 0] “may” [0, 0, 1, 0, 0] “work” [0, 0, 0, 1, 0] “anywhere” [0, 0, 0, 0, 1]
  • 5. Introduction – Word Embedding • Visualization of word embedding: https://ronxin.github.io/wevi/ 4P • However, word embedding algorithm could not represent ‘context’ of natural language
  • 6. Introduction – Markov Model • Markov model represents the context of natural language • Bi-gram language model could calculate the probability of sentence based on the Markov chain 5P I don’t like rabbits turtles snails I 0 0.33 0.66 0 0 0 don’t 0 0 1.0 0 0 0 like 0 0 0 0.33 0.33 0.33 rabbits 0 0 0 0 0 0 turtles 0 0 0 0 0 0 snails 0 0 0 0 0 0 “I like rabbits” “I like turtles” “I don’t like snails” 0.66 * 0.33 = 0.22 0.66 * 0.33 = 0.22 0.33 * 1.0 * 0.33 = 0.11
  • 7. Introduction – Recurrent Neural Network • Recurrent neural network (RNN) contains a node with directed cycle 6P Current step input Predicted next step output Current step hidden layer Previous step hidden layer One-hot-vector Basic RNN model architecture An example to predict the next character • However, RNN compute the target output in consecutive order • Distance dependency problem
  • 8. soft max Introduction – Attention Mechanism • Attention is motivated by how we pay attention to different regions of an image or correlate words in one sentence 7P Hidden state Key Query Value
  • 9. Introduction – Attention Mechanism • Attention for neural machine translation (NMT) https://distill.pub/2016/augmented-rnns/#attentional-interfaces 8P • Attention for speech to text (STT)
  • 10. Introduction – Transformer [1] 9P • Transformer architecture [1] Ashish et al., 2017, arXiv
  • 11. Introduction – Transformer 10P • Transformer architecture
  • 12. Introduction – Transformer 11P • Transformer architecture
  • 13. Introduction – Transformer 12P • Transformer architecture To predict next word and minimize the difference between label and output
  • 14. Research Aims • To improve Transformer based language model by proposing BERT • To show that the fine-turning based language model based on pre-trained representations achieves state-of-the-art performance on many natural language processing task 13P
  • 15. Input embedding layer Transformer layer Contextual representation of token Methods • Model architecture 14P BertBASE BertLARGE Transformer layer 12 24 Hidden state size 768 1024 Self-attention head 12 16 Total 110M 340M
  • 16. Methods • Corpus data for pre-train word embedding • BooksCorpus (800M words) • English Wikipedia (2,500M words without lists, tables and headers) • 30,000 token vocabulary • Data preprocessing for pre-train word embedding • A word is separated as WordPiece [1-2] tokenizing He likes playing → He likes play ##ing • Make a ‘token sequence’ which two sentences packed together for pre-training • ‘Token sequence’ with two sentences is constructed by pair of two sentences 15P Example of two sentences token sequence Classification label token Next sentence or Random chosen sentence (50%) [1] Sennrich et al., 2016, ACL [2] Kudo, 2018, ACL
  • 17. Methods • Data preprocessing for pre-train word embedding • Masked language model (MLM) masked some percentage of the input tokens at random 16P Original token sequence Randomly chosen token (15%) Masking [MASK] Randomly replacing hairy Unchanging 80% 10% 10% Example of two sentences token sequence [MASK]
  • 18. Methods • The input embeddings is the sum of token embeddings, the segmentation embeddings and the position embeddings 17P [MASK] 512. . . = IsNext or NotNext
  • 19. Methods • Training options • Train batch size: 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) • Steps: 1M • Epoch: 40 epochs • Adam learning rate: 1e-4 • Weight decay: 0.01 • Drop out probability: 0.1 • Activation function: GELU • Environmental setup • BERTBASE: 4 Cloud TPUs (16 TPU chips total) • BERTLARGE: 16 Cloud TPUs (64 TPU chips total) ≈ 72 P100 GPU • Training time: 4 days 18P
  • 20. Methods • Experiments (total 11 NLP task) • GLUE datasets ‒ MNLI: Multi-Genre Natural Language Inference ‒ To predict whether second sentence is an entailment, contradiction or neutral ‒ QQP: Quora Question Pairs ‒ To predict two questions are semantically equivalent ‒ QNLI: Question Natural Language Inference ‒ Question and Answering datasets ‒ SST-2: The Stanford Sentiment Treebank ‒ Single-sentence classification task from movie reviews with human annotations of their sentiment ‒ CoLA: The Corpus of Linguistic Acceptability ‒ Single-sentence classification to predict whether an English sentence is linguistically acceptable or not ‒ STS-B: The Semantic Textual Similarity Benchmark ‒ News headlines dataset with annotated score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning ‒ MRPC: Microsoft Research Paraphrase Corpus ‒ Online news sources with human annotations for whether the sentences in the pair the semantically equivalent ‒ RTE: Recognizing Textual Entailment ‒ Similar to MNLI, but with much less training data ‒ WNLI: Winograd NLI ‒ Small natural language inference dataset to predict sentence class • SQuAD v1.1 • CoNLL 2003 Named Entity Recognition datasets • SWAG: Situations With Adversarial Generations ‒ To decide among four choices the most plausible continuation sentence 19P
  • 21. Methods • Experiments (total 11 NLP task) 20P Sentence pair classification Single sentence pair classification Question and answering (SQuAD v1.1) Single sentence tagging
  • 22. Results • GLUE test results 21P • SQuAD v1.1
  • 23. Results • Named Entity Recognition (CoNLL-2003) 22P • SWAG
  • 24. Conclussion • BERT is undoubtedly a breakthrough in the use of machine learning for natural language processing • Bi-directional Transformer architecture enhances the natural language processing performance 23P
  • 25. Discussion • English SQuAD v1.1 test 24P • Korean BERT training BERT En vocabulary BERT Ko vocabulary BERT Korean model