0% found this document useful (0 votes)

7 views

Incorporating BERT Into NMT-1

Uploaded by

unidentified Incognito

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Incorporating BERT Into NMT-1

Uploaded by

unidentified Incognito

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Incorporating BERT into Neural Machine

Translation

—– Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang
Zhou, Houqiang Li, Tie-Yan Liu
Ramakrishna Mission Vivekananda Educational And Research
Institute

Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta

November 22, 2024
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
1/19
Contents

1 Introduction

2 Preliminary Explorations

3 BERT-fused Model

4 Algorithm

5 Results

6 Conclusion

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
2/19
Introduction

q Neural Machine Translation (NMT) translates text from a

source to a target language using an encoder to generate
hidden representations and a decoder to produce target
sentences.
q BERT excels in language understanding but integrating it
with NMT is challenging.
q Initial experiments showed contextual embeddings from BERT
improved results.
q The authors proposed a BERT-fused model that integrates
BERT-extracted representations into each layer of the NMT
model’s encoder and decoder using attention mechanisms.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
3/19
Preliminary Explorations

q Two approaches for using BERT in NMT:

1. Initialize the encoder of NMT with a pre-trained BERT model
and then fine-tune NMT.
2. Using BERT to provide contextual embeddings as input for
NMT.
q Observations: Contextual embeddings outperformed
fine-tuning.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
4/19
Preliminary Explorations cont.

Different ways to utilize pretrained models:

1. Use pre-trained models to initialize the NMT model:
q Initialize NMT encoder with pre-trained BERT.
q Initialize encoder/decoder with XLM.
2. Use pre-trained models as inputs to the NMT model:
q Use outputs of BERT’s last layer as NMT inputs.
Challenges:
q Pre-training large models for each language pair is
resource-intensive

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
5/19
Preliminary Explorations cont.

Their Approach:
q Output of BERT as context-aware embeddings of Encoder
outperforms initialization approaches.

Figure: Preliminary explorations on IWSLT’14 English→German

translation

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
6/19
Proposed Algorithm: BERT-fused Model

q The BERT-fused Model exploits representations from BERT

by feeding them into all layers, rather than using them only as
input embeddings.
q Attention mechanism adaptively controls how each layer
interacts with BERT representations.
q 2 extra attention modules:
q BERT-encoder attention: Enables NMT encoder layers to
interact with BERT representations, producing fused outputs.
q BERT-decoder attention: Combines BERT representations
with NMT encoder outputs during decoding.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
7/19
Model Architecture

Figure: BERT-fused Model Architecture

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
8/19
Notations

q Let X and Y denote the source language domain and target

language domain, respectively:
q X: Collection of sentences in the source language.
q Y: Collection of sentences in the target language.
q For any sentence x ∈ X and y ∈ Y:
q lx : Number of units (e.g., words or sub-words) in x.
q ly : Number of units in y.
q xi and yi : The i-th unit in x and y, respectively.
q Notations:
q Enc: Encoder.
q Dec: Decoder.
q BERT: Pre-trained language model.
q The encoder and decoder together form the NMT module.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
9/19
Step-1: BERT Encoding

q Given any input x ∈ X, BERT first encodes it into

representation HB = BERT(x).
q HB is the output of the last layer in BERT.
q The hB,i ∈ HB is the representation of the i-th wordpiece in x.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
10/19
Step-2: Encoder Representation
q Let HlE denote the hidden representation of the l-th layer in
the encoder, and let H0E denote the word embedding of
sequence x.
q Denote the i-th element in HlE as hli for any i ∈ [lx ].
q In the l-th layer, l ∈ [L], we have:

where attnS and attnB are attention models with different

parameters.
q Then, each h̃li is further processed by FFN(·) and we get the
output of the l-th layer:
( )
HlE = FFN(h̃l1 ), · · · , FFN(h̃llx )

q The encoder will eventually output HLE from the last layer.
VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
11/19
Step-3: Decoder Representation

q Let Sl<t denote the hidden state of the l-th layer in the
decoder preceding time step t, i.e.,

Sl<t = (sl1 , . . . , slt−1 )

q Note that s01 is a special token indicating <bos>, and s0t is

the embedding of the predicted word at time-step t − 1.
q At the l-th layer, we have the following:

where attnS , attnB , and attnE represent the self-attention

model, BERT-decoder attention model, and encoder-decoder
attention model, respectively.

q Finally, sLt is mapped via a linear transformation and softmax

to get the t-th predicted word ŷt .
q The decoding process continues until meeting the <eos>.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
13/19
Drop-Net Trick

q Using this technique, we can prevent the network from

overfitting.
q The authors propose a drop-net trick to ensure that the
features output by BERT and the conventional encoder are
fully utilized.
q Let the drop-net rate be pnet ∈ [0, 1]. It is a hyperparameter.
q At each training iteration, for any layer l, they uniformly
sample a random variable Ul from [0, 1].

q For any layer, with probability pnet

2 , either the BERT-encoder
attention (attnB ) or self-attention (attnS ) is used only.
q With probability (1 − pnet ), both attention models (attnS and
attnB ) are used.
q Similarly for the decoder:

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
15/19
Supervised NMT Results

Figure: BLEU score for low resource translation

Figure: BLEU score for high resource translation

Figure: BLEU score for document level translation

Figure: BLEU score for Semi Supervised translation

Figure: BLEU score for Unsupervised translation

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
18/19
Conclusion

q The BERT-fused model effectively integrates pre-trained

BERT representations into NMT.
q Achieved state-of-the-art results on supervised and
semi-supervised tasks.
q Future Work:
q Improve inference speed.
q Explore lightweight model alternatives.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
19/19

Thank You

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22

Bert Explained
No ratings yet
Bert Explained
8 pages
Visteon Paper
100% (2)
Visteon Paper
23 pages
The Memory Palace Prashant Mohan
100% (2)
The Memory Palace Prashant Mohan
125 pages
2024.semeval-1.72
No ratings yet
2024.semeval-1.72
6 pages
BERT Architecture
No ratings yet
BERT Architecture
8 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
BERT
No ratings yet
BERT
4 pages
11 Bert
No ratings yet
11 Bert
66 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
A Primer in BERTology
No ratings yet
A Primer in BERTology
15 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
2102.00291_bert
No ratings yet
2102.00291_bert
5 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
data_mining_report
No ratings yet
data_mining_report
17 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
ACL - 2020 - Mike Lewis - BART Denoising Sequence-To-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
No ratings yet
ACL - 2020 - Mike Lewis - BART Denoising Sequence-To-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
10 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
A Primer in BERTology - What We Know About How BERT Works
No ratings yet
A Primer in BERTology - What We Know About How BERT Works
23 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
LSTM to BERT
No ratings yet
LSTM to BERT
30 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
C4_W3
No ratings yet
C4_W3
98 pages
BERT
No ratings yet
BERT
98 pages
NLP-week9-fine-tuning_and_IR
No ratings yet
NLP-week9-fine-tuning_and_IR
64 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
BA-LLMS-W2-S2-2024-2025
No ratings yet
BA-LLMS-W2-S2-2024-2025
47 pages
BART: Denoising Sequence-to-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
No ratings yet
BART: Denoising Sequence-to-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
10 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Intbert Acl19paper-3
No ratings yet
Intbert Acl19paper-3
8 pages
Stanford Dataset 2.0
No ratings yet
Stanford Dataset 2.0
9 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations
No ratings yet
How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations
10 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
What Happens To BERT Embeddings During Fine-tuning
No ratings yet
What Happens To BERT Embeddings During Fine-tuning
13 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
BERT Slides
No ratings yet
BERT Slides
41 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Tut4_WordEmb nlp
No ratings yet
Tut4_WordEmb nlp
30 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
BERT Language Model
No ratings yet
BERT Language Model
7 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
Pretraining-Based Natural Language Generation For Text Summarization
No ratings yet
Pretraining-Based Natural Language Generation For Text Summarization
7 pages
Rare Words
No ratings yet
Rare Words
9 pages
Report Group-8
No ratings yet
Report Group-8
16 pages
Bert ayman
No ratings yet
Bert ayman
5 pages
10 Encdec Attention Notes
No ratings yet
10 Encdec Attention Notes
29 pages
BERT-1-42
No ratings yet
BERT-1-42
42 pages
ACM Conference Proceedings Primary Article Template
No ratings yet
ACM Conference Proceedings Primary Article Template
2 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
the-death-of-feature-engineering-bert-with-linguistic-4s2zmqi9xs
No ratings yet
the-death-of-feature-engineering-bert-with-linguistic-4s2zmqi9xs
6 pages
BERT
No ratings yet
BERT
1 page
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
AL3066 Diodes
No ratings yet
AL3066 Diodes
14 pages
Captcha Sniper Users Guide
No ratings yet
Captcha Sniper Users Guide
11 pages
What Is Direct Variation
No ratings yet
What Is Direct Variation
2 pages
Adobe Photoshop Activity011234
No ratings yet
Adobe Photoshop Activity011234
12 pages
Nov Dec 2023
No ratings yet
Nov Dec 2023
3 pages
Shadcn Guide
No ratings yet
Shadcn Guide
5 pages
HPE Aruba Campus Access Fundamentals
No ratings yet
HPE Aruba Campus Access Fundamentals
3 pages
silverEditionFASTcomputer 2013
No ratings yet
silverEditionFASTcomputer 2013
17 pages
Instant Access to (Ebook) Learn Go with Pocket-Sized Projects. by Aliénor Latour, Donia Chaiehloudj , Pascal Bertrand. ISBN 9781633438804, 1633438805 ebook Full Chapters
100% (1)
Instant Access to (Ebook) Learn Go with Pocket-Sized Projects. by Aliénor Latour, Donia Chaiehloudj , Pascal Bertrand. ISBN 9781633438804, 1633438805 ebook Full Chapters
76 pages
unit-1-it-contains-breif-notes
No ratings yet
unit-1-it-contains-breif-notes
21 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Thesis 18
No ratings yet
Thesis 18
55 pages
Computer Hardware Servicing
No ratings yet
Computer Hardware Servicing
16 pages
MD Asadullah (21 Law 019)
No ratings yet
MD Asadullah (21 Law 019)
50 pages
4.2 - CH 4 - 3D Transformation - Modified
No ratings yet
4.2 - CH 4 - 3D Transformation - Modified
43 pages
OOSE - Week 4 - Use Case Diagrams
No ratings yet
OOSE - Week 4 - Use Case Diagrams
43 pages
What Is The Water Management System?
No ratings yet
What Is The Water Management System?
20 pages
Business Crime
No ratings yet
Business Crime
16 pages
Essential Mathematics for Games and Interactive Applications 3rd Edition James M. Van Verth All Chapters Instant Download
No ratings yet
Essential Mathematics for Games and Interactive Applications 3rd Edition James M. Van Verth All Chapters Instant Download
41 pages
The LEARNING-CONTINUITY-PLAN-AFGBMTS
No ratings yet
The LEARNING-CONTINUITY-PLAN-AFGBMTS
5 pages
Android VS RDK
No ratings yet
Android VS RDK
11 pages
Tech Mahindra Pitch
No ratings yet
Tech Mahindra Pitch
3 pages
Mathematics: Brisbane Grammar School
No ratings yet
Mathematics: Brisbane Grammar School
6 pages
Icarus Verilog + GTKWave Guide
No ratings yet
Icarus Verilog + GTKWave Guide
9 pages
JDFH DFKHDFKHFKDJFDH FDFD Rawrite and Related Programs F DJFHDFJD HDJKFH FJKDH FJKDH FKHFD
No ratings yet
JDFH DFKHDFKHFKDJFDH FDFD Rawrite and Related Programs F DJFHDFJD HDJKFH FJKDH FJKDH FKHFD
7 pages
An Integrated Distributed Generation Optimization Model For Distribution System Planning
No ratings yet
An Integrated Distributed Generation Optimization Model For Distribution System Planning
8 pages
A Note On The Maximum Flow Through A Network
No ratings yet
A Note On The Maximum Flow Through A Network
3 pages
Job Descriptions Ebook
No ratings yet
Job Descriptions Ebook
13 pages