0% found this document useful (0 votes)
7 views

Incorporating BERT Into NMT-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Incorporating BERT Into NMT-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Incorporating BERT into Neural Machine

Translation

—– Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang
Zhou, Houqiang Li, Tie-Yan Liu
Ramakrishna Mission Vivekananda Educational And Research
Institute

Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta


November 22, 2024
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
1/19
Contents

1 Introduction

2 Preliminary Explorations

3 BERT-fused Model

4 Algorithm

5 Results

6 Conclusion

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
2/19
Introduction

q Neural Machine Translation (NMT) translates text from a


source to a target language using an encoder to generate
hidden representations and a decoder to produce target
sentences.
q BERT excels in language understanding but integrating it
with NMT is challenging.
q Initial experiments showed contextual embeddings from BERT
improved results.
q The authors proposed a BERT-fused model that integrates
BERT-extracted representations into each layer of the NMT
model’s encoder and decoder using attention mechanisms.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
3/19
Preliminary Explorations

q Two approaches for using BERT in NMT:


1. Initialize the encoder of NMT with a pre-trained BERT model
and then fine-tune NMT.
2. Using BERT to provide contextual embeddings as input for
NMT.
q Observations: Contextual embeddings outperformed
fine-tuning.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
4/19
Preliminary Explorations cont.

Different ways to utilize pretrained models:


1. Use pre-trained models to initialize the NMT model:
q Initialize NMT encoder with pre-trained BERT.
q Initialize encoder/decoder with XLM.
2. Use pre-trained models as inputs to the NMT model:
q Use outputs of BERT’s last layer as NMT inputs.
Challenges:
q Pre-training large models for each language pair is
resource-intensive

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
5/19
Preliminary Explorations cont.

Their Approach:
q Output of BERT as context-aware embeddings of Encoder
outperforms initialization approaches.

Figure: Preliminary explorations on IWSLT’14 English→German


translation

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
6/19
Proposed Algorithm: BERT-fused Model

q The BERT-fused Model exploits representations from BERT


by feeding them into all layers, rather than using them only as
input embeddings.
q Attention mechanism adaptively controls how each layer
interacts with BERT representations.
q 2 extra attention modules:
q BERT-encoder attention: Enables NMT encoder layers to
interact with BERT representations, producing fused outputs.
q BERT-decoder attention: Combines BERT representations
with NMT encoder outputs during decoding.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
7/19
Model Architecture

Figure: BERT-fused Model Architecture

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
8/19
Notations

q Let X and Y denote the source language domain and target


language domain, respectively:
q X: Collection of sentences in the source language.
q Y: Collection of sentences in the target language.
q For any sentence x ∈ X and y ∈ Y:
q lx : Number of units (e.g., words or sub-words) in x.
q ly : Number of units in y.
q xi and yi : The i-th unit in x and y, respectively.
q Notations:
q Enc: Encoder.
q Dec: Decoder.
q BERT: Pre-trained language model.
q The encoder and decoder together form the NMT module.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
9/19
Step-1: BERT Encoding

q Given any input x ∈ X, BERT first encodes it into


representation HB = BERT(x).
q HB is the output of the last layer in BERT.
q The hB,i ∈ HB is the representation of the i-th wordpiece in x.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
10/19
Step-2: Encoder Representation
q Let HlE denote the hidden representation of the l-th layer in
the encoder, and let H0E denote the word embedding of
sequence x.
q Denote the i-th element in HlE as hli for any i ∈ [lx ].
q In the l-th layer, l ∈ [L], we have:

where attnS and attnB are attention models with different


parameters.
q Then, each h̃li is further processed by FFN(·) and we get the
output of the l-th layer:
( )
HlE = FFN(h̃l1 ), · · · , FFN(h̃llx )

q The encoder will eventually output HLE from the last layer.
VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
11/19
Step-3: Decoder Representation

q Let Sl<t denote the hidden state of the l-th layer in the
decoder preceding time step t, i.e.,

Sl<t = (sl1 , . . . , slt−1 )

q Note that s01 is a special token indicating <bos>, and s0t is


the embedding of the predicted word at time-step t − 1.
q At the l-th layer, we have the following:

where attnS , attnB , and attnE represent the self-attention


model, BERT-decoder attention model, and encoder-decoder
attention model, respectively.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
12/19
Step-3: Decoder Representation cont.

q Finally, sLt is mapped via a linear transformation and softmax


to get the t-th predicted word ŷt .
q The decoding process continues until meeting the <eos>.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
13/19
Drop-Net Trick

q Using this technique, we can prevent the network from


overfitting.
q The authors propose a drop-net trick to ensure that the
features output by BERT and the conventional encoder are
fully utilized.
q Let the drop-net rate be pnet ∈ [0, 1]. It is a hyperparameter.
q At each training iteration, for any layer l, they uniformly
sample a random variable Ul from [0, 1].

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
14/19
Drop-Net Trick cont.
q In Encoder, h̃li can be written as:

q For any layer, with probability pnet


2 , either the BERT-encoder
attention (attnB ) or self-attention (attnS ) is used only.
q With probability (1 − pnet ), both attention models (attnS and
attnB ) are used.
q Similarly for the decoder:

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
15/19
Supervised NMT Results

Figure: BLEU score for low resource translation

Figure: BLEU score for high resource translation

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
16/19
Supervised NMT Results Cont.

Figure: BLEU score for document level translation

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
17/19
Semi Supervised and Unsupervised NMT Results

Figure: BLEU score for Semi Supervised translation

Figure: BLEU score for Unsupervised translation

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
18/19
Conclusion

q The BERT-fused model effectively integrates pre-trained


BERT representations into NMT.
q Achieved state-of-the-art results on supervised and
semi-supervised tasks.
q Future Work:
q Improve inference speed.
q Explore lightweight model alternatives.

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22
Introduction Preliminary Explorations BERT-fused Model Algorithm Results Conclusion
19/19

Thank You

VoxLing [RKMVERI] Presenters : Ayan Maity, Debanjan Nanda, Debayan Datta 2024-11-22

You might also like