Deep Learning for Natural Language Processing

Algorithmic Intelligence Laboratory
EE807: Recent Advances in Deep Learning
Lecture 19
Slide made by
Sangwoo Mo
KAIST EE
Advanced Models for Language
1

1. Introduction
• Why deep learning for NLP?
• Overview of the lecture
2. Network Architecture
• Learning long-term dependencies
• Improve softmax layers
3. Training Methods
• Reduce exposure bias
• Reduce loss/evaluation mismatch
• Extension to unsupervised setting
Table of Contents
2

1. Introduction
3. Training Methods
Table of Contents
3

Why Deep Learning for Natural Language Processing (NLP)?
• Deep learning is now commonly used in natural language processing (NLP)
*Source: Young et al. “Recent Trends in Deep Learning Based Natural Language Processing”, arXiv 2017 4

Recap: RNN & CNN for Sequence Modeling
• Language is sequential: It is natural to use RNN architectures
• RNN (or LSTM variants) is a natural choice for sequence modelling
• Language is translation-invariant: It is natural to use CNN architectures
• One can use CNN [Gehring et al., 2017] for parallelization
*Source: https://towardsdatascience.com/introduction-to-recurrent-neural-network-27202c3945f3
Gehring et al. “Convolutional Sequence to Sequence Learning”, ICML 2017 5

Limitations of prior works
• However, prior works have several limitations…
• Network architecture
• Long-term dependencies: Network forgets previous information as it summarizes
inputs into a single feature vector
• Limitations of softmax: Computation linearly increases to the vocabulary size,
and expressivity is bounded by the feature dimension
• Training methods
• Exposure bias: Model only sees true tokens at training, but it sees generated
tokens at inference (and noise accumulates sequentially)
• Loss/evaluation mismatch: Model uses MLE objective at training, but use other
evaluation metrics (e.g., BLEU score [Papineni et al., 2002]) at inference
• Unsupervised setting: How to train models if there are no paired data?
6

1. Introduction
3. Training Methods
Table of Contents
7

Attention [Bahdanau et al., 2015]
• Motivation:
• Previous models summarize inputs into a single feature vector
• Hence, the model forgets old inputs, especially for long sequences
• Idea:
• Use input features, but attend on the most importance features
• Example) Translate “Ich mochte ein bier” ⇔ “I’d like a beer”
• Here, when the model generates “beer”, it should attend on “bier”
8*Source: https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/10/06/attention/

• Method:
• Task: Translate source sequence [𝑥$, … , 𝑥'] to target sequence [𝑦$, … , 𝑦*]
• Now the decoder hidden state 𝑠, is a function of previous state 𝑠,-$, current input
𝑦.,-$, and context vector 𝑐,, i.e., 𝑠, = 𝑓 𝑠,-$, 𝑦.,-$, 𝑐,
9*Source: https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
𝑐,
𝑠,𝑠,-$
𝑦.,-$

• Method:
• Task: Translate source sequence [𝑥$, … , 𝑥'] to target sequence [𝑦$, … , 𝑦*]
• Now the decoder hidden state 𝑠, is a function of previous state 𝑠,-$, current input
𝑦.,-$, and context vector 𝑐,, i.e., 𝑠, = 𝑓 𝑠,-$, 𝑦.,-$, 𝑐,
• The context vector 𝑐, is linear combination of input hidden features [ℎ$, … , ℎ']
• Here, the weight 𝛼,,4 is alignment score of two words 𝑦, and 𝑥4
where score is also jointly trained, e.g.,
10

• Results: Attention shows good correlation between source and target
11

• Results: Attention improves machine translation performance
• RNNenc: no attention / RNNsearch: with attention / #: max length of train data
12
No UNK: omit unknown words
*: longer train until converge

Show, Attend, and Tell [Xu et al., 2015]
• Motivation: Can apply attention for image captioning?
• Task: Translate source image [𝑥] to target sequence [𝑦$, … , 𝑦*]
• Now attend on specific location on the image, not the words
• Idea: Apply attention to convolutional features [ℎ$, … , ℎ:] (with 𝐾 channels)
• Apply deterministic soft attention (as previous one) and stochastic hard attention
(pick one ℎ4 by sampling multinomial distribution with parameter 𝛼)
• Hard attention picks more specific area and shows better results, but training is
less stable due to the stochasticity and differentiability
13
Up: hard attention / Down: soft attention

• Results: Attention picks visually plausible locations
14

• Results: Attention improves the image captioning performance
15

Transformer [Vaswani et al., 2017]
• Motivation:
• Prior works use RNN/CNN to solve sequence-to-sequence problems
• Attention already handles arbitrary length of sequences, easy to parallelize, and
not suffer from forgetting problems… Why should one use RNN/CNN modules?
• Idea:
• Design architecture only using attention modules
• To extract features, the authors use self-attention, that features attend on itself
• Self-attention has many advantages over RNN/CNN blocks
16
𝑛: sequence length, 𝑑: feature dimension, 𝑘: (conv) kernel size, 𝑟: window size to consider
Maximum path length: maximum traversal between any two input/outputs (lower is better)
*Cf. Now self-attention is widely used in other architectures, e.g., CNN [Wang et al., 2018] or GAN [Zhang et al., 2018]

• Multi-head attention: The building block of the Transformer
• In previous slide, we introduced additive attention [Bahdanau et al., 2015]
• Here, the context vector is a linear combination of
• weight 𝛼,,4, a function of inputs [𝑥4] and output 𝑦,
• and input hidden states [ℎ4]
• In general, attention is a function of key 𝐾, value 𝑉, and query 𝑄
• key [𝑥4] and query 𝑦, defines weights 𝛼,,4, which are applied to value [ℎ4]
• For sequence length 𝑇 and feature dimension 𝑑, (𝐾, 𝑉, 𝑄) are 𝑇×𝑑, 𝑇×𝑑, and 1×𝑑 matrices
• Transformer use scaled dot-product attention
• In addition, transformer use multi-head attention,
ensemble of attentions
17

• Transformer:
• The final transformer model is built upon the (multi-head) attention blocks
• First, extract features with self-attention (see lower part of the block)
• Then decode feature with usual attention (see middle part of the block)
• Since the model don’t have a sequential structure,
the authors give position embedding (some handcrafted
feature that represents the location in sequence)
18

• Results: Transformer architecture shows good performance for languages
19

BERT [Delvin et al., 2018]
• Motivation:
• Many success of CNN comes from ImageNet-pretrained networks
• Can train a universal encoder for natural languages?
• Method:
• BERT (bidirectional encoder representations from transformers): Design a neural
network based on bidirectional transformer, and use it as a pretraining model
• Pretrain with two tasks (masked language model, next sentence prediction)
• Use fixed BERT encoder, and fine-tune simple 1-layer decoder for each task
20
Sentence classification Question answering

BERT [Delvin et al., 2018]
• Results:
• Even without task-specific complex architectures, BERT achieves SOTA for 11 NLP
tasks, including classification, question answering, tagging, etc.
21

1. Introduction
3. Training Methods
Table of Contents
22

Adaptive Softmax [Grave et al., 2017]
• Motivation:
• Computation of softmax is expensive, especially for large vocabularies
• Hierarchical softmax [Mnih & Hinton, 2009]:
• Cluster 𝑘 words into balanced 𝑘 groups, which reduces the complexity to 𝑂( 𝑘)
• For hidden state ℎ, word 𝑤, and cluster 𝐶 𝑤 ,
• One can repeat clustering for subtrees (i.e., build a balanced 𝑛-ary tree), which
reduces the complexity to 𝑂(log 𝑘)
23*Source: http://opendatastructures.org/versions/edition-0.1d/ods-java/node40.html

• Limitation of prior works & Proposed idea:
• Cluster 𝑘 words into balanced 𝑘 groups, which reduces the complexity to 𝑂( 𝑘)
• One can repeat clustering for subtrees, which reduces the complexity to 𝑂(log 𝑘)
• However, putting all words to the leaves drop the performance (around 5-10%)
• Instead, one can put frequent words in front (similar to Huffman coding)
• Put top 𝒌 𝒉 words (𝑝Q of frequencies) and token “NEXT-𝒊” in the first layer, and
put 𝑘4 words (𝑝4 of frequencies) in the next layers
24

• Put top 𝒌 𝒉 words (𝑝Q of frequencies) and token “NEXT-𝒊” in the first layer, and
put 𝑘4 words (𝑝4 of frequencies) in the next layers
• Let 𝑔(𝑘, 𝐵) be the computation time for 𝑘 words and batch size 𝐵
• Then the computation time of adaptive softmax (with 𝐽 clusters) is
• For 𝑘, 𝐵 larger than some threshold, one can simply assume 𝑔 𝑘, 𝐵 = 𝑘𝐵 (see paper for details)
• By solving the optimization problem (for 𝑘4 and 𝐽), the model is 3-5x faster than
the original softmax (in practice, 𝐽 = 5 works well)
25

• Results: Adaptive softmax shows comparable results to the original softmax
(while much faster)
26
ppl: perplexity (lower is better)

Mixture of Softmax [Yang et al., 2018]
• Motivation:
• Rank of softmax layer is bounded by the feature dimension 𝑑
• Recall: By definition of softmax
we have (which is called logit)
• Let 𝑁 be number of possible contexts, and 𝑀 be vocabulary size, then
which implies that softmax can represent at most rank 𝒅 (real 𝐀 can be larger)
27*Source: https://www.facebook.com/iclr.cc/videos/2127071060655282/

• Motivation:
• Rank of softmax layer is bounded by the feature dimension 𝑑
• Naïvely increasing dimension 𝑑 to vocab size 𝑀 is inefficient
• Idea:
• Use mixture of softmaxes (MoS)
• It is easily implemented by defining 𝜋,] and 𝐡,] as a function of original 𝐡
• Note that now
is a nonlinear (log-sum-exp) function of 𝐡 and 𝐰, hence can represent high rank
28

• Results: MoS learns full rank (= vocab size) while softmax is bounded by 𝑑
• Measured empirical rank, collect every empirical contexts & outputs
29
MoC: mixture of contexts
(mixture before softmax)
𝑑 = 400, 280, 280 for
Softmax, MoC, MoS, respectively
Note that 9981 is full rank
as vocab size = 9981

• Results: Simply changing Softmax to MoS improves the performance
• By applying MoS to SOTA models, the authors achieved new SOTA records
30

1. Introduction
3. Training Methods
Table of Contents
31

Scheduled Sampling [Bengio et al., 2015]
• Motivation:
• Teacher forcing [Williams et al., 1989] is widely used for sequential training
• It use real previous token and current state to predict current output
32
*Source: https://satopirka.com/2018/02/encoder-decoder%E3%83%A2%E3%83%87%E3%83%AB%E3%81%A8
teacher-forcingscheduled-samplingprofessor-forcing/

• Motivation:
• However, the model use predicted token at inference (a.k.a. exposure bias)
33
*Source: https://satopirka.com/2018/02/encoder-decoder%E3%83%A2%E3%83%87%E3%83%AB%E3%81%A8
teacher-forcingscheduled-samplingprofessor-forcing/

• Motivation:
• However, the model use predicted token at inference (a.k.a. exposure bias)
• Training with predicted token is not trivial, since (a) training is unstable, and (b) as
previous token is changed, target also should be changed
• Idea: Apply curriculum learning
• At beginning, use real tokens, and slowly move to predicted tokens
34

• Results: Scheduled sampling improves baseline for many tasks
35
Image captioning
Constituency parsing

Professor Forcing [Lamb et al., 2016]
• Motivation:
• Scheduled sampling (SS) is known to optimize wrong objective [Huszár et al., 2015]
• Idea:
• Make features of predicted tokens be similar to the features of true tokens
• To this end, train a discriminator classifies features of true/predicted tokens
• Teacher forcing: use real tokens / Free running: use predicted tokens
36

• Results:
• Professor forcing improves the generalization performance, especially for the
long sequences (test samples are much longer than training samples)
37
NLL for MNIST
generation
Human evaluation
for handwriting
generation

1. Introduction
3. Training Methods
Table of Contents
38

MIXER [Ranzato et al., 2016]
• Motivation:
• Prior works use word-level objectives (e.g., cross-entropy) for training, but use
sequence-level objectives (e.g., BLEU [Papineni et al., 2002]) for evaluation
• Idea: Directly optimize model with sequence-level objective (e.g., BLEU)
• Q. How to backprop (usually not differentiable) sequence-level objective?
• Sequence generation is a kind of RL problem
• state: hidden state, action: output, policy: generation algorithm
• Sequence-level objective is the reward of current algorithm
• Hence, one can use policy gradient (e.g., REINFORCE) algorithm
• However, the gradient estimator of REINFORCE has high variance
• To reduce variance, MIXER (mixed incremental cross-entropy reinforce) use
MLE for first 𝑇′ steps and REINFOCE for next 𝑇 − 𝑇′ steps (𝑇′ goes to zero)
• Cf. One can also use other variance reduction techniques, e.g., actor-critic [Bahdanau et al., 2017]
39

MIXER [Ranzato et al., 2016]
• Results:
• MIXER shows better performance than other baselines
• XENT (= cross entropy): another name of maximum likelihood estimation (MLE)
• DAD (= data as demonstrator): another name of scheduled sampling
• E2D (= end-to-end backprop): use top-K vector as input (approx. beam search)
40

SeqGAN [Yu et al., 2017]
• Motivation:
• RL-based method still relies on handcrafted objective (e.g., BLEU)
• Instead, one can use GAN loss to generate realistic sequences
• However, it is not trivial to apply GAN for natural languages, since data is discrete
(hence not differentiable) and sequence (hence need new architecture)
• Idea: Backprop discriminator’s output with policy gradient
• Similar to actor-critic; only difference is now the reward is discriminator’s output
• Use LSTM-generator and CNN (or Bi-LSTM)-discriminator architectures
41

SeqGAN [Yu et al., 2017]
• Results:
• SeqGAN shows better performance than prior methods
42
Synthetic generation
(follow the oracle)
Chinese poem generation Obama speech generation

1. Introduction
3. Training Methods
Table of Contents
43

UNMT [Artetxe et al., 2018]
• Motivation:
• Can train neural machine translation models in unsupervised way?
• Idea: Apply the idea of domain transfer in Lecture 12
• Combine two losses: reconstruction loss and cycle-consistency loss
• Recall: Cycle-consistency loss forces twice cross-domain generated (e.g., L1→L2→L1) data to
become the original data
44*Source: Lample et al. “Unsupervised Machine Translation Using Monolingual Corpora Only”, ICLR 2018.
Model architecture (L1/L2: language 1, 2)
reconstruction
cross-domain generation

UNMT [Artetxe et al., 2018]
• Results: UNMT produces good translation results
45
BPE (byte pair encoding),
a preprocessing method

Conclusion
• Deep learning is widely used for natural language processing (NLP)
• RNN and CNN were popular in 2014-2017
• Recently, self-attention based methods are widely used
• Many new ideas are proposed to solve language problems
• New architectures (e.g., self-attention, softmax)
• New training methods (e.g., loss, algorithm, unsupervised)
• Research for natural languages are now just began
• Deep learning (especially GAN) is not widely used in NLP as computer vision
• Transformer and BERT are just published in 2017-2018
• There are still many research opportunities in NLP
46

Introduction
• [Papineni et al., 2002] BLEU: a method for automatic evaluation of machine translation. ACL 2002.
link : https://dl.acm.org/citation.cfm?id=1073135
• [Cho et al., 2014] Learning Phrase Representations using RNN Encoder-Decoder for Statistical... EMNLP 2014.
link : https://arxiv.org/abs/1406.1078
• [Sutskever et al., 2014] Sequence to Sequence Learning with Neural Networks. NIPS 2014.
• [Gehring et al., 2017] Convolutional Sequence to Sequence Learning. ICML 2017.
• [Young et al., 2017] Recent Trends in Deep Learning Based Natural Language Processing. arXiv 2017.
Extension to unsupervised setting
• [Artetxe et al., 2018] Unsupervised Neural Machine Translation. ICLR 2018.
• [Lample et al., 2018] Unsupervised Machine Translation Using Monolingual Corpora Only. ICLR 2018.
References
47

Learning long-term dependencies
• [Bahdanau et al., 2015] Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015.
• [Weston et al., 2015] Memory Networks. ICLR 2015.
• [Xu et al., 2015] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015.
• [Sukhbaatar et al., 2015] End-To-End Memory Networks. NIPS 2015.
• [Kumar et al., 2016] Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. ICML 2016.
• [Vaswani et al., 2017] Attention Is All You Need. NIPS 2017.
• [Wang et al., 2018] Non-local Neural Networks. CVPR 2018.
• [Zhang et al., 2018] Self-Attention Generative Adversarial Networks. arXiv 2018.
• [Peters et al., 2018] Deep contextualized word representations. NAACL 2018.
• [Delvin et al., 2018] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018.
References
48

Improve softmax layers
• [Mnih & Hinton, 2009] A Scalable Hierarchical Distributed Language Model. NIPS 2009.
link : https://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model
• [Grave et al., 2017] Efficient softmax approximation for GPUs. ICML 2017.
• [Yang et al., 2018] Breaking the Softmax Bottleneck: A High-Rank RNN Language Model. ICLR 2018.
Reduce exposure bias
• [Williams et al., 1989] A Learning Algorithm for Continually Running Fully Recurrent... Neural Computation 1989.
link : https://ieeexplore.ieee.org/document/6795228
• [Bengio et al., 2015] Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. NIPS 2015.
• [Huszár et al., 2015] How (not) to Train your Generative Model: Scheduled Sampling, Likelihood... arXiv 2015.
• [Lamb et al., 2016] Professor Forcing: A New Algorithm for Training Recurrent Networks. NIPS 2016.
References
49

Reduce loss/evaluation mismatch
• [Ranzato et al., 2016] Sequence Level Training with Recurrent Neural Networks. ICLR 2016.
• [Bahdanau et al., 2017] An Actor-Critic Algorithm for Sequence Prediction. ICLR 2017.
• [Yu et al., 2017] SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. AAAI 2017.
• [Rajeswar et al., 2017] Adversarial Generation of Natural Language. arXiv 2017.
• [Maddison et al., 2017] The Concrete Distribution: A Continuous Relaxation of Discrete Random... ICLR 2017.
• [Jang et al., 2017] Categorical Reparameterization with Gumbel-Softmax. ICLR 2017.
• [Kusner et al., 2016] GANS for Sequences of Discrete Elements with the Gumbel-softmax... NIPS Workshop 2016.
• [Tucker et al., 2017] REBAR: Low-variance, unbiased gradient estimates for discrete latent variable... NIPS 2017.
• [Hjelm et al., 2018] Boundary-Seeking Generative Adversarial Networks. ICLR 2018.
• [Zhao et al., 2018] Adversarially Regularized Autoencoders. ICML 2018.
References
50

• Method:
• (Scaled dot-product) attention is given by
• Use multi-head attention (i.e., ensemble of attentions)
• The final transformer model is built upon the attention blocks
• First, extract features with self-attention
• Then decode feature with usual attention
• Since the model don’t have a sequential structure,
the authors give position embedding (some handcrafted
feature that represents the location in sequence)
51
*Notation: (𝐊, 𝐕) is (key, value) pair, and 𝐐 is query

• Put top 𝒌 𝒉 words (𝑝Q of frequencies) and a token “NEXT” in the first layer, and
put 𝑘, = 𝑘 − 𝑘Q words (𝑝, = 1 − 𝑝Q of frequencies) in the next layer
• Let 𝑔(𝑘, 𝐵) be a computation time for 𝑘 vocabularies and batch size 𝐵
• Then the computation time of the proposed method is
• Here, 𝑔(𝑘, 𝐵) is a threshold function (due to the initial setup of GPU)
52

• The computation time of the proposed method is
• Hence, give a constraint that 𝑘𝐵 ≥ 𝑘l 𝐵l (for efficient usage for GPU)
• Also, extend the model to multi-cluster setting (with 𝐽 clusters):
• By solving the optimization problem (for 𝑘4 and 𝐽), the model is 3-5x faster than
the original softmax (in practice, 𝐽 = 5 shows good computation/performance trade-off)
53

• Motivation:
• Scheduled sampling (SS) is known to optimize wrong objective [Huszár et al., 2015]
• Let 𝑃 and 𝑄 be data and model distribution, respectively
• Assume length 2 sequence 𝑥$ 𝑥n, and let 𝜖 be the ratio of real sample
• Then the objective of scheduled sampling is
• If 𝜖 = 1, it is usual MLE objective, but as 𝜖 → 0, it pushes the conditional
distribution 𝑄pq|ps
to the marginal distribution 𝑃pq
instead of 𝑃pq|ps
• Hence, the factorized 𝑄∗
= 𝑃ps
𝑃pq
can minimize the objective
54

More Methods for Discrete GAN
• Gumbel-Softmax (a.k.a. concrete distribution):
• Gradient estimator of REINFORCE has high variance
• One can apply reparameterization trick… but how for discrete variables?
• One can use Gumbel-softmax trick [Jang et al., 2017]; [Maddison et al., 2017] to
achieve a biased but low variance gradient estimator
• One can also get unbiased estimator using Gumbel-softmax estimator as a control
variate for REINFORCE, called REBAR [Tucker et al., 2017]
• Discrete GAN is still an active research area
• BSGAN [Hjelm et al., 2018], ARAE [Zhao et al., 2018], etc.
• However, GAN is not popular for sequences (natural languages) as images yet
55

Deep Learning for Natural Language Processing

Recommended

More Related Content

What's hot (20)

Similar to Deep Learning for Natural Language Processing (20)

More from Sangwoo Mo (20)

Recently uploaded (20)

Deep Learning for Natural Language Processing