12-13.Chapter9_DeepLearningInNLP
12-13.Chapter9_DeepLearningInNLP
AC3110E
1
Chapter 9: Deep Learning in NLP
Kamath, U., Liu, J., & Whitaker, J. (2019). Deep learning for NLP
and speech recognition (Vol. 84). Cham, Switzerland: Springer.
3
Introduction
• Neural nets:
• Feedforward network
• Recurrent neural networks
• Transformer
• etc.
• Neural nets applications in
• Classification
• Language Modeling
• Other NLP tasks
Slide Reference:
+ Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition
+ CS224N: Natural Language Processing with Deep Learning, Stanford / Winter 2023
4
9.1. Feedforward Neural Networks
5
Building block of a Neural Network
• Computational unit:
• set of corresponding weights w=w1...wn and a bias b
• Input: a vector x
• Output: y = a = f(z)=f(w.x+b) A binary logistic regression unit
is a bit similar to a neuron
• f: active function
• a: active value
• Activate functions:
• Sigmoid
• Tanh
• ReLU, Leaky ReLU, GELU
• Etc.
• A neural network
• running several logistic regressions at the same time
6
9.1.1 Feedforward Neural Networks
• Input layer: x
• A hidden layer:
• Weight matrix W ×
• Bias vector b
• Output vector h = g(W.x + b):
forms a representation of the input
• Output layer:
• Weight matrix U ×
7
9.1.1 Feedforward Neural Networks
• Training
• Supervised machine learning:
• y: true value for input x; 𝒚: estimated value by Network
• Learn parameters W[i] and b[i] for each layer i that make 𝒚 for each training observation as close as
possible to the true y.
• The cross-entropy loss:
• Binary classifier: LCE(𝑦,y)= -logP(y|x)=-[ylog𝑦+(1-y)log(1-𝑦)]
• Mutil-class classifier: LCE(𝑦,y) = − ∑ y𝑘𝑙𝑜𝑔𝑦k = −𝑙𝑜𝑔𝑦 = −𝑙𝑜𝑔𝑝̂ (𝑦 =1|x) =
( )
− 𝑙𝑜𝑔 ∑
( )
(where c is the correct class) (also call this the negative log likelihood
loss)
• Gradient of loss function in deep network
• Computing the gradients for each weight with respect to much of parameters
• [Error] backpropagation algorithm on the computation graph:
• Based on Backward differentiation on computation graphs
• Makes use of the chain rule to do backward computation of the gradients: re-use derivatives
computed for higher layers in computing derivatives for lower layers to minimize computation
• Back to all the weight nodes.
9
Training Feedforward Neural Network
• Training:
• Forward propagation: calculate given an input x (save intermediate values)
• Backward propagation: calculate the prediction error -y, recursively apply the chain
rule along computation graph to compute gradients, update the weight matrices to
minimize the prediction error.
• Optimization in neural networks is more complex than for logistic regression
• Need to initialize the weights with small random numbers
• Forms of regularization to prevent overfitting
• Dropout
• Tuning of hyper-parameters (chosen by the algorithm designer) on devset
• Learning rate η
• Mini-batch size
• The model architecture (the number of layers, the number of hidden nodes per layer, the choice of
activation functions)
10
Word2Vec
Skip gram neural network architecture
Winput Woutput
…
Winput is used as Word
embedding matrix
Size of embedding vector
11
9.1.2 Feedforward Neural Networks in Text classification
• Sentiment classifier
• Feedforward Neural Networks :
• Using traditional hand-built features of the input text
12
9.1.2 Feedforward Neural Networks in Text classification
• Sentiment classifier
• Feedforward Neural Networks :
• Learning features from the data:
• Using pretrained embedding representations
• Apply some sort of pooling function to the embeddings of all the words in the input.
• E.g. taking the element-wise max; mean pooling: 𝑥 = ∑ 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑤 )
13
9.1.3 Feedforward Neural Networks as Language Model
14
9.1.3 Feedforward Neural Networks as Language Model
Hidden layer
• Forward inference/decoding:
• At time t-1, given an input wt-N...wt-2wt-1, estimate the probability distribution over all
possible outputs for the next word wt: P(wt = i|wt-N...wt-2wt-1 ); i = 1..|V|
16
9.1.3 Feedforward Neural Networks as Language Model
• Forward inference/decoding:
• At time t-1, given an input wt-N...wt-2wt-1, estimate the probability distribution over all
possible outputs for the next word wt: P(wt = i|wt-N...wt-2wt-1 ); i = 1..|V|
17
9.1.3 Feedforward Neural Networks as Language Model
18
9.1.3 Feedforward Neural Networks as Language Model
19
9.2. Recurrent neural networks
20
9.2.1 Recurrent neural networks
21
9.2.1 Recurrent neural networks
• Inference
• at time t, compute an output yt for an input xt
• ht = g(U.ht-1 + W.xt)
• yt= f(V.ht)
• xt , ht , yt
• W ×
,U ×
,V ×
22
9.2.1 Recurrent neural networks
• Training
• Use backpropagation through time
• The first pass:
• perform forward inference, computing and saving ht, yt at each step
• accumulating the loss at each step
• The second pass:
• for each step backward i = t, … ,0, compute the required 𝜕𝐽( ) 𝜕𝐽( )
gradients by summing gradients as you go = |
𝜕𝑈 𝜕𝑈 ( )
• saving the gradients for the next use
• => can occur vanishing gradient
• the gradient gets smaller and smaller as it backpropagates further
• model weights are updated only with respect to near effects, not long-term effects
• Unrolling a recurrent network into a feedforward computational graph
• for longer input sequences, unroll the input into manageable fixed-length segments and
treat each segment as a distinct training item
• training via ordinary backpropagation
23
9.2.2 Common applications for RNNs in NLP
24
Recurrent Neural Networks as Language Model
• Forward Inference
• Input sequence X = [x1;...;xt;...;xN]
• at time t:
• et = E.xt
• ht = g(U.ht-1 + W.et)
• yt= softmax(V.ht) The probability that a particular word
yt[i] = P(wt+1 = i|w1,...,wt) ; i = 1..|V| i in the vocabulary is the next word
• Training an RNN LM
• Self-supervision algorithm from a corpus of text (without extra labels)
• Cross-entropy loss:
• At time t: LCE(𝑦 ,𝑦 ) = −𝑙𝑜𝑔𝑦 [𝑤 ] = −𝑙𝑜𝑔𝑃(𝑤 |𝑤 , … , 𝑤 )
• The final loss = average LCE over the training sequence
• SGD: give the model the correct history sequence to predict the next word: “Teacher
forcing”
• At each word position t of the input, takes as input the correct sequence w1:t, estimate the
probability of token wt+1 => compute the model’s loss for the next token wt+1
• Ignore what the model predicted for wt+1, use the correct sequence w1:t+1 to estimate the
probability of token wt+2.
• etc.
27
Recurrent Neural Networks as Language Model
• Training an RNN LM
• …
• Weight tying :
• Use Embedding matrix as the weights from the hidden layer to the output layer.
• et = E.xt
• ht = g(U.ht-1 + W.et)
• yt= softmax(ET.ht)
28
RNNs for other NLP tasks
• Sequence Labeling
• Inputs: pre-trained word embeddings
• Outputs: tag probabilities
• run forward inference over the input sequence and select the most likely tag from
the softmax at each step
29
RNNs for other NLP tasks
• Text Classification
• Use a simple RNN combined with a feedforward network
• Pass the text through the RNN a word at a time generating a new hidden layer at
each time step.
• Constitute a compressed representation of the entire sequence:
• By taking the hidden layer for the last token hn
• Or pooling of all the hidden states hi for each word i in the sequence.
• Pass the entire sequence representation to a feedforward network that make
prediction via a softmax over the possible classes.
30
RNNs for other NLP tasks
• Text Generation
• Tasks: question answering, machine translation, text summarization,
grammar correction, story generation, conversational dialogue, etc.
• Autoregressive generation using a language model
• Start from the beginning of sentence marker <s>
• or/and Use additional task-appropriate context
31
9.3. Other architectures
32
9.3.1 The LSTM
“When she tried to print her tickets, she found that the printer was out
of toner. She went to the stationery store to buy more toner. It was
very overpriced. After installing the toner into the printer, she finally
printed her ________ “
33
9.3.1 The LSTM
• Take into account both long term memory and short term memory
• Avoid the vanishing gradient
• Become the standard unit for modern system that makes use of recurrent networks
• Modularity makes the widespread applicability of different neural units in different architectures
35
9.3.1 The LSTM
36
9.3.1 The LSTM
37
9.3.2. Advanced RNN architectures
• Bidirectional RNNs
• Take advantage of context to the right of the current input
• Combine the output: concatenate,
element-wise addition or multiplication
38
9.3.2. Advanced RNN architectures
differing
levels of
feature
abstraction
from low to
high
39
9.3.3 Encoder-Decoder Model
(Kalchbrenner and Blunsom, 2013); (Cho et al. 2014), (Sutskever et al. 2014)
40
9.3.3 Encoder-Decoder Model
(Kalchbrenner and Blunsom, 2013); (Cho et al. 2014), (Sutskever et al. 2014)
41
The Encoder-Decoder Model with RNNs
Perform forward inference to generate hidden … Then begin autoregressive generation until
states until get to the end of the source… an end-of-sequence marker is generated
𝒅
c= 𝒆
𝒏 = 𝟎
𝒅
𝒕= g( , 𝒕 𝟏, c)
𝒅
𝒅
𝒕 𝒕
𝒕= softmax( 𝒕 )
42
The Encoder-Decoder Model with RNNs
• Attention mechanism:
• Allowing the decoder to get information from all the hidden states of the
encoder: c = f( 𝐞𝟏 𝐞)
𝐧
• Or can make weights ‘attend to’ a particular part of the source text with
context vector ci for each decoding step i. Then: 𝐝𝐢 = g( , 𝐝𝐢 𝟏 , 𝐢 )
• ci can be different, dynamic for each step
• Attention mechanism:
• Computing ci:
• How relevant each encoder state is to the decoder state
• Using a score( 𝐝𝐢 𝟏 , 𝐞𝐣 ) for each encoder state j
• Dot-product attention:
• score( 𝐝𝐢 𝟏 , 𝐞𝐣 )= 𝐝𝐢 𝟏 𝐞𝐣
• normalize with a softmax to create a vector of weights
• α = softmax(score( 𝐝𝐢 𝟏 , 𝐞𝐣 ) j e)
• α 𝐞𝐣
• Scoring functions for attention:
• score( 𝐝𝐢 𝟏 , 𝐞𝐣 )= 𝐝𝐢 𝟏 𝒔 𝐞𝐣
•…
44
• end of Chapter 9
45