0% found this document useful (0 votes)
11 views

12-13.Chapter9_DeepLearningInNLP

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

12-13.Chapter9_DeepLearningInNLP

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Natural Language Processing

AC3110E

1
Chapter 9: Deep Learning in NLP

Lecturer: PhD. DO Thi Ngoc Diep


SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Introduction

• Natural language processing has historically focused on linear classification


• Later, rapid advances in deep learning make nonlinear classifiers more
popular, now the default approach for many NLP tasks

Highlights in natural language processing research

Kamath, U., Liu, J., & Whitaker, J. (2019). Deep learning for NLP
and speech recognition (Vol. 84). Cham, Switzerland: Springer.

3
Introduction

• Neural nets:
• Feedforward network
• Recurrent neural networks
• Transformer
• etc.
• Neural nets applications in
• Classification
• Language Modeling
• Other NLP tasks

Slide Reference:
+ Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition
+ CS224N: Natural Language Processing with Deep Learning, Stanford / Winter 2023

4
9.1. Feedforward Neural Networks

5
Building block of a Neural Network

• Computational unit:
• set of corresponding weights w=w1...wn and a bias b
• Input: a vector x
• Output: y = a = f(z)=f(w.x+b) A binary logistic regression unit
is a bit similar to a neuron
• f: active function
• a: active value
• Activate functions:
• Sigmoid
• Tanh
• ReLU, Leaky ReLU, GELU
• Etc.

• A neural network
• running several logistic regressions at the same time

6
9.1.1 Feedforward Neural Networks

• The simplest kind of neural network


• Multilayer network
• Outputs from units in each layer are passed to units in the next higher layer
• No outputs are passed back to lower layers (no cycles)
• fully-connected

• Input layer: x
• A hidden layer:
• Weight matrix W ×

• Bias vector b
• Output vector h = g(W.x + b):
forms a representation of the input
• Output layer:
• Weight matrix U ×

• Output vector y =softmax(z); z = U.h


• y : probability distribution
across the output nodes.

7
9.1.1 Feedforward Neural Networks

=> create a highly non-linear classifier


• More deeper networks
in terms of the original inputs
• Input layer: x=a[0]
• Each layer i:
• Weight matrix W[i]
• Bias vector b[i]
• Output from previous layer = a[i-1]
• z[i]=W[i].a[i-1]+ b[i]
• Output from this layer: a[i]=g[i](z[i])
• Output :
• Output vector y = a[n]
• Activation functions g(·): non-linear functions
• Internal layers: might be ReLU or tanh
• Output layer:
• sigmoid for binary classification
• softmax for multinomial classification

Replacing the bias node with x0


8
Training Feedforward Neural Network

• Training
• Supervised machine learning:
• y: true value for input x; 𝒚: estimated value by Network
• Learn parameters W[i] and b[i] for each layer i that make 𝒚 for each training observation as close as
possible to the true y.
• The cross-entropy loss:
• Binary classifier: LCE(𝑦,y)= -logP(y|x)=-[ylog𝑦+(1-y)log(1-𝑦)]
• Mutil-class classifier: LCE(𝑦,y) = − ∑ y𝑘𝑙𝑜𝑔𝑦k = −𝑙𝑜𝑔𝑦 = −𝑙𝑜𝑔𝑝̂ (𝑦 =1|x) =
( )
− 𝑙𝑜𝑔 ∑
( )
(where c is the correct class) (also call this the negative log likelihood
loss)
• Gradient of loss function in deep network
• Computing the gradients for each weight with respect to much of parameters
•  [Error] backpropagation algorithm on the computation graph:
• Based on Backward differentiation on computation graphs
• Makes use of the chain rule to do backward computation of the gradients: re-use derivatives
computed for higher layers in computing derivatives for lower layers to minimize computation
• Back to all the weight nodes.

9
Training Feedforward Neural Network

• Training:
• Forward propagation: calculate given an input x (save intermediate values)
• Backward propagation: calculate the prediction error -y, recursively apply the chain
rule along computation graph to compute gradients, update the weight matrices to
minimize the prediction error.
• Optimization in neural networks is more complex than for logistic regression
• Need to initialize the weights with small random numbers
• Forms of regularization to prevent overfitting
• Dropout
• Tuning of hyper-parameters (chosen by the algorithm designer) on devset
• Learning rate η
• Mini-batch size
• The model architecture (the number of layers, the number of hidden nodes per layer, the choice of
activation functions)

10
Word2Vec
Skip gram neural network architecture

• Input word: a one-hot vector


• Output: single vector containing the probability distribution for target words
• 1 hidden layer with no activation function
• Output layer uses softmax (vanilla Skip gram), sigmoid (negative sampling)

Winput Woutput

Winput is used as Word
embedding matrix
Size of embedding vector

11
9.1.2 Feedforward Neural Networks in Text classification

• Sentiment classifier
• Feedforward Neural Networks :
• Using traditional hand-built features of the input text

12
9.1.2 Feedforward Neural Networks in Text classification

• Sentiment classifier
• Feedforward Neural Networks :
• Learning features from the data:
• Using pretrained embedding representations
• Apply some sort of pooling function to the embeddings of all the words in the input.
• E.g. taking the element-wise max; mean pooling: 𝑥 = ∑ 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑤 )

13
9.1.3 Feedforward Neural Networks as Language Model

• Neural language models:


• Pros: Handle much longer histories, can generalize better over contexts of similar
words, more accurate at word-prediction
• Use word embeddings, rather than word identity, allows neural language models to
generalize better to unseen data
• Ex:
• In training data: have “make sure that the cat gets fed” but none of “dog gets fed”
• In test: “make sure that the dog gets ...”. Predict the next word ?
• In n-gram LM: can not predict (or assign very low probability to) “fed” because “dog
gets fed” has never appeared in training data
• in Neural LM: “cat” and “dog” have similar embeddings => Neural LM can generalize
from the “cat” context to assign a high enough probability to “fed”.
• Cons: much more complex, slower, need more energy to train, less interpretable than
n-gram models

14
9.1.3 Feedforward Neural Networks as Language Model

• A fixed-window neural Language Model - Neural Probabilistic Language


Model (NPLM for short; Bengio et al., 2003)

“thanks for all the ... “

Softmax layer: Output distribution

Hidden layer

Concatenated word embeddings

Words / one-hot vectors


Bengio et al. (2003) 15
9.1.3 Feedforward Neural Networks as Language Model

• Forward inference/decoding:
• At time t-1, given an input wt-N...wt-2wt-1, estimate the probability distribution over all
possible outputs for the next word wt: P(wt = i|wt-N...wt-2wt-1 ); i = 1..|V|

• One-hot vector for each


wi: xi of the
shape[|V|×1]
• Embedding weight
matrix E of the shape[d
×|V|]: each column for
each word
• ei: embedding for wi =
E.x[i]
• e: concatenate N
embeddings ei, of the
shape[N*d×1]
• Output vector y of the
shape [|V|×1]

16
9.1.3 Feedforward Neural Networks as Language Model

• Forward inference/decoding:
• At time t-1, given an input wt-N...wt-2wt-1, estimate the probability distribution over all
possible outputs for the next word wt: P(wt = i|wt-N...wt-2wt-1 ); i = 1..|V|

• e=[E.x[t-N],..., E.x[t-2], E.x[t-1]]


• h = σ(W.e+b)
• z = U.h
• y = softmax(z)
y[i] = P(wt=i|wt-N...wt-2wt-1)
i = 1..|V|

17
9.1.3 Feedforward Neural Networks as Language Model

• Training the neural language model


• Freeze the embedding layer E: only modify W, U, and b
• or Learn the embeddings simultaneously with training: θ = E,W,U,b.
• as predicting upcoming words, learn the embeddings E for each word that
best predict upcoming words
• the embedding matrix E is shared among the context words
• Take input as a very long text concatenating all the sentences
• Start with random weights
• Iteratively move through the text to predict each word wt
• At each word wt, update the parameters using stochastic gradient descent
• Loss function: CE
• For Language Modeling, CE
• Parameter update:
[ ( | ,…, )]
, θ = E,W,U,b

18
9.1.3 Feedforward Neural Networks as Language Model

• Improvements over n-gram LM:


• No sparsity problem
• Don’t need to store all observed n-grams
• Problems:
• Fixed window is too small
• Window can never be large enough!
• Each input is multiplied by completely different weights: No symmetry
• => need a neural architecture that can process any length input

19
9.2. Recurrent neural networks

20
9.2.1 Recurrent neural networks

• RNN contains a cycle within its network connections


• The hidden layer includes a recurrent connection as part of its input
• The activation value of the hidden layer depends on the current input as well as the
activation value of the hidden layer from the previous time step
• Can handle the temporal nature of language (long context)
• The prior (very long, even back to the beginning of the sequence) context can be
represented by recurrent connections

New set of weights U: connect the hidden layer from the


previous time step to the current hidden layer

21
9.2.1 Recurrent neural networks

• Inference
• at time t, compute an output yt for an input xt
• ht = g(U.ht-1 + W.xt)
• yt= f(V.ht)
• xt , ht , yt
• W ×
,U ×
,V ×

function FORWARDRNN(x, network) returns output sequence y


h0 ←0
for i←1 to LENGTH(x) do // U, V and W are shared across time
hi ← g(U.hi−1 + W.xi) // Model size doesn’t increase for longer
yi ← f(V.hi) // input context
return y

22
9.2.1 Recurrent neural networks

• Training
• Use backpropagation through time
• The first pass:
• perform forward inference, computing and saving ht, yt at each step
• accumulating the loss at each step
• The second pass:
• for each step backward i = t, … ,0, compute the required 𝜕𝐽( ) 𝜕𝐽( )
gradients by summing gradients as you go = |
𝜕𝑈 𝜕𝑈 ( )
• saving the gradients for the next use
• => can occur vanishing gradient
• the gradient gets smaller and smaller as it backpropagates further
• model weights are updated only with respect to near effects, not long-term effects
• Unrolling a recurrent network into a feedforward computational graph
• for longer input sequences, unroll the input into manageable fixed-length segments and
treat each segment as a distinct training item
• training via ordinary backpropagation

23
9.2.2 Common applications for RNNs in NLP

• Probabilistic language modeling


• Assigning a probability to a sequence, or to the next element of a sequence given the
preceding words.
• Prediction problems
• Auto-regressive generation
• Text generation
• Sequence labeling
• Each element of a sequence is assigned a label.
• POST, NER tasks, etc.
• Sequence classification
• An entire text is assigned to a category
• Spam detection, Sentiment analysis, Topic classification, etc.
• Encoder-decoder architectures
• An input is mapped to an output of different length and alignment.
• Machine translation, Text summarization, etc.

24
Recurrent Neural Networks as Language Model

• RNN language models


• predict the next word from the current word and the previous hidden state
• hidden state from the previous time step can represent information about all of the
preceding words (even from the beginning of the sequence).

Apply the same


weights , U on
every timestep

Can process any length input


(Mikolov et al., 2010) Chris Manning, CS224N 25
Recurrent Neural Networks as Language Model

• Forward Inference
• Input sequence X = [x1;...;xt;...;xN]
• at time t:
• et = E.xt
• ht = g(U.ht-1 + W.et)
• yt= softmax(V.ht) The probability that a particular word
yt[i] = P(wt+1 = i|w1,...,wt) ; i = 1..|V| i in the vocabulary is the next word

(Mikolov et al., 2010) 26


Recurrent Neural Networks as Language Model

• Training an RNN LM
• Self-supervision algorithm from a corpus of text (without extra labels)
• Cross-entropy loss:
• At time t: LCE(𝑦 ,𝑦 ) = −𝑙𝑜𝑔𝑦 [𝑤 ] = −𝑙𝑜𝑔𝑃(𝑤 |𝑤 , … , 𝑤 )
• The final loss = average LCE over the training sequence
• SGD: give the model the correct history sequence to predict the next word: “Teacher
forcing”
• At each word position t of the input, takes as input the correct sequence w1:t, estimate the
probability of token wt+1 => compute the model’s loss for the next token wt+1
• Ignore what the model predicted for wt+1, use the correct sequence w1:t+1 to estimate the
probability of token wt+2.
• etc.

27
Recurrent Neural Networks as Language Model

• Training an RNN LM
• …
• Weight tying :
• Use Embedding matrix as the weights from the hidden layer to the output layer.
• et = E.xt
• ht = g(U.ht-1 + W.et)
• yt= softmax(ET.ht)

28
RNNs for other NLP tasks

• Sequence Labeling
• Inputs: pre-trained word embeddings
• Outputs: tag probabilities
• run forward inference over the input sequence and select the most likely tag from
the softmax at each step

29
RNNs for other NLP tasks

• Text Classification
• Use a simple RNN combined with a feedforward network
• Pass the text through the RNN a word at a time generating a new hidden layer at
each time step.
• Constitute a compressed representation of the entire sequence:
• By taking the hidden layer for the last token hn
• Or pooling of all the hidden states hi for each word i in the sequence.
• Pass the entire sequence representation to a feedforward network that make
prediction via a softmax over the possible classes.

30
RNNs for other NLP tasks

• Text Generation
• Tasks: question answering, machine translation, text summarization,
grammar correction, story generation, conversational dialogue, etc.
• Autoregressive generation using a language model
• Start from the beginning of sentence marker <s>
• or/and Use additional task-appropriate context

31
9.3. Other architectures

32
9.3.1 The LSTM

“When she tried to print her tickets, she found that the printer was out
of toner. She went to the stationery store to buy more toner. It was
very overpriced. After installing the toner into the printer, she finally
printed her ________ “

• Vanishing gradient problem:


• model weights are updated only with respect to near effects, not long-term effects
• unable to predict long-distance dependencies at test time
•  Difficult for the RNN to learn to preserve information over many
timesteps.
• In a vanilla RNN, the hidden state is constantly being rewritten
• Design an RNN with separate memory which is added to?

33
9.3.1 The LSTM

• Long short-term memory (LSTM) network:


• maintains relevant context over time:
• to forget/remove information that is no longer needed
• to remember/add information required for decisions still to come
• adds an explicit context to the architecture
• LSTM neural units

in feedforward in RNN LSTM unit

• Take into account both long term memory and short term memory
• Avoid the vanishing gradient
• Become the standard unit for modern system that makes use of recurrent networks
• Modularity makes the widespread applicability of different neural units in different architectures

(Hochreiter and Schmidhuber, 1997), (Gers et al., 2000) 34


9.3.1 The LSTM

• A single LSTM unit:


• On step t: a hidden state t and a cell state t
• The cell stores long-term information
• The LSTM can read, erase, and write information from the cell
• The gates: control which information is erased/written/read

35
9.3.1 The LSTM

• A single LSTM unit


• Actual information (new content to be written to
the cell)
𝑔 = 𝑡𝑎𝑛ℎ 𝑈 . ℎ + 𝑊 .𝑥
• Forget gate: what is kept vs forgotten, from
previous cell state
𝑓 = 𝜎 𝑈 .ℎ + 𝑊 .𝑥
• Input gate: which parts of the new cell content
are written to cell
𝑖 = 𝜎 𝑈 .ℎ + 𝑊.𝑥 All these are vectors
• Output gate: which parts of cell are output to hidden state of same length
𝑜 = 𝜎 𝑈 .ℎ + 𝑊 .𝑥

• New context vector:


• Hidden layer value:
erase (“forget”) some from last cell
state, and write (“input”) some to new
read (“output”) some cell content
content from the cell

36
9.3.1 The LSTM

• LSTMs with pre-trained word-embeddings applied many common tasks:


• part-of-speech tagging (Ling et al., 2015)
• syntactic chunking (Sogaard and Goldberg, 2016)
• named entity recognition (Chiu and Nichols, 2016; Ma and Hovy, 2016)
• opinion mining (Irsoy and Cardie, 2014)
• semantic role labeling (Zhou and Xu, 2015a)
• etc...

37
9.3.2. Advanced RNN architectures

• Bidirectional RNNs
• Take advantage of context to the right of the current input
• Combine the output: concatenate,
element-wise addition or multiplication

• Combine the hidden layer values:

• Effective for sequence classification


• Only applicable if have access to the
entire input sequence
• not applicable to Language Modeling

38
9.3.2. Advanced RNN architectures

• Stacked (Multi-layer) RNNs


• using the entire sequence of outputs from one RNN as an input sequence to another
one

differing
levels of
feature
abstraction
from low to
high

39
9.3.3 Encoder-Decoder Model

• Encoder-Decoder Model (sequence-to-sequence networks)


• Encoder network takes an input sequence and creates a contextualized representation
of input.
• Context: function of contextualized representation, or whole sequence representation
• Context is then passed to a Decoder network which generates a task-specific output
sequence.
• => Can be different length

(Kalchbrenner and Blunsom, 2013); (Cho et al. 2014), (Sutskever et al. 2014)
40
9.3.3 Encoder-Decoder Model

• Encoder-Decoder Model (sequence-to-sequence networks)


• Sequence-to-sequence tasks in NLP:
• Machine Translation (text  text)
• Summariza on (long text → short text)
• Dialogue (previous u erances → next u erance)
• Parsing (input text → parsed sequence)
• Code genera on (natural language → Python code)
• etc.

(Kalchbrenner and Blunsom, 2013); (Cho et al. 2014), (Sutskever et al. 2014)
41
The Encoder-Decoder Model with RNNs

• Encoder-decoder networks for autoregressively generation


• Encoder RNN produces an encoding of the input sequence.
• Decoder RNN autoregressively generates output sequence, conditioned on encoding.

Perform forward inference to generate hidden … Then begin autoregressive generation until
states until get to the end of the source… an end-of-sequence marker is generated
𝒅
c= 𝒆
𝒏 = 𝟎
𝒅
𝒕= g( , 𝒕 𝟏, c)
𝒅
𝒅
𝒕 𝒕
𝒕= softmax( 𝒕 )
42
The Encoder-Decoder Model with RNNs

• Attention mechanism:
• Allowing the decoder to get information from all the hidden states of the
encoder: c = f( 𝐞𝟏 𝐞)
𝐧
• Or can make weights ‘attend to’ a particular part of the source text with
context vector ci for each decoding step i. Then: 𝐝𝐢 = g( , 𝐝𝐢 𝟏 , 𝐢 )
• ci can be different, dynamic for each step

(Bahdanau et al. 2015) 43


The Encoder-Decoder Model with RNNs

• Attention mechanism:
• Computing ci:
• How relevant each encoder state is to the decoder state
• Using a score( 𝐝𝐢 𝟏 , 𝐞𝐣 ) for each encoder state j
• Dot-product attention:
• score( 𝐝𝐢 𝟏 , 𝐞𝐣 )= 𝐝𝐢 𝟏 𝐞𝐣
• normalize with a softmax to create a vector of weights
• α = softmax(score( 𝐝𝐢 𝟏 , 𝐞𝐣 ) j e)
• α 𝐞𝐣
• Scoring functions for attention:
• score( 𝐝𝐢 𝟏 , 𝐞𝐣 )= 𝐝𝐢 𝟏 𝒔 𝐞𝐣
•…

44
• end of Chapter 9

45

You might also like