0% found this document useful (0 votes)

11 views

12-13.Chapter9_DeepLearningInNLP

Uploaded by

Minh Mai Ngọc

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

12-13.Chapter9_DeepLearningInNLP

Uploaded by

Minh Mai Ngọc

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Natural Language Processing

AC3110E

1
Chapter 9: Deep Learning in NLP

Lecturer: PhD. DO Thi Ngoc Diep

SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Introduction

• Natural language processing has historically focused on linear classification

• Later, rapid advances in deep learning make nonlinear classifiers more
popular, now the default approach for many NLP tasks

Highlights in natural language processing research

Kamath, U., Liu, J., & Whitaker, J. (2019). Deep learning for NLP
and speech recognition (Vol. 84). Cham, Switzerland: Springer.

3
Introduction

• Neural nets:
• Feedforward network
• Recurrent neural networks
• Transformer
• etc.
• Neural nets applications in
• Classification
• Language Modeling
• Other NLP tasks

Slide Reference:
+ Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition
+ CS224N: Natural Language Processing with Deep Learning, Stanford / Winter 2023

4
9.1. Feedforward Neural Networks

5
Building block of a Neural Network

• Computational unit:
• set of corresponding weights w=w1...wn and a bias b
• Input: a vector x
• Output: y = a = f(z)=f(w.x+b) A binary logistic regression unit
is a bit similar to a neuron
• f: active function
• a: active value
• Activate functions:
• Sigmoid
• Tanh
• ReLU, Leaky ReLU, GELU
• Etc.

• A neural network
• running several logistic regressions at the same time

6
9.1.1 Feedforward Neural Networks

• The simplest kind of neural network

• Multilayer network
• Outputs from units in each layer are passed to units in the next higher layer
• No outputs are passed back to lower layers (no cycles)
• fully-connected

• Input layer: x
• A hidden layer:
• Weight matrix W ×

• Bias vector b
• Output vector h = g(W.x + b):
forms a representation of the input
• Output layer:
• Weight matrix U ×

• Output vector y =softmax(z); z = U.h

• y : probability distribution
across the output nodes.

7
9.1.1 Feedforward Neural Networks

=> create a highly non-linear classifier

• More deeper networks
in terms of the original inputs
• Input layer: x=a[0]
• Each layer i:
• Weight matrix W[i]
• Bias vector b[i]
• Output from previous layer = a[i-1]
• z[i]=W[i].a[i-1]+ b[i]
• Output from this layer: a[i]=g[i](z[i])
• Output :
• Output vector y = a[n]
• Activation functions g(·): non-linear functions
• Internal layers: might be ReLU or tanh
• Output layer:
• sigmoid for binary classification
• softmax for multinomial classification

Replacing the bias node with x0

8
Training Feedforward Neural Network

• Training
• Supervised machine learning:
• y: true value for input x; 𝒚: estimated value by Network
• Learn parameters W[i] and b[i] for each layer i that make 𝒚 for each training observation as close as
possible to the true y.
• The cross-entropy loss:
• Binary classifier: LCE(𝑦,y)= -logP(y|x)=-[ylog𝑦+(1-y)log(1-𝑦)]
• Mutil-class classifier: LCE(𝑦,y) = − ∑ y𝑘𝑙𝑜𝑔𝑦k = −𝑙𝑜𝑔𝑦 = −𝑙𝑜𝑔𝑝̂ (𝑦 =1|x) =
( )
− 𝑙𝑜𝑔 ∑
( )
(where c is the correct class) (also call this the negative log likelihood
loss)
• Gradient of loss function in deep network
• Computing the gradients for each weight with respect to much of parameters
•  [Error] backpropagation algorithm on the computation graph:
• Based on Backward differentiation on computation graphs
• Makes use of the chain rule to do backward computation of the gradients: re-use derivatives
computed for higher layers in computing derivatives for lower layers to minimize computation
• Back to all the weight nodes.

9
Training Feedforward Neural Network

• Training:
• Forward propagation: calculate given an input x (save intermediate values)
• Backward propagation: calculate the prediction error -y, recursively apply the chain
rule along computation graph to compute gradients, update the weight matrices to
minimize the prediction error.
• Optimization in neural networks is more complex than for logistic regression
• Need to initialize the weights with small random numbers
• Forms of regularization to prevent overfitting
• Dropout
• Tuning of hyper-parameters (chosen by the algorithm designer) on devset
• Learning rate η
• Mini-batch size
• The model architecture (the number of layers, the number of hidden nodes per layer, the choice of
activation functions)

10
Word2Vec
Skip gram neural network architecture

• Input word: a one-hot vector

• Output: single vector containing the probability distribution for target words
• 1 hidden layer with no activation function
• Output layer uses softmax (vanilla Skip gram), sigmoid (negative sampling)

Winput Woutput
…
Winput is used as Word
embedding matrix
Size of embedding vector

11
9.1.2 Feedforward Neural Networks in Text classification

• Sentiment classifier
• Feedforward Neural Networks :
• Using traditional hand-built features of the input text

12
9.1.2 Feedforward Neural Networks in Text classification

• Sentiment classifier
• Feedforward Neural Networks :
• Learning features from the data:
• Using pretrained embedding representations
• Apply some sort of pooling function to the embeddings of all the words in the input.
• E.g. taking the element-wise max; mean pooling: 𝑥 = ∑ 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑤 )

13
9.1.3 Feedforward Neural Networks as Language Model

• Neural language models:

• Pros: Handle much longer histories, can generalize better over contexts of similar
words, more accurate at word-prediction
• Use word embeddings, rather than word identity, allows neural language models to
generalize better to unseen data
• Ex:
• In training data: have “make sure that the cat gets fed” but none of “dog gets fed”
• In test: “make sure that the dog gets ...”. Predict the next word ?
• In n-gram LM: can not predict (or assign very low probability to) “fed” because “dog
gets fed” has never appeared in training data
• in Neural LM: “cat” and “dog” have similar embeddings => Neural LM can generalize
from the “cat” context to assign a high enough probability to “fed”.
• Cons: much more complex, slower, need more energy to train, less interpretable than
n-gram models

14
9.1.3 Feedforward Neural Networks as Language Model

• A fixed-window neural Language Model - Neural Probabilistic Language

Model (NPLM for short; Bengio et al., 2003)

“thanks for all the ... “

Softmax layer: Output distribution

Hidden layer

Concatenated word embeddings

Words / one-hot vectors

Bengio et al. (2003) 15
9.1.3 Feedforward Neural Networks as Language Model

• Forward inference/decoding:
• At time t-1, given an input wt-N...wt-2wt-1, estimate the probability distribution over all
possible outputs for the next word wt: P(wt = i|wt-N...wt-2wt-1 ); i = 1..|V|

• One-hot vector for each

wi: xi of the
shape[|V|×1]
• Embedding weight
matrix E of the shape[d
×|V|]: each column for
each word
• ei: embedding for wi =
E.x[i]
• e: concatenate N
embeddings ei, of the
shape[N*d×1]
• Output vector y of the
shape [|V|×1]

16
9.1.3 Feedforward Neural Networks as Language Model

• e=[E.x[t-N],..., E.x[t-2], E.x[t-1]]

• h = σ(W.e+b)
• z = U.h
• y = softmax(z)
y[i] = P(wt=i|wt-N...wt-2wt-1)
i = 1..|V|

17
9.1.3 Feedforward Neural Networks as Language Model

• Training the neural language model

• Freeze the embedding layer E: only modify W, U, and b
• or Learn the embeddings simultaneously with training: θ = E,W,U,b.
• as predicting upcoming words, learn the embeddings E for each word that
best predict upcoming words
• the embedding matrix E is shared among the context words
• Take input as a very long text concatenating all the sentences
• Start with random weights
• Iteratively move through the text to predict each word wt
• At each word wt, update the parameters using stochastic gradient descent
• Loss function: CE
• For Language Modeling, CE
• Parameter update:
[ ( | ,…, )]
, θ = E,W,U,b

18
9.1.3 Feedforward Neural Networks as Language Model

• Improvements over n-gram LM:

• No sparsity problem
• Don’t need to store all observed n-grams
• Problems:
• Fixed window is too small
• Window can never be large enough!
• Each input is multiplied by completely different weights: No symmetry
• => need a neural architecture that can process any length input

19
9.2. Recurrent neural networks

20
9.2.1 Recurrent neural networks

• RNN contains a cycle within its network connections

• The hidden layer includes a recurrent connection as part of its input
• The activation value of the hidden layer depends on the current input as well as the
activation value of the hidden layer from the previous time step
• Can handle the temporal nature of language (long context)
• The prior (very long, even back to the beginning of the sequence) context can be
represented by recurrent connections

New set of weights U: connect the hidden layer from the

previous time step to the current hidden layer

21
9.2.1 Recurrent neural networks

• Inference
• at time t, compute an output yt for an input xt
• ht = g(U.ht-1 + W.xt)
• yt= f(V.ht)
• xt , ht , yt
• W ×
,U ×
,V ×

function FORWARDRNN(x, network) returns output sequence y

h0 ←0
for i←1 to LENGTH(x) do // U, V and W are shared across time
hi ← g(U.hi−1 + W.xi) // Model size doesn’t increase for longer
yi ← f(V.hi) // input context
return y

22
9.2.1 Recurrent neural networks

• Training
• Use backpropagation through time
• The first pass:
• perform forward inference, computing and saving ht, yt at each step
• accumulating the loss at each step
• The second pass:
• for each step backward i = t, … ,0, compute the required 𝜕𝐽( ) 𝜕𝐽( )
gradients by summing gradients as you go = |
𝜕𝑈 𝜕𝑈 ( )
• saving the gradients for the next use
• => can occur vanishing gradient
• the gradient gets smaller and smaller as it backpropagates further
• model weights are updated only with respect to near effects, not long-term effects
• Unrolling a recurrent network into a feedforward computational graph
• for longer input sequences, unroll the input into manageable fixed-length segments and
treat each segment as a distinct training item
• training via ordinary backpropagation

23
9.2.2 Common applications for RNNs in NLP

• Probabilistic language modeling

• Assigning a probability to a sequence, or to the next element of a sequence given the
preceding words.
• Prediction problems
• Auto-regressive generation
• Text generation
• Sequence labeling
• Each element of a sequence is assigned a label.
• POST, NER tasks, etc.
• Sequence classification
• An entire text is assigned to a category
• Spam detection, Sentiment analysis, Topic classification, etc.
• Encoder-decoder architectures
• An input is mapped to an output of different length and alignment.
• Machine translation, Text summarization, etc.

24
Recurrent Neural Networks as Language Model

• RNN language models

• predict the next word from the current word and the previous hidden state
• hidden state from the previous time step can represent information about all of the
preceding words (even from the beginning of the sequence).

Apply the same

weights , U on
every timestep

Can process any length input

(Mikolov et al., 2010) Chris Manning, CS224N 25
Recurrent Neural Networks as Language Model

• Forward Inference
• Input sequence X = [x1;...;xt;...;xN]
• at time t:
• et = E.xt
• ht = g(U.ht-1 + W.et)
• yt= softmax(V.ht) The probability that a particular word
yt[i] = P(wt+1 = i|w1,...,wt) ; i = 1..|V| i in the vocabulary is the next word

(Mikolov et al., 2010) 26

Recurrent Neural Networks as Language Model

• Training an RNN LM
• Self-supervision algorithm from a corpus of text (without extra labels)
• Cross-entropy loss:
• At time t: LCE(𝑦 ,𝑦 ) = −𝑙𝑜𝑔𝑦 [𝑤 ] = −𝑙𝑜𝑔𝑃(𝑤 |𝑤 , … , 𝑤 )
• The final loss = average LCE over the training sequence
• SGD: give the model the correct history sequence to predict the next word: “Teacher
forcing”
• At each word position t of the input, takes as input the correct sequence w1:t, estimate the
probability of token wt+1 => compute the model’s loss for the next token wt+1
• Ignore what the model predicted for wt+1, use the correct sequence w1:t+1 to estimate the
probability of token wt+2.
• etc.

27
Recurrent Neural Networks as Language Model

• Training an RNN LM
• …
• Weight tying :
• Use Embedding matrix as the weights from the hidden layer to the output layer.
• et = E.xt
• ht = g(U.ht-1 + W.et)
• yt= softmax(ET.ht)

28
RNNs for other NLP tasks

• Sequence Labeling
• Inputs: pre-trained word embeddings
• Outputs: tag probabilities
• run forward inference over the input sequence and select the most likely tag from
the softmax at each step

29
RNNs for other NLP tasks

• Text Classification
• Use a simple RNN combined with a feedforward network
• Pass the text through the RNN a word at a time generating a new hidden layer at
each time step.
• Constitute a compressed representation of the entire sequence:
• By taking the hidden layer for the last token hn
• Or pooling of all the hidden states hi for each word i in the sequence.
• Pass the entire sequence representation to a feedforward network that make
prediction via a softmax over the possible classes.

30
RNNs for other NLP tasks

• Text Generation
• Tasks: question answering, machine translation, text summarization,
grammar correction, story generation, conversational dialogue, etc.
• Autoregressive generation using a language model
• Start from the beginning of sentence marker <s>
• or/and Use additional task-appropriate context

31
9.3. Other architectures

32
9.3.1 The LSTM

“When she tried to print her tickets, she found that the printer was out
of toner. She went to the stationery store to buy more toner. It was
very overpriced. After installing the toner into the printer, she finally
printed her ________ “

• Vanishing gradient problem:

• model weights are updated only with respect to near effects, not long-term effects
• unable to predict long-distance dependencies at test time
•  Difficult for the RNN to learn to preserve information over many
timesteps.
• In a vanilla RNN, the hidden state is constantly being rewritten
• Design an RNN with separate memory which is added to?

33
9.3.1 The LSTM

• Long short-term memory (LSTM) network:

• maintains relevant context over time:
• to forget/remove information that is no longer needed
• to remember/add information required for decisions still to come
• adds an explicit context to the architecture
• LSTM neural units

in feedforward in RNN LSTM unit

• Take into account both long term memory and short term memory
• Avoid the vanishing gradient
• Become the standard unit for modern system that makes use of recurrent networks
• Modularity makes the widespread applicability of different neural units in different architectures

(Hochreiter and Schmidhuber, 1997), (Gers et al., 2000) 34

9.3.1 The LSTM

• A single LSTM unit:

• On step t: a hidden state t and a cell state t
• The cell stores long-term information
• The LSTM can read, erase, and write information from the cell
• The gates: control which information is erased/written/read

35
9.3.1 The LSTM

• A single LSTM unit

• Actual information (new content to be written to
the cell)
𝑔 = 𝑡𝑎𝑛ℎ 𝑈 . ℎ + 𝑊 .𝑥
• Forget gate: what is kept vs forgotten, from
previous cell state
𝑓 = 𝜎 𝑈 .ℎ + 𝑊 .𝑥
• Input gate: which parts of the new cell content
are written to cell
𝑖 = 𝜎 𝑈 .ℎ + 𝑊.𝑥 All these are vectors
• Output gate: which parts of cell are output to hidden state of same length
𝑜 = 𝜎 𝑈 .ℎ + 𝑊 .𝑥

• New context vector:

• Hidden layer value:
erase (“forget”) some from last cell
state, and write (“input”) some to new
read (“output”) some cell content
content from the cell

36
9.3.1 The LSTM

• LSTMs with pre-trained word-embeddings applied many common tasks:

• part-of-speech tagging (Ling et al., 2015)
• syntactic chunking (Sogaard and Goldberg, 2016)
• named entity recognition (Chiu and Nichols, 2016; Ma and Hovy, 2016)
• opinion mining (Irsoy and Cardie, 2014)
• semantic role labeling (Zhou and Xu, 2015a)
• etc...

37
9.3.2. Advanced RNN architectures

• Bidirectional RNNs
• Take advantage of context to the right of the current input
• Combine the output: concatenate,
element-wise addition or multiplication

• Combine the hidden layer values:

• Effective for sequence classification

• Only applicable if have access to the
entire input sequence
• not applicable to Language Modeling

38
9.3.2. Advanced RNN architectures

• Stacked (Multi-layer) RNNs

• using the entire sequence of outputs from one RNN as an input sequence to another
one

differing
levels of
feature
abstraction
from low to
high

39
9.3.3 Encoder-Decoder Model

• Encoder-Decoder Model (sequence-to-sequence networks)

• Encoder network takes an input sequence and creates a contextualized representation
of input.
• Context: function of contextualized representation, or whole sequence representation
• Context is then passed to a Decoder network which generates a task-specific output
sequence.
• => Can be different length

(Kalchbrenner and Blunsom, 2013); (Cho et al. 2014), (Sutskever et al. 2014)
40
9.3.3 Encoder-Decoder Model

• Encoder-Decoder Model (sequence-to-sequence networks)

• Sequence-to-sequence tasks in NLP:
• Machine Translation (text  text)
• Summariza on (long text → short text)
• Dialogue (previous u erances → next u erance)
• Parsing (input text → parsed sequence)
• Code genera on (natural language → Python code)
• etc.

(Kalchbrenner and Blunsom, 2013); (Cho et al. 2014), (Sutskever et al. 2014)
41
The Encoder-Decoder Model with RNNs

• Encoder-decoder networks for autoregressively generation

• Encoder RNN produces an encoding of the input sequence.
• Decoder RNN autoregressively generates output sequence, conditioned on encoding.

Perform forward inference to generate hidden … Then begin autoregressive generation until
states until get to the end of the source… an end-of-sequence marker is generated
𝒅
c= 𝒆
𝒏 = 𝟎
𝒅
𝒕= g( , 𝒕 𝟏, c)
𝒅
𝒅
𝒕 𝒕
𝒕= softmax( 𝒕 )
42
The Encoder-Decoder Model with RNNs

• Attention mechanism:
• Allowing the decoder to get information from all the hidden states of the
encoder: c = f( 𝐞𝟏 𝐞)
𝐧
• Or can make weights ‘attend to’ a particular part of the source text with
context vector ci for each decoding step i. Then: 𝐝𝐢 = g( , 𝐝𝐢 𝟏 , 𝐢 )
• ci can be different, dynamic for each step

(Bahdanau et al. 2015) 43

The Encoder-Decoder Model with RNNs

• Attention mechanism:
• Computing ci:
• How relevant each encoder state is to the decoder state
• Using a score( 𝐝𝐢 𝟏 , 𝐞𝐣 ) for each encoder state j
• Dot-product attention:
• score( 𝐝𝐢 𝟏 , 𝐞𝐣 )= 𝐝𝐢 𝟏 𝐞𝐣
• normalize with a softmax to create a vector of weights
• α = softmax(score( 𝐝𝐢 𝟏 , 𝐞𝐣 ) j e)
• α 𝐞𝐣
• Scoring functions for attention:
• score( 𝐝𝐢 𝟏 , 𝐞𝐣 )= 𝐝𝐢 𝟏 𝒔 𝐞𝐣
•…

44
• end of Chapter 9

Problems Chaptr 1 PDF
No ratings yet
Problems Chaptr 1 PDF
4 pages
Mac Cse: Information Technology Algorithm Worksheet
No ratings yet
Mac Cse: Information Technology Algorithm Worksheet
5 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
22 pages
Lec9 NN I
No ratings yet
Lec9 NN I
47 pages
ECSE484 Intro v2
No ratings yet
ECSE484 Intro v2
67 pages
Non-Linear Classifiers
No ratings yet
Non-Linear Classifiers
19 pages
14.Chapter10_AdvancedDeepLearningForText
No ratings yet
14.Chapter10_AdvancedDeepLearningForText
22 pages
Back Propagation Technique
No ratings yet
Back Propagation Technique
24 pages
Neural_Network_and_Deep_Learning_1736802600
No ratings yet
Neural_Network_and_Deep_Learning_1736802600
54 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
Ann 2
No ratings yet
Ann 2
22 pages
Lecture14 - ML (FF, Autoenc, Dense Networks)
No ratings yet
Lecture14 - ML (FF, Autoenc, Dense Networks)
28 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
NN DL
No ratings yet
NN DL
54 pages
Classification by Back Propagation
No ratings yet
Classification by Back Propagation
20 pages
Artificial intelligence basics
No ratings yet
Artificial intelligence basics
13 pages
Unit 1
No ratings yet
Unit 1
70 pages
Mod 2.1,2.2
No ratings yet
Mod 2.1,2.2
24 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Ds Unit V Ann Perceptron
No ratings yet
Ds Unit V Ann Perceptron
69 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
Simplifying Neural Networks and Deep Learning Basics!
No ratings yet
Simplifying Neural Networks and Deep Learning Basics!
27 pages
What Actions Can Human Brain Do?: Trained
No ratings yet
What Actions Can Human Brain Do?: Trained
40 pages
BP-ML
No ratings yet
BP-ML
47 pages
AI Lecture
No ratings yet
AI Lecture
63 pages
Deep Learning: International Islamic University of Chittagong
No ratings yet
Deep Learning: International Islamic University of Chittagong
31 pages
AML M2 Neural Networks
No ratings yet
AML M2 Neural Networks
52 pages
Chapter 2
No ratings yet
Chapter 2
41 pages
Assign 1 Soft Comp
No ratings yet
Assign 1 Soft Comp
12 pages
GDG_SOF_WEEK_2[1]
No ratings yet
GDG_SOF_WEEK_2[1]
11 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
34 pages
09-Neural Networks
No ratings yet
09-Neural Networks
18 pages
Res Net 2
No ratings yet
Res Net 2
40 pages
Neuralnetworks 1
No ratings yet
Neuralnetworks 1
65 pages
Neural - Networks
No ratings yet
Neural - Networks
47 pages
DeepLearning Book
No ratings yet
DeepLearning Book
108 pages
NN
No ratings yet
NN
37 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
72 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
29 pages
Lecture-2 Learning Process45452465442
No ratings yet
Lecture-2 Learning Process45452465442
50 pages
Int254 Unit 3
No ratings yet
Int254 Unit 3
29 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
9.0 CNN-Overview PDF
No ratings yet
9.0 CNN-Overview PDF
11 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
81 pages
Introduction To ANNs
No ratings yet
Introduction To ANNs
31 pages
Artificial Neural Networks - MiniProject
100% (1)
Artificial Neural Networks - MiniProject
16 pages
Neural Networks
No ratings yet
Neural Networks
45 pages
Chapter 3-1 Neural Network
No ratings yet
Chapter 3-1 Neural Network
43 pages
NN mat workshop
No ratings yet
NN mat workshop
36 pages
chapter 4 Neural Network
No ratings yet
chapter 4 Neural Network
46 pages
Ann mod1
No ratings yet
Ann mod1
106 pages
Lecture 9
No ratings yet
Lecture 9
97 pages
3 Deep Learning Overview v3.5
No ratings yet
3 Deep Learning Overview v3.5
85 pages
Module1 ECO-598 AI & ML Aug 21
No ratings yet
Module1 ECO-598 AI & ML Aug 21
45 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Backpropagation Learning in Neural Networks
No ratings yet
Backpropagation Learning in Neural Networks
27 pages
Unit 3 - Ann
No ratings yet
Unit 3 - Ann
49 pages
4.2 Ann
No ratings yet
4.2 Ann
26 pages
Day 10
No ratings yet
Day 10
17 pages
CENG3300 Lecture 9
No ratings yet
CENG3300 Lecture 9
19 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Fair Federated Learning for Digital Healthcare
No ratings yet
Fair Federated Learning for Digital Healthcare
15 pages
15.Chapter11_NLPApplications
No ratings yet
15.Chapter11_NLPApplications
25 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
1.Chapter1 Introduction Chapter2 LanguageCharacteristics
No ratings yet
1.Chapter1 Introduction Chapter2 LanguageCharacteristics
35 pages
DAA LAB 1 Ghanshyam
No ratings yet
DAA LAB 1 Ghanshyam
5 pages
Hierarchical Clustering PDF
No ratings yet
Hierarchical Clustering PDF
5 pages
Company Bankruptcy Prediction With SMOTE
No ratings yet
Company Bankruptcy Prediction With SMOTE
8 pages
PROM Workbook
No ratings yet
PROM Workbook
27 pages
Wa0004.
No ratings yet
Wa0004.
9 pages
Mini-Test: Chapter 6 Student's Name: Multiple Choice: Choose The One Alternative That Best Completes The Statement or Answers The Question
No ratings yet
Mini-Test: Chapter 6 Student's Name: Multiple Choice: Choose The One Alternative That Best Completes The Statement or Answers The Question
2 pages
Exercise 4
No ratings yet
Exercise 4
1 page
Sas/Or 14.3 User's Guide: The Nonlinear Programming Solver
No ratings yet
Sas/Or 14.3 User's Guide: The Nonlinear Programming Solver
63 pages
CG12 BSP
No ratings yet
CG12 BSP
31 pages
Quiz 11: This Is A Preview of The Published Version of The Quiz
No ratings yet
Quiz 11: This Is A Preview of The Published Version of The Quiz
4 pages
17.5 Automated Allocation of Mesh Points
No ratings yet
17.5 Automated Allocation of Mesh Points
2 pages
Class3 - Nondeterministic Finite Accepters
No ratings yet
Class3 - Nondeterministic Finite Accepters
19 pages
Fixed-Point Iteration Method
No ratings yet
Fixed-Point Iteration Method
8 pages
David Di Ruscio (2014) EMPIRICAL MODELING-APPROXIMATING THE DSR E SUB-SPACE SYSTEM IDENTIFICATION ALGORITHM BY A TWO-STEP ARX ALGORITHM
No ratings yet
David Di Ruscio (2014) EMPIRICAL MODELING-APPROXIMATING THE DSR E SUB-SPACE SYSTEM IDENTIFICATION ALGORITHM BY A TWO-STEP ARX ALGORITHM
10 pages
CCL_leadership-challenge-ladder-technical-report
No ratings yet
CCL_leadership-challenge-ladder-technical-report
42 pages
AI Foundations
No ratings yet
AI Foundations
9 pages
Multi Level Deep Learning Model For Network Anomal
No ratings yet
Multi Level Deep Learning Model For Network Anomal
12 pages
IT0007-Laboratory-Exercise-6 - Creating Codes
No ratings yet
IT0007-Laboratory-Exercise-6 - Creating Codes
6 pages
Mu-Analysis and Synthesis Toolbox
No ratings yet
Mu-Analysis and Synthesis Toolbox
734 pages
Filter Design: Lab Manual
No ratings yet
Filter Design: Lab Manual
9 pages
Shivdip Dilip Deshmukh: Data Scientist at TCS
No ratings yet
Shivdip Dilip Deshmukh: Data Scientist at TCS
3 pages
SLB 3
No ratings yet
SLB 3
1 page
Random: - Generate Pseudo-Random Numbers
No ratings yet
Random: - Generate Pseudo-Random Numbers
10 pages
Gate by RK Kanodia Signals and Systems PDF Free
No ratings yet
Gate by RK Kanodia Signals and Systems PDF Free
41 pages
Invitation To Computer Science, Java Version, Third Edition
No ratings yet
Invitation To Computer Science, Java Version, Third Edition
46 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
4 pages
DP Linear
No ratings yet
DP Linear
14 pages
Immediate download Making Sense of Statistical Mechanics 1st Edition Jean Bricmont ebooks 2024
100% (6)
Immediate download Making Sense of Statistical Mechanics 1st Edition Jean Bricmont ebooks 2024
40 pages

12-13.Chapter9_DeepLearningInNLP

Uploaded by

12-13.Chapter9_DeepLearningInNLP

Uploaded by

Natural Language Processing

Lecturer: PhD. DO Thi Ngoc Diep

• Natural language processing has historically focused on linear classification

Highlights in natural language processing research

• The simplest kind of neural network

• Output vector y =softmax(z); z = U.h

=> create a highly non-linear classifier

Replacing the bias node with x0

• Input word: a one-hot vector

• Neural language models:

• A fixed-window neural Language Model - Neural Probabilistic Language

“thanks for all the ... “

Softmax layer: Output distribution

Concatenated word embeddings

Words / one-hot vectors

• One-hot vector for each

• e=[E.x[t-N],..., E.x[t-2], E.x[t-1]]

• Training the neural language model

• Improvements over n-gram LM:

• RNN contains a cycle within its network connections

New set of weights U: connect the hidden layer from the

function FORWARDRNN(x, network) returns output sequence y

• Probabilistic language modeling

• RNN language models

Apply the same

Can process any length input

(Mikolov et al., 2010) 26

• Vanishing gradient problem:

• Long short-term memory (LSTM) network:

in feedforward in RNN LSTM unit

(Hochreiter and Schmidhuber, 1997), (Gers et al., 2000) 34

• A single LSTM unit:

• A single LSTM unit

• New context vector:

• LSTMs with pre-trained word-embeddings applied many common tasks:

• Combine the hidden layer values:

• Effective for sequence classification

• Stacked (Multi-layer) RNNs

• Encoder-Decoder Model (sequence-to-sequence networks)

• Encoder-Decoder Model (sequence-to-sequence networks)

• Encoder-decoder networks for autoregressively generation

(Bahdanau et al. 2015) 43

You might also like