SlideShare a Scribd company logo
DEEP
LEARNING
WORKSHOP
Dublin City University
27-28 April 2017
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Neural Machine Translation
Day 2 Lecture 10
#InsightDL2017
2
Acknowledgments
Antonio
Bonafonte
Santiago
Pascual
3
Acknowledgments
Marta R. Costa-jussà
4
Acknowledgments
Kyunghyun Cho
5
Predecents of Neural Machine Translation
Neco, R.P. and Forcada, M.L., 1997, June. Asynchronous translations with recurrent neural nets. In
Neural Networks, 1997., International Conference on (Vol. 4, pp. 2535-2540). IEEE.
6
Encoder-Decoder
Representation or
Embedding
Cho, Kyunghyun, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. "On the properties of
neural machine translation: Encoder-decoder approaches." SSST-8 (2014).
7
Encoder-Decoder
Front View Side View
Representation of the sentence
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical
machine translation." arXiv preprint arXiv:1406.1078 (2014).
Encoder
9
Encoder in three steps
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
(2)
(3) (1) One hot encoding
(2) Continuous-space Word
Representation (word
embedding)
(3) Sequence summarization
10
Step 1: One-hot encoding
Example: letters. |V| = 30
‘a’: x = 1
‘b’: x = 2
‘c’: x = 3
.
.
.
‘.’: x = 30
11
Step 1: One-hot encoding
Example: letters. |V| = 30
‘a’: x = 1
‘b’: x = 2
‘c’: x = 3
.
.
.
‘.’: x = 30
We impose fake range ordering
12
Step 1: One-hot encoding
Example: letters. |V| = 30
‘a’: xT
= [1,0,0, ..., 0]
‘b’: xT
= [0,1,0, ..., 0]
‘c’: xT
= [0,0,1, ..., 0]
.
.
.
‘.’: xT
= [0,0,0, ..., 1]
13
Step 1: One-hot encoding
Example: words.
cat: xT
= [1,0,0, ..., 0]
dog: xT
= [0,1,0, ..., 0]
.
.
house: xT
= [0,0,0, …,0,1,0,...,0]
.
.
.
Number of words, |V| ?
B2: 5K
C2: 18K
LVSR: 50-100K
Wikipedia (1.6B): 400K
Crawl data (42B): 2M
14
Step 1: One-hot encoding
● Large dimensionality
● Sparse representation (mostly zeros)
● Blind representation
○ Only operators: ‘!=’ and ‘==’
15
Step 2: Projection to word embedding
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
siM Wi
E
The one-hot is linearly projected to a embedded space of lower
dimension with matrix E for learned weights (=fully connected).
K
K
si
= Ewi
16
Word embeddings
Figure: Christopher Olah, Visualizing Representations
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of
words and phrases and their compositionality." NIPS 2013
Step 2: Projection to word embedding
17
Step 2: Projection to word embedding
GloVe (Stanford)
18
Step 2: Projection to word embedding
GloVe (Stanford)
19
Step 2: Projection to word embedding
Figure:
TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language
model." Journal of machine learning research 3, no. Feb (2003): 1137-1155.
20
Step 2: Projection to word embedding
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of
words and phrases and their compositionality." NIPS 2013
the cat climbed a tree
Given context:
a, cat, the, tree
Estimate prob. of
climbed
Word2Vec:
Continuous
Bag of
Words
21
Step 2: Projection to word embedding
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of
words and phrases and their compositionality." NIPS 2013
the cat climbed a tree
Given word:
climbed
Estimate prob. of context words:
a, cat, the, tree
(It selects randomly the context
length, till max of 10 left + 10 right)
Word2Vec:
Skip-gram
22
Step 2: Projection to word embedding
● Represent words using vectors of dimension d
(~100 - 500)
● Meaningful (semantic, syntactic) distances
● Dominant research topic in last years in NLP
conferences (EMNLP)
● Good embeddings are useful for many other tasks
23
Pre-trained Word Embeddings for 90 languages trained using
FastText, on Wikipedia.
Step 2: Projection to word embedding
24
Step 3: Recurrence
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
hT
25
Step 3: Recurrence
Sequence
Figure: Cristopher Olah, “Understanding LSTM Networks” (2015)
Activation function could be
LSTM, GRU, QRNN, pLSTM...
Decoder
27
Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The Recurrent State (zi
) of the decoder is determined by:
1) summary vector hT
2) previous output word ui-1
3) previous state zi-1
hT
28
Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
With zi
ready, we can compute a score e(k) for each word k in
the vocabulary with a dot product with the Recurrent State (zi
):
RNN
internal
state
Neuron weights
for word k
(output layer)
hT
29
Decoder
A score e(k) is higher if neuron weights for word k (wk
) and the
decoder’s internal state zi
are similar to each other.
Reminder:
a dot product computes the length of the
projection of one vector onto another.
Similar vectors (nearly parallel) the
projection is longer than if they are very
different (nearly perpendicular)
RNN
internal
state
Neuron weights
for word k
(output layer)
30
Decoder
Bridle, John S. "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum
Mutual Information Estimation of Parameters." NIPS 1989
...we can finally normalize to word probabilities with a softmax.
Probability that the output word at timestep i (wi
) is word k
Previous words Hidden state
Given the score for word k...
31
Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Once an output word sample ui
is predicted at
timestep i, the process is iterated…
(1) update the decoder’s internal state zi+1
(2) compute scores and probabilities pi+1
for all
possible target words
(3) predict the word sample ui+1
...
32
Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
More words for the decoded sentence are generated until a
<EOS> (End Of Sentence) “word” is predicted.
EOS
33
Encoder-Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Representation or
Embedding
34
Representation or
Embedding
Encoder Decoder
35
Encoder-Decoder: Training
Training requires a large dataset of pairs of sentences in the
two languages to translate.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for
statistical machine translation." AMNLP 2014.
36
Encoder-Decoder: Seq2Seq
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks."
NIPS 2014.
The Seq2Seq variation:
● trigger the output generation with an input <go> symbol.
● the predicted word at timestep t, becomes the input at t+1.
37
Encoder-Decoder: Seq2Seq
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks."
NIPS 2014.
38
Thanks ! Q&A ?
Follow me at
https://imatge.upc.edu/web/people/xavier-giro
@DocXavi
/ProfessorXavi

More Related Content

What's hot (20)

PDF
NLP using transformers
Arvind Devaraj
 
PPTX
Notes on attention mechanism
Khang Pham
 
PDF
Introduction to Tree-LSTMs
Daniel Perez
 
PPTX
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
PPTX
Neural machine translation by jointly learning to align and translate
sotanemoto
 
PDF
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
PDF
Use CNN for Sequence Modeling
Dongang (Sean) Wang
 
PPTX
Master Thesis of Computer Engineering: OpenTranslator
Giuseppe D'Onofrio
 
PPTX
Sequence to Sequence Learning with Neural Networks
Nguyen Quang
 
PDF
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Márton Miháltz
 
PDF
Deep Learning & NLP: Graphs to the Rescue!
Roelof Pieters
 
PDF
Recurrent Neural Networks, LSTM and GRU
ananth
 
PDF
EMNLP 2014: Opinion Mining with Deep Recurrent Neural Network
Peinan ZHANG
 
PPTX
Attention Mechanism in Language Understanding and its Applications
Artifacia
 
PDF
RNN, LSTM and Seq-2-Seq Models
Emory NLP
 
PPTX
Natural Question Generation using Deep Learning
Arijit Mukherjee
 
PDF
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for Chatbot (3/4)
Jaemin Cho
 
PDF
Week 3 Deep Learning And POS Tagging Hands-On
SARCCOM
 
PPTX
Neural network basic and introduction of Deep learning
Tapas Majumdar
 
NLP using transformers
Arvind Devaraj
 
Notes on attention mechanism
Khang Pham
 
Introduction to Tree-LSTMs
Daniel Perez
 
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
Neural machine translation by jointly learning to align and translate
sotanemoto
 
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
Use CNN for Sequence Modeling
Dongang (Sean) Wang
 
Master Thesis of Computer Engineering: OpenTranslator
Giuseppe D'Onofrio
 
Sequence to Sequence Learning with Neural Networks
Nguyen Quang
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Márton Miháltz
 
Deep Learning & NLP: Graphs to the Rescue!
Roelof Pieters
 
Recurrent Neural Networks, LSTM and GRU
ananth
 
EMNLP 2014: Opinion Mining with Deep Recurrent Neural Network
Peinan ZHANG
 
Attention Mechanism in Language Understanding and its Applications
Artifacia
 
RNN, LSTM and Seq-2-Seq Models
Emory NLP
 
Natural Question Generation using Deep Learning
Arijit Mukherjee
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Deep Learning for Chatbot (3/4)
Jaemin Cho
 
Week 3 Deep Learning And POS Tagging Hands-On
SARCCOM
 
Neural network basic and introduction of Deep learning
Tapas Majumdar
 

Similar to Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017) (20)

PDF
Deep learning for nlp
Viet-Trung TRAN
 
PDF
Pointing the Unknown Words
hytae
 
PDF
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
Building a Neural Machine Translation System From Scratch
Natasha Latysheva
 
PPTX
Deep Learning Bangalore meet up
Satyam Saxena
 
PPTX
DLBLR talk
Anuj Gupta
 
PDF
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
IRJET Journal
 
PPTX
Natural language processing and transformer models
Ding Li
 
PDF
05-transformers.pdf
ChaoYang81
 
PDF
encoder-decoder for large language model
ShrideviS7
 
PDF
encoder and decoder for language modelss
ShrideviS7
 
PDF
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Matching networks for one shot learning
Kazuki Fujikawa
 
PPTX
A Neural Probabilistic Language Model
Rama Irsheidat
 
PDF
Generating Natural-Language Text with Neural Networks
Jonathan Mugan
 
PDF
5_RNN_LSTM.pdf
FEG
 
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta
 
PDF
Detecting Misleading Headlines in Online News: Hands-on Experiences on Attent...
Kunwoo Park
 
PDF
Representation Learning of Vectors of Words and Phrases
Felipe Moraes
 
PDF
Transformers and BERT with SageMaker
Suman Debnath
 
Deep learning for nlp
Viet-Trung TRAN
 
Pointing the Unknown Words
hytae
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Building a Neural Machine Translation System From Scratch
Natasha Latysheva
 
Deep Learning Bangalore meet up
Satyam Saxena
 
DLBLR talk
Anuj Gupta
 
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
IRJET Journal
 
Natural language processing and transformer models
Ding Li
 
05-transformers.pdf
ChaoYang81
 
encoder-decoder for large language model
ShrideviS7
 
encoder and decoder for language modelss
ShrideviS7
 
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Matching networks for one shot learning
Kazuki Fujikawa
 
A Neural Probabilistic Language Model
Rama Irsheidat
 
Generating Natural-Language Text with Neural Networks
Jonathan Mugan
 
5_RNN_LSTM.pdf
FEG
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta
 
Detecting Misleading Headlines in Online News: Hands-on Experiences on Attent...
Kunwoo Park
 
Representation Learning of Vectors of Words and Phrases
Felipe Moraes
 
Transformers and BERT with SageMaker
Suman Debnath
 
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
PDF
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
PDF
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
PDF
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
PDF
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Ad

Recently uploaded (20)

PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Credit Card Fraud Detection Presentation
rasmilalama
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
DOCX
Online Delivery Restaurant idea and analyst the data
sejalsengar2323
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Credit Card Fraud Detection Presentation
rasmilalama
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Online Delivery Restaurant idea and analyst the data
sejalsengar2323
 
Introduction to Data Analytics and Data Science
KavithaCIT
 

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)

  • 1. DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Xavier Giro-i-Nieto [email protected] Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Neural Machine Translation Day 2 Lecture 10 #InsightDL2017
  • 5. 5 Predecents of Neural Machine Translation Neco, R.P. and Forcada, M.L., 1997, June. Asynchronous translations with recurrent neural nets. In Neural Networks, 1997., International Conference on (Vol. 4, pp. 2535-2540). IEEE.
  • 6. 6 Encoder-Decoder Representation or Embedding Cho, Kyunghyun, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. "On the properties of neural machine translation: Encoder-decoder approaches." SSST-8 (2014).
  • 7. 7 Encoder-Decoder Front View Side View Representation of the sentence Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).
  • 9. 9 Encoder in three steps Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (2) (3) (1) One hot encoding (2) Continuous-space Word Representation (word embedding) (3) Sequence summarization
  • 10. 10 Step 1: One-hot encoding Example: letters. |V| = 30 ‘a’: x = 1 ‘b’: x = 2 ‘c’: x = 3 . . . ‘.’: x = 30
  • 11. 11 Step 1: One-hot encoding Example: letters. |V| = 30 ‘a’: x = 1 ‘b’: x = 2 ‘c’: x = 3 . . . ‘.’: x = 30 We impose fake range ordering
  • 12. 12 Step 1: One-hot encoding Example: letters. |V| = 30 ‘a’: xT = [1,0,0, ..., 0] ‘b’: xT = [0,1,0, ..., 0] ‘c’: xT = [0,0,1, ..., 0] . . . ‘.’: xT = [0,0,0, ..., 1]
  • 13. 13 Step 1: One-hot encoding Example: words. cat: xT = [1,0,0, ..., 0] dog: xT = [0,1,0, ..., 0] . . house: xT = [0,0,0, …,0,1,0,...,0] . . . Number of words, |V| ? B2: 5K C2: 18K LVSR: 50-100K Wikipedia (1.6B): 400K Crawl data (42B): 2M
  • 14. 14 Step 1: One-hot encoding ● Large dimensionality ● Sparse representation (mostly zeros) ● Blind representation ○ Only operators: ‘!=’ and ‘==’
  • 15. 15 Step 2: Projection to word embedding Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) siM Wi E The one-hot is linearly projected to a embedded space of lower dimension with matrix E for learned weights (=fully connected). K K si = Ewi
  • 16. 16 Word embeddings Figure: Christopher Olah, Visualizing Representations Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." NIPS 2013 Step 2: Projection to word embedding
  • 17. 17 Step 2: Projection to word embedding GloVe (Stanford)
  • 18. 18 Step 2: Projection to word embedding GloVe (Stanford)
  • 19. 19 Step 2: Projection to word embedding Figure: TensorFlow tutorial Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal of machine learning research 3, no. Feb (2003): 1137-1155.
  • 20. 20 Step 2: Projection to word embedding Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." NIPS 2013 the cat climbed a tree Given context: a, cat, the, tree Estimate prob. of climbed Word2Vec: Continuous Bag of Words
  • 21. 21 Step 2: Projection to word embedding Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." NIPS 2013 the cat climbed a tree Given word: climbed Estimate prob. of context words: a, cat, the, tree (It selects randomly the context length, till max of 10 left + 10 right) Word2Vec: Skip-gram
  • 22. 22 Step 2: Projection to word embedding ● Represent words using vectors of dimension d (~100 - 500) ● Meaningful (semantic, syntactic) distances ● Dominant research topic in last years in NLP conferences (EMNLP) ● Good embeddings are useful for many other tasks
  • 23. 23 Pre-trained Word Embeddings for 90 languages trained using FastText, on Wikipedia. Step 2: Projection to word embedding
  • 24. 24 Step 3: Recurrence Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) hT
  • 25. 25 Step 3: Recurrence Sequence Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) Activation function could be LSTM, GRU, QRNN, pLSTM...
  • 27. 27 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The Recurrent State (zi ) of the decoder is determined by: 1) summary vector hT 2) previous output word ui-1 3) previous state zi-1 hT
  • 28. 28 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) With zi ready, we can compute a score e(k) for each word k in the vocabulary with a dot product with the Recurrent State (zi ): RNN internal state Neuron weights for word k (output layer) hT
  • 29. 29 Decoder A score e(k) is higher if neuron weights for word k (wk ) and the decoder’s internal state zi are similar to each other. Reminder: a dot product computes the length of the projection of one vector onto another. Similar vectors (nearly parallel) the projection is longer than if they are very different (nearly perpendicular) RNN internal state Neuron weights for word k (output layer)
  • 30. 30 Decoder Bridle, John S. "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters." NIPS 1989 ...we can finally normalize to word probabilities with a softmax. Probability that the output word at timestep i (wi ) is word k Previous words Hidden state Given the score for word k...
  • 31. 31 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Once an output word sample ui is predicted at timestep i, the process is iterated… (1) update the decoder’s internal state zi+1 (2) compute scores and probabilities pi+1 for all possible target words (3) predict the word sample ui+1 ...
  • 32. 32 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) More words for the decoded sentence are generated until a <EOS> (End Of Sentence) “word” is predicted. EOS
  • 33. 33 Encoder-Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Representation or Embedding
  • 35. 35 Encoder-Decoder: Training Training requires a large dataset of pairs of sentences in the two languages to translate. Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014.
  • 36. 36 Encoder-Decoder: Seq2Seq Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014. The Seq2Seq variation: ● trigger the output generation with an input <go> symbol. ● the predicted word at timestep t, becomes the input at t+1.
  • 37. 37 Encoder-Decoder: Seq2Seq Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014.
  • 38. 38 Thanks ! Q&A ? Follow me at https://imatge.upc.edu/web/people/xavier-giro @DocXavi /ProfessorXavi