SlideShare a Scribd company logo
Drawing a ”map of words”: word embeddings and
applications to machine translation and sentiment
analysis
Mostapha Benhenda
Artificial Intelligence Club, Kyiv
mostaphabenhenda@gmail.com
February 19, 2015
Mostapha Benhenda word embeddings February 19, 2015 1 / 28
Overview
1 Introduction
2 Applications
Machine translation
Sentiment analysis of movie reviews
Vector averaging (Kaggle tutorial)
Convolutional Neural Networks (deep learning)
Other applications of word embeddings
3 Example of word embedding algorithm: GloVe
Build the co-occurrence matrix X
Matrix factorization
4 Future work
5 References
Mostapha Benhenda word embeddings February 19, 2015 2 / 28
Introduction
We want to compute a ”map of words”, i.e. a representation:
R : Words = {w1, ..., wN} → Vectors = {R(w1), ..., R(wN)} ⊂ Rd
such that:
wi ≈ wj (meaning of words)
is equivalent to:
R(wi ) ≈ R(wj ) (distance of vectors)
Mostapha Benhenda word embeddings February 19, 2015 3 / 28
These vectors are interesting because:
the computer can ”grasp” the meaning of words, just by looking at
the distance between vectors.
We can feed prediction algorithms (linear regression,...) with these
vectors, and hopefully get good accuracy, because the representation
is ”faithful” to the meaning of words.
Mostapha Benhenda word embeddings February 19, 2015 4 / 28
For example, if we have the list of words:
cat, dog, mouse, house
We expect the most distant word to be: ?
Mostapha Benhenda word embeddings February 19, 2015 5 / 28
(this is a 2-dimensional visualization of a 300-dimensional space, using
t-SNE, a visualization technique that preserves clusters of points)
Mostapha Benhenda word embeddings February 19, 2015 6 / 28
Even better, we can sometimes make additions and substractions of
vectors, and draw parallelisms. For example,
king + woman - man ≈ ?
Mostapha Benhenda word embeddings February 19, 2015 7 / 28
(2-dimensional visualization of a 300-dimensional space, using PCA, a
visualization technique that preserves parallelisms)
Mostapha Benhenda word embeddings February 19, 2015 8 / 28
Idea behind all algorithms (Word2vec, GloVe,...): ”You shall know a
word by the company it keeps” (John R. Firth, 1957)
The more often 2 words are near each other in a text (the training
data), the closer their vectors will be.
We hope that 2 words have close meanings if, statistically, they are
often near each other.
So we need quite big datasets:
Google News English: 100 Billion words
Wikipedia French or Spanish: 100 Million words
MT 11 French: 200 Million words
MT 11 Spanish: 84 Million words
Mostapha Benhenda word embeddings February 19, 2015 9 / 28
Application to machine translation
Idea: we compute maps of all English words, all French words, and we
”superpose” the 2 maps, and we should get an English/French translator.
Mostapha Benhenda word embeddings February 19, 2015 10 / 28
[Mikolov et al. 2013] (from Google), made an English → Spanish
translator (among other languages).
I tried to reproduce their results for French → English,
and French → Spanish.
Using their algorithm Word2vec, they trained their vectors on the
dataset MT 11, and on Google News.
I did the same on MT 11 and Wikipedias for French and Spanish
(trained with Gensim package in Python), and I used their Google
News-trained vectors for English.
Mostapha Benhenda word embeddings February 19, 2015 11 / 28
Then they took the list of the 5000 most frequent English words, and
their Google translation in Spanish.
They train a linear transformation W that approximates the English
→ Spanish translation, i.e. they take W that minimizes:
5000
i=1
W (vi ) − G(vi ) 2
where G(vi ) is the Google-translation of vi .
Mostapha Benhenda word embeddings February 19, 2015 12 / 28
They test the accuracy of W on the next 1000 most common words
of the list (1-accuracy). They also test the accuracy up to 5
attempts, i.e. they test if G(vi ) belongs to the 5 nearest neighbors of
W (vi ) (5-accuracy).
I did the same for French → English, and French → Spanish.
code available here: https://drive.google.com/
open?id=0B86WKpvkt66BY09TSHJoekRqZjg&authuser=0
computer-intensive task, I recommend using Amazon Web Services,
or similar.
Mostapha Benhenda word embeddings February 19, 2015 13 / 28
Results
Mikolov, English → Spanish:
Training data 1-accuracy 5-accuracy
Google News 50% 75%
MT 11 33% 51%
Me, French → Spanish:
Training data 1-accuracy 5-accuracy
Wikipedia na <10%
MT 11 25% 37%
Mostapha Benhenda word embeddings February 19, 2015 14 / 28
Results
Mikolov, English → Spanish:
Training data 1-accuracy 5-accuracy
Google News 50% 75%
MT 11 33% 51%
Me, French → Spanish:
Training data 1-accuracy 5-accuracy
Wikipedia na <10%
MT 11 25% 37%
French (Wikipedia) → English (Google News) 10-accuracy: < 10%.
Mostapha Benhenda word embeddings February 19, 2015 14 / 28
Results
Mikolov, English → Spanish:
Training data 1-accuracy 5-accuracy
Google News 50% 75%
MT 11 33% 51%
Me, French → Spanish:
Training data 1-accuracy 5-accuracy
Wikipedia na <10%
MT 11 25% 37%
French (Wikipedia) → English (Google News) 10-accuracy: < 10%.
My conclusion: Wikipedia = cauchemar !
Mostapha Benhenda word embeddings February 19, 2015 14 / 28
Sentiment analysis of movie reviews
Goal: we want to determine if a movie review is positive or negative.
Toy problem in machine learning: reviews are ambiguous, emotional,
full of sarcasm,...
Long-term goal: computer can understand emotions.
Commercial applications of sentiment analysis:
marketing: customer satisfaction,...
finance: predict market trends,...
Example of review: ”This movie contains everything you’d expect,
but nothing more”.
Mostapha Benhenda word embeddings February 19, 2015 15 / 28
Vector averaging (Kaggle tutorial)
We work on the IMDB dataset. To predict the sentiment (+ or -) of a
review:
we average the vectors of all the words of the review
We use these average vectors as input to a supervised learning
algorithm (e.g. SVM, random forests...)
We get 83% accuracy.
Limitation of the method: the order of words is lost, because addition
is a commutative operation.
Mostapha Benhenda word embeddings February 19, 2015 16 / 28
Convolutional Neural Networks (deep learning) (work in
progress)
CNN are biology-inspired neural network models, initially introduced
for image processing.
they preserve spatial structure: the input is a n × m matrix (the pixels
of the image)
but here, the input is a sentence (with its ”spatial structure”
preserved):
Figure : source: [Collobert et al. 2011]
Mostapha Benhenda word embeddings February 19, 2015 17 / 28
Practical problem: Kaggle dataset (IMDB) can have 2000
words/review >> 50 words/review in [Collobert et al. 2011]
→ training is too slow!!
there are tricks to speed-up training, but benefits are uncertain...
Mostapha Benhenda word embeddings February 19, 2015 18 / 28
Other applications of word embeddings
Innovative search engine (ThisPlusThat), where we can subtract
queries, for example:
pizza + Japan -Italy → sushi
Recommendation systems for online shops
Mostapha Benhenda word embeddings February 19, 2015 19 / 28
Example of word embedding algorithm: GloVe
GloVe: ”Global Vectors” algorithm made in Stanford by
[Pennington, Socher, Manning, 2014]. There are 2 steps:
1 Build the co-occurrence matrix X from the training text
2 Factorize the matrix X to get vectors
1. Build the co-occurrence matrix X:
For the first step, we apply the principle ”you shall know a word by the
company it keeps”:
Mostapha Benhenda word embeddings February 19, 2015 20 / 28
The context window C(w) of size 2 of the word w= Ukraine (for
example), is given by:
The national flag of Ukraine is yellow and blue.
(in practice, we take the context window size around 10)
The number of times 2 words i and j lie in the same context window
is denoted by Xi,j .
The symmetric matrix X = (Xi,j )1≤i,j≤N is the co-occurrence matrix.
Mostapha Benhenda word embeddings February 19, 2015 21 / 28
2. Factorize the co-occurrence matrix X:
To extract vectors from a symmetric matrix, we can write:
Xi,j = < vi , vj > vi ∈ Rd
,
where d is an integer fixed by us a priori (a hyperparameter). i.e. we
can write:
X = Gram(v1, ..., vN)
This formula does not give a good empirical performance, but:
for any scalar function f , (f (Xi,j ))1≤i,j≤N is still a symmetric matrix.
Let’s find f that works!
Mostapha Benhenda word embeddings February 19, 2015 22 / 28
Let i, j be 2 words, for example: i= fruit, j=house. The third word
k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is
large.
If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is
small.
If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1.
→ the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning
of words. So we should look at f (Xi,k/Xj,k).
On the other hand, if we want to combine 3 vectors and the scalar
product in a ”natural” way, we do not have much choice:
Mostapha Benhenda word embeddings February 19, 2015 23 / 28
Let i, j be 2 words, for example: i= fruit, j=house. The third word
k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is
large.
If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is
small.
If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1.
→ the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning
of words. So we should look at f (Xi,k/Xj,k).
On the other hand, if we want to combine 3 vectors and the scalar
product in a ”natural” way, we do not have much choice:
< vi − vj , vk >= f (Xi,k/Xj,k)
Mostapha Benhenda word embeddings February 19, 2015 23 / 28
Let i, j be 2 words, for example: i= fruit, j=house. The third word
k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is
large.
If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is
small.
If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1.
→ the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning
of words. So we should look at f (Xi,k/Xj,k).
On the other hand, if we want to combine 3 vectors and the scalar
product in a ”natural” way, we do not have much choice:
< vi − vj , vk >= f (Xi,k/Xj,k)
We also have:
< vi , vk > − < vj , vk >= f (Xi,k) − f (Xj,k)
So f = log
Mostapha Benhenda word embeddings February 19, 2015 23 / 28
We cannot factorize the matrix log X explicitly, i.e. we cannot directly
compute vi , vj such that:
< vi , vj >= log Xi,j
but we compute an approximation, by minimizing a cost function J:
min
v1,...,vN
J(v1, ..., vN) ∼
N
i,j=1
(< vi , vj > − log Xi,j )2
To do that, we use gradient descent (standard method).
Mostapha Benhenda word embeddings February 19, 2015 24 / 28
Looks like cooking, but:
good empirical performance (at least similar to Word2vec)
cost function J similar to Word2vec
GloVe model is analogous to Latent Semantic Analysis (LSA). In
LSA:
co-occurrence matrix is a word-document matrix X (In GloVe:
word-word matrix)
we factorize ∼ log(1 + Xi,j ) (SVD, not Gram matrix)
Conclusion: GloVe model is worth to study!
Mostapha Benhenda word embeddings February 19, 2015 25 / 28
Future work
Deep learning (Convolutional Neural Networks): Natural Language
Processing (almost) from Scratch [Collobert et al. 2011]
Not related with word embeddings:
Boltzmann machines/Dynamical systems
Mostapha Benhenda word embeddings February 19, 2015 26 / 28
Future work
Deep learning (Convolutional Neural Networks): Natural Language
Processing (almost) from Scratch [Collobert et al. 2011]
Not related with word embeddings:
Boltzmann machines/Dynamical systems
Suggestion: have a study group about deep learning in Kyiv!
Applications of deep learning: NLP, images, videos, speech, fraud
detection...
→ money!!
Mostapha Benhenda word embeddings February 19, 2015 26 / 28
Future work
Deep learning (Convolutional Neural Networks): Natural Language
Processing (almost) from Scratch [Collobert et al. 2011]
Not related with word embeddings:
Boltzmann machines/Dynamical systems
Suggestion: have a study group about deep learning in Kyiv!
Applications of deep learning: NLP, images, videos, speech, fraud
detection...
→ money!!
Pre-requisites:
1 coding (Python or Matlab or Java...)
2 linear algebra (matrix multiplication)
3 calculus (chain rule)
4 enthusiasm!
Mostapha Benhenda word embeddings February 19, 2015 26 / 28
References
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., ...
& Bengio, Y. (2010)
Theano: a CPU and GPU math expression compiler
Proceedings of the Python for scientific computing conference (SciPy), (Vol. 4, p.
3)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P.
(2011)
Natural language processing (almost) from scratch
The Journal of Machine Learning Research, 12, 2493-2537
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013)
Distributed representations of words and phrases and their compositionality
Advances in Neural Information Processing Systems pp. 3111-3119
Mikolov, T., Quoc V. Le, Sutskever, I. (2013)
Exploiting similarities among languages for machine translation
arXiv preprint arXiv:1309.4168.
Mostapha Benhenda word embeddings February 19, 2015 27 / 28
References
Mikolov, T., Quoc V. Le (2014)
Distributed representations of sentences and documents
arXiv preprint arXiv:1405.4053.
Pennington, J., Socher, R., & Manning, C. D. (2014)
Glove: Global vectors for word representation
Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP
2014), 12.
ˇReh˚uˇrek, R., & Sojka, P. (2010)
Software framework for topic modelling with large corpora
Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP
2014), 12.
Mostapha Benhenda word embeddings February 19, 2015 28 / 28

More Related Content

What's hot (20)

PPTX
Presentation on Text Classification
Sai Srinivas Kotni
 
PDF
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Lviv Startup Club
 
PPT
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly
 
PDF
Nlp ambiguity presentation
Gurram Poorna Prudhvi
 
PPTX
NLP State of the Art | BERT
shaurya uppal
 
PPTX
Word2Vec
mohammad javad hasani
 
PDF
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Edureka!
 
PPT
Horspool Algorithm in Design and Analysis of Algorithms in VTU
SaMarthHitnalli
 
PPTX
Language models
Maryam Khordad
 
PPTX
Bootstrapping in Compiler
Akhil Kaushik
 
PPTX
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
PPTX
Text summarization
Akash Karwande
 
PPTX
Unsupervised learning clustering
Arshad Farhad
 
PPTX
Machine Learning lecture6(regularization)
cairo university
 
PPTX
OCR using Tesseract
Shobhit Chittora
 
PDF
BERT Finetuning Webinar Presentation
bhavesh_physics
 
PDF
Python Programming - VI. Classes and Objects
Ranel Padon
 
PDF
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
PPTX
What is word2vec?
Traian Rebedea
 
PDF
An introduction to the Transformers architecture and BERT
Suman Debnath
 
Presentation on Text Classification
Sai Srinivas Kotni
 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Lviv Startup Club
 
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly
 
Nlp ambiguity presentation
Gurram Poorna Prudhvi
 
NLP State of the Art | BERT
shaurya uppal
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Edureka!
 
Horspool Algorithm in Design and Analysis of Algorithms in VTU
SaMarthHitnalli
 
Language models
Maryam Khordad
 
Bootstrapping in Compiler
Akhil Kaushik
 
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
Text summarization
Akash Karwande
 
Unsupervised learning clustering
Arshad Farhad
 
Machine Learning lecture6(regularization)
cairo university
 
OCR using Tesseract
Shobhit Chittora
 
BERT Finetuning Webinar Presentation
bhavesh_physics
 
Python Programming - VI. Classes and Objects
Ranel Padon
 
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
What is word2vec?
Traian Rebedea
 
An introduction to the Transformers architecture and BERT
Suman Debnath
 

Viewers also liked (20)

ODP
Start a deep learning startup - tutorial
Mostapha Benhenda
 
PDF
Mindolia- Facial Recognition - Pitch deck
Mostapha Benhenda
 
DOCX
Prosedural model desain instruksional
Dedi Yulianto
 
PPTX
Using Embeddings for Both Entity Recognition and Linking on Tweets
Giuseppe Attardi
 
PPTX
Vectorland: Brief Notes from Using Text Embeddings for Search
Bhaskar Mitra
 
PPTX
Using Text Embeddings for Information Retrieval
Bhaskar Mitra
 
PPTX
CNN for Sentiment Analysis on Italian Tweets
Giuseppe Attardi
 
PDF
Kaz Sato, Evangelist, Google at MLconf ATL 2016
MLconf
 
PPT
Distributed representation of sentences and documents
Abdullah Khan Zehady
 
PDF
Word2vec 4 all
Óscar García Peinado
 
PDF
Practical Sentiment Analysis
People Pattern
 
PDF
Can Deep Learning solve the Sentiment Analysis Problem
Mark Cieliebak
 
PDF
Emerging Trends in Online Search
Distilled
 
PDF
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
PDF
Word Embeddings - Introduction
Christian Perone
 
PDF
Deep learning for natural language embeddings
Roelof Pieters
 
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
PPTX
Machine Learning From Movie Reviews - Long Form
Jennifer Dunne
 
PDF
Introduction to word embeddings with Python
Pavel Kalaidin
 
PDF
実例で学ぶ、明日から使えるSpring Boot Tips #jsug
Toshiaki Maki
 
Start a deep learning startup - tutorial
Mostapha Benhenda
 
Mindolia- Facial Recognition - Pitch deck
Mostapha Benhenda
 
Prosedural model desain instruksional
Dedi Yulianto
 
Using Embeddings for Both Entity Recognition and Linking on Tweets
Giuseppe Attardi
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Bhaskar Mitra
 
Using Text Embeddings for Information Retrieval
Bhaskar Mitra
 
CNN for Sentiment Analysis on Italian Tweets
Giuseppe Attardi
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
MLconf
 
Distributed representation of sentences and documents
Abdullah Khan Zehady
 
Word2vec 4 all
Óscar García Peinado
 
Practical Sentiment Analysis
People Pattern
 
Can Deep Learning solve the Sentiment Analysis Problem
Mark Cieliebak
 
Emerging Trends in Online Search
Distilled
 
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
Word Embeddings - Introduction
Christian Perone
 
Deep learning for natural language embeddings
Roelof Pieters
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
Machine Learning From Movie Reviews - Long Form
Jennifer Dunne
 
Introduction to word embeddings with Python
Pavel Kalaidin
 
実例で学ぶ、明日から使えるSpring Boot Tips #jsug
Toshiaki Maki
 
Ad

Similar to word embeddings and applications to machine translation and sentiment analysis (20)

PDF
AI&BigData Lab. Mostapha Benhenda. "Word vector representation and applications"
GeeksLab Odessa
 
PDF
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embedd...
Rehgan Avon
 
PDF
Effect of word embedding vector dimensionality on sentiment analysis through ...
IAESIJAI
 
PDF
Continuous bag of words cbow word2vec word embedding work .pdf
devangmittal4
 
PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
Deep Learning Italia
 
PPTX
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
sofia pillai
 
PPTX
Natural language processing unit - 2 ppt
Hshhdvrjdnkddb
 
PPTX
Word_Embeddings.pptx
GowrySailaja
 
PDF
David Barber - Deep Nets, Bayes and the story of AI
Bayes Nets meetup London
 
PPTX
word vector embeddings in natural languag processing
ReetShinde
 
PPTX
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
iwan_rg
 
PDF
Word embeddings as a service - PyData NYC 2015
François Scharffe
 
PDF
Word2vec ultimate beginner
Sungmin Yang
 
PPTX
Efficient estimation of word representations in vector space (2013)
Minhazul Arefin
 
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
PPTX
Word vectors
Adwait Bhave
 
PPTX
Word embeddings
Shruti kar
 
PPTX
Text Mining for Lexicography
Leiden University
 
PDF
Word Embeddings, why the hype ?
Hady Elsahar
 
AI&BigData Lab. Mostapha Benhenda. "Word vector representation and applications"
GeeksLab Odessa
 
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embedd...
Rehgan Avon
 
Effect of word embedding vector dimensionality on sentiment analysis through ...
IAESIJAI
 
Continuous bag of words cbow word2vec word embedding work .pdf
devangmittal4
 
Embedding for fun fumarola Meetup Milano DLI luglio
Deep Learning Italia
 
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
sofia pillai
 
Natural language processing unit - 2 ppt
Hshhdvrjdnkddb
 
Word_Embeddings.pptx
GowrySailaja
 
David Barber - Deep Nets, Bayes and the story of AI
Bayes Nets meetup London
 
word vector embeddings in natural languag processing
ReetShinde
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
iwan_rg
 
Word embeddings as a service - PyData NYC 2015
François Scharffe
 
Word2vec ultimate beginner
Sungmin Yang
 
Efficient estimation of word representations in vector space (2013)
Minhazul Arefin
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
Word vectors
Adwait Bhave
 
Word embeddings
Shruti kar
 
Text Mining for Lexicography
Leiden University
 
Word Embeddings, why the hype ?
Hady Elsahar
 
Ad

Recently uploaded (20)

PPT
Conservation-of-Mechanical-Energy-Honors-14.ppt
exieHANNAHEXENGaALME
 
PPTX
Structure and uses of DDT, Saccharin..pptx
harsimrankaur204
 
PDF
Polarized Multiwavelength Emission from Pulsar Wind—Accretion Disk Interactio...
Sérgio Sacani
 
PDF
Is the Interstellar Object 3I/ATLAS Alien Technology?
Sérgio Sacani
 
PDF
LiDO: Discovery of a 10:1 Resonator with a Novel Libration State
Sérgio Sacani
 
PDF
Histry of resresches in Genetics notes
S.B.P.G. COLLEGE BARAGAON VARANASI
 
DOCX
Introduction to Weather & Ai Integration (UI)
kutatomoshi
 
PDF
Continuous Model-Based Engineering of Software-Intensive Systems: Approaches,...
Hugo Bruneliere
 
PDF
crestacean parasitim non chordates notes
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PDF
Primordial Black Holes and the First Stars
Sérgio Sacani
 
PPT
Human physiology and digestive system
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
Qualification of DISSOLUTION TEST APPARATUS.pptx
shrutipandit17
 
PPT
Introduction of animal physiology in vertebrates
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
Diuretic Medicinal Chemistry II Unit II.pptx
Dhanashri Dupade
 
PDF
The Rise of Autonomous Intelligence: How AI Agents Are Redefining Science, Ar...
Kamer Ali Yuksel
 
PPTX
INTRODUCTION TO METAMORPHIC ROCKS.pptx
Jing Jing
 
PDF
The role of the Lorentz force in sunspot equilibrium
Sérgio Sacani
 
PPTX
INTRODUCTION TO METAMORPHIC ROCKS.pptx
JingJing82
 
PDF
The ∞ Galaxy: A Candidate Direct-collapse Supermassive Black Hole between Two...
Sérgio Sacani
 
PDF
Refractory solid condensation detected in an embedded protoplanetary disk
Sérgio Sacani
 
Conservation-of-Mechanical-Energy-Honors-14.ppt
exieHANNAHEXENGaALME
 
Structure and uses of DDT, Saccharin..pptx
harsimrankaur204
 
Polarized Multiwavelength Emission from Pulsar Wind—Accretion Disk Interactio...
Sérgio Sacani
 
Is the Interstellar Object 3I/ATLAS Alien Technology?
Sérgio Sacani
 
LiDO: Discovery of a 10:1 Resonator with a Novel Libration State
Sérgio Sacani
 
Histry of resresches in Genetics notes
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Introduction to Weather & Ai Integration (UI)
kutatomoshi
 
Continuous Model-Based Engineering of Software-Intensive Systems: Approaches,...
Hugo Bruneliere
 
crestacean parasitim non chordates notes
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Primordial Black Holes and the First Stars
Sérgio Sacani
 
Human physiology and digestive system
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Qualification of DISSOLUTION TEST APPARATUS.pptx
shrutipandit17
 
Introduction of animal physiology in vertebrates
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Diuretic Medicinal Chemistry II Unit II.pptx
Dhanashri Dupade
 
The Rise of Autonomous Intelligence: How AI Agents Are Redefining Science, Ar...
Kamer Ali Yuksel
 
INTRODUCTION TO METAMORPHIC ROCKS.pptx
Jing Jing
 
The role of the Lorentz force in sunspot equilibrium
Sérgio Sacani
 
INTRODUCTION TO METAMORPHIC ROCKS.pptx
JingJing82
 
The ∞ Galaxy: A Candidate Direct-collapse Supermassive Black Hole between Two...
Sérgio Sacani
 
Refractory solid condensation detected in an embedded protoplanetary disk
Sérgio Sacani
 

word embeddings and applications to machine translation and sentiment analysis

  • 1. Drawing a ”map of words”: word embeddings and applications to machine translation and sentiment analysis Mostapha Benhenda Artificial Intelligence Club, Kyiv [email protected] February 19, 2015 Mostapha Benhenda word embeddings February 19, 2015 1 / 28
  • 2. Overview 1 Introduction 2 Applications Machine translation Sentiment analysis of movie reviews Vector averaging (Kaggle tutorial) Convolutional Neural Networks (deep learning) Other applications of word embeddings 3 Example of word embedding algorithm: GloVe Build the co-occurrence matrix X Matrix factorization 4 Future work 5 References Mostapha Benhenda word embeddings February 19, 2015 2 / 28
  • 3. Introduction We want to compute a ”map of words”, i.e. a representation: R : Words = {w1, ..., wN} → Vectors = {R(w1), ..., R(wN)} ⊂ Rd such that: wi ≈ wj (meaning of words) is equivalent to: R(wi ) ≈ R(wj ) (distance of vectors) Mostapha Benhenda word embeddings February 19, 2015 3 / 28
  • 4. These vectors are interesting because: the computer can ”grasp” the meaning of words, just by looking at the distance between vectors. We can feed prediction algorithms (linear regression,...) with these vectors, and hopefully get good accuracy, because the representation is ”faithful” to the meaning of words. Mostapha Benhenda word embeddings February 19, 2015 4 / 28
  • 5. For example, if we have the list of words: cat, dog, mouse, house We expect the most distant word to be: ? Mostapha Benhenda word embeddings February 19, 2015 5 / 28
  • 6. (this is a 2-dimensional visualization of a 300-dimensional space, using t-SNE, a visualization technique that preserves clusters of points) Mostapha Benhenda word embeddings February 19, 2015 6 / 28
  • 7. Even better, we can sometimes make additions and substractions of vectors, and draw parallelisms. For example, king + woman - man ≈ ? Mostapha Benhenda word embeddings February 19, 2015 7 / 28
  • 8. (2-dimensional visualization of a 300-dimensional space, using PCA, a visualization technique that preserves parallelisms) Mostapha Benhenda word embeddings February 19, 2015 8 / 28
  • 9. Idea behind all algorithms (Word2vec, GloVe,...): ”You shall know a word by the company it keeps” (John R. Firth, 1957) The more often 2 words are near each other in a text (the training data), the closer their vectors will be. We hope that 2 words have close meanings if, statistically, they are often near each other. So we need quite big datasets: Google News English: 100 Billion words Wikipedia French or Spanish: 100 Million words MT 11 French: 200 Million words MT 11 Spanish: 84 Million words Mostapha Benhenda word embeddings February 19, 2015 9 / 28
  • 10. Application to machine translation Idea: we compute maps of all English words, all French words, and we ”superpose” the 2 maps, and we should get an English/French translator. Mostapha Benhenda word embeddings February 19, 2015 10 / 28
  • 11. [Mikolov et al. 2013] (from Google), made an English → Spanish translator (among other languages). I tried to reproduce their results for French → English, and French → Spanish. Using their algorithm Word2vec, they trained their vectors on the dataset MT 11, and on Google News. I did the same on MT 11 and Wikipedias for French and Spanish (trained with Gensim package in Python), and I used their Google News-trained vectors for English. Mostapha Benhenda word embeddings February 19, 2015 11 / 28
  • 12. Then they took the list of the 5000 most frequent English words, and their Google translation in Spanish. They train a linear transformation W that approximates the English → Spanish translation, i.e. they take W that minimizes: 5000 i=1 W (vi ) − G(vi ) 2 where G(vi ) is the Google-translation of vi . Mostapha Benhenda word embeddings February 19, 2015 12 / 28
  • 13. They test the accuracy of W on the next 1000 most common words of the list (1-accuracy). They also test the accuracy up to 5 attempts, i.e. they test if G(vi ) belongs to the 5 nearest neighbors of W (vi ) (5-accuracy). I did the same for French → English, and French → Spanish. code available here: https://drive.google.com/ open?id=0B86WKpvkt66BY09TSHJoekRqZjg&authuser=0 computer-intensive task, I recommend using Amazon Web Services, or similar. Mostapha Benhenda word embeddings February 19, 2015 13 / 28
  • 14. Results Mikolov, English → Spanish: Training data 1-accuracy 5-accuracy Google News 50% 75% MT 11 33% 51% Me, French → Spanish: Training data 1-accuracy 5-accuracy Wikipedia na <10% MT 11 25% 37% Mostapha Benhenda word embeddings February 19, 2015 14 / 28
  • 15. Results Mikolov, English → Spanish: Training data 1-accuracy 5-accuracy Google News 50% 75% MT 11 33% 51% Me, French → Spanish: Training data 1-accuracy 5-accuracy Wikipedia na <10% MT 11 25% 37% French (Wikipedia) → English (Google News) 10-accuracy: < 10%. Mostapha Benhenda word embeddings February 19, 2015 14 / 28
  • 16. Results Mikolov, English → Spanish: Training data 1-accuracy 5-accuracy Google News 50% 75% MT 11 33% 51% Me, French → Spanish: Training data 1-accuracy 5-accuracy Wikipedia na <10% MT 11 25% 37% French (Wikipedia) → English (Google News) 10-accuracy: < 10%. My conclusion: Wikipedia = cauchemar ! Mostapha Benhenda word embeddings February 19, 2015 14 / 28
  • 17. Sentiment analysis of movie reviews Goal: we want to determine if a movie review is positive or negative. Toy problem in machine learning: reviews are ambiguous, emotional, full of sarcasm,... Long-term goal: computer can understand emotions. Commercial applications of sentiment analysis: marketing: customer satisfaction,... finance: predict market trends,... Example of review: ”This movie contains everything you’d expect, but nothing more”. Mostapha Benhenda word embeddings February 19, 2015 15 / 28
  • 18. Vector averaging (Kaggle tutorial) We work on the IMDB dataset. To predict the sentiment (+ or -) of a review: we average the vectors of all the words of the review We use these average vectors as input to a supervised learning algorithm (e.g. SVM, random forests...) We get 83% accuracy. Limitation of the method: the order of words is lost, because addition is a commutative operation. Mostapha Benhenda word embeddings February 19, 2015 16 / 28
  • 19. Convolutional Neural Networks (deep learning) (work in progress) CNN are biology-inspired neural network models, initially introduced for image processing. they preserve spatial structure: the input is a n × m matrix (the pixels of the image) but here, the input is a sentence (with its ”spatial structure” preserved): Figure : source: [Collobert et al. 2011] Mostapha Benhenda word embeddings February 19, 2015 17 / 28
  • 20. Practical problem: Kaggle dataset (IMDB) can have 2000 words/review >> 50 words/review in [Collobert et al. 2011] → training is too slow!! there are tricks to speed-up training, but benefits are uncertain... Mostapha Benhenda word embeddings February 19, 2015 18 / 28
  • 21. Other applications of word embeddings Innovative search engine (ThisPlusThat), where we can subtract queries, for example: pizza + Japan -Italy → sushi Recommendation systems for online shops Mostapha Benhenda word embeddings February 19, 2015 19 / 28
  • 22. Example of word embedding algorithm: GloVe GloVe: ”Global Vectors” algorithm made in Stanford by [Pennington, Socher, Manning, 2014]. There are 2 steps: 1 Build the co-occurrence matrix X from the training text 2 Factorize the matrix X to get vectors 1. Build the co-occurrence matrix X: For the first step, we apply the principle ”you shall know a word by the company it keeps”: Mostapha Benhenda word embeddings February 19, 2015 20 / 28
  • 23. The context window C(w) of size 2 of the word w= Ukraine (for example), is given by: The national flag of Ukraine is yellow and blue. (in practice, we take the context window size around 10) The number of times 2 words i and j lie in the same context window is denoted by Xi,j . The symmetric matrix X = (Xi,j )1≤i,j≤N is the co-occurrence matrix. Mostapha Benhenda word embeddings February 19, 2015 21 / 28
  • 24. 2. Factorize the co-occurrence matrix X: To extract vectors from a symmetric matrix, we can write: Xi,j = < vi , vj > vi ∈ Rd , where d is an integer fixed by us a priori (a hyperparameter). i.e. we can write: X = Gram(v1, ..., vN) This formula does not give a good empirical performance, but: for any scalar function f , (f (Xi,j ))1≤i,j≤N is still a symmetric matrix. Let’s find f that works! Mostapha Benhenda word embeddings February 19, 2015 22 / 28
  • 25. Let i, j be 2 words, for example: i= fruit, j=house. The third word k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is large. If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is small. If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1. → the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning of words. So we should look at f (Xi,k/Xj,k). On the other hand, if we want to combine 3 vectors and the scalar product in a ”natural” way, we do not have much choice: Mostapha Benhenda word embeddings February 19, 2015 23 / 28
  • 26. Let i, j be 2 words, for example: i= fruit, j=house. The third word k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is large. If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is small. If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1. → the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning of words. So we should look at f (Xi,k/Xj,k). On the other hand, if we want to combine 3 vectors and the scalar product in a ”natural” way, we do not have much choice: < vi − vj , vk >= f (Xi,k/Xj,k) Mostapha Benhenda word embeddings February 19, 2015 23 / 28
  • 27. Let i, j be 2 words, for example: i= fruit, j=house. The third word k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is large. If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is small. If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1. → the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning of words. So we should look at f (Xi,k/Xj,k). On the other hand, if we want to combine 3 vectors and the scalar product in a ”natural” way, we do not have much choice: < vi − vj , vk >= f (Xi,k/Xj,k) We also have: < vi , vk > − < vj , vk >= f (Xi,k) − f (Xj,k) So f = log Mostapha Benhenda word embeddings February 19, 2015 23 / 28
  • 28. We cannot factorize the matrix log X explicitly, i.e. we cannot directly compute vi , vj such that: < vi , vj >= log Xi,j but we compute an approximation, by minimizing a cost function J: min v1,...,vN J(v1, ..., vN) ∼ N i,j=1 (< vi , vj > − log Xi,j )2 To do that, we use gradient descent (standard method). Mostapha Benhenda word embeddings February 19, 2015 24 / 28
  • 29. Looks like cooking, but: good empirical performance (at least similar to Word2vec) cost function J similar to Word2vec GloVe model is analogous to Latent Semantic Analysis (LSA). In LSA: co-occurrence matrix is a word-document matrix X (In GloVe: word-word matrix) we factorize ∼ log(1 + Xi,j ) (SVD, not Gram matrix) Conclusion: GloVe model is worth to study! Mostapha Benhenda word embeddings February 19, 2015 25 / 28
  • 30. Future work Deep learning (Convolutional Neural Networks): Natural Language Processing (almost) from Scratch [Collobert et al. 2011] Not related with word embeddings: Boltzmann machines/Dynamical systems Mostapha Benhenda word embeddings February 19, 2015 26 / 28
  • 31. Future work Deep learning (Convolutional Neural Networks): Natural Language Processing (almost) from Scratch [Collobert et al. 2011] Not related with word embeddings: Boltzmann machines/Dynamical systems Suggestion: have a study group about deep learning in Kyiv! Applications of deep learning: NLP, images, videos, speech, fraud detection... → money!! Mostapha Benhenda word embeddings February 19, 2015 26 / 28
  • 32. Future work Deep learning (Convolutional Neural Networks): Natural Language Processing (almost) from Scratch [Collobert et al. 2011] Not related with word embeddings: Boltzmann machines/Dynamical systems Suggestion: have a study group about deep learning in Kyiv! Applications of deep learning: NLP, images, videos, speech, fraud detection... → money!! Pre-requisites: 1 coding (Python or Matlab or Java...) 2 linear algebra (matrix multiplication) 3 calculus (chain rule) 4 enthusiasm! Mostapha Benhenda word embeddings February 19, 2015 26 / 28
  • 33. References Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., ... & Bengio, Y. (2010) Theano: a CPU and GPU math expression compiler Proceedings of the Python for scientific computing conference (SciPy), (Vol. 4, p. 3) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011) Natural language processing (almost) from scratch The Journal of Machine Learning Research, 12, 2493-2537 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013) Distributed representations of words and phrases and their compositionality Advances in Neural Information Processing Systems pp. 3111-3119 Mikolov, T., Quoc V. Le, Sutskever, I. (2013) Exploiting similarities among languages for machine translation arXiv preprint arXiv:1309.4168. Mostapha Benhenda word embeddings February 19, 2015 27 / 28
  • 34. References Mikolov, T., Quoc V. Le (2014) Distributed representations of sentences and documents arXiv preprint arXiv:1405.4053. Pennington, J., Socher, R., & Manning, C. D. (2014) Glove: Global vectors for word representation Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12. ˇReh˚uˇrek, R., & Sojka, P. (2010) Software framework for topic modelling with large corpora Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12. Mostapha Benhenda word embeddings February 19, 2015 28 / 28