Word Embeddings
Word Embeddings
1
The important idea: model of meaning focusing on
similarity
Why word meaning in NLP models?
Content
• Representing words
• Word2Vec
• Application of Word2Vec
How do we represent the meaning of a word?
Definition: meaning (Webster dictionary)
• the idea that is represented by a word, phrase, etc.
• the idea that a person wants to express by using
words, signs, etc.
• the idea that is expressed in a work of writing, art,
etc.
How do we have usable meaning in a computer?
Problems with resources like WordNet
Representing words as discrete symbols
Problem with words as discrete symbols
Representing words by their context
• Distributional semantics: A word’s meaning is given by the words
that frequently appear close-by
• “You shall
know a word by the company it keeps” (J. R. Firth 1957: 11)
• “Words which frequently appear in similar contexts have similar meaning”
• One of the most successful ideas of modern statistical NLP!
14
Word vectors
• We will build a dense vector for each word, chosen so that it
is similar to vectors of words that appear in similar contexts
https://skymind.ai/wiki/word2vec
Skip gram/CBOW intuition
• Similar “contexts” (that is, what words are likely to appear around
them), lead to similar embeddings for two words.
• One way for the network to output similar context predictions for
these two words is if the word vectors are similar. So, if two words
have similar contexts, then the network is motivated to learn similar
word vectors for these two words!
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
CBOW
• Bag of words
• Gets rid of word order. Used in discrete case using counts of
words that appear.
• CBOW
• Takes vector embeddings of n words before target and n words
after and adds them (as vectors).
• Also removes word order, but the vector sum is meaningful
enough to deduce missing word.
Word2vec – Continuous Bag of Word
the
cat
sat
on
floor
28
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
Input layer
0
𝑊′𝑁×𝑉 0
0
sat
0 0
0 1
0 …
N-dim
1
0
𝑊𝑉×𝑁 0 V-dim
on
0
0
0
…
V-dim 0 N will be the size of word vector
30
𝑇
𝑊𝑉×𝑁 × 𝑥𝑐𝑎𝑡 = 𝑣𝑐𝑎𝑡
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0 2.4
1
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.6
0
0 … … … … … … … … … … × 0 = …
1 0
… … … … … … … … … … …
0 0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 1.8
31
𝑇
𝑊𝑉×𝑁 × 𝑥𝑜𝑛 = 𝑣𝑜𝑛
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0 1.8
0
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.9
0
0 … … … … … … … … … … × 1 = …
1 0
… … … … … … … … … … …
0 0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 1.9
32
Input layer
0
1
0
0
on 𝑦sat
0
0
0
V-dim
…
V-dim 0 N will be the size of word vector
33
Input layer
0
1 We would prefer 𝑦 close to 𝑦𝑠𝑎𝑡
0
0
0 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧) 0 0.01
0 1 0.02
0 𝑣 …
0.01
1
0
𝑊𝑉×𝑁 N-dim
0
0.7
on 𝑦sat
0 …
0
0
V-dim 0.00
…
V-dim 0 𝑦
N will be the size of word vector
34
𝑇
𝑊𝑉×𝑁
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2
Contain word’s vectors
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
0 … … … … … … … … … …
1
… … … … … … … … … …
0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2
36
Word analogies
37
Word embeddings: properties
• Need to have a function W(word) that returns a vector encoding that
word.
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
Skip gram
• Skip gram – alternative to CBOW
• Start with a single word embedding and try to predict the
context words (surrounding words).
• Skip-Gram works well with small datasets, and can better
represent less frequent words.
Skip gram
• Map from center word to probability on surrounding words. One
input/output unit below.
• There is no activation function on the hidden layer neurons, but the output
neurons use softmax.
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Skip gram example
• Vocabulary of 10,000 words.
• Embedding vectors with 300 features.
• So the hidden layer is going to be represented by a weight matrix with
10,000 rows (multiply by vector on the left).
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Word2vec shortcomings
• Problem: 10,000 words and 300 dim embedding gives a large
parameter space to learn. And 10K words is minimal for real
applications.
(jumps, fox) …
The quick brown fox jumps over the lazy dog.
(jumps, brown)
(jumps, over)
(jumps, the)
Label: 1 Label: 0
47
Word2vec improvements: selective updates
• Selecting 5-20 words works well for smaller datasets, and you can get
away with only 2-5 words for large datasets.
• Updating the weights for our positive word (“jump”), plus the weights
for 5 other words that we want to output 0. That’s a total of 6 output
neurons, and 1,800 weight values total. That’s only 0.06% of the 3M
weights in the output layer!
The Glove Model (2014)
• The advantage of GloVe is that, unlike Word2vec, GloVe does not rely
just on local statistics (local context information of words), but
incorporates global statistics (word co-occurrence) to obtain word
vectors.
FastText vs. Word2Vec