word embedding
word embedding
• Natural language processing (NLP) models do not work with plain text. So, a numerical
representation was required.
• Word embedding is a class of techniques where word is represented as a real value
vectors.
• It is a representation of word in a continuous vector space.
• It is a dense representation in a vector space.
• It can be represented in smaller dimension compared to sparse representation like one-
hot encoding.
• Most of the word embedding method is based on “distributional hypothesis” by Zelling
Harris.
What is word embedding? continued
• The Distributional Hypothesis is that words that occur in the same contexts tend to have
similar meanings. (Harris, 1954)
• Word embeddings are designed to capture the similarity between representation like:
meaning, morphology, context etc.
• The captured relationship helps us to work on downstream NLP task like chat-bot, text
summarization, information retrieval etc.
• It is generated by co-occurrence matrix, dimensionality reduction and neural networks.
• It can be broadly categorized in two parts: frequency-based embeddings and prediction-
based embeddings.
• The earliest work to give a vector representation was vector space model used in
information retrieval task.
Vector space model
Doc 3
Doc 1 0 5 5
Doc 2 2 0 1
Term 1 Doc 3 3 3 0
• The document got a numerical vector representation in a vector space represented by words.
• E.g.
• Doc 1 -> [0, 5, 5]
• Doc 2 -> [2, 0, 1]
• This representation is sparse in nature. Because, in real life scenario the dimensionality of a corpus
shoots up to millions.
• It is based on term frequency.
• TF-IDF normalization is applied to reduce the weightage of frequent words like „the‟, „are‟ , and etc.
• One-hot encoding is a similar technique to represent a sentence/document in vector space.
• This representation gather limited information and fails to capture the context of the word.
Co-occurrence matrix
• It is applied to capture the neighbouring word that appeared with the word under
consideration. A context window is considered to calculate co-occurrence.
• E.g.:
• India won the match. I like the match.
• Co-occurrence matrix for above two sentence for context window of 1.
• Representations like One-hot encoding, Count based method and co-occurrence matrix
based methods are very sparse in nature.
• Either context was limited or absent all together.
• Single representation for word in every context.
• Relation between two words like: semantic reasoning is not possible with this
representation.
Word
Word Model
embedding
[0, 1, 0, .... 0]
One hot encoded representation - India V(India) = [0.1, 2.3, -2.1, ...., 0.1]
= [0, 1, 0, .... 0]
• In this case, h is the representation from hidden layer and 𝑉𝑖𝑊 is the embedding of the word.
• The inner product of ℎ𝑇𝑉’𝑊𝑖 generate the log probability of word 𝑊 𝑖.
Classical Neural language model
• Context is co-occurrence of the words. It is a sliding window around the word under the
consideration.
Window size = 2, Yellow patches are words are in consideration, orange box are the context window
CBOW continued
𝑊𝑡+1
• One hot encoded of the context words 𝑊𝑡−2 , .... 𝑊𝑡+2 is input to the model.
• Projection matrix of shape V x D, where is V is the total no of unique words in the corpus and
D is dimension of the dense representation, to project one-hot encoded vector into D-
dimension vector.
• Averaged context vector is projected back to V-dimensional space. Softmax layer converts the
representation into proablities for 𝑊𝑡.
• The model is trained using cross-entropy loss between the softmax layer output and the one-
hot encoded representation of 𝑊𝑡.
Skip-gram model: High level
• Goal: To predict the context word 𝑊𝑡−2 , .... 𝑊𝑡+2 given the word 𝑊𝑡
one-hot vector of 𝑊𝑡 softmax layer
One hot vectors
C*M 𝑊𝑡−2
𝑊𝑡 𝑊𝑡 ∗ 𝑃
𝑊𝑡−1
C
𝑊𝑡+1
C.M
of dimension D X V 𝑊𝑡+2
cross entropy loss
Skip-gram continued
• A end to end flow of training:
Context Maxtrix
(Shared with all
context word 𝑢𝑜𝑇𝑉𝑐 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑢𝑜𝑇 𝑉𝑐 )
𝑊𝑡 prediction)
𝑉𝑐
0.8 0.1 0.076
0 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑚𝑎𝑡𝑟𝑖𝑥𝑇 - - - - 0
0.1 0.2 0.084
0 - - - - 0
- - 0.2 - 0.2
0.4 1.3 0.252
0 - - - - 0
- - 0.1 - 0.1
0.8 0.4 softmax 0.102
1 - - - - 0
× - - 0.4 - 0.4
0 = - - 0.2 - - = 0 0.068
1
- - 0.8 - 0.8
0.9 .6 0.125
0 - - - - 0
- - 0.2 - 0.2
0.2 .7 0.138
0 - - - - 0
0.6 .8 0.153
0 - - - - 0
Vx1
DxV Dx1 VxD (𝑢𝑇𝑜) Vx1 (𝑊𝑡−1 )
Softmax Ground
truth
This representation is taken from the Lectures of Manning on YouTube
Skip-gram continued
• It focus on optimization of loss for each word:
exp 𝑢 𝑇0 𝑣 𝑐
• 𝑃 𝑜|𝑐 = ∑ 𝑉 𝑇
𝑘 =1 exp 𝑢 𝑤 𝑣 𝑐
1 𝑇
𝐽 𝜃 =− ∑ ∑ 𝑙𝑜𝑔𝑝(𝑊𝑡+𝑗 / 𝑊𝑡)
𝑇 𝑡=1 −𝑚 ≤𝑗≤𝑚
𝑗≠0
queen 0.7118
monarch 0.619
princess 0.5902
crown_prince 0.55
prince 0.54
Word feature
words
matrix
Word co-occurrence matrix (Embedding
matrix ) features
GloVe continued
• The authors presented a relation with “steam” and “ice” as target words.
• It is common to consider steam occur with gas and ice with solid.
• Other co-occur words are “water” and “fashion”. “Water” has some shared property while “fashion” is
irrelevant.
• Only in the ratio of probabilities cancels out the noisy words like “water” and “fashion”.
• As presented in the above table, the ratio of probabilities are maximum for 𝑃(𝑘/𝑖𝑐𝑒) /𝑃(𝑘/𝑠𝑡𝑒𝑎𝑚) is
high for solid and small for gas.
GloVe continued
𝑃𝑖𝑘
𝐹(𝑊𝑖 , 𝑊𝑗 , ˜𝑊𝑘 ) =
𝑃𝑗𝑘
• Where 𝑉 is the size of vocabulary and 𝑊𝑇𝑖 and 𝑏 𝑖 is the vector and bias of the word 𝑊 𝑖
and 𝑊˜𝑗and 𝑏𝑗 is the context vector and its bias. The last term is the probability of
occurring i in the context of j.
• Both the embeddings can not deal with out of vocabulary words.
• Both can capture the context, but in a limited sense.
• They always produce single embedding for the word in cosideration.
• They can‟t distinguish:
• “I went to a bank.” and “I was standing at a river bank.”
• It will always produce single representation for both the context.
• Both gives decent performance than encoding like tf-idf, count vector etc.
• Does pretrained model helps the case?
• Pretrained models on huge corpus shows better performance compared to small corpus.
• Pretrained models of Word2vec2 is available from Google and GloVe1 is available on Stanford‟s
website.
1. https://nlp.stanford.edu/projects/glove/
2. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
Fasttext
𝑆(𝑤, 𝑐) = ∑ 𝑍𝑇𝑔𝑉 𝑐
𝑔∈𝐺𝑐
• Where 𝐺𝑐 is the set of n-grams for word 𝑤 and 𝑍𝑔 is the vector representation of n-gram.
• The n-gram vector learning enables the model to learn the representation for out-of-
vocabulary words as well.
Fasttext results
Top 5 most relevant word list:
• None of the representation can capture the contextual representation. Meaning that
representation based on the usage.
• These are based on dictionary based look up to get the embeddings.
• Limited performance on task such as question answering, summarization compare to
current state of art models like ELMO (LSTM based), BERT (transformer based) etc.
Reference:
word2vec | Text | TensorFlow