0% found this document useful (0 votes)
6 views

word embedding

Word embedding is a technique in natural language processing that represents words as dense vectors in a continuous vector space, capturing semantic relationships based on the distributional hypothesis. It can be generated through methods like co-occurrence matrices and neural networks, with popular models including Word2Vec, which utilizes CBOW and Skip-gram approaches. These embeddings facilitate various NLP tasks by providing a numerical representation that retains contextual meaning, unlike traditional sparse representations.

Uploaded by

phantomx443
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

word embedding

Word embedding is a technique in natural language processing that represents words as dense vectors in a continuous vector space, capturing semantic relationships based on the distributional hypothesis. It can be generated through methods like co-occurrence matrices and neural networks, with popular models including Word2Vec, which utilizes CBOW and Skip-gram approaches. These embeddings facilitate various NLP tasks by providing a numerical representation that retains contextual meaning, unlike traditional sparse representations.

Uploaded by

phantomx443
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Word embedding

What is Word Embedding?

• Natural language processing (NLP) models do not work with plain text. So, a numerical
representation was required.
• Word embedding is a class of techniques where word is represented as a real value
vectors.
• It is a representation of word in a continuous vector space.
• It is a dense representation in a vector space.
• It can be represented in smaller dimension compared to sparse representation like one-
hot encoding.
• Most of the word embedding method is based on “distributional hypothesis” by Zelling
Harris.
What is word embedding? continued

• The Distributional Hypothesis is that words that occur in the same contexts tend to have
similar meanings. (Harris, 1954)
• Word embeddings are designed to capture the similarity between representation like:
meaning, morphology, context etc.
• The captured relationship helps us to work on downstream NLP task like chat-bot, text
summarization, information retrieval etc.
• It is generated by co-occurrence matrix, dimensionality reduction and neural networks.
• It can be broadly categorized in two parts: frequency-based embeddings and prediction-
based embeddings.
• The earliest work to give a vector representation was vector space model used in
information retrieval task.
Vector space model

• A document was represented in a vector space.


• The dimensionality of vector space is of size of unique words in corpora.

Term Term Term


Term 2
Doc 1 1 2 3

Doc 3
Doc 1 0 5 5
Doc 2 2 0 1
Term 1 Doc 3 3 3 0

• Hypothetical corpora with three


Doc 2 words represented as dimension.
Term 3 • Three doc projected in the vector
space as per their term frequency
Vector space model continued

• The document got a numerical vector representation in a vector space represented by words.
• E.g.
• Doc 1 -> [0, 5, 5]
• Doc 2 -> [2, 0, 1]

• This representation is sparse in nature. Because, in real life scenario the dimensionality of a corpus
shoots up to millions.
• It is based on term frequency.
• TF-IDF normalization is applied to reduce the weightage of frequent words like „the‟, „are‟ , and etc.
• One-hot encoding is a similar technique to represent a sentence/document in vector space.
• This representation gather limited information and fails to capture the context of the word.
Co-occurrence matrix

• It is applied to capture the neighbouring word that appeared with the word under
consideration. A context window is considered to calculate co-occurrence.
• E.g.:
• India won the match. I like the match.
• Co-occurrence matrix for above two sentence for context window of 1.

India won the match I like


India 1 1 0 0 0 0
won 1 1 1 0 0 0
the 0 1 1 1 0 1
match 0 0 1 1 0 0
I 0 0 0 0 1 1
like 0 0 1 0 1 1
Co-occurrence matrix continued

• Representations like One-hot encoding, Count based method and co-occurrence matrix
based methods are very sparse in nature.
• Either context was limited or absent all together.
• Single representation for word in every context.
• Relation between two words like: semantic reasoning is not possible with this
representation.

• Context is limited but predetermined.


• Long term dependencies are not captured.
Prediction based word embeddings

• It is a method to learn dense representation of word from a very high dimensional


representation.

Word
Word Model
embedding

[0, 1, 0, .... 0]
One hot encoded representation - India V(India) = [0.1, 2.3, -2.1, ...., 0.1]
= [0, 1, 0, .... 0]

• It is a modular representation, where a sparse vector is fed to generate a dense


representation
Language modelling

• Word Embedding models are very closely related to Language modelling.


• Language modelling tries to learn a probability distribution over the words in Vocabulary (V)
• Prime task of language model to calculate the probability a word Wi given the previous (N-1)
words, mathematically 𝑃(𝑊𝑖|𝑊𝑖−1, . . . 𝑊𝑖−𝑛+1)
• Probabilities over n-gram is calculated by frequency by constituent n-gram.
• In Neural network as well we achieve the same using softmax layer.
• We calculate the log probability of 𝑊𝑖 and normalize it with the sum of the probablities over all the words.
𝑒𝑥𝑝(ℎ𝑇𝑉’𝑊𝑖 )
• 𝑃(𝑊𝑖 |𝑊𝑖−1 , . . . 𝑊𝑖−𝑛+1 ) = ∑𝑊 𝑒𝑥𝑝(ℎ𝑇𝑉’ )
𝑖∈𝑉 𝑊𝑖

• In this case, h is the representation from hidden layer and 𝑉𝑖𝑊 is the embedding of the word.
• The inner product of ℎ𝑇𝑉’𝑊𝑖 generate the log probability of word 𝑊 𝑖.
Classical Neural language model

• It was proposed by Bengio et al., 2003

• It consists of one layer feed-forward neural


network to predict next word in sequence.
• The model tries to maximize the probability
as computed by softmax.

• Bengio et.al. introduced three concepts


• Embedding layer: a layer that generates
word embeddings by multiplying an index
vector with a word embedding matrix.
Classical Neural language model continued

• Intermediate layers: One or more layers that produce an intermediate representation of


the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of
word embeddings of 𝑛 previous word
• Softmax Layer: the final layer that produces a probability distribution over words in V
• Intermediate layer can be replaced with LSTM.
• The network has computational complexity bottleneck due to softmax layer, in which
probability over the set of vocabulary needs to be computed.
• Neural based work embedding model made a significant progress with Word2vec model
proposed by Mikolov et.al. in 2013
Word2Vec

• It was proposed by Mikolov et.al. in 2013.


• It is a two layer shallow neural network trained to learn the contextual relationship.
• It places contextually similar word near to each other.
• It is a co-occurrence based model.
• Two variants of the model was proposed
• Continuous bag of words model (CBOW)
• Given the context word, predict the center word
• Order of context words are not considered, so this representation is similar to BOW.
• Skip-gram model
What does context mean?

• Context is co-occurrence of the words. It is a sliding window around the word under the
consideration.

India is now inching towards a self reliant state

India is now inching towards a self reliant state

India is now inching towards a self reliant state

India is now inching towards a self reliant state

India is now inching towards a self reliant state

India is now inching towards a self reliant state

India is now inching towards a self reliant state

India is now inching towards a self reliant state

Window size = 2, Yellow patches are words are in consideration, orange box are the context window
CBOW continued

• Goal: Predict the center word, given the context words.


one-hot vector of 𝑊𝑡
softmax layer
One hot vectors

Average of context vectors


𝑊𝑡−2 𝑊𝑖 * P
Output projection matrix M
of dimension D X V
𝑊𝑡−1 cross entropy loss
C
C.M

𝑊𝑡+1

Projection matrix of shape


𝑊𝑡+2 V x D (to be learned)
CBOW continued

• One hot encoded of the context words 𝑊𝑡−2 , .... 𝑊𝑡+2 is input to the model.
• Projection matrix of shape V x D, where is V is the total no of unique words in the corpus and
D is dimension of the dense representation, to project one-hot encoded vector into D-
dimension vector.
• Averaged context vector is projected back to V-dimensional space. Softmax layer converts the
representation into proablities for 𝑊𝑡.
• The model is trained using cross-entropy loss between the softmax layer output and the one-
hot encoded representation of 𝑊𝑡.
Skip-gram model: High level

• Goal: To predict the context word 𝑊𝑡−2 , .... 𝑊𝑡+2 given the word 𝑊𝑡
one-hot vector of 𝑊𝑡 softmax layer
One hot vectors

C*M 𝑊𝑡−2

𝑊𝑡 𝑊𝑡 ∗ 𝑃
𝑊𝑡−1
C
𝑊𝑡+1

Projection matrix of shape


V x D (to be learned)
Output projection matrix M

C.M
of dimension D X V 𝑊𝑡+2
cross entropy loss
Skip-gram continued
• A end to end flow of training:
Context Maxtrix
(Shared with all
context word 𝑢𝑜𝑇𝑉𝑐 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑢𝑜𝑇 𝑉𝑐 )
𝑊𝑡 prediction)
𝑉𝑐
0.8 0.1 0.076
0 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑚𝑎𝑡𝑟𝑖𝑥𝑇 - - - - 0
0.1 0.2 0.084
0 - - - - 0
- - 0.2 - 0.2
0.4 1.3 0.252
0 - - - - 0
- - 0.1 - 0.1
0.8 0.4 softmax 0.102
1 - - - - 0
× - - 0.4 - 0.4
0 = - - 0.2 - - = 0 0.068
1
- - 0.8 - 0.8
0.9 .6 0.125
0 - - - - 0
- - 0.2 - 0.2
0.2 .7 0.138
0 - - - - 0
0.6 .8 0.153
0 - - - - 0

Vx1
DxV Dx1 VxD (𝑢𝑇𝑜) Vx1 (𝑊𝑡−1 )
Softmax Ground
truth
This representation is taken from the Lectures of Manning on YouTube
Skip-gram continued
• It focus on optimization of loss for each word:
exp 𝑢 𝑇0 𝑣 𝑐
• 𝑃 𝑜|𝑐 = ∑ 𝑉 𝑇
𝑘 =1 exp 𝑢 𝑤 𝑣 𝑐

• It calculates probability of output context word given the center word c.


• The loss function which it tries to minimize is:

1 𝑇
𝐽 𝜃 =− ∑ ∑ 𝑙𝑜𝑔𝑝(𝑊𝑡+𝑗 / 𝑊𝑡)
𝑇 𝑡=1 −𝑚 ≤𝑗≤𝑚
𝑗≠0

• Log value is calculated similar to depicted in first equation.


• Naive Training is costly because gradient calculation is of order 𝑉
• Two computationally efficient methods are proposed:
• Hierarchical Softmax
• Negative sampling
Skip-gram continued

• Use of negative sampling is to train is more prevalent.


• In early example, tuple like (India, the), (India, now) are the example to true cases.
• Any corrupted tuple is called as negative sample. like (India, reliant), (India, state)
• This process with modified objective function results it in a logistic regression to classify a
tuple as a true combination or a corrupt ones.
• The corrupt tuple is generated by sampling such that less frequent words are picked up
more often as a corrupt tuple.
Word Embedding visualization

Top 5 similar words


crude barrel 0.548
crude oteiba 0.464
crude netback 0.45
crude refinery 0.438
crude pipeline 0.421

ship vessel 0.623


ship port 0.575
ship tanker 0.496
ship navigation 0.471
ship crane 0.463

computer software 0.602


computer micro 0.559
computer printer 0.542
computer mainframe 0.538
computer hemdale 0.527

• Even with a smaller corpus it can


capture semanticallly relevant
words.
t-SNE 2. D projection of Word2vec (gensim implementation) embeddings of top 10 similar words, trained for 50 epoch on Reuters news
corpus from NLTK, with context len 15, vector dimension 100
Word2vec results

• The top 5 similar words when:


Context length is 30 Embedding dimension is 300
crude barrel 0.475 crude refinery 0.27
crude refinery 0.438 crude stockdraws 0.254
crude stockdraws 0.427 crude barrel 0.244
crude yates 0.408 crude utilized 0.242
crude utilized 0.382 crude liquefied 0.239
ship vessel 0.557 ship vessel 0.468
ship tanker 0.506 ship crew 0.318
ship port 0.5 ship tanker 0.308
ship icebreaker 0.461 ship shipbuilder 0.302
ship loaded 0.453 ship yard 0.288
computer software 0.569 computer software 0.441
computer micro 0.517 computer disk 0.345
computer memory 0.498 computer printer 0.345
computer disk 0.495 computer uccel 0.338
computer printer 0.476 computer scientific 0.335
Analogies
• Representation of analogy in vector space using word2vec vectors:

• Vector representation of “King-man+woman” is


roughly equivalent to the vector representation
of queen
• Using gensim and pretrained word2vec the
analogy vector generated for “King-
man+woman” 5 most closer relationships are

queen 0.7118
monarch 0.619
princess 0.5902
crown_prince 0.55
prince 0.54

Image taken from https://jalammar.github.io/illustrated-word2vec/


GloVe: Global Vectors for Word Representation

• It is a unsupervised learning to learn the word representation.


• It is based on co-occurrence matrix.
• The co-occurrence matrix built on the whole corpus.
• It is able to capture global context.
• It encompass best of the two model families:
• Local context window method
• Global matrix factorization
• Earlier matrix factorization like LSA was used to reduce the dimensionality.
• Two things that GloVe model captures:
• Statistical measure using co-occurrence matrix
• Context, by considering the neighbouring words
Glove continued

• It moves away from old matrix factorization. By considering relationship reasoning


(Semantic and syntactic), GloVe tries to learn the representation for words.
• It can be represented as:
features context
context

Feature context matrix


= *
words

Word feature

words
matrix
Word co-occurrence matrix (Embedding
matrix ) features
GloVe continued

• How does GloVe learns embedding?


• It considers word-word co-occurrence probabilities as the potential of relation between words.

• The authors presented a relation with “steam” and “ice” as target words.
• It is common to consider steam occur with gas and ice with solid.
• Other co-occur words are “water” and “fashion”. “Water” has some shared property while “fashion” is
irrelevant.
• Only in the ratio of probabilities cancels out the noisy words like “water” and “fashion”.
• As presented in the above table, the ratio of probabilities are maximum for 𝑃(𝑘/𝑖𝑐𝑒) /𝑃(𝑘/𝑠𝑡𝑒𝑎𝑚) is
high for solid and small for gas.
GloVe continued

• What is the optimization function for GloVe?


• In a co-occurrence matrix 𝑋 the 𝑋𝑖𝑗 represents the co-occurrence count.
• 𝑋𝑖 is the total number of times the word appears in the context.
• 𝑃𝑖𝑗 = 𝑃(𝑗/𝑖) = 𝑋𝑖𝑗/𝑋𝑖 is the probability of word j appear in the context of word 𝑋𝑖
• For a combination of three words 𝑖, 𝑗, 𝑘. A general representation of the model is

𝑃𝑖𝑘
𝐹(𝑊𝑖 , 𝑊𝑗 , ˜𝑊𝑘 ) =
𝑃𝑗𝑘

• The optimization function proposed by authors are:


𝑉
˜ 𝑗 + 𝑏𝑖 + 𝑏𝑗 − log 𝑋𝑖𝑗 )2
𝐽 = ∑ 𝑓(𝑋𝑖𝑗 )(𝑊𝑖𝑇 𝑊
𝑖,𝑗=1
Glove Continued

• Where 𝑉 is the size of vocabulary and 𝑊𝑇𝑖 and 𝑏 𝑖 is the vector and bias of the word 𝑊 𝑖
and 𝑊˜𝑗and 𝑏𝑗 is the context vector and its bias. The last term is the probability of
occurring i in the context of j.

• The function 𝑓(𝑋) should have following properties:


• It tends to zero at when 𝑋 → 0
• It should be non-decreasing so that it can discriminate rare co-occurrence instances.
• It should not overweight frequent co-occurrence.
• The choice of 𝑓(𝑋) is

𝑓(𝑋) = (𝑋/𝑋𝑚𝑎𝑥)𝛼 𝑖𝑓 𝑋 < 𝑋𝑚𝑎𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 1


Glove Continued

• The model has following computational bottlenecks:


• Creating a big co-occurrence matrix of size 𝑉 𝑋 𝑉.
• The model computational complexity depends on the number of non-zero elements.
• During training the context window needs to be sufficiently large so that the model can
distinguish left context and right context.
• Words which are more distant to each other contribute less in the count. Because, distant
words contribute less to the relationship of the words.
• The model generates two set of 𝑊 𝑎𝑛𝑑 ˜𝑊. An average of both is used as the representation of
words.
Glove results

Top 5 most relevant word list:

crude barrel 0.752


crude posting 0.58
crude raise 0.537
crude light 0.505
crude sour 0.502

ship loading 0.58


ship kuwaiti 0.54
ship missile 0.537
ship vessel 0.522
ship flag 0.522

computer wallace 0.595


computer software 0.592
computer microfilm 0.559
t-SNE 2D projection of Glove embeddings of top 5 similar words, trained for 50 epoch on Reuters news corpus from
NLTK, with context len 15, vector dimension 100
computer microchip 0.536
computer technology 0.52
Are Word2vec and GloVe enough?

• Both the embeddings can not deal with out of vocabulary words.
• Both can capture the context, but in a limited sense.
• They always produce single embedding for the word in cosideration.
• They can‟t distinguish:
• “I went to a bank.” and “I was standing at a river bank.”
• It will always produce single representation for both the context.
• Both gives decent performance than encoding like tf-idf, count vector etc.
• Does pretrained model helps the case?
• Pretrained models on huge corpus shows better performance compared to small corpus.
• Pretrained models of Word2vec2 is available from Google and GloVe1 is available on Stanford‟s
website.

1. https://nlp.stanford.edu/projects/glove/
2. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
Fasttext

• It was proposed by Facebook AI team.


• It was primarily meant to handle the out of vocabulary issue of GloVe and Word2vec.
• It is an extension of Word2vec.
• This model relies on n-gram character rather than word to generate the embeddings.
• This model relies on the morphological features of a word.
• The n-gram character of a word can be represented as below:
• For word <where> and n=3 the word n-gram characters are:
• <wh, whe, her, ere, re>
• The final representation for word “Where” is the sum of the vector representation of <wh,
whe, her, ere, re>.
Fasttext continued

• The modified scoring function is

𝑆(𝑤, 𝑐) = ∑ 𝑍𝑇𝑔𝑉 𝑐
𝑔∈𝐺𝑐

• Where 𝐺𝑐 is the set of n-grams for word 𝑤 and 𝑍𝑔 is the vector representation of n-gram.
• The n-gram vector learning enables the model to learn the representation for out-of-
vocabulary words as well.
Fasttext results
Top 5 most relevant word list:

crude cruz 0.582


crude barrel 0.561
crude cruise 0.501
crude crumble 0.433
crude jude 0.41

ship shipyard 0.714


ship steamship 0.703
ship shipowner 0.688
ship shipper 0.668
ship vessel 0.667

computer supercomputer 0.843


computer computerized 0.823
computer computerland 0.773
computer software 0.54
computer microfilm 0.52
t-SNE 2D projection of fasttext embeddings (gensim implementation) of top 15 similar words, trained for 50 epoch on
Reuters news corpus from NLTK, with context len 15, vector dimension 100
Observation

• None of the representation can capture the contextual representation. Meaning that
representation based on the usage.
• These are based on dictionary based look up to get the embeddings.
• Limited performance on task such as question answering, summarization compare to
current state of art models like ELMO (LSTM based), BERT (transformer based) etc.
Reference:
word2vec | Text | TensorFlow

You might also like