0% found this document useful (0 votes)
59 views

Word Embeddings

Word2Vec is a model that produces word embeddings by learning vector representations of words from large amounts of text such that words that share common contexts in the text are located in close proximity to one another in the vector space. It works by using a neural network to predict words given a target word and its surrounding context words. There are two main architectures for Word2Vec: Continuous Bag-of-Words (CBOW) and Skip-gram. CBOW predicts a target word from surrounding context words, while Skip-gram predicts surrounding context words from a target word. The model is trained to learn word vector representations that are useful for various natural language processing tasks.

Uploaded by

21020641
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Word Embeddings

Word2Vec is a model that produces word embeddings by learning vector representations of words from large amounts of text such that words that share common contexts in the text are located in close proximity to one another in the vector space. It works by using a neural network to predict words given a target word and its surrounding context words. There are two main architectures for Word2Vec: Continuous Bag-of-Words (CBOW) and Skip-gram. CBOW predicts a target word from surrounding context words, while Skip-gram predicts surrounding context words from a target word. The model is trained to learn word vector representations that are useful for various natural language processing tasks.

Uploaded by

21020641
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Word Embedding

Nguyen Van Vinh


VNU-UET

1
The important idea: model of meaning focusing on
similarity
Why word meaning in NLP models?
Content
• Representing words
• Word2Vec
• Application of Word2Vec
How do we represent the meaning of a word?
Definition: meaning (Webster dictionary)
• the idea that is represented by a word, phrase, etc.
• the idea that a person wants to express by using
words, signs, etc.
• the idea that is expressed in a work of writing, art,
etc.
How do we have usable meaning in a computer?
Problems with resources like WordNet
Representing words as discrete symbols
Problem with words as discrete symbols
Representing words by their context
• Distributional semantics: A word’s meaning is given by the words
that frequently appear close-by
• “You shall
know a word by the company it keeps” (J. R. Firth 1957: 11)
• “Words which frequently appear in similar contexts have similar meaning”
• One of the most successful ideas of modern statistical NLP!

• When a word w appears in a text, its context is the set of words


that appear nearby (within a fixed-size window)
• Use the many contexts of w to build up a representation of w

These context words will represent banking


Distributed representation

• Vector representation that encodes information about the distribution


of contexts a word appears in
• Words that appear in similar contexts have similar representations
• We have several different ways we can encode the notion of “context.
Term-document matrix
• Context = appearing in the same document.
Measuring similarity

• We can calculate the cosine similarity of two


vectors to judge the degree of their similarity
[Salton 1971]
• Cosine similarity measures their orientation
• A common similarity metric: cosine of the angle
between the two vectors (the larger, the more
similar the two vectors are)
Sparse vs dense vectors

• The vectors in the word-word occurrence matrix are


• Long: vocabulary size
• Sparse: most are 0’s
• Alternative: we want to represent words as short (50-300
dimensional) & dense (real-valued) vectors
• The basis for modern NLP systems

14
Word vectors
• We will build a dense vector for each word, chosen so that it
is similar to vectors of words that appear in similar contexts

• Note: word vectors are sometimes called word embeddings


or word representations. They are a distributed
representation.
Word embeddings: idea

• Idea: We have to put information about contexts into word vectors.


• How: Learn word vectors by teaching them to predict contexts
• Prior work:
• Learning representations by back-propagating errors. (Rumelhart et al., 1986)
• A neural probabilistic language model (Bengio et al., 2003)
• NLP (almost) from Scratch (Collobert & Weston, 2008)
• A recent, even simpler and faster model: word2vec (Mikolov et al. 2013)
Language model based on neural networks (Bengio
et al., 2003)
Source: https://cs230.stanford.edu/files/C5M2.pdf 18
Word embeddings: similarity

• Hope to have similar words nearby


Word embeddings: relationships

• Hope to preserve some language structure (relationships between


words).
Word embeddings: questions

• How big should the embedding space be?


• Trade-offs like any other machine learning problem – greater capacity versus
efficiency and overfitting.

• How do we find W matrix (vectors)?


• Often as part of a prediction or classification task involving neighboring
words.
Word2vec: Overview

• Word2vec (Mikolov et al. 2013) is a framework for


learning word vectors
Word2Vec Overview

• Example windows and process for computing P(Wt+j|Wt)


Word2Vec Overview

• Example windows and process for computing P(Wt+j|Wt)


word2vec

• Predict words using context


• Two versions: CBOW (continuous bag of words) and Skip-gram

https://skymind.ai/wiki/word2vec
Skip gram/CBOW intuition

• Similar “contexts” (that is, what words are likely to appear around
them), lead to similar embeddings for two words.
• One way for the network to output similar context predictions for
these two words is if the word vectors are similar. So, if two words
have similar contexts, then the network is motivated to learn similar
word vectors for these two words!

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
CBOW

• Bag of words
• Gets rid of word order. Used in discrete case using counts of
words that appear.
• CBOW
• Takes vector embeddings of n words before target and n words
after and adds them (as vectors).
• Also removes word order, but the vector sum is meaningful
enough to deduce missing word.
Word2vec – Continuous Bag of Word

• E.g. “The cat sat on floor”


• Window size = 2

the

cat

sat

on

floor

28
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
Input layer
0

Index of cat in vocabulary 1


0
0

cat 0 Hidden layer Output layer


0
0 0
0 0
… 0
0 0
0
one-hot sat one-hot
0
vector vector
0 0
0 1
0 …
1 0
0
on 0
0
0

0
We must learn W and W’
Input layer
0
1
0
0

cat Hidden layer Output layer


0
0
𝑊𝑉×𝑁
0 0
0 0
… 0
V-dim 0 0

𝑊′𝑁×𝑉 0
0
sat
0 0
0 1
0 …
N-dim
1
0
𝑊𝑉×𝑁 0 V-dim
on
0
0
0

V-dim 0 N will be the size of word vector

30
𝑇
𝑊𝑉×𝑁 × 𝑥𝑐𝑎𝑡 = 𝑣𝑐𝑎𝑡
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0 2.4
1
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.6
0
0 … … … … … … … … … … × 0 = …
1 0
… … … … … … … … … … …
0 0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 1.8

xcat 0 0 Output layer


0 …
0 0 0
0 0
… 0
V-dim 0 0
𝑣𝑐𝑎𝑡 + 𝑣𝑜𝑛 0
+ 𝑣= sat
2 0
0 0
0 1
0 …
1 0 V-dim
xon 0 Hidden layer
0
0
N-dim
0

V-dim 0

31
𝑇
𝑊𝑉×𝑁 × 𝑥𝑜𝑛 = 𝑣𝑜𝑛
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0 1.8
0
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.9
0
0 … … … … … … … … … … × 1 = …
1 0
… … … … … … … … … … …
0 0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 1.9

xcat 0 0 Output layer


0 …
0 0 0
0 0
… 0
V-dim 0 0
𝑣𝑐𝑎𝑡 + 𝑣𝑜𝑛 0
+ 𝑣= sat
2 0
0 0
0 1
0 …
1 0 V-dim
xon 0 Hidden layer
0
0
N-dim
0

V-dim 0

32
Input layer
0
1
0
0

cat Hidden layer Output layer


0
0
𝑊𝑉×𝑁
0 0
0 0
… 0
V-dim 0 0

𝑊𝑉×𝑁 ×𝑣 =𝑧 0 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
0
0 0
0 1
0 𝑣 …
1
0
𝑊𝑉×𝑁 N-dim
0

on 𝑦sat
0
0
0
V-dim

V-dim 0 N will be the size of word vector

33
Input layer
0
1 We would prefer 𝑦 close to 𝑦𝑠𝑎𝑡
0
0

cat Hidden layer Output layer


0
0
𝑊𝑉×𝑁
0 0 0.01
0 0
0.02
… 0
V-dim 0 0 0.00

𝑊𝑉×𝑁 ×𝑣 =𝑧 0
0
0.02

0 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧) 0 0.01

0 1 0.02
0 𝑣 …
0.01
1
0
𝑊𝑉×𝑁 N-dim
0
0.7
on 𝑦sat
0 …
0
0
V-dim 0.00


V-dim 0 𝑦
N will be the size of word vector

34
𝑇
𝑊𝑉×𝑁
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2
Contain word’s vectors
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1

0 … … … … … … … … … …
1
… … … … … … … … … …
0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2

xcat 0 Output layer


0
0 0
0 𝑊𝑉×𝑁 0
… 0
V-dim 0 0

𝑊𝑉×𝑁 0
sat
0
0 0
0 1
0 …
1 𝑊𝑉×𝑁 0 V-dim
xon 0 Hidden layer
0
0
N-dim
0

V-dim 0

We can consider either W or W’ as the word’s representation. Or


even take the average.
35
Some interesting results

36
Word analogies

37
Word embeddings: properties
• Need to have a function W(word) that returns a vector encoding that
word.

• Similarity of words corresponds to nearby vectors.


• Director – chairman, scratched – scraped

• Relationships between words correspond to difference between


vectors.
• Big – bigger, small – smaller
Word embeddings: properties

• Relationships between words correspond to


difference between vectors.

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
Skip gram
• Skip gram – alternative to CBOW
• Start with a single word embedding and try to predict the
context words (surrounding words).
• Skip-Gram works well with small datasets, and can better
represent less frequent words.
Skip gram
• Map from center word to probability on surrounding words. One
input/output unit below.
• There is no activation function on the hidden layer neurons, but the output
neurons use softmax.

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Skip gram example
• Vocabulary of 10,000 words.
• Embedding vectors with 300 features.
• So the hidden layer is going to be represented by a weight matrix with
10,000 rows (multiply by vector on the left).

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Word2vec shortcomings
• Problem: 10,000 words and 300 dim embedding gives a large
parameter space to learn. And 10K words is minimal for real
applications.

• Slow to train, and need lots of data, particularly to learn uncommon


words.
• Both of these layers would have a weight matrix with 300 x 10,000 = 3
million weights each
Calculating all gradients!
• We went through the gradient for each center vector v in a
window
• We also need gradients for outside vectors u
• Generally in each window we will compute updates for all
parameters that are being used in that window. For example:
Word2vec improvements: selective updates
• Idea: Use “Negative Sampling”, which causes each
training sample to update only a small percentage of
the model’s weights.
• Every unique word in the vocabulary has:
• Embedding vector (D-dimensional)
• Training vector (D-dimensional)
• Positive samples: Words that appear close to each
other (window size).
• Negative samples: Randomly picked words from
vocabulary.
• Training task:
• Input: A word from training data corpus
• Task: Predict the neighboring words
45
Defining a new learning problem
Word2vec improvements: selective updates

Window size: 2 Positive Negative


Source Text Training Samples Training Samples
(fox, jumps) (fox, words)
The quick brown fox jumps over the lazy dog. (fox, over) (fox, cat)
(fox, brown) (fox, pen)
(fox, quick) (fox, chat) …

(jumps, fox) …
The quick brown fox jumps over the lazy dog.
(jumps, brown)
(jumps, over)
(jumps, the)

Label: 1 Label: 0
47
Word2vec improvements: selective updates

• Selecting 5-20 words works well for smaller datasets, and you can get
away with only 2-5 words for large datasets.
• Updating the weights for our positive word (“jump”), plus the weights
for 5 other words that we want to output 0. That’s a total of 6 output
neurons, and 1,800 weight values total. That’s only 0.06% of the 3M
weights in the output layer!
The Glove Model (2014)

• GloVe observes that ratios of word-word co-occurrence probabilities


have the potential for encoding some form of meaning. To consider
the co-occurrence probabilities for target words ice and steam with
various probe words from the vocabulary:
• As one might expect, ice co-occurs more frequently with solid than it does
with gas, whereas steam co-occurs more frequently with gas than it does
with solid.
• Both words co-occur with their shared property water frequently, and both
co-occur with the unrelated word fashion infrequently.
Word2Vec vs. GloVe

• The advantage of GloVe is that, unlike Word2vec, GloVe does not rely
just on local statistics (local context information of words), but
incorporates global statistics (word co-occurrence) to obtain word
vectors.
FastText vs. Word2Vec

• Fasttext invented by Facebook (2016)


• Fastext solving unknown word problem.
• FastText operates at a more granular level with character n-grams.
Where words are represented by the sum of the character n-gram
vectors.
Word embedding applications
• The use of word representations… has become a key
“secret sauce” for the success of many NLP systems in
recent years, across tasks including named entity
recognition, part-of-speech tagging, parsing, and
semantic role labeling. (Luong et al. (2013))
• Learning a good representation on a task A and then
using it on a task B is one of the major tricks in the
Deep Learning toolbox.
• Pretraining, transfer learning, and multi-task learning, Large
Language Models.
• Can allow the representation to learn from more than one
kind of data.
• StarSpace: Embed All The Things (Ledell Wu et al, 2017)
Word embedding applications
• Can apply to get a joint embedding of words and images or other
multi-modal data sets.
• New classes map near similar existing classes: e.g., if ‘cat’ is unknown,
cat images map near dog.
Conclusion
• Representing words
• Word2vec
• Skip-gram
• CBOW
• Application of word2vec
References

• Slide of NLP Course, Stanford, 2022


• Slide of NLP Course, Princeton, 2023
• Other slides on Internet.

You might also like