0% found this document useful (0 votes)
4 views

4. WordRepresentation

The document discusses distributed word representation and neural language models, focusing on key concepts such as neurons, layers, activation functions, and word embeddings. It explains the necessity of word embeddings for techniques that do not operate on text directly, highlighting types of embeddings like local and distributed representations. Additionally, it covers various models for word representation, including Word2Vec, GloVe, and FastText, and introduces concepts like negative sampling and subsampling to improve computational efficiency.

Uploaded by

saurav22465
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

4. WordRepresentation

The document discusses distributed word representation and neural language models, focusing on key concepts such as neurons, layers, activation functions, and word embeddings. It explains the necessity of word embeddings for techniques that do not operate on text directly, highlighting types of embeddings like local and distributed representations. Additionally, it covers various models for word representation, including Word2Vec, GloVe, and FastText, and introduces concepts like negative sampling and subsampling to improve computational efficiency.

Uploaded by

saurav22465
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Distributed Word representation

Md Shad Akhtar
[email protected]

shadakhtar:nlp:iiitd:2024:we
A few key terms
● Neurons
● Layers
○ Input, Output and Hidden
● Activation functions
○ Sigmoid, Tanh, Relu
● Softmax
● Weight matrices
○ Input → Hidden, Hidden → Output
● Backpropagation
○ Optimizers
■ Gradient Descent (GD), Stochastic Gradient Descent (SGD), etc.
○ Error (Loss) functions
■ Mean-Squared Error, Cross-Entropy etc.
○ Gradient of error
○ Passes: Forward pass and Backward pass

2
shadakhtar:nlp:iiitd:2024:we
Neural Language Model and
Distributed Word Representation

shadakhtar:dl:iiitd:2021:nlm:we
Input to the neural models
● Image
○ Pixel values

● Speech
○ Acoustic features extracted from some tools, e.g., MFCC

● Text
○ Word embeddings
■ Cat [1, 0, 0]
■ Dog [0, 1, 0]
■ Lamp [0, 0, 1]

shadakhtar:dl:iiitd:2021:nlm:we
Word Embeddings
• Word Embedding, Word Vector, or Word Representation
• A vector/numeric representation of a word.

• Why do we need it?


• Some techniques do not operate on text
• SVM, Neural Network, etc.

• Objective
• Semantically-rich word representations
• Conserve the meaning of a word (in context), e.g.,
• egood should convey the sense of a pleasant scenario
• egood should be on the opposite spectrum or far away from the unpleasant words, e.g.,
ebad.

5
shadakhtar:nlp:iiitd:2024:we
Types of embeddings
– Local Representation
Cat
• One hot
Dog
– Sparse
– No semantics
– Curse of Dimensionality Man
Table

6
shadakhtar:nlp:iiitd:2024:we
• I like deep learning.
• I like NLP.
Types of embeddings • I enjoy ying.

– Distributed Representation counts I like enjoy deep learning NLP flying .


• Co-occurrence vector
– Partially Dense I 0 2 1 0 0 0 0 0
– Low-degree of semantics
– Curse of Dimensionality like 2 0 0 1 0 1 0 0

enjoy 1 0 0 0 0 0 1 0

deep 0 1 0 0 1 0 0 0

learning 0 0 0 1 0 0 0 1

elike = [2,0,0,1,0,1,0,0] NLP 0 1 0 0 0 0 0 1


eenjoy = [1,0,0,0,0,0,1,0]
flying 0 0 1 0 0 0 0 1
Cosine(elike , eenjoy ) != 0
. 0 0 0 0 1 1 1 0
7
shadakhtar:nlp:iiitd:2024:we
fl
Types of embeddings
– Distributed Representation
• Word Embeddings
– Dense
– Semantically rich
– Dimension is not a function of the
vocabulary size.
ecat = [1.5, 0.5, 3.2, 5.7]

etable = [4.3, 1.7, 2.5, 1.9]


eman = [0.3, 5.6, 1.0, 3.9]

These representations are good (efficient), iff, y < z < x

8
shadakhtar:nlp:iiitd:2024:we *Matrix has hypothetical values
https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Probabilistic Language Modeling - [Bengio et al., 2003]


sky
● Input layer
○ n previous words
● Projection layer (no non-linearity)
○ Projection matrix C of size |V| x m
● Hidden layer (tanh)
● Output layer
● Skip-connection (optional)

y = softmax(b + Wx + U tanh(Hx + d))

x = [C(wt-1), C(wt-2), ..., C(wt-n+1)]

Clouds in the
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf

Efficient Word Representation - [Mikolov et al., 2013a]


● Word2Vec (Mikolov et al.) inherited the idea and used it to capture the word semantics.
○ Used previous as well as future context
○ Removed the hidden layer
○ Used the shared projection layer

• Offers remarkable performance

• Foundation stone of DL-based NLP

shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf

Efficient Word Representation - [Mikolov et al., 2013a]

queen male king

Royal ordinary

male
woman

man

shadakhtar:dl:iiitd:2021:nlm:we
“Linguistics is the eye, computation is the body”: Word Embeddings

● “A word is known by the company its keeps” - [Firth, 1950s]

● Words with similar distributional properties have similar meanings. - [Harris, 1970s]

● Model differences/similarities in meaning rather than the proper meaning itself.


○ What is the meaning of the word “cat”?

• A word can be defined by the properties it posses


– Cat: {mew, animal, mice, furry, purr, carnivore, pet, milk, curious}
– Dog: {bark, animal, bone, police, faithful, pet, carnivore, milk}
– Lamp: {candle, light, flash, stand, shade, bulb}

12
shadakhtar:dl:iiitd:2021:nlm:we
Distributional Properties

≈ ≠
{mew, animal, mice, {bark, animal, bone, {candle, light, flash,
furry, purr, carnivore, police, faithful, pet, stand, shade, bulb}
pet, milk, curious} carnivore, milk}

13
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf

Efficient Word Representation - [Mikolov et al., 2013a]

Loss

Linear Softmax
Continuous Bag-of-Words
(CBOW)

shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf

Efficient Word Representation - [Mikolov et al., 2013a]

Loss

Continuous Skip-gram Linear -

Softmax

shadakhtar:dl:iiitd:2021:nlm:we
Word2Vec: Preprocessing
Sentence: The domestic cat likes milk.

Context window size |C| = 2

X The domestic cat likes milk .

The domestic cat likes milk .


Previous |C| tokens Next |C| tokens

1000…0 Word-1
One-Hot encoding
0100…0 Word-2
Vocabulary
|V| ...

Corpus 0000…1 Word-N

shadakhtar:nlp:iiitd:2021:DL:NLP
https://arxiv.org/pdf/1301.3781.pdf

Efficient Word Representation - [Mikolov et al., 2013a]

The
[100000]
Cat
[000010]
Domestic
[010000]
Loss

Likes
[001000] Linear Softmax

Milk
[000001]
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf

Efficient Word Representation - [Mikolov et al., 2013a]

The
[100000]
W ∈ R |V|×|h|

Cat Domestic
[000010] [010000]
Loss

Linear Likes -
[001000]

W T ⋅ [0,0,0,0,1,0]

h=
Milk
Softmax
[000001]
shadakhtar:dl:iiitd:2021:nlm:we
Context window size |C| = 2

Objective function The domestic cat likes milk .

1 T If wt = cat, then
T∑ ∑
− log p(wt+j | wt ) wt+j = wt−2 = 'The' , for j = − 2
t=1 −c≤ j≤c, j≠0
= wt−1 = 'domestic' , for j = − 1
= wt+1 = 'likes' , for j = + 1
= wt+2 = 'milk' , for j = + 2

log p(the | cat) + log p(domestic | cat) + log p(likes | cat) + log p(milk | cat)

exp(vmilk vcat )
p(milk | cat) =
∑i∈V exp(vi vcat )

shadakhtar:dl:iiitd:2021:nlm:we
∂ log p(c | w) ∂ exp(vc vw)
= log
Objective function ∂vw ∂vw ∑i∈V exp(vi vw)

∂ ∂

= log exp(vc vw) − log exp(vi vw)
∂vw ∂vw i∈V
1 T
T∑ ∑
− log p(wt+j | wt ) ∂vc vw ∂ ∂
∑ ∑
t=1 −c≤ j≤c, j≠0 = − log exp(vi vw) = vc − log exp(vi vw)
∂vw ∂vw i∈V
∂vw i∈V

1 ∂ ∂v v
exp(vl vw) . l w
∑i∈V exp(vi vw) ∂vw ∑
= vc − .
exp(vc vw) l∈V
∂vw
p(wt+j | wt ) = p(c | w) =
∑i∈V exp(vi vw)
1
∑i∈V exp(vi vw) ∑
= vc − exp(vl vw) . vl
l∈V

∑ ( ∑ exp(vi vw) ) l
exp(vl vw)
= vc − .v

= vc − p(l | w) . vl
Denominator is a costly operation l∈V i∈V l∈V

shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1310.4546.pdf

Negative Sampling - [Mikolov et al., 2013b]

• The first version of word2vec had some computational issue


– One of the operations (denominator) in loss computation was expensiv
– A word can have a closed set of related/contextual words, usually small in number.
– A cat is related to mew, animal, mice, furry, purr, carnivore, pet, milk, curious, etc.
– On the other hand, a word can have a huge number of unrelated/non-contextual words.
– A cat is NOT related to everything other than what it is related to

• Negative sampling
– To reduce the complexity, unrelatedness of a word is calculated with a fixed/small set of
words.

shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1310.4546.pdf

Negative Sampling - [Mikolov et al., 2013b]

∑ ∑
log P(D = 1 | c, w) + log P(D = 0 | i, w)
(c,w)∈D (i,w)∈Pn

log (1 − P(D = 1 | i, w))


∑ ∑
log P(D = 1 | c, w) +
(c,w)∈D (i,w)∈Pn

log (1 − σ(vi vw))


∑ ∑
log σ(vc vw) +
(c,w)∈D (i,w)∈Pn

∑ ∑
log σ(vc vw) + log σ(−vi vw)
(c,w)∈D (i,w)∈Pn

shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1310.4546.pdf

Subsampling - [Mikolov et al., 2013b]


• Word2vec model learns the semantic through cooccurrence
• Cooccurrence(“England”, “London”) is more crucial than the cooccurrence(“England”,
“The”)
• Frequent words (‘a’, ‘an’, ‘the’, etc.) usually provide less information value than the
rare words.

• Each word, wi, in the training set is discarded with probability computed by the formula

t
P(wi) = 1 −
f (wi)

shadakhtar:dl:iiitd:2021:nlm:we
Word Embedding Models
• Non-contextual
◦ Word2vec [Mikolov et al., 2013] - https://code.google.com/archive/p/word2vec/
▪ Two variants
▪ Skip-gram
▪ Continuous Bag-of-word

◦ GloVe [Pennington et al., 2014] - https://nlp.stanford.edu/projects/glove/


▪ Co-occurrence matrix

◦ FastText [Bojanowski et al., 2016] - https://fasttext.cc/docs/en/unsupervised-tutorial.html


▪ Similar to word2vec
▪ Works at the sub-word level

• Contextual: The representation for each word depends on the context in which it is used.
◦ Embeddings from Language Models (ELMo) [Peters et al., 2018] - https://allennlp.org/elmo
◦ OpenAIs’ Generative Pre-trained Transformer (GPT) [Radford et al., 2018] - https://openai.com/blog/better-language-models/
◦ Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al., 2018] - https://huggingface.co/transformers/
◦ Many more based on BERT…

shadakhtar:dl:iiitd:2021:nlm:we
Gensim “word2vec” package.
● Explore https://radimrehurek.com/gensim/models/word2vec.html

● A toy example:
https://colab.research.google.com/drive/1CixWE8AR-tckgfV5fAzO9KqOqUwQBSIC#scrollTo=vR0upm_rUEF5

25
shadakhtar:nlp:iiitd:2024:we
Thanks

26
shadakhtar:nlp:iiitd:2024:we

You might also like