4. WordRepresentation
4. WordRepresentation
Md Shad Akhtar
[email protected]
shadakhtar:nlp:iiitd:2024:we
A few key terms
● Neurons
● Layers
○ Input, Output and Hidden
● Activation functions
○ Sigmoid, Tanh, Relu
● Softmax
● Weight matrices
○ Input → Hidden, Hidden → Output
● Backpropagation
○ Optimizers
■ Gradient Descent (GD), Stochastic Gradient Descent (SGD), etc.
○ Error (Loss) functions
■ Mean-Squared Error, Cross-Entropy etc.
○ Gradient of error
○ Passes: Forward pass and Backward pass
2
shadakhtar:nlp:iiitd:2024:we
Neural Language Model and
Distributed Word Representation
shadakhtar:dl:iiitd:2021:nlm:we
Input to the neural models
● Image
○ Pixel values
● Speech
○ Acoustic features extracted from some tools, e.g., MFCC
● Text
○ Word embeddings
■ Cat [1, 0, 0]
■ Dog [0, 1, 0]
■ Lamp [0, 0, 1]
shadakhtar:dl:iiitd:2021:nlm:we
Word Embeddings
• Word Embedding, Word Vector, or Word Representation
• A vector/numeric representation of a word.
• Objective
• Semantically-rich word representations
• Conserve the meaning of a word (in context), e.g.,
• egood should convey the sense of a pleasant scenario
• egood should be on the opposite spectrum or far away from the unpleasant words, e.g.,
ebad.
5
shadakhtar:nlp:iiitd:2024:we
Types of embeddings
– Local Representation
Cat
• One hot
Dog
– Sparse
– No semantics
– Curse of Dimensionality Man
Table
6
shadakhtar:nlp:iiitd:2024:we
• I like deep learning.
• I like NLP.
Types of embeddings • I enjoy ying.
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
8
shadakhtar:nlp:iiitd:2024:we *Matrix has hypothetical values
https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Clouds in the
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf
Royal ordinary
male
woman
man
shadakhtar:dl:iiitd:2021:nlm:we
“Linguistics is the eye, computation is the body”: Word Embeddings
● Words with similar distributional properties have similar meanings. - [Harris, 1970s]
12
shadakhtar:dl:iiitd:2021:nlm:we
Distributional Properties
≈ ≠
{mew, animal, mice, {bark, animal, bone, {candle, light, flash,
furry, purr, carnivore, police, faithful, pet, stand, shade, bulb}
pet, milk, curious} carnivore, milk}
13
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf
Loss
Linear Softmax
Continuous Bag-of-Words
(CBOW)
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf
Loss
Softmax
shadakhtar:dl:iiitd:2021:nlm:we
Word2Vec: Preprocessing
Sentence: The domestic cat likes milk.
1000…0 Word-1
One-Hot encoding
0100…0 Word-2
Vocabulary
|V| ...
shadakhtar:nlp:iiitd:2021:DL:NLP
https://arxiv.org/pdf/1301.3781.pdf
The
[100000]
Cat
[000010]
Domestic
[010000]
Loss
Likes
[001000] Linear Softmax
Milk
[000001]
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1301.3781.pdf
The
[100000]
W ∈ R |V|×|h|
Cat Domestic
[000010] [010000]
Loss
Linear Likes -
[001000]
W T ⋅ [0,0,0,0,1,0]
∑
h=
Milk
Softmax
[000001]
shadakhtar:dl:iiitd:2021:nlm:we
Context window size |C| = 2
1 T If wt = cat, then
T∑ ∑
− log p(wt+j | wt ) wt+j = wt−2 = 'The' , for j = − 2
t=1 −c≤ j≤c, j≠0
= wt−1 = 'domestic' , for j = − 1
= wt+1 = 'likes' , for j = + 1
= wt+2 = 'milk' , for j = + 2
log p(the | cat) + log p(domestic | cat) + log p(likes | cat) + log p(milk | cat)
exp(vmilk vcat )
p(milk | cat) =
∑i∈V exp(vi vcat )
shadakhtar:dl:iiitd:2021:nlm:we
∂ log p(c | w) ∂ exp(vc vw)
= log
Objective function ∂vw ∂vw ∑i∈V exp(vi vw)
∂ ∂
∑
= log exp(vc vw) − log exp(vi vw)
∂vw ∂vw i∈V
1 T
T∑ ∑
− log p(wt+j | wt ) ∂vc vw ∂ ∂
∑ ∑
t=1 −c≤ j≤c, j≠0 = − log exp(vi vw) = vc − log exp(vi vw)
∂vw ∂vw i∈V
∂vw i∈V
1 ∂ ∂v v
exp(vl vw) . l w
∑i∈V exp(vi vw) ∂vw ∑
= vc − .
exp(vc vw) l∈V
∂vw
p(wt+j | wt ) = p(c | w) =
∑i∈V exp(vi vw)
1
∑i∈V exp(vi vw) ∑
= vc − exp(vl vw) . vl
l∈V
∑ ( ∑ exp(vi vw) ) l
exp(vl vw)
= vc − .v
∑
= vc − p(l | w) . vl
Denominator is a costly operation l∈V i∈V l∈V
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1310.4546.pdf
• Negative sampling
– To reduce the complexity, unrelatedness of a word is calculated with a fixed/small set of
words.
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1310.4546.pdf
∑ ∑
log P(D = 1 | c, w) + log P(D = 0 | i, w)
(c,w)∈D (i,w)∈Pn
∑ ∑
log σ(vc vw) + log σ(−vi vw)
(c,w)∈D (i,w)∈Pn
shadakhtar:dl:iiitd:2021:nlm:we
https://arxiv.org/pdf/1310.4546.pdf
• Each word, wi, in the training set is discarded with probability computed by the formula
t
P(wi) = 1 −
f (wi)
shadakhtar:dl:iiitd:2021:nlm:we
Word Embedding Models
• Non-contextual
◦ Word2vec [Mikolov et al., 2013] - https://code.google.com/archive/p/word2vec/
▪ Two variants
▪ Skip-gram
▪ Continuous Bag-of-word
• Contextual: The representation for each word depends on the context in which it is used.
◦ Embeddings from Language Models (ELMo) [Peters et al., 2018] - https://allennlp.org/elmo
◦ OpenAIs’ Generative Pre-trained Transformer (GPT) [Radford et al., 2018] - https://openai.com/blog/better-language-models/
◦ Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al., 2018] - https://huggingface.co/transformers/
◦ Many more based on BERT…
shadakhtar:dl:iiitd:2021:nlm:we
Gensim “word2vec” package.
● Explore https://radimrehurek.com/gensim/models/word2vec.html
● A toy example:
https://colab.research.google.com/drive/1CixWE8AR-tckgfV5fAzO9KqOqUwQBSIC#scrollTo=vR0upm_rUEF5
25
shadakhtar:nlp:iiitd:2024:we
Thanks
26
shadakhtar:nlp:iiitd:2024:we