0% found this document useful (0 votes)
247 views

COMP5046: Natural Language Processing

This document provides an overview of the lecture plan for Lecture 2 on word embeddings and representation in Natural Language Processing. The lecture will cover word2vec, FastText and GloVe as prediction-based methods for word representation that learn vector representations of words from large corpora. It will also review count-based word representation methods and their limitations in representing word similarity and semantics. The lab information section describes how students should submit their lab assignments using Google Colab notebooks.

Uploaded by

朱宸烨
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
247 views

COMP5046: Natural Language Processing

This document provides an overview of the lecture plan for Lecture 2 on word embeddings and representation in Natural Language Processing. The lecture will cover word2vec, FastText and GloVe as prediction-based methods for word representation that learn vector representations of words from large corpora. It will also review count-based word representation methods and their limitations in representing word similarity and semantics. The lab information section describes how students should submit their lab assignments using Google Colab notebooks.

Uploaded by

朱宸烨
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

COMP5046

Natural Language Processing


Lecture 2: Word Embeddings and Representation

Dr. Caren Han


Semester 1, 2021
School of Computer Science,
University of Sydney
0 LECTURE PLAN

Lecture 2: Word Embeddings and Representation

1. Lab Info
2. Previous Lecture Review
1. Word Meaning and WordNet
2. Count based Word Representation

3. Prediction based Word Representation


1. Introduction to the concept ‘Prediction’
2. Word2Vec
3. FastText
4. GloVe

4. Next Week Preview


1 Info: Lab Exercise

What do we do during Labs?


In Labs, Students will use Google Colab
Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely
in the cloud. With Colaboratory you can write and execute code, save and share your
analyses, and access powerful computing resources, all for free from your browser.
1 Info: Lab Exercise

Submissions
How to Submit
Students should submit “ipynb” file (Download it from “File” >
“Download .ipynb”) to Canvas.

When and Where to Submit


Students must submit the Lab 1(for Week2) by Week 3 Monday 11:59PM.
0 LECTURE PLAN

Lecture 2: Word Embeddings and Representation

1. Lab Info
2. Count-based Word Representation
1. Word Meaning
2. Limitations

3. Prediction based Word Representation


1. Introduction to the concept ‘Prediction’
2. Word2Vec
3. FastText
4. GloVe

4. Next Week Preview


2 WORD REPRESENTATION

How to represent the meaning of the word?

Definition: meaning (Collins dictionary).


• the idea that it represents, and which can be explained using other words.
• the thoughts or ideas that are intended to be expressed by it.

Commonest linguistic way of thinking of meaning:


signifier (symbol) ⟺ signified (idea or thing) = denotation

“Computer” “Apple”
2 COUNT based WORD REPRESENTATION

Problem with one-hot vectors


Problem #1. No word similarity representation
Example: in web search, if user searches for “Sydney motel”, we would like to
match documents containing “Sydney Inn”

hotel motel Inn


motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 … 0]

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0]

Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]

There is no natural notion of similarity for one-hot vectors!

Problem #2. Inefficiency


Vector dimension = number of words in vocabulary
Each representation has only a single ‘1’ with all remaining 0s.
2 COUNT based WORD REPRESENTATION

Problem with BoW (Bag of Words)


• The intuition is that documents are similar if they have similar content. Further,
that from the content alone we can learn something about the meaning of the
document.
• Discarding word order ignores the context, and in turn meaning of words in
the document (semantics). Context and meaning can offer a lot to the model,
that if modeled could tell the difference between the same words differently
arranged (“this is interesting” vs “is this interesting”).

S1= I love you but you hate me

S2= I hate you but you love me


2 COUNT based WORD REPRESENTATION

Limitation of Term Frequency Inverse Document Frequency

1+dfi
wi,j = weight of term i in document j

tfi,j = number of occurrences of term i in document j

N = total number of documents

dfi = number of documents containing term i

• It computes document similarity directly in the word-count space, which may


be slow for large vocabularies.
• It assumes that the counts of different words provide independent evidence of
similarity.
• It makes no use of semantic similarities between words.
2 COUNT based WORD REPRESENTATION

Sparse Representation
With COUNT based word representation (especially, one-hot vector),
linguistic information was represented with sparse representations (high-
dimensional features)

hotel motel Inn


motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 … 0]

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0]

Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]
2 COUNT based WORD REPRESENTATION

Sparse Representation
With COUNT based word representation (especially, one-hot vector),
linguistic information was represented with sparse representations (high-
dimensional features)

hotel motel Inn


motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 … 0]

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0]

Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]

A Significant Improvement Required!


1. How to get the low-dimensional vector representation
2. How to represent the word similarity

maybe a low-dimensional vector?


Can we use a list of fixed numbers (properties) to represent the word?
0 LECTURE PLAN

Lecture 2: Word Embeddings and Representation


第⼆讲:词的嵌⼊与表示

1. Lab Info 2.
实验室信息
1.

以前的课复习
1. 单词的含义和单词⽹

2. 基于计数的字表示
2. Previous Lecture Review 3.基于预测的词表示

1. 字嵌⼊

1. Word Meaning and WordNet 2. Word2Vec

3.FastText

⼿套
4.

2. Count based Word Representation 4. 下周预览

3. Prediction based Word Representation


1. Word Embedding
2. Word2Vec
3. FastText
4. Glove

4. Next Week Preview


3 Prediction based Word representation

How to Represent the Word Similarity!


• How to represent the word similarity with dense vector

• Try this with word2vec

Reference: http://turbomaze.github.io/word2vecjson/
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things


Assume that you are taking a personality test (the Big Five Personality Traits test)
1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion
让我们熟悉⼀下⽤向量来表示东⻄
假设你正在做⼀个⼈格测试(⼤五⼈格测试):1)开放性,2)亲和性,3)尽责性,4)消极情绪性,5)外向性

Openness
100

40

Jane

0
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things


Assume that you are taking a personality test (the Big Five Personality Traits test)
1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion

Openness
100

40 70 Agreeableness
0 100
Jane

0
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things


⼤五⼈格特质测试

Assume that you are taking a personality test (the Big Five Personality Traits test)
1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion

Openness
1

Jane 0.4 0.7 Agreeableness


0 1
Mark 0.3 0.2
Jane
Eve
Mark
Eve 0.4 0.6
0
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things


Which of two people (Mark or Eve) is more similar to Jane?
让我们熟悉⼀下⽤向量来表示东⻄
两个⼈(⻢克和伊芙)中哪⼀个更像简?

Cosine Similarity
Openness
Measure of similarity between two vectors 1
of inner product space that measures the
cosine of the angle between them
内积空间中两个向量相似性的度量,度量它们之间夹⻆的余弦 Agreeableness
0 1
Jane
Eve
Mark

0
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things


Which of two people (Mark or Eve) is more similar to Jane?

Openness
1

Jane 0.4 0.7 Agreeableness


0 1
Mark 0.3 0.2
Jane
Eve
Mark
Eve 0.4 0.6
0

Jane Mark
cos( , ) ≈ 0.89
Jane Eve
cos( , ) ≈ 0.99
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things


We need all five major factors for represent the personality

Openness
1

Jane 0.4 0.7 0.5 0.2 0.1 Agreeableness


0 1
Mark 0.3 0.2 0.3 0.7 0.2
Jane
Eve
Mark
Eve 0.4 0.6 0.4 0.3 0.5
0

With these embeddings,这些嵌⼊的


⽤固定数字的向量表示事物!
1.

很容易计算向量之间的相似性
2.

1.Represent things as vectors of fixed numbers!


2.Easily calculate the similarity between vectors
3 Prediction based Word representation

Remember? The Word2Vec Demo!

This is a word embedding for the word “king”


这个词嵌⼊了" king "这个词
3 Prediction based Word representation

Remember? The Word2Vec Demo!

This is a word embedding for the word “king”


* Trained by Wikipedia Data, 50-dimension GloVe Vector

[0.50451, 0.68607, -0.59517, -0.022801, 0.60046, 0.08813, 0.47377, -0.61798, -0.31012,


-0.066666, 1.493, -0.034173, -0.98173, 0.68229, 0.812229, 0.81722, -0.51722, -744.5.4
1503, -0.55809, 0.66421, 0.1961, -0.1495, -0.033474, -0.30344, 0.41177, -2.223, -1.0756,
king -0.343554, 0.33505, 1.9927, -0.042434, -0.64519, 0.72519, 0.71419, 0.714319, 0.71419
9159, 0.16754, 0.34344, -0.25663, -0.8523, 0.1661, 0.40102, 1.1685, -1.0137, -0.2155,
0.78321, -0.91241, -1.6626, -0.64426, -0.542102]
3 Prediction based Word representation

Remember? The Word2Vec Demo!

This is a word embedding for the word “king”


* Trained by Wikipedia Data, 50-dimension GloVe Vector

king
3 Prediction based Word representation

Remember? The Word2Vec Demo!

Compare with Woman, Man, King, and Queen

woman

man

king

queen
3 Prediction based Word representation

Remember? The Word2Vec Demo!

Compare with Woman, Man, King, Queen, and Water

woman

man

king

queen

water
3 Prediction based Word representation

Remember? The Word2Vec Demo!

king – man + woman ≈ queen?

Word Algebra

woman

man

king

queen

king-man
+woman
如何使密集向量字表示

How to make dense vectors for word representation


3 Prediction based Word representation

How to make dense vectors for word representation

“ 从⼀个词的结交看它的为⼈。”

“You shall know a word by the company it keeps”


— (Firth, J. R. 1957:11)

Prof. Firth is noted for drawing attention to the


context-dependent nature of meaning with his
notion of 'context of situation', and his work on
collocational meaning is widely acknowledged in the
field of distributional semantics.
Prof. John Rupert Firth 费斯教授以其“情境语境”的概念引起了⼈们对语境依赖性意义的关注,其搭配意义的研究在分布语义学领域得到了⼴泛的认可。
3 Prediction based Word representation

Word Representations in the context


When a word w appears in a text, its context is the set of words that appear nearby
• Use the surrounding contexts of w to build up a representation of w
利⽤w的周围环境来构建w的表示 当⼀个单词w出现在⽂本中,它的上下⽂是出现在附近的单词集

• 使⽤w的周围环境来建⽴w的表示

These context words will represent Sydney


3 Prediction based Word representation

How can we train the word representation to machine?


Neural Network! (Machine Learning)

These context words will represent Sydney


# Brief in Machine Learning!

Machine Learning
How to classify this with your machine?

Object: CAT
# Brief in Machine Learning!

Computer System

def prediction(image as input):


…program…
return result
Data Result
# Brief in Machine Learning!

Can we classify this with the computer system?

Object: ??? Object: ??? Object: ???


# Brief in Machine Learning!

Computer System VS Machine Learning

def prediction(image as input):


…program…
return result
Data Result

Data: Result
Image 1: Dog
Image 2: Cat training
Image 3: Dog
Image 4: Cat
Data+Result Image 5: Dog

Pattern

xi Input words (indices or vectors), sentences, documents, etc.

yi class What we try to classify/predict


# Brief in Machine Learning!

Neural Network and Deep Learning

Neuron and Perceptron

Neuron

Input Output
weight

Perceptron

详细的神经⽹络和深度学习的概念将在讲座3中涵盖
The detailed neural network and deep learning concept will be covered in the Lecture 3
3 Prediction based Word representation

Neural Network and Deep Learning in Word Representation

“You shall know a word by the company it keeps” (Firth, J. R. 1957:11)


Why don’t we train a word by the company it keeps?
Why don’t we represent a word by the company it keeps?
词表示中的神经⽹络和深度学习
“ 你可以从⼀个词所结交的⼈了解它”(弗斯,J. R. 1957:11)
为什么我们不根据它的同伴训练⼀个词呢?
为什么我们不代表⼀个词的公司它保持?

The company it keeps A Word


Input Output

Perceptron
3 Prediction based Word representation

Neural Network and Deep Learning in Word Representation


Wikipedia: “Sydney is the state capital of NSW…”

The company it keeps A Word


Input Output
weight

Perceptron
3 Prediction based Word representation

Neural Network and Deep Learning in Word Representation


Wikipedia: “Sydney is the state capital of NSW…”

The company it keeps A Word

Word representation
3 Prediction based Word representation

Neural Network and Deep Learning in Word Representation


Wikipedia: “Sydney is the state capital of NSW…”

Context word Centre word

Word representation

Word2Vec
3 Prediction based Word representation

Word2Vec
Word2vec can utilize either of two model architectures
to produce a distributed representation of words:
可以使⽤两种模型架构中的任何⼀种来⽣成单词的分布式表示:
Word2vec

1. Continuous Bag of Words (CBOW)


Predict center word from (bag of) context words

Context Centre
word word

2. Continuous Skip-gram
Predict context (“outside”) words given center word
Centre Context
word word
3 Prediction based Word representation

Word2Vec with Continuous Bag of Words (CBOW)


Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Word2Vec与连续词袋(CBOW)

从上下⽂词中预测中⼼词
句⼦:“悉尼是新南威尔⼠州⾸府”

Sydney is the state capital of NSW


Aim
Sydney is the state capital of NSW
• Predict the center word
Sydney is the state capital of NSW

Setup Sydney is the state capital of NSW


• Window size
• Assume that the window size is 2
Sydney is the state capital of NSW

Sydney is the state capital of NSW

Sydney is the state capital of NSW


Center word
Context (“outside”) word
3 Prediction based Word representation

Word2Vec with Continuous Bag of Words (CBOW)


Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Using window slicing, develop the training data

Center word Context (“outside”) word

[1,0,0,0,0,0,0] [0,1,0,0,0,0,0], [0,0,1,0,0,0,0]


Sydney is the state capital of NSW

[0,1,0,0,0,0,0] [1,0,0,0,0,0,0],
[0,0,1,0,0,0,0], [0,0,0,1,0,0,0]
Sydney is the state capital of NSW

[0,0,1,0,0,0,0] [1,0,0,0,0,0,0], [0,1,0,0,0,0,0]


[0,0,0,1,0,0,0], [0,0,0,0,1,0,0]
Sydney is the state capital of NSW
[0,1,0,0,0,0,0], [0,0,1,0,0,0,0]
[0,0,0,1,0,0,0] Sydney is the state capital of NSW
[0,0,0,0,1,0,0], [0,0,0,0,0,1,0]
[0,0,1,0,0,0,0], [0,0,0,1,0,0,0]
[0,0,0,0,1,0,0] Sydney is the state capital of NSW
[0,0,0,0,0,1,0], [0,0,0,0,0,0,1]
[0,0,0,1,0,0,0], [0,0,0,0,1,0,0]
[0,0,0,0,0,1,0] [0,0,0,0,0,0,1] Sydney is the state capital of NSW

[0,0,0,0,0,0,1] [0,0,0,0,1,0,0], [0,0,0,0,0,1,0]


Sydney is the state capital of NSW
Center word
Context (“outside”) word
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer

is (one-hot vector)

Projection layer Output layer

the (one-hot vector)


Context Centre
word state (one-hot vector)
word
capital (one-hot vector)

of (one-hot vector)
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

is (one-hot vector)is

the
the (one-hot vector)

W’NxV state (one-hot vector)

capital (one-hot vector)


capital
N-Dimension
Depends on the dimension of word representation
you would like to set up
这取决于您想要设置的单词表示的维度

of
of (one-hot vector)
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Xthe WVxN Vthe

the
the (one-hot vector)

state (one-hot vector)

capital (one-hot vector)


capital
N-Dimension
N = Dimension of Word Embedding (Representation)
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

is (one-hot vector)is

𝑣=

Vis + Vthe + Vcapital + Vof
2m (window size)

the
the (one-hot vector)

state (one-hot vector)

capital (one-hot vector)


capital
N-Dimension

N = Dimension of Word Embedding (Representation)

of
of (one-hot vector)
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

the (one-hot vector)

ෝ=z
W’N x V x 𝒗

capital (one-hot vector) N

Predicted result state (one-hot vector)

Softmax: outputs a vector that represents the probability


distributions (sum to 1) of a list of potential outcome
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

the (one-hot vector)

ෝ=z
W’N x V x 𝒗

capital (one-hot vector) N

Predicted result state (one-hot vector)

Cross Entropy: can be used as a loss function Loss Function (Cross Entropy)
when optimizing classification
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

the (one-hot vector)

ෝ=z
W’N x V x 𝒗

capital (one-hot vector) N

Predicted result state (one-hot vector)

BACK PROPAGATION
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)

1. Initialise each word in a one-hot vector form.


𝒙𝒌 = [0,...,0,1,0,...,0]

2. Use context words (2m, based on window size =m)


as input of the Word2Vec-CBOW model.
(𝒙𝒄−𝒎 , 𝒙𝒄−𝒎+𝟏 , … , 𝒙𝒄−𝟏 , 𝒙𝒄+𝟏 , … , 𝒙𝒄+𝒎−𝟏 , 𝒙𝒄+𝒎 ) ∈ ℝ|𝑽|

3. Has two Parameter Matrices:


1) Parameter Matrix (from Input Layer to Hidden/Projection Layer)
𝐖 ∈ ℝ𝑉x𝑁
2) Parameter Matrix (to Output Layer)
𝐖′ ∈ ℝ𝑁x𝑉
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)

4. Initial words are represented in one hot vector so


multiplying a one hot vector with 𝐖𝐕x𝐍 will give you
a 1 x N (embedded word) vector.

𝟏𝟎 𝟐 𝟏𝟖
𝟏𝟓 𝟐𝟐 𝟑
e.g. 𝟎 𝟏 𝟎 𝟎 × = [15 22 3]
𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐

(𝒗𝒄−𝒎 = 𝐖𝑥 𝑐−𝑚 , … , 𝒗𝒄+𝒎 = 𝐖𝑥 𝑐+𝑚 ) ∈ ℝ𝒏

5. Average those 2m embedded vectors to calculate


the value of the Hidden Layer.

𝒗𝒄−𝒎 + 𝒗𝒄−𝒎+𝟏 + … + 𝒗𝒄+𝒎


𝑣ො =
2𝑚
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)

6. Calculate the score value for the output layer. The


higher score is produced when words are closer.
𝒛 = 𝐖 ′ × 𝑣ො ∈ ℝ|𝑽|

7. Calculate the probability using softmax


𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛) ∈ ℝ|𝑽|

8. Train the parameter matrix using objective function.


|𝑉|

H 𝑦,
ො 𝑦 = − ෍ 𝑦𝑗 𝑙𝑜𝑔(𝑦ො𝒋 )
𝑗=1
* Focus on minimising the value

We use an one-hot vector (one 1, the rest 0) so it will be


calculated in only one.
H 𝑦,
ො 𝑦 = −𝑦𝑗 𝑙𝑜𝑔(𝑦ො𝒋 )
3 Prediction based Word representation

CBOW – Neural Network Architecture


Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)

8-1. Optimization Objective Function can be presented:


8-1. 优化⽬标函数为:

*This optimization objective will be learned more details in the lecture 3.


3 Prediction based Word representation

Skip Gram
Predict context (“outside”) words (position independent) given center word
Sentence: “Sydney is the state capital of NSW”

P(wt+2|wt)

P(wt-1|wt) P(wt+1|wt)

… Sydney is the state capital of NSW …


3 Prediction based Word representation

Skip Gram
Predict context (“outside”) words (position independent) given center word
Sentence: “Sydney is the state capital of NSW”

Output layer

is (one-hot vector)

Input layer Projection layer

the (one-hot vector)


Context
Centre
state (one-hot vector) word
word
capital (one-hot vector)

of (one-hot vector)
3 Prediction based Word representation

Skip Gram – Neural Network Architecture


Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)

1. Initialise the centre word in a one-hot vector form.


𝒙𝒌 = [0,...,0,1,0,...,0]
𝒙 ∈ ℝ|𝑽|

2. Has two Parameter Matrices:


1) Parameter Matrix (from Input Layer to Hidden/Projection Layer)
𝐖 ∈ ℝ𝑉x𝑁
2) Parameter Matrix (to Output Layer)
𝐖′ ∈ ℝ𝑁x𝑉
3 Prediction based Word representation

Skip Gram – Neural Network Architecture


Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)

3. Initial words are represented in one hot vector so


multiplying a one hot vector with 𝐖𝐕x𝐍 will give you
a 1 x N (embedded word) vector.

𝟏𝟎 𝟐 𝟏𝟖
𝟏𝟓 𝟐𝟐 𝟑
e.g. 𝟎 𝟏 𝟎 𝟎 × = [15 22 3]
𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐

𝒗𝒄 = 𝐖𝒙 ∈ ℝ𝒏 (as there is only one input)

4. Calculate the score value for the output layer by


multiplying the parameter matrix W’
𝒛 = 𝐖′𝒗𝒄
3 Prediction based Word representation

Skip Gram – Neural Network Architecture


Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)

5. Calculate the probability using softmax


𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛)

6. Calculate 2m probabilities as we need to predict 2m


context words.
𝑦ො𝒄−𝒎 , … , 𝑦ො𝒄−𝟏 , 𝑦ො𝒄+𝟏 , … , 𝑦ො𝒄+𝒎

and compare with the ground truth (one-hot vector)


𝑦 (𝑐−𝑚) , … , 𝑦 (𝑐−1) , 𝑦 (𝑐+1) , … , 𝑦 (𝑐+𝑚)
3 Prediction based Word representation

Skip Gram – Neural Network Architecture


Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
和在CBOW中⼀样,使⽤⼀个⽬标函数来评估模型。⼀个关键的区别是,我们调⽤Naïve Bayes假设来计算概率。这是⼀个很强的条件独⽴假设。给定中⼼词,所有输出词都是完全独⽴的。

8. As in CBOW, use an objective function for us to


evaluate the model. A key difference e is that we
invoke a Naïve Bayes assumption to break out the
probabilities. It is a strong conditional independence
assumption. Given the centre word, all output words
are completely independent.
3 Prediction based Word representation

Skip Gram – Neural Network Architecture


Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)

8-1 。有了这个⽬标函数,我们可以计算关于未知参数的梯度,并在每次迭代时通过随机梯度下降更新它们

8-1. With this objective function, we can compute the


gradients with respect to the unknown parameters
and at each iteration update them via Stochastic
Gradient Descent

* 这个随机梯度下降法将在第三讲中详细学习。
*This Stochastic Gradient Descent will be learned details in the lecture 3.
3 Prediction based Word representation

CBOW vs Skip Gram Overview

CBOW Skip-gram

Predict center word Predict context words


from (bag of) context words given center word
3 Prediction based Word representation

Key Parameter (1) for Training methods: Window Size


Different tasks are served better by different window sizes.

Smaller window sizes (2-15) lead to embeddings where high similarity scores
between two embeddings indicates that the words are interchangeable.

Larger window sizes (15-50, or even more) lead to embeddings where


similarity is more indicative of relatedness of the words
培训⽅法关键参数(1):窗⼝⼤⼩
不同的窗⼝⼤⼩可以更好地完成不同的任务。

较⼩的窗⼝尺⼨(2-15)导致嵌⼊,两个嵌⼊之间的相似度⾼表明单词是可互换的。

更⼤的窗⼝⼤⼩(15-50,甚⾄更多)会导致嵌⼊,相似度更能表明单词的关联度
3 Prediction based Word representation
训练⽅法的关键参数(2):负样本
负样本的数量是训练过程的另⼀个因素。
我们数据集的负⾯样本——不是邻居的单词样本

Key Parameter (2) for Training methods: Negative Samples


The number of negative samples is another factor of the training process.
Negative samples to our dataset – samples of words that are not neighbors

Negative sample: 2 Negative sample: 5


Input word Output word Target Input word Output word Target
eat mango 1 eat mango 1
eat exam 0 eat exam 0
eat tobacco 0 eat tobacco 0
eat pool 0
*1= Appeared, 0=Not Appeared eat supervisor 0

原论⽂规定5-20为较好的阴性样本数量。它还指出,当你有⾜够⼤的数据集时,2-5似乎⾜够了。
The original paper prescribes 5-20 as being a good number of negative
samples. It also states that 2-5 seems to be enough when you have a large
enough dataset.
3 Prediction based Word representation

Word2Vec Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors

想法:
拥有⼤量的⽂本
Idea:

•固定词汇表中的每个单词都⽤向量表示

•遍历⽂本中的每个位置t,其中有⼀个中⼼词c和上下⽂(“外部”)词o

•使⽤c和o的词向量的相似度来计算给定c的o的概率(反之亦然)

• Have a large corpus of text •不断调整单词向量以最⼤化此概率

• Every word in a fixed vocabulary is represented by a vector


• Go through each position t in the text, which has a center word c and context
(“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of
o given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability
3 Prediction based Word representation

Let’s try some Word2Vec!

Gensim: https://radimrehurek.com/gensim/models/word2vec.html
Resources: https://wit3.fbk.eu/
https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models
3 Prediction based Word representation

Limitation of Word2Vec
Issue#1: Cannot cover the morphological similarity
• Word2vec represents every word as an independent vector, even though
many words are morphologically similar, like: teach, teacher, teaching

Issue#2: Hard to conduct embedding for rare words


• Word2vec is based on the Distribution hypothesis. Works well with the
frequent words but does not embed the rare words.
(same concept with the under-fitting in machine learning)

未登录词

Issue#3: Cannot handle the Out-of-Vocabulary (OOV)


• Word2vec does not work at all if the word is not included in the Vocabulary
限制Word2Vec
问题#1:不能覆盖形态相似性
•Word2vec将每个单词作为⼀个独⽴的向量表示,尽管许多单词在形态上是相似的,如:teach, teacher, teaching

问题2:很难对罕⻅词汇进⾏嵌⼊
基于分布假设。可以很好地与常⻅的词,但不嵌⼊罕⻅的词。
•Word2vec

(机器学习中的⽋拟合也是同样的概念)

问题#3:⽆法处理词汇表外(OOV)
不⼯作,如果单词不包括在词汇
•Word2vec
3 Prediction based Word representation

FastText
• Deal with this Word2Vec Limitation
• Another Way to transfer WORDS to VECTORS

• FastText is a library for learning of word embeddings and text classification


created by Facebook's AI Research lab. The model allows to create an
unsupervised learning or supervised learning algorithm for obtaining vector
representations for words.
是⼀个由Facebook⼈⼯智能研究实验室创建的⽤于学习单词嵌⼊和⽂本分类的库。该模型允许创建⼀个⾮监督学习或监督学习算法来获取单词的向量表示。
FastText

• Extension to Word2Vec
• Instead of feeding individual words into the Neural Network, FastText breaks words
into several n-grams (sub-words)
扩展Word2Vec
• 与将单个单词输⼊到神经⽹络不同,FastText将单词分解为⼏个n-gram(⼦单词)

https://fasttext.cc/
3 Prediction based Word representation

与N-gram嵌⼊
FastText

FastText with N-gram Embeddings


• N-grams are simply all combinations of adjacent words or letters of length n
that you can find in your source text. For example, given the word apple, all 2-
grams (or “bigrams”) are ap, pp, pl, and le
就是源⽂本中⻓度为n的相邻单词或字⺟的所有组合。例如,给定单词apple,所有2克(或“bigrams”)是ap、pp、pl和le
n -gram

• The tri-grams (n=3) for the word apple is app, ppl, and ple (ignoring the
starting and ending of boundaries of words). The word embedding vector for
apple will be the sum of all these n-grams.
单词apple的三坐标(n=3)是app、ppl和ple(忽略单词的起始和结束边界)。苹果的嵌⼊向量是所有这些n元的和。

apple apple

ap pp pl le app ppl ple

• After training the Neural Network (either with skip-gram or CBOW), we will
have word embeddings for all the n-grams given the training dataset.
训练完神经⽹络(使⽤skip-gram或CBOW)后,我们将对给定训练数据集的所有n-gram进⾏词嵌⼊。

• Rare words can now be properly represented since it is highly likely that some
of their n-grams also appears in other words.
稀有词现在可以被恰当地表示,因为它们的⼀些n-gram很可能也出现在其他词语中。 https://fasttext.cc/
3 Prediction based Word representation

Word2Vec VS FastText
Find synonym with Word2vec
from gensim.models import Word2Vec
cbow_model = Word2Vec(sentences=result, size=100, window=5, min_count=5, workers=4, sg=0)

a=cbow_model.wv.most_similar("electrofishing")
pprint.pprint(a)

Find synonym with FastText


from gensim.models import FastText
FT_model = FastText(sentences=result, size=100, window=5, min_count=5, workers=4, sg=0)

a=FT_model.wv.most_similar("electrofishing")
pprint.pprint(a)

https://fasttext.cc/
3 Prediction based Word representation

Global Vectors (GloVe)


• Deal with this Word2Vec Limitation

“Methods like skip-gram may do better on the analogy task, but they poorly utilize
the statistics of the corpus since they train on separate local context windows
instead of on global co-occurrence counts.”
“ 像skip-gram这样的⽅法可能在类⽐任务上做得更好,但它们没有充分利⽤语料库的统计数据,因为它们是在单独的局部上下⽂窗⼝上训练的,⽽不是在全局共现次数上。”

(PeddingLon et al., 2014)

• Focus on the Co-occurrence 同现

https://nlp.stanford.edu/projects/glove/
3 Prediction based Word representation

Limitation of Prediction based Word Representation

• I like
apple banana fruit

• Training dataset reflect the word representation result


• The word similarity of the word ‘software’ the model learned by Google News corpus
can be different from the one from Twitter.
训练数据集反映了词的表示结果
⾕歌新闻语料库学习的模型可能与Twitter的模型不同。

https://nlp.stanford.edu/projects/glove/
4 NEXT WEEK PREVIEW…

Word Embeddings
• Finalisation! 完成

Machine Learning/ Deep Learning for Natural Language Processing


/ Reference

Reference for this lecture


• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language
Applications Using Deep Learning. " O'Reilly Media, Inc.".
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language
processing. MIT press.
• Manning, C 2017, Introduction and Word Vectors, Natural Language Processing with Deep Learning,
lecture notes, Stanford University
• Images: http://jalammar.github.io/illustrated-word2vec/
Word2vec
• Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781.
• Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of
words and phrases and their compositionality. In Advances in neural information processing systems (pp.
3111-3119).

FastText
• Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword
information. Transactions of the Association for Computational Linguistics, 5, 135-146.
• Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training
distributed word representations. arXiv preprint arXiv:1712.09405.

You might also like