0% found this document useful (0 votes)

247 views

COMP5046: Natural Language Processing

This document provides an overview of the lecture plan for Lecture 2 on word embeddings and representation in Natural Language Processing. The lecture will cover word2vec, FastText and GloVe as prediction-based methods for word representation that learn vector representations of words from large corpora. It will also review count-based word representation methods and their limitations in representing word similarity and semantics. The lab information section describes how students should submit their lab assignments using Google Colab notebooks.

Uploaded by

朱宸烨

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

247 views

COMP5046: Natural Language Processing

Uploaded by

朱宸烨

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

COMP5046

Natural Language Processing

Lecture 2: Word Embeddings and Representation

Dr. Caren Han

Semester 1, 2021
School of Computer Science,
University of Sydney
0 LECTURE PLAN

Lecture 2: Word Embeddings and Representation

1. Lab Info
2. Previous Lecture Review
1. Word Meaning and WordNet
2. Count based Word Representation

3. Prediction based Word Representation

1. Introduction to the concept ‘Prediction’
2. Word2Vec
3. FastText
4. GloVe

4. Next Week Preview

1 Info: Lab Exercise

What do we do during Labs?

In Labs, Students will use Google Colab
Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely
in the cloud. With Colaboratory you can write and execute code, save and share your
analyses, and access powerful computing resources, all for free from your browser.
1 Info: Lab Exercise

Submissions
How to Submit
Students should submit “ipynb” file (Download it from “File” >
“Download .ipynb”) to Canvas.

When and Where to Submit

Students must submit the Lab 1(for Week2) by Week 3 Monday 11:59PM.
0 LECTURE PLAN

Lecture 2: Word Embeddings and Representation

1. Lab Info
2. Count-based Word Representation
1. Word Meaning
2. Limitations

3. Prediction based Word Representation

1. Introduction to the concept ‘Prediction’
2. Word2Vec
3. FastText
4. GloVe

4. Next Week Preview

2 WORD REPRESENTATION

How to represent the meaning of the word?

Definition: meaning (Collins dictionary).

• the idea that it represents, and which can be explained using other words.
• the thoughts or ideas that are intended to be expressed by it.

Commonest linguistic way of thinking of meaning:

signifier (symbol) ⟺ signified (idea or thing) = denotation

“Computer” “Apple”
2 COUNT based WORD REPRESENTATION

Problem with one-hot vectors

Problem #1. No word similarity representation
Example: in web search, if user searches for “Sydney motel”, we would like to
match documents containing “Sydney Inn”

hotel motel Inn

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 … 0]

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0]

Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]

There is no natural notion of similarity for one-hot vectors!

Problem #2. Inefficiency

Vector dimension = number of words in vocabulary
Each representation has only a single ‘1’ with all remaining 0s.
2 COUNT based WORD REPRESENTATION

Problem with BoW (Bag of Words)

• The intuition is that documents are similar if they have similar content. Further,
that from the content alone we can learn something about the meaning of the
document.
• Discarding word order ignores the context, and in turn meaning of words in
the document (semantics). Context and meaning can offer a lot to the model,
that if modeled could tell the difference between the same words differently
arranged (“this is interesting” vs “is this interesting”).

S1= I love you but you hate me

S2= I hate you but you love me

2 COUNT based WORD REPRESENTATION

Limitation of Term Frequency Inverse Document Frequency

1+dfi
wi,j = weight of term i in document j

tfi,j = number of occurrences of term i in document j

N = total number of documents

dfi = number of documents containing term i

• It computes document similarity directly in the word-count space, which may

be slow for large vocabularies.
• It assumes that the counts of different words provide independent evidence of
similarity.
• It makes no use of semantic similarities between words.
2 COUNT based WORD REPRESENTATION

Sparse Representation
With COUNT based word representation (especially, one-hot vector),
linguistic information was represented with sparse representations (high-
dimensional features)

hotel motel Inn

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 … 0]

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0]

Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]
2 COUNT based WORD REPRESENTATION

Sparse Representation
With COUNT based word representation (especially, one-hot vector),
linguistic information was represented with sparse representations (high-
dimensional features)

hotel motel Inn

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 … 0]

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0]

Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]

A Significant Improvement Required!

1. How to get the low-dimensional vector representation
2. How to represent the word similarity

maybe a low-dimensional vector?

Can we use a list of fixed numbers (properties) to represent the word?
0 LECTURE PLAN

Lecture 2: Word Embeddings and Representation

第⼆讲:词的嵌⼊与表示

1. Lab Info 2.
实验室信息
1.

以前的课复习
1. 单词的含义和单词⽹

2. 基于计数的字表示
2. Previous Lecture Review 3.基于预测的词表示

1. 字嵌⼊

1. Word Meaning and WordNet 2. Word2Vec

3.FastText

⼿套
4.

2. Count based Word Representation 4. 下周预览

3. Prediction based Word Representation

1. Word Embedding
2. Word2Vec
3. FastText
4. Glove

4. Next Week Preview

3 Prediction based Word representation

How to Represent the Word Similarity!

• How to represent the word similarity with dense vector

• Try this with word2vec

Reference: http://turbomaze.github.io/word2vecjson/
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things

Assume that you are taking a personality test (the Big Five Personality Traits test)
1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion
让我们熟悉⼀下⽤向量来表示东⻄
假设你正在做⼀个⼈格测试(⼤五⼈格测试):1)开放性，2)亲和性，3)尽责性，4)消极情绪性，5)外向性

Openness
100

Jane

0
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things

Assume that you are taking a personality test (the Big Five Personality Traits test)
1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion

Openness
100

40 70 Agreeableness
0 100
Jane

0
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things

⼤五⼈格特质测试

Assume that you are taking a personality test (the Big Five Personality Traits test)
1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion

Openness
1

Jane 0.4 0.7 Agreeableness

0 1
Mark 0.3 0.2
Jane
Eve
Mark
Eve 0.4 0.6
0
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things

Which of two people (Mark or Eve) is more similar to Jane?
让我们熟悉⼀下⽤向量来表示东⻄
两个⼈(⻢克和伊芙)中哪⼀个更像简?

Cosine Similarity
Openness
Measure of similarity between two vectors 1
of inner product space that measures the
cosine of the angle between them
内积空间中两个向量相似性的度量，度量它们之间夹⻆的余弦 Agreeableness
0 1
Jane
Eve
Mark

0
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things

Which of two people (Mark or Eve) is more similar to Jane?

Openness
1

Jane 0.4 0.7 Agreeableness

0 1
Mark 0.3 0.2
Jane
Eve
Mark
Eve 0.4 0.6
0

Jane Mark
cos( , ) ≈ 0.89
Jane Eve
cos( , ) ≈ 0.99
3 Prediction based Word representation

Let’s get familiar with using vectors to represent things

We need all five major factors for represent the personality

Openness
1

Jane 0.4 0.7 0.5 0.2 0.1 Agreeableness

0 1
Mark 0.3 0.2 0.3 0.7 0.2
Jane
Eve
Mark
Eve 0.4 0.6 0.4 0.3 0.5
0

With these embeddings,这些嵌⼊的

⽤固定数字的向量表示事物!
1.

很容易计算向量之间的相似性
2.

1.Represent things as vectors of fixed numbers!

2.Easily calculate the similarity between vectors
3 Prediction based Word representation

Remember? The Word2Vec Demo!

This is a word embedding for the word “king”

这个词嵌⼊了" king "这个词
3 Prediction based Word representation

Remember? The Word2Vec Demo!

This is a word embedding for the word “king”

* Trained by Wikipedia Data, 50-dimension GloVe Vector

[0.50451, 0.68607, -0.59517, -0.022801, 0.60046, 0.08813, 0.47377, -0.61798, -0.31012,

-0.066666, 1.493, -0.034173, -0.98173, 0.68229, 0.812229, 0.81722, -0.51722, -744.5.4
1503, -0.55809, 0.66421, 0.1961, -0.1495, -0.033474, -0.30344, 0.41177, -2.223, -1.0756,
king -0.343554, 0.33505, 1.9927, -0.042434, -0.64519, 0.72519, 0.71419, 0.714319, 0.71419
9159, 0.16754, 0.34344, -0.25663, -0.8523, 0.1661, 0.40102, 1.1685, -1.0137, -0.2155,
0.78321, -0.91241, -1.6626, -0.64426, -0.542102]
3 Prediction based Word representation

Remember? The Word2Vec Demo!

This is a word embedding for the word “king”

* Trained by Wikipedia Data, 50-dimension GloVe Vector

king
3 Prediction based Word representation

Remember? The Word2Vec Demo!

Compare with Woman, Man, King, and Queen

woman

man

king

queen
3 Prediction based Word representation

Remember? The Word2Vec Demo!

Compare with Woman, Man, King, Queen, and Water

woman

man

king

queen

water
3 Prediction based Word representation

Remember? The Word2Vec Demo!

king – man + woman ≈ queen?

Word Algebra

woman

man

king

queen

king-man
+woman
如何使密集向量字表示

How to make dense vectors for word representation

3 Prediction based Word representation

How to make dense vectors for word representation

“ 从⼀个词的结交看它的为⼈。”

“You shall know a word by the company it keeps”

— (Firth, J. R. 1957:11)

Prof. Firth is noted for drawing attention to the

context-dependent nature of meaning with his
notion of 'context of situation', and his work on
collocational meaning is widely acknowledged in the
field of distributional semantics.
Prof. John Rupert Firth 费斯教授以其“情境语境”的概念引起了⼈们对语境依赖性意义的关注，其搭配意义的研究在分布语义学领域得到了⼴泛的认可。
3 Prediction based Word representation

Word Representations in the context

When a word w appears in a text, its context is the set of words that appear nearby
• Use the surrounding contexts of w to build up a representation of w
利⽤w的周围环境来构建w的表示当⼀个单词w出现在⽂本中，它的上下⽂是出现在附近的单词集

• 使⽤w的周围环境来建⽴w的表示

These context words will represent Sydney

3 Prediction based Word representation

How can we train the word representation to machine?

Neural Network! (Machine Learning)

These context words will represent Sydney

# Brief in Machine Learning!

Machine Learning
How to classify this with your machine?

Object: CAT
# Brief in Machine Learning!

Computer System

def prediction(image as input):

…program…
return result
Data Result
# Brief in Machine Learning!

Can we classify this with the computer system?

Object: ??? Object: ??? Object: ???

# Brief in Machine Learning!

Computer System VS Machine Learning

def prediction(image as input):

…program…
return result
Data Result

Data: Result
Image 1: Dog
Image 2: Cat training
Image 3: Dog
Image 4: Cat
Data+Result Image 5: Dog
…
Pattern

xi Input words (indices or vectors), sentences, documents, etc.

yi class What we try to classify/predict

# Brief in Machine Learning!

Neural Network and Deep Learning

Neuron and Perceptron

Neuron

Input Output
weight

Perceptron

详细的神经⽹络和深度学习的概念将在讲座3中涵盖
The detailed neural network and deep learning concept will be covered in the Lecture 3
3 Prediction based Word representation

Neural Network and Deep Learning in Word Representation

“You shall know a word by the company it keeps” (Firth, J. R. 1957:11)

Why don’t we train a word by the company it keeps?
Why don’t we represent a word by the company it keeps?
词表示中的神经⽹络和深度学习
“ 你可以从⼀个词所结交的⼈了解它”(弗斯，J. R. 1957:11)
为什么我们不根据它的同伴训练⼀个词呢?
为什么我们不代表⼀个词的公司它保持?

The company it keeps A Word

Input Output

Perceptron
3 Prediction based Word representation

Neural Network and Deep Learning in Word Representation

Wikipedia: “Sydney is the state capital of NSW…”

The company it keeps A Word

Input Output
weight

Perceptron
3 Prediction based Word representation

Neural Network and Deep Learning in Word Representation

Wikipedia: “Sydney is the state capital of NSW…”

The company it keeps A Word

Word representation
3 Prediction based Word representation

Neural Network and Deep Learning in Word Representation

Wikipedia: “Sydney is the state capital of NSW…”

Context word Centre word

Word representation

Word2Vec
3 Prediction based Word representation

Word2Vec
Word2vec can utilize either of two model architectures
to produce a distributed representation of words:
可以使⽤两种模型架构中的任何⼀种来⽣成单词的分布式表示:
Word2vec

1. Continuous Bag of Words (CBOW)

Predict center word from (bag of) context words

Context Centre
word word

2. Continuous Skip-gram
Predict context (“outside”) words given center word
Centre Context
word word
3 Prediction based Word representation

Word2Vec with Continuous Bag of Words (CBOW)

Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Word2Vec与连续词袋(CBOW)

从上下⽂词中预测中⼼词
句⼦:“悉尼是新南威尔⼠州⾸府”

Sydney is the state capital of NSW

Aim
Sydney is the state capital of NSW
• Predict the center word
Sydney is the state capital of NSW

Setup Sydney is the state capital of NSW

• Window size
• Assume that the window size is 2
Sydney is the state capital of NSW

Sydney is the state capital of NSW

Center word
Context (“outside”) word
3 Prediction based Word representation

Word2Vec with Continuous Bag of Words (CBOW)

Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Using window slicing, develop the training data

Center word Context (“outside”) word

[1,0,0,0,0,0,0] [0,1,0,0,0,0,0], [0,0,1,0,0,0,0]

Sydney is the state capital of NSW

[0,1,0,0,0,0,0] [1,0,0,0,0,0,0],
[0,0,1,0,0,0,0], [0,0,0,1,0,0,0]
Sydney is the state capital of NSW

[0,0,1,0,0,0,0] [1,0,0,0,0,0,0], [0,1,0,0,0,0,0]

[0,0,0,1,0,0,0], [0,0,0,0,1,0,0]
Sydney is the state capital of NSW
[0,1,0,0,0,0,0], [0,0,1,0,0,0,0]
[0,0,0,1,0,0,0] Sydney is the state capital of NSW
[0,0,0,0,1,0,0], [0,0,0,0,0,1,0]
[0,0,1,0,0,0,0], [0,0,0,1,0,0,0]
[0,0,0,0,1,0,0] Sydney is the state capital of NSW
[0,0,0,0,0,1,0], [0,0,0,0,0,0,1]
[0,0,0,1,0,0,0], [0,0,0,0,1,0,0]
[0,0,0,0,0,1,0] [0,0,0,0,0,0,1] Sydney is the state capital of NSW

[0,0,0,0,0,0,1] [0,0,0,0,1,0,0], [0,0,0,0,0,1,0]

Sydney is the state capital of NSW
Center word
Context (“outside”) word
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer

is (one-hot vector)

Projection layer Output layer

the (one-hot vector)

Context Centre
word state (one-hot vector)
word
capital (one-hot vector)

of (one-hot vector)
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

is (one-hot vector)is

the
the (one-hot vector)

W’NxV state (one-hot vector)

capital (one-hot vector)

capital
N-Dimension
Depends on the dimension of word representation
you would like to set up
这取决于您想要设置的单词表示的维度

of
of (one-hot vector)
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Xthe WVxN Vthe

the
the (one-hot vector)

state (one-hot vector)

capital (one-hot vector)

capital
N-Dimension
N = Dimension of Word Embedding (Representation)
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

is (one-hot vector)is

𝑣=
ො
Vis + Vthe + Vcapital + Vof
2m (window size)

the
the (one-hot vector)

state (one-hot vector)

capital (one-hot vector)

capital
N-Dimension

N = Dimension of Word Embedding (Representation)

of
of (one-hot vector)
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

the (one-hot vector)

ෝ=z
W’N x V x 𝒗

capital (one-hot vector) N

Predicted result state (one-hot vector)

Softmax: outputs a vector that represents the probability

distributions (sum to 1) of a list of potential outcome
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

the (one-hot vector)

ෝ=z
W’N x V x 𝒗

capital (one-hot vector) N

Predicted result state (one-hot vector)

Cross Entropy: can be used as a loss function Loss Function (Cross Entropy)
when optimizing classification
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”

Input layer Projection layer Output layer

the (one-hot vector)

ෝ=z
W’N x V x 𝒗

capital (one-hot vector) N

Predicted result state (one-hot vector)

BACK PROPAGATION
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)

1. Initialise each word in a one-hot vector form.

𝒙𝒌 = [0,...,0,1,0,...,0]

2. Use context words (2m, based on window size =m)

as input of the Word2Vec-CBOW model.
(𝒙𝒄−𝒎 , 𝒙𝒄−𝒎+𝟏 , … , 𝒙𝒄−𝟏 , 𝒙𝒄+𝟏 , … , 𝒙𝒄+𝒎−𝟏 , 𝒙𝒄+𝒎 ) ∈ ℝ|𝑽|

3. Has two Parameter Matrices:

1) Parameter Matrix (from Input Layer to Hidden/Projection Layer)
𝐖 ∈ ℝ𝑉x𝑁
2) Parameter Matrix (to Output Layer)
𝐖′ ∈ ℝ𝑁x𝑉
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)

4. Initial words are represented in one hot vector so

multiplying a one hot vector with 𝐖𝐕x𝐍 will give you
a 1 x N (embedded word) vector.

𝟏𝟎 𝟐 𝟏𝟖
𝟏𝟓 𝟐𝟐 𝟑
e.g. 𝟎 𝟏 𝟎 𝟎 × = [15 22 3]
𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐

(𝒗𝒄−𝒎 = 𝐖𝑥 𝑐−𝑚 , … , 𝒗𝒄+𝒎 = 𝐖𝑥 𝑐+𝑚 ) ∈ ℝ𝒏

5. Average those 2m embedded vectors to calculate

the value of the Hidden Layer.

𝒗𝒄−𝒎 + 𝒗𝒄−𝒎+𝟏 + … + 𝒗𝒄+𝒎

𝑣ො =
2𝑚
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)

6. Calculate the score value for the output layer. The

higher score is produced when words are closer.
𝒛 = 𝐖 ′ × 𝑣ො ∈ ℝ|𝑽|

7. Calculate the probability using softmax

𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛) ∈ ℝ|𝑽|

8. Train the parameter matrix using objective function.

|𝑉|

H 𝑦,
ො 𝑦 = − ෍ 𝑦𝑗 𝑙𝑜𝑔(𝑦ො𝒋 )
𝑗=1
* Focus on minimising the value

We use an one-hot vector (one 1, the rest 0) so it will be

calculated in only one.
H 𝑦,
ො 𝑦 = −𝑦𝑗 𝑙𝑜𝑔(𝑦ො𝒋 )
3 Prediction based Word representation

CBOW – Neural Network Architecture

Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)

8-1. Optimization Objective Function can be presented:

８-1. 优化⽬标函数为:

*This optimization objective will be learned more details in the lecture 3.

3 Prediction based Word representation

Skip Gram
Predict context (“outside”) words (position independent) given center word
Sentence: “Sydney is the state capital of NSW”

P(wt+2|wt)

P(wt-1|wt) P(wt+1|wt)

… Sydney is the state capital of NSW …

3 Prediction based Word representation

Skip Gram
Predict context (“outside”) words (position independent) given center word
Sentence: “Sydney is the state capital of NSW”

Output layer

is (one-hot vector)

Input layer Projection layer

the (one-hot vector)

Context
Centre
state (one-hot vector) word
word
capital (one-hot vector)

of (one-hot vector)
3 Prediction based Word representation

Skip Gram – Neural Network Architecture

Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)

1. Initialise the centre word in a one-hot vector form.

𝒙𝒌 = [0,...,0,1,0,...,0]
𝒙 ∈ ℝ|𝑽|

2. Has two Parameter Matrices:

1) Parameter Matrix (from Input Layer to Hidden/Projection Layer)
𝐖 ∈ ℝ𝑉x𝑁
2) Parameter Matrix (to Output Layer)
𝐖′ ∈ ℝ𝑁x𝑉
3 Prediction based Word representation

Skip Gram – Neural Network Architecture

Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)

3. Initial words are represented in one hot vector so

multiplying a one hot vector with 𝐖𝐕x𝐍 will give you
a 1 x N (embedded word) vector.

𝟏𝟎 𝟐 𝟏𝟖
𝟏𝟓 𝟐𝟐 𝟑
e.g. 𝟎 𝟏 𝟎 𝟎 × = [15 22 3]
𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐

𝒗𝒄 = 𝐖𝒙 ∈ ℝ𝒏 (as there is only one input)

4. Calculate the score value for the output layer by

multiplying the parameter matrix W’
𝒛 = 𝐖′𝒗𝒄
3 Prediction based Word representation

Skip Gram – Neural Network Architecture

Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)

5. Calculate the probability using softmax

𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛)

6. Calculate 2m probabilities as we need to predict 2m

context words.
𝑦ො𝒄−𝒎 , … , 𝑦ො𝒄−𝟏 , 𝑦ො𝒄+𝟏 , … , 𝑦ො𝒄+𝒎

and compare with the ground truth (one-hot vector)

𝑦 (𝑐−𝑚) , … , 𝑦 (𝑐−1) , 𝑦 (𝑐+1) , … , 𝑦 (𝑐+𝑚)
3 Prediction based Word representation

Skip Gram – Neural Network Architecture

Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
和在CBOW中⼀样，使⽤⼀个⽬标函数来评估模型。⼀个关键的区别是，我们调⽤Naïve Bayes假设来计算概率。这是⼀个很强的条件独⽴假设。给定中⼼词，所有输出词都是完全独⽴的。

8. As in CBOW, use an objective function for us to

evaluate the model. A key difference e is that we
invoke a Naïve Bayes assumption to break out the
probabilities. It is a strong conditional independence
assumption. Given the centre word, all output words
are completely independent.
3 Prediction based Word representation

Skip Gram – Neural Network Architecture

Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)

8-1 。有了这个⽬标函数，我们可以计算关于未知参数的梯度，并在每次迭代时通过随机梯度下降更新它们

8-1. With this objective function, we can compute the

gradients with respect to the unknown parameters
and at each iteration update them via Stochastic
Gradient Descent

* 这个随机梯度下降法将在第三讲中详细学习。
*This Stochastic Gradient Descent will be learned details in the lecture 3.
3 Prediction based Word representation

CBOW vs Skip Gram Overview

CBOW Skip-gram

Predict center word Predict context words

from (bag of) context words given center word
3 Prediction based Word representation

Key Parameter (1) for Training methods: Window Size

Different tasks are served better by different window sizes.

Smaller window sizes (2-15) lead to embeddings where high similarity scores
between two embeddings indicates that the words are interchangeable.

Larger window sizes (15-50, or even more) lead to embeddings where

similarity is more indicative of relatedness of the words
培训⽅法关键参数(1):窗⼝⼤⼩
不同的窗⼝⼤⼩可以更好地完成不同的任务。

较⼩的窗⼝尺⼨(2-15)导致嵌⼊，两个嵌⼊之间的相似度⾼表明单词是可互换的。

更⼤的窗⼝⼤⼩(15-50，甚⾄更多)会导致嵌⼊，相似度更能表明单词的关联度
3 Prediction based Word representation
训练⽅法的关键参数(2):负样本
负样本的数量是训练过程的另⼀个因素。
我们数据集的负⾯样本——不是邻居的单词样本

Key Parameter (2) for Training methods: Negative Samples

The number of negative samples is another factor of the training process.
Negative samples to our dataset – samples of words that are not neighbors

Negative sample: 2 Negative sample: 5

Input word Output word Target Input word Output word Target
eat mango 1 eat mango 1
eat exam 0 eat exam 0
eat tobacco 0 eat tobacco 0
eat pool 0
*1= Appeared, 0=Not Appeared eat supervisor 0

原论⽂规定5-20为较好的阴性样本数量。它还指出，当你有⾜够⼤的数据集时，2-5似乎⾜够了。
The original paper prescribes 5-20 as being a good number of negative
samples. It also states that 2-5 seems to be enough when you have a large
enough dataset.
3 Prediction based Word representation

Word2Vec Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors

想法:
拥有⼤量的⽂本
Idea:
•

•固定词汇表中的每个单词都⽤向量表示

•遍历⽂本中的每个位置t，其中有⼀个中⼼词c和上下⽂(“外部”)词o

•使⽤c和o的词向量的相似度来计算给定c的o的概率(反之亦然)

• Have a large corpus of text •不断调整单词向量以最⼤化此概率

• Every word in a fixed vocabulary is represented by a vector

• Go through each position t in the text, which has a center word c and context
(“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of
o given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability
3 Prediction based Word representation

Let’s try some Word2Vec!

Gensim: https://radimrehurek.com/gensim/models/word2vec.html
Resources: https://wit3.fbk.eu/
https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models
3 Prediction based Word representation

Limitation of Word2Vec
Issue#1: Cannot cover the morphological similarity
• Word2vec represents every word as an independent vector, even though
many words are morphologically similar, like: teach, teacher, teaching

Issue#2: Hard to conduct embedding for rare words

• Word2vec is based on the Distribution hypothesis. Works well with the
frequent words but does not embed the rare words.
(same concept with the under-fitting in machine learning)

未登录词

Issue#3: Cannot handle the Out-of-Vocabulary (OOV)

• Word2vec does not work at all if the word is not included in the Vocabulary
限制Word2Vec
问题#1:不能覆盖形态相似性
•Word2vec将每个单词作为⼀个独⽴的向量表示，尽管许多单词在形态上是相似的，如:teach, teacher, teaching

问题2:很难对罕⻅词汇进⾏嵌⼊
基于分布假设。可以很好地与常⻅的词，但不嵌⼊罕⻅的词。
•Word2vec

(机器学习中的⽋拟合也是同样的概念)

问题#3:⽆法处理词汇表外(OOV)
不⼯作，如果单词不包括在词汇
•Word2vec
3 Prediction based Word representation

FastText
• Deal with this Word2Vec Limitation
• Another Way to transfer WORDS to VECTORS

• FastText is a library for learning of word embeddings and text classification

created by Facebook's AI Research lab. The model allows to create an
unsupervised learning or supervised learning algorithm for obtaining vector
representations for words.
是⼀个由Facebook⼈⼯智能研究实验室创建的⽤于学习单词嵌⼊和⽂本分类的库。该模型允许创建⼀个⾮监督学习或监督学习算法来获取单词的向量表示。
FastText

• Extension to Word2Vec
• Instead of feeding individual words into the Neural Network, FastText breaks words
into several n-grams (sub-words)
扩展Word2Vec
• 与将单个单词输⼊到神经⽹络不同，FastText将单词分解为⼏个n-gram(⼦单词)

https://fasttext.cc/
3 Prediction based Word representation

与N-gram嵌⼊
FastText

FastText with N-gram Embeddings

• N-grams are simply all combinations of adjacent words or letters of length n
that you can find in your source text. For example, given the word apple, all 2-
grams (or “bigrams”) are ap, pp, pl, and le
就是源⽂本中⻓度为n的相邻单词或字⺟的所有组合。例如，给定单词apple，所有2克(或“bigrams”)是ap、pp、pl和le
n -gram

• The tri-grams (n=3) for the word apple is app, ppl, and ple (ignoring the
starting and ending of boundaries of words). The word embedding vector for
apple will be the sum of all these n-grams.
单词apple的三坐标(n=3)是app、ppl和ple(忽略单词的起始和结束边界)。苹果的嵌⼊向量是所有这些n元的和。

apple apple

ap pp pl le app ppl ple

• After training the Neural Network (either with skip-gram or CBOW), we will
have word embeddings for all the n-grams given the training dataset.
训练完神经⽹络(使⽤skip-gram或CBOW)后，我们将对给定训练数据集的所有n-gram进⾏词嵌⼊。

• Rare words can now be properly represented since it is highly likely that some
of their n-grams also appears in other words.
稀有词现在可以被恰当地表示，因为它们的⼀些n-gram很可能也出现在其他词语中。 https://fasttext.cc/
3 Prediction based Word representation

Word2Vec VS FastText
Find synonym with Word2vec
from gensim.models import Word2Vec
cbow_model = Word2Vec(sentences=result, size=100, window=5, min_count=5, workers=4, sg=0)

a=cbow_model.wv.most_similar("electrofishing")
pprint.pprint(a)

Find synonym with FastText

from gensim.models import FastText
FT_model = FastText(sentences=result, size=100, window=5, min_count=5, workers=4, sg=0)

a=FT_model.wv.most_similar("electrofishing")
pprint.pprint(a)

https://fasttext.cc/
3 Prediction based Word representation

Global Vectors (GloVe)

• Deal with this Word2Vec Limitation

“Methods like skip-gram may do better on the analogy task, but they poorly utilize
the statistics of the corpus since they train on separate local context windows
instead of on global co-occurrence counts.”
“ 像skip-gram这样的⽅法可能在类⽐任务上做得更好，但它们没有充分利⽤语料库的统计数据，因为它们是在单独的局部上下⽂窗⼝上训练的，⽽不是在全局共现次数上。”

(PeddingLon et al., 2014)

• Focus on the Co-occurrence 同现

https://nlp.stanford.edu/projects/glove/
3 Prediction based Word representation

Limitation of Prediction based Word Representation

• I like
apple banana fruit

• Training dataset reflect the word representation result

• The word similarity of the word ‘software’ the model learned by Google News corpus
can be different from the one from Twitter.
训练数据集反映了词的表示结果
⾕歌新闻语料库学习的模型可能与Twitter的模型不同。

https://nlp.stanford.edu/projects/glove/
4 NEXT WEEK PREVIEW…

Word Embeddings
• Finalisation! 完成

Machine Learning/ Deep Learning for Natural Language Processing

/ Reference

Reference for this lecture

• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language
Applications Using Deep Learning. " O'Reilly Media, Inc.".
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language
processing. MIT press.
• Manning, C 2017, Introduction and Word Vectors, Natural Language Processing with Deep Learning,
lecture notes, Stanford University
• Images: http://jalammar.github.io/illustrated-word2vec/
Word2vec
• Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781.
• Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of
words and phrases and their compositionality. In Advances in neural information processing systems (pp.
3111-3119).

FastText
• Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword
information. Transactions of the Association for Computational Linguistics, 5, 135-146.
• Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training
distributed word representations. arXiv preprint arXiv:1712.09405.

Foundational Python For Data Science
100% (1)
Foundational Python For Data Science
324 pages
Face Mask Detection Project
0% (1)
Face Mask Detection Project
57 pages
List of Supported DAT Linux Apps:: App Description
No ratings yet
List of Supported DAT Linux Apps:: App Description
3 pages
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
No ratings yet
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
34 pages
Deep Learning: COMP 5329
No ratings yet
Deep Learning: COMP 5329
32 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Lecture12 - Word RepEmb
No ratings yet
Lecture12 - Word RepEmb
28 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
wordembed
No ratings yet
wordembed
31 pages
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
No ratings yet
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
40 pages
XCS224N_Module1_Slides
No ratings yet
XCS224N_Module1_Slides
72 pages
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
No ratings yet
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
17 pages
NLP Lec 03
No ratings yet
NLP Lec 03
26 pages
Vector Semantics and Embedding (part 1)
No ratings yet
Vector Semantics and Embedding (part 1)
66 pages
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
No ratings yet
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
9 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
4.Machine Learning Word Embedding-1
No ratings yet
4.Machine Learning Word Embedding-1
36 pages
Unit iv
No ratings yet
Unit iv
58 pages
Lecture 3. 6 - Vector - Apr18 - 2021
No ratings yet
Lecture 3. 6 - Vector - Apr18 - 2021
106 pages
Vector Semantics 2 Word Embeddings (Vector Semantics)
No ratings yet
Vector Semantics 2 Word Embeddings (Vector Semantics)
5 pages
Unit iv
No ratings yet
Unit iv
57 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Lect04
No ratings yet
Lect04
44 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Week11 WordEmbedding
No ratings yet
Week11 WordEmbedding
20 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
Effect of Word Embedding Vector Dimensionality On Sentiment Analysis Through Short and Long Texts
No ratings yet
Effect of Word Embedding Vector Dimensionality On Sentiment Analysis Through Short and Long Texts
8 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
No ratings yet
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
5 pages
Unit - 3 Distributional Semantics and Word Embedding
No ratings yet
Unit - 3 Distributional Semantics and Word Embedding
69 pages
week2and3
No ratings yet
week2and3
76 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
word embedding
No ratings yet
word embedding
35 pages
Vector Semantics and Embedding (part 2)
No ratings yet
Vector Semantics and Embedding (part 2)
47 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
CSF-429 - L3-L6 Word Vectors
No ratings yet
CSF-429 - L3-L6 Word Vectors
34 pages
Week5
No ratings yet
Week5
26 pages
Word 2 Vector
No ratings yet
Word 2 Vector
4 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
Survey On Vector Representations
No ratings yet
Survey On Vector Representations
46 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
CS490 Advanced Topics in Computing - Deep Learning
No ratings yet
CS490 Advanced Topics in Computing - Deep Learning
20 pages
NLP Session 1-7 Bt Dr.chetna
No ratings yet
NLP Session 1-7 Bt Dr.chetna
469 pages
CITS4012 Lecture 3
No ratings yet
CITS4012 Lecture 3
47 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Christopher Manning Lecture 1: Introduction and Word Vectors
No ratings yet
Christopher Manning Lecture 1: Introduction and Word Vectors
42 pages
Vector Based Models
No ratings yet
Vector Based Models
41 pages
CS585 Lecture October17th
No ratings yet
CS585 Lecture October17th
104 pages
WordNet Embeddings
No ratings yet
WordNet Embeddings
10 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
PG1
No ratings yet
PG1
5 pages
Master Thesis
No ratings yet
Master Thesis
74 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Vector Semantics
No ratings yet
Vector Semantics
18 pages
5b. Word Vectors
No ratings yet
5b. Word Vectors
24 pages
Comments on Paul Cobley's Essay (2018) "Human Understanding: A Key Triad"
From Everand
Comments on Paul Cobley's Essay (2018) "Human Understanding: A Key Triad"
Razie Mah
No ratings yet
Ipseology: A new science of the self
From Everand
Ipseology: A new science of the self
Dr. Jason Jeffrey Jones
No ratings yet
A Primer on Sensible and Social Construction
From Everand
A Primer on Sensible and Social Construction
Razie Mah
No ratings yet
Comments on Cheong Lee's Essay (2018) "Peirce's Theory of Interpretation"
From Everand
Comments on Cheong Lee's Essay (2018) "Peirce's Theory of Interpretation"
Razie Mah
No ratings yet
Rock Your Writing
From Everand
Rock Your Writing
David Chislett
No ratings yet
07 QueryOptimisation-no Blanks
No ratings yet
07 QueryOptimisation-no Blanks
20 pages
Tutorial 2
No ratings yet
Tutorial 2
7 pages
COMP5048 Visual Analytics: Color
No ratings yet
COMP5048 Visual Analytics: Color
23 pages
02 - The Python Ecosystem For ML
No ratings yet
02 - The Python Ecosystem For ML
108 pages
Quantitative Economics With Python
No ratings yet
Quantitative Economics With Python
1,478 pages
Python Syllabus
No ratings yet
Python Syllabus
10 pages
University Institute of Technology Rajiv Gandhi Proudyogikivishwavidyalaya Bhopal (M.P.)
No ratings yet
University Institute of Technology Rajiv Gandhi Proudyogikivishwavidyalaya Bhopal (M.P.)
44 pages
Exploratory Data Visualization Using Python
No ratings yet
Exploratory Data Visualization Using Python
3 pages
1) Intro To Python
No ratings yet
1) Intro To Python
32 pages
Summer Internship Report: Bachelor of Technology
No ratings yet
Summer Internship Report: Bachelor of Technology
38 pages
Profile: Bhavik Bansal
No ratings yet
Profile: Bhavik Bansal
5 pages
Python Tools for Scientists An Introduction to Using Anaconda JupyterLab and Python s Scientific Libraries 1st Edition Lee Vaughan instant download
100% (2)
Python Tools for Scientists An Introduction to Using Anaconda JupyterLab and Python s Scientific Libraries 1st Edition Lee Vaughan instant download
40 pages
Telco Customer Churn Prediction Project Report
No ratings yet
Telco Customer Churn Prediction Project Report
40 pages
Anaconda-Open Source Data Science
No ratings yet
Anaconda-Open Source Data Science
8 pages
Python Scripting
100% (1)
Python Scripting
15 pages
Lab 0 - Getting Started with Google Colab
No ratings yet
Lab 0 - Getting Started with Google Colab
5 pages
Using The R Programming Language in Jupyter Notebook - Anaconda Documentation
No ratings yet
Using The R Programming Language in Jupyter Notebook - Anaconda Documentation
3 pages
Python For Econometrics
No ratings yet
Python For Econometrics
300 pages
Anaconda Download & Installation Stepsdocx
No ratings yet
Anaconda Download & Installation Stepsdocx
11 pages
Applied Quantitative Finance Using Python for Financial Analysis 1st ed. 2021 Edition Mauricio Garita pdf download
No ratings yet
Applied Quantitative Finance Using Python for Financial Analysis 1st ed. 2021 Edition Mauricio Garita pdf download
79 pages
Python Basics
No ratings yet
Python Basics
33 pages
M1 DS21-Pengantar Sains Data Dan Analisis Big Data
100% (3)
M1 DS21-Pengantar Sains Data Dan Analisis Big Data
52 pages
IME 212 Course Orientation
No ratings yet
IME 212 Course Orientation
15 pages
A Review On Python Libraries and Ides For Data Science: November 2021
No ratings yet
A Review On Python Libraries and Ides For Data Science: November 2021
19 pages
IEEE Paper - Intelligent Plant Growth Monitoring System
No ratings yet
IEEE Paper - Intelligent Plant Growth Monitoring System
5 pages
Python For Finance
No ratings yet
Python For Finance
289 pages
Generative AI Tutorial
No ratings yet
Generative AI Tutorial
5 pages
Ii Project Documentation Template
No ratings yet
Ii Project Documentation Template
86 pages
AvinavChatterjee Resume
No ratings yet
AvinavChatterjee Resume
1 page
We Are and VR - Final Year Project - Research Paper
No ratings yet
We Are and VR - Final Year Project - Research Paper
3 pages