COMP5046: Natural Language Processing
COMP5046: Natural Language Processing
1. Lab Info
2. Previous Lecture Review
1. Word Meaning and WordNet
2. Count based Word Representation
Submissions
How to Submit
Students should submit “ipynb” file (Download it from “File” >
“Download .ipynb”) to Canvas.
1. Lab Info
2. Count-based Word Representation
1. Word Meaning
2. Limitations
“Computer” “Apple”
2 COUNT based WORD REPRESENTATION
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0]
Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]
1+dfi
wi,j = weight of term i in document j
Sparse Representation
With COUNT based word representation (especially, one-hot vector),
linguistic information was represented with sparse representations (high-
dimensional features)
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0]
Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]
2 COUNT based WORD REPRESENTATION
Sparse Representation
With COUNT based word representation (especially, one-hot vector),
linguistic information was represented with sparse representations (high-
dimensional features)
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0]
Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]
1. Lab Info 2.
实验室信息
1.
以前的课复习
1. 单词的含义和单词⽹
2. 基于计数的字表示
2. Previous Lecture Review 3.基于预测的词表示
1. 字嵌⼊
3.FastText
⼿套
4.
Reference: http://turbomaze.github.io/word2vecjson/
3 Prediction based Word representation
Openness
100
40
Jane
0
3 Prediction based Word representation
Openness
100
40 70 Agreeableness
0 100
Jane
0
3 Prediction based Word representation
Assume that you are taking a personality test (the Big Five Personality Traits test)
1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion
Openness
1
Cosine Similarity
Openness
Measure of similarity between two vectors 1
of inner product space that measures the
cosine of the angle between them
内积空间中两个向量相似性的度量,度量它们之间夹⻆的余弦 Agreeableness
0 1
Jane
Eve
Mark
0
3 Prediction based Word representation
Openness
1
Jane Mark
cos( , ) ≈ 0.89
Jane Eve
cos( , ) ≈ 0.99
3 Prediction based Word representation
Openness
1
很容易计算向量之间的相似性
2.
king
3 Prediction based Word representation
woman
man
king
queen
3 Prediction based Word representation
woman
man
king
queen
water
3 Prediction based Word representation
Word Algebra
woman
man
king
queen
king-man
+woman
如何使密集向量字表示
“ 从⼀个词的结交看它的为⼈。”
• 使⽤w的周围环境来建⽴w的表示
Machine Learning
How to classify this with your machine?
Object: CAT
# Brief in Machine Learning!
Computer System
Data: Result
Image 1: Dog
Image 2: Cat training
Image 3: Dog
Image 4: Cat
Data+Result Image 5: Dog
…
Pattern
Neuron
Input Output
weight
Perceptron
详细的神经⽹络和深度学习的概念将在讲座3中涵盖
The detailed neural network and deep learning concept will be covered in the Lecture 3
3 Prediction based Word representation
Perceptron
3 Prediction based Word representation
Perceptron
3 Prediction based Word representation
Word representation
3 Prediction based Word representation
Word representation
Word2Vec
3 Prediction based Word representation
Word2Vec
Word2vec can utilize either of two model architectures
to produce a distributed representation of words:
可以使⽤两种模型架构中的任何⼀种来⽣成单词的分布式表示:
Word2vec
Context Centre
word word
2. Continuous Skip-gram
Predict context (“outside”) words given center word
Centre Context
word word
3 Prediction based Word representation
从上下⽂词中预测中⼼词
句⼦:“悉尼是新南威尔⼠州⾸府”
[0,1,0,0,0,0,0] [1,0,0,0,0,0,0],
[0,0,1,0,0,0,0], [0,0,0,1,0,0,0]
Sydney is the state capital of NSW
Input layer
is (one-hot vector)
of (one-hot vector)
3 Prediction based Word representation
is (one-hot vector)is
the
the (one-hot vector)
of
of (one-hot vector)
3 Prediction based Word representation
the
the (one-hot vector)
is (one-hot vector)is
𝑣=
ො
Vis + Vthe + Vcapital + Vof
2m (window size)
the
the (one-hot vector)
of
of (one-hot vector)
3 Prediction based Word representation
ෝ=z
W’N x V x 𝒗
ෝ=z
W’N x V x 𝒗
Cross Entropy: can be used as a loss function Loss Function (Cross Entropy)
when optimizing classification
3 Prediction based Word representation
ෝ=z
W’N x V x 𝒗
BACK PROPAGATION
3 Prediction based Word representation
𝟏𝟎 𝟐 𝟏𝟖
𝟏𝟓 𝟐𝟐 𝟑
e.g. 𝟎 𝟏 𝟎 𝟎 × = [15 22 3]
𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐
H 𝑦,
ො 𝑦 = − 𝑦𝑗 𝑙𝑜𝑔(𝑦ො𝒋 )
𝑗=1
* Focus on minimising the value
Skip Gram
Predict context (“outside”) words (position independent) given center word
Sentence: “Sydney is the state capital of NSW”
P(wt+2|wt)
P(wt-1|wt) P(wt+1|wt)
Skip Gram
Predict context (“outside”) words (position independent) given center word
Sentence: “Sydney is the state capital of NSW”
Output layer
is (one-hot vector)
of (one-hot vector)
3 Prediction based Word representation
𝟏𝟎 𝟐 𝟏𝟖
𝟏𝟓 𝟐𝟐 𝟑
e.g. 𝟎 𝟏 𝟎 𝟎 × = [15 22 3]
𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐
8-1 。有了这个⽬标函数,我们可以计算关于未知参数的梯度,并在每次迭代时通过随机梯度下降更新它们
* 这个随机梯度下降法将在第三讲中详细学习。
*This Stochastic Gradient Descent will be learned details in the lecture 3.
3 Prediction based Word representation
CBOW Skip-gram
Smaller window sizes (2-15) lead to embeddings where high similarity scores
between two embeddings indicates that the words are interchangeable.
较⼩的窗⼝尺⼨(2-15)导致嵌⼊,两个嵌⼊之间的相似度⾼表明单词是可互换的。
更⼤的窗⼝⼤⼩(15-50,甚⾄更多)会导致嵌⼊,相似度更能表明单词的关联度
3 Prediction based Word representation
训练⽅法的关键参数(2):负样本
负样本的数量是训练过程的另⼀个因素。
我们数据集的负⾯样本——不是邻居的单词样本
原论⽂规定5-20为较好的阴性样本数量。它还指出,当你有⾜够⼤的数据集时,2-5似乎⾜够了。
The original paper prescribes 5-20 as being a good number of negative
samples. It also states that 2-5 seems to be enough when you have a large
enough dataset.
3 Prediction based Word representation
Word2Vec Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
想法:
拥有⼤量的⽂本
Idea:
•
•固定词汇表中的每个单词都⽤向量表示
•遍历⽂本中的每个位置t,其中有⼀个中⼼词c和上下⽂(“外部”)词o
•使⽤c和o的词向量的相似度来计算给定c的o的概率(反之亦然)
Gensim: https://radimrehurek.com/gensim/models/word2vec.html
Resources: https://wit3.fbk.eu/
https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models
3 Prediction based Word representation
Limitation of Word2Vec
Issue#1: Cannot cover the morphological similarity
• Word2vec represents every word as an independent vector, even though
many words are morphologically similar, like: teach, teacher, teaching
未登录词
问题2:很难对罕⻅词汇进⾏嵌⼊
基于分布假设。可以很好地与常⻅的词,但不嵌⼊罕⻅的词。
•Word2vec
(机器学习中的⽋拟合也是同样的概念)
问题#3:⽆法处理词汇表外(OOV)
不⼯作,如果单词不包括在词汇
•Word2vec
3 Prediction based Word representation
FastText
• Deal with this Word2Vec Limitation
• Another Way to transfer WORDS to VECTORS
• Extension to Word2Vec
• Instead of feeding individual words into the Neural Network, FastText breaks words
into several n-grams (sub-words)
扩展Word2Vec
• 与将单个单词输⼊到神经⽹络不同,FastText将单词分解为⼏个n-gram(⼦单词)
https://fasttext.cc/
3 Prediction based Word representation
与N-gram嵌⼊
FastText
• The tri-grams (n=3) for the word apple is app, ppl, and ple (ignoring the
starting and ending of boundaries of words). The word embedding vector for
apple will be the sum of all these n-grams.
单词apple的三坐标(n=3)是app、ppl和ple(忽略单词的起始和结束边界)。苹果的嵌⼊向量是所有这些n元的和。
apple apple
• After training the Neural Network (either with skip-gram or CBOW), we will
have word embeddings for all the n-grams given the training dataset.
训练完神经⽹络(使⽤skip-gram或CBOW)后,我们将对给定训练数据集的所有n-gram进⾏词嵌⼊。
• Rare words can now be properly represented since it is highly likely that some
of their n-grams also appears in other words.
稀有词现在可以被恰当地表示,因为它们的⼀些n-gram很可能也出现在其他词语中。 https://fasttext.cc/
3 Prediction based Word representation
Word2Vec VS FastText
Find synonym with Word2vec
from gensim.models import Word2Vec
cbow_model = Word2Vec(sentences=result, size=100, window=5, min_count=5, workers=4, sg=0)
a=cbow_model.wv.most_similar("electrofishing")
pprint.pprint(a)
a=FT_model.wv.most_similar("electrofishing")
pprint.pprint(a)
https://fasttext.cc/
3 Prediction based Word representation
“Methods like skip-gram may do better on the analogy task, but they poorly utilize
the statistics of the corpus since they train on separate local context windows
instead of on global co-occurrence counts.”
“ 像skip-gram这样的⽅法可能在类⽐任务上做得更好,但它们没有充分利⽤语料库的统计数据,因为它们是在单独的局部上下⽂窗⼝上训练的,⽽不是在全局共现次数上。”
https://nlp.stanford.edu/projects/glove/
3 Prediction based Word representation
• I like
apple banana fruit
https://nlp.stanford.edu/projects/glove/
4 NEXT WEEK PREVIEW…
Word Embeddings
• Finalisation! 完成
FastText
• Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword
information. Transactions of the Association for Computational Linguistics, 5, 135-146.
• Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training
distributed word representations. arXiv preprint arXiv:1712.09405.