0% found this document useful (0 votes)
6 views

Word Embedding

Uploaded by

fostersam058
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Word Embedding

Uploaded by

fostersam058
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

7.

8 WORD EMBEDDING

The above discussions were more concerned with document embedding. When -w:e discussed
LSI or LSA, which use singular value decomposition (SVD), the term docu~ent matrix
~as decomposed into three smaller matrices. The first matrix corz:es~onds t? document
embedding in topic space, the second matrix corresponds .t o topic mteraction ang the
transpose of the third matrix corresponds to word embedding in topic space. ~ this section,
we are going to focus on word embedding.

7 .8.1 One Hot Encoding


One simple technique for word embedding was using one hot encoding. The idea is very
simple. Let us say we have four words in our corpora: Apple, Bus, Orange and Truck. To
represent each of the words, we use binary vectors like 1000, 0100, 0010 and 0001. So, if I
have n words in the vocabulary, all the vectors have a length n and one of the bits will be on
or 1, and the rest will be off or 0. The words are sorted. The 'l' bit is often called as hot and
for each of the words, only one bit is hot, hence the name one hot encoding. This technique
is indeed very simple but has several issues.
1. The length of the vector is very large, a corpus can easily have 50, 000 distinct words
in the vocabulary, and as a result, the vector will have 50,000 dimensions or a bit
string of length 50,000.
' 2. The second issue is-that the vectors are very sparse; only 1 bit worth of information
among 50, 000 bits. ,·
3. Third, the semantics or relationship between words is lost. Looking at the words,
we would expect some sort of similarity between the words 'apple' and.'',o range' or
the words 'bus' and 'truck'. Yet, if we calculate the Euclidian distance' from all the
vectors, it will be ✓'i. . -
The next idea was representing a word in terms of the co-occurrence matrix. Another idea
that was explored was to model this task of word representation as a prediction task.
7.8.2 - Language Model

~e con~entional a~~r?ach of the language ~odel was inspecting a corpus and come up
with various proba~ilities of the wor?~· The ob~ective of the model is assigning a probability
to a sentence. The higher the probability~ the higher is the validity of the sentence. So, if we
~ . . - _.. '
Representation Learning 249 .

two sequences of words 'Statistics is essential for data science' and 'statistics key skill
::~ success', re~pectively. The fir~t sequence will have a m1;1ch_~gher probability. Well, the
ianguage model, JUSt wants to achieve that. ··
Let us take a sequence of words A, B, C, D. So, we are trying to find out p (A, B, C, D)
or the joint probability of A, B, C and D. We CclJ1 find out this by using the chain rule of the.
probability. ·

. P~A, B, C, D) = !'(~) X P(B IA) X P(C IAB) X P(D IAB_C) -


I

Using Markov assumption, that the word's probability is affected only by the previous
word the·above equation ·c an be simplified as · . ·· ·
I ; 1'. • • -.

, - P(A, B, C, D) = P(A) X P(B IA) X P(C IB) X P(D IC) .

Hence, one of the basic tasks of the language model is to predict the next word. If we are
simply predicting P(Wt + 1) this is ~e unigram model, when we compute P(Wt + 1 1 W,), it is a
bigram model and similarly, P(Wt + 1 1 Wt, Wt_1) is a trigram model. Essentially, the words that
we are using to predict the next word can be thought of as the input. Now, this can be thought
of as a prediction problem. Given a word as an input, we need to predict the next word.
It can be thought of as a multi-class classification problem, where the number of possible
classes is the number of words in a vocabulary. Like any representation learning task, given
a corpus of text, when we will train a neural network to learn correctly the parameters to
accurately predict the next word; it will, in turn, also learn some representation of the words.
Some of the desired characteristics of such a representation are as follows.
1. A much denser representation compared to Bag of Word or one-hot encoding.
2. To have a better semantic meaning; as an example, the representation should be
able to identify that the relationship between France and Paris and between UK and
London is the same. Similar examples are relationship between Pig, Piglet and Duck,
Duckling.
The earlier statistics-based models have a few problems. As the task is to learn the parameters
of the language model, even if we try to generate a 10-word sequence from, let us say, a
l0,000-word dictionary, the number of possible sequences is 10 to the power 40 ( Each of the
word ~osition can be filled up in 10000, so 10000 multiplied_l0 times), resulting in estimating
: eqwvalent number of parameters. This problem was called as the curse of dimensionality
Y Yoshua Benjio, who proposed neural network-based models to discover this denser
~resentation of words. The model proposed by him was called the feedforward neural net
guage mode~ (FFNLM). . ·
. Another important point discussed was about unknown or new words, which are quite
difficult to compute in the traditional language base model. Suppose, in our corpus, we have
~sentence as follows: 'The cat was playing with its tail'. Let us further assume that the corpus
, ~ not have the word 'dog'. The statistics-based language model cannot understand that
: and 'dog'_are similar and 'The dog was playing with its tail' is also a valid and probable
tence. Some of the original examples of similar sentences from the paper are as follows:
/

250 Deep Leaming ~1


------------------~Re:Ep!!resentatlo~~~n~L e~am~ln!g__:25

1. The cat is walking in the bedroom. In the first step, ~e index of the words is passed. The index .
'tion of the word m the vocabulary. Then, a shared p . ~es the lexicographical
2. A dog was running in the room. ~ension IVI x m. IVI gives the size of the vocabu}~ C!8 used~ which is of the
3. A dog is walking in a room. resentation. Typically, based on the index, the dense rep~;.~
~This is represented as C(W;), corresponding to the word w This embed;e
me of the dense
loolced up~
Hence, the objectives of the model are two-fold: to estimate the probability of the next Wo
and to learn a better dense representation of the words. . rd . ·combined and taken through a hidden layer as well as tUrectI to th ed representation
JS •
output is again a
I V I x 1 roam,.,
...:~ th . . y e output layer. The
at IS, a vector. The loss function is calculated li1ce
Itmaybenoted,thisrepresentationiscalledasdistributedrepresentation.Anondistnb
representing is like spending a storage unit for storing each entity. One hot encodin 18 ~ted ar neural network between the desired and actual output and th b ckprop a
regul . . th f th embedding matrix. One of thenchaII a agated
example of such a representation, which is also called as a localist representation. DisJb ~ -r1.:~ matrix C 1s given e name o
U'-"'
e
.
high comput ational cost assooated e enges o
f this.
Representations, tries to create different dimensions of the entities, as a result needs~ model was the with the softmax layer, which the later
storage and can capture better similarity. models try to solve.
In this paper, n - 1 previous words are used as the input vector to predict the probability
of the nth word. · 7.8.2.1 Word-Embedding Layer
Tomas Mikolov et. al, proposed two ways of learning word representations, using neural
" I • models, which are significantly efficient as compared to the neural language model (NLM).
,, J 7.8.3 Continuous Bag of Word
,,
W1 Index • • •It•_. '
..
This is a very simple linear model, where the authors used the terms 'context word' and
'target words'. 'This was introduced by Thomas Mikolov et. al in 2013.

•• • /I
• ""•' 1 I ·•, ~ .••. ' 1 I.,
Let us take an example. 'Decision tree models are popular because these are white-box
models'. We try to predict the target word, which is a central word, by its surrounding words.
The number of surrounding words that will be considered is determined by the window
C(W2) size. ff we assume a window size of 1, then some of the following target and context words

W2 lndex • tanh
will be generated.

- -. . . i - - - - - - . 1 Context Target

SoftmaX Decision, Models Tree


tree, are models
Models, popular are
, are, because popular

The solution is worked out as follows:


1. Input words (context) are taken a~ a one-hot encoded vector.
w3 tndex . th dense representation of the
2· This gets multiplied by the embedding matriX to get e
context words.
• •t ' 3· These embeddings are aggregated. · · ·· with the
• 4 Th
·
, · .
.
e aggregation of embeddings is put across
,
a so.ftmaX layer, to come up
· of an -
·
this
Matrix C IICllng 88 a PBrameter target
. word. · · · ord and in case ......., is
5· This is matched with the expected target wtriX.
hackpropagated to update the embedding ma
FIGURE7.22 Neural Language Model
2125~2:. _~D~ee~ p~Le~a~ m~ln~g~ ----------- -----
..:..
The process is depicted with the below diagram (Fig. 7.23), with a window size of two.
---- ----- ---------- --~Re:f~ ~~~Lea~ m~ln~g~ ~25~3

Softmaxlayw

FIGURE 7.24 Skipgram Model

Let us understand the embedding in more detail. Let us take another simple sentence:
'We like Natural Language Processing'. The words will be sorted and we should have the
following encoding of the words.
FIGURE 7.23 Continuous Bag of Words Model (~BoW Model)
Language 1 0 0 0 0
One of the things to notice over here is that in the neural language model, only past words Like 0 1 0 0 0
were used to predict the future word, whereas in the case of CBoW, the words from both past
and future, or both left and right, are considered. Hence the name Continuous bag of words. Natural 0 0 1 0 0
,:· ,I :
Processing 0 0 0 1 0
7 .BA Ski pg ram
We 0 0 0 0 1
The Skipgram model works just the other way round, that is, it tries to predict the c~ntext
word, given the target word. Let us use the same table as CBoW, however, the input and The embedding matrix will have V rows (as many words as in the vocabulary)_ and let
output now change. The reference sentence is: 'Decision Tree Models are popular because the embedding dimension be 2 (typically, it will be much les.5 than~- H we_mul~ply_ the
these are white-box models'. The context words can be picked up from a specified window one-hot encoded vector which is of 1 x V dimension and the embedding matriX which JS V
size of the target word. As a result, the context word may not be an adjacent word of the x k dimension, the embedding matrix corresponding to a word will have 1 x k dimension.
target, hence the name Skipgram.

Context Target
lo lo 11 10 jo I
Decision, Models Tree
0.2 0.4
tree, are models
0.1 0.3
Models, popular are
X= 0.1 0.9
are, because popular
0.2 0.6 ,1

0.2 0.3
This is quite similar to what is happening in CBoW. The differences are as follows.
. ~triX, to extract the w~
1. The way we are using the Inputs and Outputs TI:te one-hot encoding performs a lookup in the embedding
Specific embedd"mg. ·
2. Here, based on one word, we are predicting another word, so no averaging layer Js
required. As this averaging is not happening, Skipgram gives a better result.
254 Deep Leaming
-- -- -- -- -- -~
--,
~~ 255

Both Skipgram and CBoW have softmax at the output layer as there is
· class classification and the number of classes is the same as the number of wor~ ~Ulti. ,,. .,,. .,,.....
.. '\
vocabulary. As a result, the hidden layer will hav~ k x V parameters and calculationII\ the /
,,. ' I
these probabilities is computationally very expensive. . of all /
and Skipgram . / I
Let us summarize the neural language model, CBoW /
/
1. For all of them, the input word is one-hot encoded. /
I
I
2. We use nearby words of a given word (Wi) as input ~d the word Wi as the ou /
(What and how many of the nearby words to be considered, varies from one m?:l
el /
/
I
I

to another). This is the prediction task of the neural network. I


/
/
3. This one-hot encoded vector is used with an embedding matrix E, to get the w0rd I /
/
embedding of the input words. I /
4. Finally, the activation function in the output layer is softmax, which predicts th I /
/
target word (W1) or the one-hot encoding of the output word. This calculation is :
,,. ,,.
I /

time-consuming one. So, let us assume parameters of the last layer to be 6 and the
embedding of the word W; is given as Ei, then the probability calculation looks like
the following
FIGURE 7.25 Hierarchical SoftMax

7.8.4.2 Negative Sampling I ,


'•

This is another way to avoid the intensive softmax computation. The input-output pairs or
the target-context pairs that are getting generated are called a positive sample. Let us go
The vocabulary can easily have a few lakhs of the words.
back to the sentence: 'Decision trees are popular because they are white box models'. Let us
!VI say the input is 'Decision' and the target is 'Tree'. This is called a positive example. ~ow, the
word 'Decision' is fixed and randomly some words are picked from the corpus. It JS more
5. The loss function is calculated as -Ilog(y ,y), where y is the one-hot encod~ likely they will be random. These are called negative examples.
i=l ,
Is Positive
version of the expected output and y one-hot encoded version of the actual or the Input Output

obtained output. Decision Trees 1

For 0
Decision
7.8.4.1 Hierarchical Softmax
th Dance 0
Decision
This model simply used the concept of dividing by two. The softmax is treated as a tree, wi A positive example with k
as many leaf nodes as the number of words in the vocabulary. It is best understood with an negative examples
Decision And 0
example as provided in Fig. 7.25.
se
Earlier, for each probability calculation, all words needed to be considered. In this ca th, Now 0
Decision
the probabilities are multiplied in a chain. The calculations are shown in Fig. 7.25, the dep 0
Way
of the_ tree f?r a vocabulary of size IVI is of the order log2 I VI -Another trick that is often [)edslon

used 15 havmg frequent words near the root and having less frequent words towards the
leaf.
~25~6~~D~ee~p~Le~am~ln~g_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
----------------~Re~s>resentauon~~~Lea ~mlng~_,1257~
As per the paper, k negative examples are generated. The authors have_ rec~nunended character n-Gram
keeping a value of 2-5 for a large corpus and 6--20 for smalle~ corpus. 1:hls trick actuan 1,8.6
converts the multiclass (SoftMax) classification problem to a binary classification Prob~
(Logistic). · · d ·
'fhis idea is popularized by the open-source library FastText. As the
emJ,e<idings are learnt at a sub-word level The embedding is leam
characters of the ~ord._ On: popular e~ple is, let us say, we want
na:
to~:
:
suggests here, the
!:i:"tinuous
Earlier,foranexample,when theneuralnetwo~kwasgetting_tra~e ~e1ghtscorrespon~
to all the words were being calculated. Now, instead o!this gig~tic calculation of IVI w
ord gearshift. It 1s qwte likely that this word is not present In the
on comp1ete words will fail in ·
such a case._This is not only truecorpus
and word _for
an embedding
words (can be easily in few thousands), we have I VI _binary classifiers, and only k + 1 of built for the word "Gearshifr'
them needs to be weight adjusted for one example. This greatly reduces the computational but is also true for any uncommon word. This sub-word level representati . called
complexity. It is to be noted these two algorithms, with the fine tunings discussed are called character n-grams. Let's take a word like a 'kingdom'. Often, the word . onrecedlSed andas
II,,,, "t . di t
succeeded by < , > o in ca e the start and stop of a word. Let us take the !Sp
Word2Vec in a generic way. value of n as 3.
The possible sets of character 3-gram are as follows ·
7.8.5 Global Vector Representation
I' {,,<ki", "kin", "ing''.' "ngd",go,
" d " "dom,om>)
" " "
This is another word embedding model, proposed by Pennington et. al, in 2014. The detailed
mathematical explanation is quite intense. Here is a simplified view of the same. The main So, if a rare word occurs containing these sub-words; these embeddings can be leveraged.
motivations W!?re as follows: It is also to be noted the embedding for the complete word 'kin' is going to be quite different
from the subword 'kin'.
1. LSA, LDA type of models tried to use the global statistical properties of the corpus
2. CBoW and SkipGram models try to utilize the context of surrounding words which 7.8.7 Pre-trained Models
is local.
The general practice is now to use first a word embedding layer to get the vectorized format
There were no models that leverage both these types of information. Let us quickly jump to of the words and then use them for downstream application for sentiment analysis or
co-occurrence and the desired property of the embedding. Let us assume that the embedding question answering. There are two issues with the same:
of the target word is given as Er and that of the context word is given as Ee Let us also denote
1. The embedding is an extra computation layer.
that number of times they are co-occurring in the document is given as W1,. The GLoVe
'
2. The corpus on which we are training our network to learn the embedding
model assumes that the inner product of these two embeddings should be close to the count. may have
In the original equation, bias terms are added and instead of the direct co-occurrence count, less amount of text, resulting in not very meaningful or stable embedding.
the log of the same is used. So, for each word pair, the following expression is calculated. b, Also, most of the words do not change much from context to context Hence, embeddings
and be are the biases corresponding to the target and the context. · learnt on general corpus like Wlkipedia, or the newsgroup dataset can ~ used for ~
text classification tasks. Using such an embedding is called as a pre-trained embedding.
Empirical studies suggest that by using pre-trained embeddings, the convergence improves,
as well as task-specific performance is enhanced.
For the loss function, it is summed over all such pairs, in the following manner: A couple of very popular pre-trained models are as follows:

You may have noticed that a function is being used in the following equation. The purpose
is for handling cases like Wtc =0, or frequent words like 'to', 'and', 'for' etc. It turnS out ttie
billion words) · · - · ··
2· Stanford's GloVe Embedding: there· are sev~ ~ptions
.· ·

Common Crawl Twitter datasets (trained over 6billion to 840 fro


, . . dim ·ons m 25
The embeddings are also available with different . enst •
~=:~=
1. Google's Word2Vec model trained on Google News dataset (trained on ~ t 1~

-
u.w--
..11--..1,ons to
·
embeddin r:. r: 300 dimensions · · · ·· · · . ·
gs wt' we learned are symmetric and an average of them have been propased to u. . ' . . . . This process is illustrated
be used. These weights are learned using gradient descent. s_mg Pre-trained embedding is also one form of tranSfer learruns• . .
Usmg Fig. 7.26. ., - .
, .' ' .,:
•• . I •.
---- ---- -:._
~25 ~8~ ~De e~p ~L~ ea~ m~ ln~ g~- ---- ---- ----- ----- ----- ---~R ep~r ese~ ntat1 on~~ Lea~ rn1~ ng!. __12
S~g

iii
Ii i if '

~ Ii i
.l·
~ '. d
Words /tokens

. .'
Word embedding

FIGURE 7.27 Paragraph-2-Vector Learning ·


Embedded vectors

Figure 7.27 explai ns the conce pt by extending the CBoW


architecture for paragraphs. As

l
Downstream NLP
'' i
discussed, a parag raph id is gener ated and a paragr aph embed
word embed ding matrix . The idea is that the context words
target words, based on what parag raph they are belonging
embedding and parag raph embed ding are included ~ug
ding matrix is used, similar to
may generate slightly different
to. In the final layer, this word
h the softmax layer. , ·

appllc~trons

I •
1. The neura l netwo rk can be thoug ht of as models specia
lized to learn non-linear and
FIGURE7.26 Transfer Learn ing Using a Pre-tr ained
Embe dding
compl ex repres entati ons. ·
Analys~ (PCA) 1~~ a
2. In traditi onal Mach ine Learning, Principal Component
explain s the maxunum vanability
linear comb inatio n of the original features which
7 .9 DOCUMENT EMBEDDING of the data and is kind of the state of the art.
to learn representation by
next.s tep is to learn the docum ~t 3. Autoe ncode rs are feed-forward networks specialized
Once w~ have learnt the word embed dings, then the
the avera ge of the embedding recon struct ing the input.
embed ding. A couple of very simpl e techni ques are taking . . 1 hidden layer, and an output layer.
4 Th 10. tuition is that they are
and findin g the centro id.
tf-idf,
. In the vanill a versio n, we have an input ay~, a . e
ge· using a schem e like The hidde n layer outpu ts are the representations learnt.
This can be furthe r impro ved by using a weigh ted avera a phrase, the
seque nce of word s like
There are several metho ds to learn to embe d a larger general,
sufficient to recon struct the input. lay'er and the hidden layer to
that the parag raph is more oding
sentence, parag raph or docum ent. Mikol ov et. al, argue d a~h S. The input to the hidde n layer is called the enc
wing techn ique, uniqu e paragr
. He propo sed a very_s~ple scheme. Using the windo ' . the l function.
a parag raph embe dding matriX outpu t is called the decod ing layer.
ids were create d and similar to a word embe dding matrix 6· They can be used to fit a different :t~ety of inp~t d,ata _b y.changing os.,
was learne d. 1
9_ How are LSI- or LS~-~ased techniques different from
7. There can be multiple hidden layers; hodo~ever thteckyedneedtto beodsrmmetrtc. Sllc:h LDA?
autoencoders are called as deep autoenc ers or s a au oenc ers. What is one-hot encoding and what are some of the•---........:-
10· ..___led wilh the same?
8. Regularization is important for autoencode~ if w~ want to construct over~mpJete model used to leam the -...hM,n..- of
11 _ How is the neural network-based language
autoencoders. Some such techniques are weight tying, sparse autoencoder, denoising the words? . ---e
autoencoder, and contractive autoencoder. ,
12. Explain how CBow and SkipGram models work to learn word embedding.
9. Representation learning is extremelr critical fo~ text. Tradition~ tf-idf based
documentation ~ding has many ISSues, especrally the curse of dimensionality. 13. What are some of the strategies used to ieduce the computational complexity of
these models?
10. Topic modeling techniques like Matrix Factorization based NMF, LSI or Generative
14. Explain the working ofGloVe models.
model-based LDA does a good task of finding the unobserved topics. These models
were good in document embedding but poor on word analogy tasks. 15. What are the advantages of pre-trained embeddings? Can this be called transfer
learning?
11. To learn a better word representation, the traditional language model was ·picked.
One of the effective implementations was by Benjio et. al. where they used a neural 16. Explain the issues of using averaging for word embedding. How is Paragraph2Vec
network. The proxy task was to predict the next word. The more important motivation used to learn the embedding of a paragraph? · .
was to learn a more compact and distributed representation. This was suitably
modified by Mikolov et. al. with two very popular methods namely Continuous Bag
of Words (CBoW) and Skipgram. The computation challenges of these models were OBJECTIVE QUESTIONS
solved using schemes like negative sampling and hierarchical softmax.
1. Which is not a regularized autoencoder?
12. Another set of models was proposed by Pennigton et. al. which learns the Global
representation of the word vectors (Glove). Here the task is to reproduce the co- (a) Denoizing autoencoder
occurrence frequency given two words. (b) Sparse autoencoder
13. Both these are very popular methods and pre-trained embeddings can be used in a (c) Deep autoencoder
transfer learning model for downstream NLP applications. · · ·
(d) Weight tied autoencoder
14. Averaging is not a good strategy to derive the document embedding fr~m word 2. Which is the possible number of hidden layers in a deep autoencoder?
embedding. Paragraph to vectors is a good scheme for the embedding of the
documents. (a) 2
(b) 1

SUBJECTIVE QUESTIONS
(c) 6
(d) 5
1. Discuss how neural networks can be thought to be a tool to learn representations. 3. When the input data is binary which is an appropria~ loss function?
2. What are the basic objectives of principal component analysis (PCA) and how can
. (a) Hinge loss
the number of components to be selected be determined?
(b) MAE
3. Explain the architecture of an autoencoder with an explanation about the Jayeis, loss
functions and activation functions. (c) MSE
4. Why is autoencoder thought to be more powerful than PCA? (d) Cross entropy
5. Explain in detail the different regularization schemes of an autoencoder. 4. Which is not a way to document.embedding?
6. Discuss stacked autoencoders. (a) Average word embedding
7. What are some of the issues with Bag of Word representation of d~ents and ha" (b) One hot encoding
they are solved? .
. I (c) Weighted word embedding
8. Explain how topic modelling techniques leverage matrix factorization techniques- (d) tf-idf based
5. Which of the following methods does not use matrix factorization?
(a) PCA -1 , • I •, ·•

(b) NMF .'


(c) LOA
(d) LSA
6. Which is true for one hot encoding?
(a) It is a dense representation of a word
/·'
• I

(b) Good for analogy task


(c) Is very simple for understanding
(d) The length of the v~ctors is small
7. Which of the following technique is not used to improve the bag of words?.:
f ••
(a) tf-idf
(b) Neural modelling
(c) Text normalization '
••
• !

(d) Word embedding


8. Which is not a valid technique of word embedding?
(a) One hot encoding
. ~
. '
r t ',
1- -
I
(b) GloVe
(c) LOA
..
(d) Word2Vec
9. Which is not true for Word2Vec?
•J
(a) Both CBoW and SkipGram are a part of this
(b) SoftMax is used in the activation layer
(c) The embedding matrix is a sparse matrix
• I I • •
'' .
(d) A window is used for finding target word and context word
10. Which one of the following statements is true?
. '.
•,

(a) SkipGram has an averaging layer


(b) CBoW uses left and right context in comparison with NNLM , '
-
(c) GloVe used matrix factorization technique . _
(d) The co-occurrence of words can give a shorte~ vecto~ ~epresentation·-of a
particular word. ·· .

You might also like