NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx

Understanding of Word
Embeddings: From Count Vectors
to Word2Vec
MODULE-I

• The objective of this task is to detect hate speech in tweets.
For the sake of simplicity, we say a tweet contains hate
speech if it has a racist or sexist sentiment associated with
it. So, the task is to classify racist or sexist tweets from other
tweets.
• Formally, given a training sample of tweets and labels,
where label ‘1’ denotes the tweet is racist/sexist and label ‘0’
denotes the tweet is not racist/sexist, your objective is to
predict the labels on the test dataset.

What Are Word Embeddings?
• Word Embeddings are the texts converted into numbers and
there may be different numerical representations of the
same text.
• sentence=” Word Embeddings are Word converted into
numbers “
• A word in this sentence may be “Embeddings” or “numbers ”
etc.
• A dictionary may be the list of all unique words in
the sentence. So, a dictionary may look like –
[‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]

• A vector representation of a word may be a one-hot encoded
vector where 1 stands for the position where the word exists
and 0 everywhere else. The vector representation of
“numbers” in this format according to the above dictionary
is [0,0,0,0,0,1] and of converted is[0,0,0,1,0,0].
• This is just a very simple method to represent a word in the
vector form. Let us look at different types of Word
Embeddings or Word Vectors and their advantages and
disadvantages over the rest.

Different Types of Word Embeddings
• The different types of word embeddings can be broadly
classified into two categories-
1.Frequency-based Embedding
2.Prediction-based Embedding
• Let us try to understand each of these methods in detail.

Different Types of Word Embeddings
• 1. Classical Techniques
1.Count Vectorization (Bag-of-Words):
1. Represents text as a vector of word frequencies or counts.
2. High-dimensional and sparse.
2.TF-IDF (Term Frequency-Inverse Document Frequency):
1. Enhances the Bag-of-Words model by weighting words based on their
importance across documents.

• 2. Prediction-Based Models
3.Word2Vec (Mikolov et al., 2013):
3. Produces dense vector representations using a neural network.
4. Approaches:
3.Skip-Gram: Predicts the context given a target word.
4.CBOW (Continuous Bag of Words): Predicts a target word from its surrounding context.

• GloVe (Global Vectors for Word Representation):
• Combines matrix factorization and predictive modeling to produce
embeddings based on word co-occurrence statistics.
• FastText:
• Extends Word2Vec by incorporating subword (character n-grams)
information, making it robust to rare and out-of-vocabulary words.
• ELMo (Embeddings from Language Models):
• Contextualized embeddings generated from a deep bidirectional LSTM
trained on a language model task.
• Captures word meaning in different contexts

• 3. Transformer-Based Models
7.BERT (Bidirectional Encoder Representations from Transformers):
7. Contextualized embeddings generated using a Transformer architecture.
8. Trains bidirectionally to understand word context comprehensively.
8.GPT (Generative Pre-trained Transformer):
7. Generates embeddings using autoregressive (unidirectional) language
modeling.
9.RoBERTa, ALBERT, and Other Variants:
7. Variations of BERT with improved pretraining techniques and architecture
optimizations.
10.Word Embeddings from Transformers (e.g., Sentence-BERT,
DistilBERT):
7. Variants adapted for tasks requiring sentence or contextual embeddings.

• 4. Graph-Based and Other Advanced Techniques
11.Graph Embeddings (e.g., Node2Vec, DeepWalk):
11.Represent nodes in a graph as embeddings, applicable to word networks.
12.Sense2Vec:
11.Embeddings that distinguish between different senses of the same word (e.g.,
"bank" as a financial institution vs. a riverbank).
13.Latent Semantic Analysis (LSA):
11.Uses Singular Value Decomposition (SVD) to reduce dimensionality in a term-
document matrix.
14.Latent Dirichlet Allocation (LDA):
11.A probabilistic model that discovers topics in a corpus and represents words as
distributions over topics.
15.Doc2Vec:
11.Extends Word2Vec to represent entire documents or paragraphs as dense vectors.

• 5. Specialized Embeddings
16.InferSent and Universal Sentence Encoder:
16.Pretrained models specifically designed for generating sentence
embeddings.
17.StarSpace:
16.General-purpose embedding method for words, sentences, and entities.
18.CoVe (Context Vectors):
16.Contextual word embeddings derived from seq2seq models trained for
machine translation.

Frequency-based Embedding
• There are generally three types of vectors that we encounter
under this category.
1.Count Vector
2.TF-IDF Vector
3.Co-Occurrence Vector

Count Vector
• Consider a Corpus C of D documents {d1,d2…..dD} and N unique
tokens extracted out of the corpus C. The N tokens will form our
dictionary and the size of the Count Vector matrix M will be given by
D X N. Each row in the matrix M contains the frequency of tokens in
document D(i).
• Let us understand this using a simple example.
• D1: He is a lazy boy. She is also lazy.
• D2: Neeraj is a lazy person.
• The dictionary created may be a list of unique tokens(words) in the
corpus =[‘He’,’She’,’lazy’,’boy’,’Neeraj’,’person’]
• Here, D=2, N=6

• The count matrix M of size 2 X 6 will be represented as –
He She lazy boy Neeraj person
D1 1 1 2 1 0 0
D2 0 0 1 0 1 1
He She lazy boy Neeraj person
D1 1 1 2 1 0 0
D2 0 0 1 0 1 1

• Now, a column can also be understood as word vector for the corresponding word in
the matrix M. For example, the word vector for ‘lazy’ in the above matrix is [2,1] and
so on.Here, the rows correspond to the documents in the corpus and
the columns correspond to the tokens in the dictionary. The second row in the above
matrix may be read as – D2 contains ‘lazy’: once, ‘Neeraj’: once and ‘person’ once.
• Now there may be quite a few variations while preparing the above matrix M. The
variations will be generally in-
1.The way dictionary is prepared.
Why? Because in real world applications we might have a corpus which contains
millions of documents. And with millions of document, we can extract hundreds of
millions of unique words. So basically, the matrix that will be prepared like above will
be a very sparse one and inefficient for any computation. So an alternative to using
every unique word as a dictionary element would be to pick say top 10,000 words
based on frequency and then prepare a dictionary.
2.The way count is taken for each word.
We may either take the frequency (number of times a word has appeared in the
document) or the presence(has the word appeared in the document?) to be the entry
in the count matrix M. But generally, frequency method is preferred over the latter.

17
Vector Embedding of Words
• A word is represented as a vector.
• Word embeddings depend on a notion of word similarity.
• Similarity is computed using cosine.
• A very useful definition is paradigmatic similarity:
• Similar words occur in similar contexts. They are exchangeable.
POTUS
• Yesterday The President called a press conference.
Trump
• “POTUS: President of the United States.”

Bag of Words
• The Bag of Words (BoW) model is the simplest form of text
representation in numbers. Like the term itself, we can represent a
sentence as a bag of words vector (a string of numbers).
Let’s recall the three types of movie reviews :
• Review 1: This movie is very scary and long
• Review 2: This movie is not scary and is slow
• Review 3: This movie is spooky and good
We will first build a vocabulary from all the unique words in the above
three reviews. The vocabulary consists of these 11 words: ‘This’,
‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.
• We can now take each of these words and mark their occurrence in
the three movie reviews above with 1s and 0s. This will give us 3
vectors for 3 reviews:

Vector-of-Review1: [1 1 1 1 1 1 1 0 0 0 0]
Vector-of-Review2: [1 1 2 0 0 1 1 0 1 0 0]
Vector-of-Review3: [1 1 1 0 0 0 1 0 0 1 1]
And that’s the core idea behind a Bag of Words (BoW) model.

Drawbacks of using a Bag-of-Words
(BoW) Model
• In the above example, we can have vectors of length 11.
However, we start facing issues of Bag of Words when we
come across new sentences:
1.If the new sentences contain new words, then our
vocabulary size would increase and thereby, the length of
the vectors would increase too.
2.Additionally, the vectors would also contain many 0s,
thereby resulting in a sparse matrix (which is what we would
like to avoid)
3.We are retaining no information on the grammar of the
sentences nor on the ordering of the words in the text.

Example to Understand Bag-of-Words
(BoW) and TF-IDF
• sample of reviews about a particular horror movie:
• Review 1: This movie is very scary and long
• Review 3: This movie is spooky and good
• You can see that there are some contrasting reviews about the movie as well
as the length and pace of the movie. Imagine looking at a thousand reviews
like these. Clearly, there is a lot of interesting insights we can draw from them
and build upon them to gauge how well the movie performed.
• However, as we saw above, we cannot simply give these sentences to a
machine learning model and ask it to tell us whether a review was positive or
negative. We need to perform certain text preprocessing steps.
• Bag-of-Words and TF-IDF are two examples of how to do this

Creating Vectors from Text
• Can you think of some techniques we could use to vectorize a
sentence at the beginning? The basic requirements would be:
1.It should not result in a sparse matrix since sparse matrices result in
high computation cost
2.We should be able to retain most of the linguistic information present
in the sentence
• Word Embedding is one such technique where we can represent the
text using vectors. The more popular forms of word embeddings are:
1.BoW, which stands for Bag of Words
2.TF-IDF, which stands for Term Frequency-Inverse Document Frequency

Limitations of Bag of Words
1.No Word Order: It doesn’t care about the order of words, missing
out on how words work together.
2.Ignores Context: It doesn’t understand the meaning of words
based on the words around them.
3.Always Same Length: It always represents text in the same way,
which can be limiting for different types of text.
4.Lots of Words: It needs to know every word in a language, which
can be a huge list to handle.
5.No Meanings: It doesn’t understand what words mean, only how
often they appear, so it can’t grasp synonyms or different word
forms.

Term Frequency-Inverse Document
Frequency (TF-IDF)
“Term frequency–inverse document frequency, is a numerical statistic that
is intended to reflect how important a word is to a document in a collection
or corpus.”
Term Frequency (TF)
Let’s first understand Term Frequent (TF). It is a measure of how frequently
a term, t, appears in a document, d:
Here, in the numerator, n is the number of times the term “t”
appears in the document “d”. Thus, each document and term
would have its own TF value.

• We will again use the same vocabulary we had built in the Bag-
of-Words model to show how to calculate the TF for Review #2:
• Here,
• Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,
‘slow’, ‘spooky’, ‘good’
• Number of words in Review 2 = 8
• TF for the word ‘this’ = (number of times ‘this’ appears in review
2)/(number of terms in review 2) = 1/8

• TF(‘movie’) = 1/8
• TF(‘is’) = 2/8 = 1/4
• TF(‘very’) = 0/8 = 0
• TF(‘scary’) = 1/8
• TF(‘and’) = 1/8
• TF(‘long’) = 0/8 = 0
• TF(‘not’) = 1/8
• TF(‘slow’) = 1/8
• TF( ‘spooky’) = 0/8 = 0
• TF(‘good’) = 0/8 = 0

We can calculate the term frequencies
for all the terms and all the reviews in
this manner:

• Inverse Document Frequency (IDF)
• IDF is a measure of how important a term is. We need the
IDF value because computing just the TF alone is not
sufficient to understand the importance of words:

• We can calculate the IDF values for the all the words in
Review 2:
• IDF(‘this’) = log(number of documents/number of
documents containing the word ‘this’) = log(3/3) = log(1) = 0
• Similarly,
• IDF(‘movie’, ) = log(3/3) = 0
• IDF(‘is’) = log(3/3) = 0
• IDF(‘not’) = log(3/1) = log(3) = 0.48
• IDF(‘scary’) = log(3/2) = 0.18
• IDF(‘and’) = log(3/3) = 0
• IDF(‘slow’) = log(3/1) = 0.48

We can calculate the IDF values for each word like this.
Thus, the IDF values for the entire vocabulary would be:

• Hence, we see that words like “is”, “this”, “and”, etc., are reduced
and have little importance; while words like “scary”, “long”, “good
are words with more importance and thus have a higher value.
• We can now compute the TF-IDF score for each word in the corpus. W
with a higher score are more important, and those with a lower score
important
We can now calculate the TF-IDF score for every word in Review 2:
TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0

• TF-IDF(‘movie’, Review 2) = 1/8 * 0 = 0
• TF-IDF(‘is’, Review 2) = 1/4 * 0 = 0
• TF-IDF(‘not’, Review 2) = 1/8 * 0.48 = 0.06
• TF-IDF(‘scary’, Review 2) = 1/8 * 0.18 = 0.023
• TF-IDF(‘and’, Review 2) = 1/8 * 0 = 0
• TF-IDF(‘slow’, Review 2) = 1/8 * 0.48 = 0.06

Similarly, we can calculate the TF-IDF scores for
all the words with respect to all the reviews:

We have now obtained the TF-IDF scores for our vocabulary. TF-IDF also gives larger values for
less frequent words and is high when both IDF and TF values are high i.e the word is rare in all
the documents combined but frequent in a single document.

Summary
1.Bag of Words just creates a set of vectors containing the count of word
occurrences in the document (reviews), while the TF-IDF model contains
information on the more important words and the less important ones as
well.
2.Bag of Words vectors are easy to interpret. However, TF-IDF usually
performs better in machine learning models.
While both Bag-of-Words and TF-IDF have been popular in their own
regard, there still remained a void where understanding the context of
words was concerned. Detecting the similarity between the words
‘spooky’ and ‘scary’, or translating our given documents into another
language, requires a lot more information on the documents.
This is where Word Embedding techniques such as Word2Vec,
Continuous Bag of Words (CBOW), Skipgram, etc. come in. You can find a
detailed guide to such techniques here:

36
Vector Embedding of Words
Traditional Method - Bag of Words
Model
• Either uses one hot encoding.
• Each word in the vocabulary is
represented by one bit position in a
HUGE vector.
• For example, if we have a vocabulary of
10000 words, and “Hello” is the 4th word
in the dictionary, it would be represented
by: 0 0 0 1 0 0 . . . . . . . 0 0 0
• Or uses document representation.
• Each word in the vocabulary is
represented by its presence in
documents.
• For example, if we have a corpus of 1M
documents, and “Hello” is in 1th, 3th and
5th documents only, it would be
Word Embeddings
• Stores each word in as a point in space,
where it is represented by a dense
vector of fixed number of dimensions
(generally 300) .
• Unsupervised, built just by reading
huge corpus.
• For example, “Hello” might be
represented as : [0.4, -0.11, 0.55,
0.3 . . . 0.1, 0.02].
• Dimensions are basically projections
along different axes, more of a
mathematical concept.

37
Example
• vector[Queen] vector[King] - vector[Man] + vector[Woman]
• vector[Paris] vector[France] - vector[ Italy] + vector[ Rome]
• This can be interpreted as “France is to Paris as Italy is to Rome”.

Working with vectors
• Finding the most similar words to .
• Compute the similarity from word to all other words.
• This is a single matrix-vector product:
• W is the word embedding matrix of |V| rows and
d columns.
• Result is a |V| sized vector of similarities.
• Take the indices of the k-highest values.
38

Working with vectors
• Similarity to a group of words
• “Find me words most similar to cat, dog and cow”.
• Calculate the pairwise similarities and sum them:
• Now find the indices of the highest values as before.
• Matrix-vector products are wasteful. Better option:
39

Applications of Word Vectors
• Word Similarity
• Machine Translation
• Part-of-Speech and Named Entity Recognition
• Relation Extraction
• Sentiment Analysis
• Co-reference Resolution
• Chaining entity mentions across multiple documents - can we find and
unify the multiple contexts in which mentions occurs?
• Clustering
• Words in the same class naturally occur in similar contexts, and this
feature vector can directly be used with any conventional clustering
algorithms (K-Means, agglomerative, etc). Human doesn’t have to waste
time hand-picking useful word features to cluster on.
• Semantic Analysis of Documents
• Build word distributions for various topics, etc.
40

Vector
Embedding
of Words
• Latent Semantic Analysis/Indexing (1988)
• Term weighting-based model
• Consider occurrences of terms at document
level.
• Word2Vec (2013)
• Prediction-based model.
• Consider occurrences of terms at context level.
• GloVe (2014)
• Count-based model.
• Consider occurrences of terms at context level.
• ELMo (2018)
• Language model-based.
• A different embedding for each word for each
task.
41

43
Embedding: Latent Semantic Analysis
• Latent semantic analysis studies documents in Bag-Of-Words model (1988).
• i.e. given a matrix A encoding some documents: is the count* of word j in document i. Most
entries are 0.
* Often tf-idf or other “squashing” functions of the count are used.
A
N docs
M words

44
• Low rank SVD decomposition:
• : document-to-concept similarities matrix (orthogonal matrix).
• V: word-to-concept similarities matrix (orthogonal matrix).
• : strength of each concept.
• Then given a word w (column of A):
• is the embedding (encoding) of the word w in the latent space.
• is the decoding of the word w from its embedding.
A  U
VT
Σ
N docs
M words K latent dim M words

45
• is the decoding of the word w from its embedding.
• An SVD factorization gives the best possible reconstructions of the a word w from its
embedding.
• Note:
• The problem with this method, is that we may end up with matrices having billions of rows
and columns, which makes SVD computationally expensive and restrictive.
A  U
VT
Σ
N docs
M words K latent dim M words

Word Embeddings in NLP | Word2Vec | GloVe | fastText
• Word embeddings are word vector representations where words
with similar meaning have similar representation. Word vectors are
one of the most efficient ways to represent words.
• Word vectors with Word2Vec
• Using Deep Learning to NLP tasks has proven to be performing very
well. The core concept is to feed the human readable sentences
into neural networks so that the models can extract some sort
of information from them.
• It’s important to understand how to pre-process textual data for the
neural neural networks. Neural networks can take numbers as an
input but not raw text. Hence we need to convert these words into
numerical format.

Word Vectors
• Word vectors are much better ways to represent words
than one hot encoded vectors (As the size of vocabulary
increases, leads to extensive memory usage while
representing text vectors). The index which is assigned to
each word does not hold any semantic meaning. In one hot
encoded vectors, the vectors for “dog” and “cat” are just as
close to each other as “dog” and “computer”, hence neural
network has to try really hard to understand each word
since they are being treated as completely isolated entities.
The usage of word vectors aim to resolve both these issues.

• One important aspect before diving directly to word vectors
is, similar words occur more frequently together than
dissimilar words.
• Just because word occurs within the vicinity of another word
it doesnt always mean they have similar meaning , but when
we consider frequency of words which are found close
together we find that words of similar meaning are found
together.
• “Word vectors consume much less space than one hot encoded
vectors and they also maintain semantic representation of
word”.

• One important aspect before diving directly to word vectors
is, similar words occur more frequently together than
dissimilar words.
• Just because word occurs within the vicinity of another word
it doesnt always mean they have similar meaning , but when
we consider frequency of words which are found close
together we find that words of similar meaning are found
together.

Word2Vec
• In word2vec there are 2 architectures CBOW(Continuous Bag of Words)
and Skip Gram.
• First thing to do is to collect word co-occurrence data. We need set of data
telling us which words are occurring close to certain word. We will use
something called as context window for doing this.
• Consider, “Deep Learning is very hard and fun”. We need to set something
known as window size. Let’s say 2 in this case. What we do is iterate over
all the words in the given data, which in this case is just one sentence,
and then consider a window of word which surrounds it. Here since our
window size is 2 we will consider 2 words behind the word and 2 words
after the word, hence each word will get 4 words associated with it. We will
do this for each and every word in the data and collect the word pairs.

NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx

• As we are passing the context window through the text
data, we find all pairs of target and context words to form
a dataset in the format of target word and context word. For
the sentence above, it will look like this:
• 1st Window pairs: (Deep, Learning), (Deep, is)
• 2nd Window pairs: (Learning, Deep), (Learning, is),
(Learning, very)
• 3rd Window pairs: (is, Deep), (is, Learning), (is, very), (is,
hard)

• (Deep, Learning), (Deep, is), (Learning, Deep), (Learning, is),
(Learning, very), (is, Deep), (is, Learning), (is, very), (is, hard),
(very, learning), (very, is), (very, hard), (very, and), (hard, is),
(hard, very), (hard, and), (hard, fun), (and, very), (and, hard),
(and, fun), (fun, hard), (fun, and)
And so on. At the end our target word vs context word data set is going to look like this:

• We use neural network for this prediction task. The input to
the neural network is the one hot encoded version of
the context word. Hence the size of the input and output
layer is V(vocabulary count). This neural network has only
one layer in the middle, the size of the hidden layer
determines the size of the word vectors we wish to have
at the end.
This can be considered as our “training data” for word2vec.
In skipgram model, we try to predict each context word given a target word.

What is Continuous Bag of Words (CBOW)?
• Continuous Bag of Words (CBOW) is a neural network model used for natural
language processing tasks, primarily for word embedding. It belongs to the family of
neural network architectures called Word2Vec, which aims to represent words in a
continuous vector space.
• In CBOW, the model predicts the current word based on the context of surrounding
words. CBOW predicts the target word from its context. The architecture typically
consists of an input layer, a hidden layer, and an output layer.
• Input Layer: It represents the context words encoded as one-hot vectors.
• Hidden Layer: This layer processes the input and performs non-linear transformations
to capture the semantic relationships between words.
• Output Layer: It produces a probability distribution over the vocabulary, with each
word assigned a probability of being the target word given its context.

What is Skip-Gram Model?
• The Skip-Gram model is another neural network architecture within the
Word2Vec framework for generating word embeddings. Unlike Continuous
Bag of Words (CBOW), Skip-Gram predicts context words given a target
word. It's designed to learn the representation of a word by predicting the
surrounding words in its context.
• Input Layer: It takes a single word (the target word) encoded as a one-hot
vector.
• Hidden Layer: This layer transforms the input word into a distributed
representation in the hidden layer.
• Output Layer: It predicts the context words (surrounding words) based on
the representation learned in the hidden layer.

Key Differences Between CBOW and Skip-Gram
Aspect CBOW (Continuous Bag of Words) Skip-Gram
Concept Predicts a target word based on context words. Predicts context words given a target word.
Architecture Averages context word vectors to predict the target word. Uses the target word vector to predict multiple context words.
Training Process Minimizes cross-entropy loss to predict the target word.
Maximizes the likelihood of context words around a target word
using techniques like negative sampling or hierarchical softmax.
Training Speed Faster due to averaging of context vectors and fewer updates.
Slower because it predicts multiple context words, requiring more
updates.
Performance with
Infrequent Words
Less effective in representing rare words. More effective, as it captures detailed word-context relationships.
Quality of Word
Embeddings Produces decent word embeddings, but not as rich as Skip-Gram. Produces higher quality embeddings, capturing subtle semantic
nuances.
Hyperparameter
Sensitivity Less sensitive to hyperparameters compared to Skip-Gram.
More sensitive, requiring careful tuning of parameters like learning
rate and context window size.
Computational
Resources
Requires fewer resources due to simpler training process. Requires more computational power and memory.
Use Cases
Suitable for tasks requiring speed over detailed word representations,
like text classification and sentiment analysis.
Ideal for tasks needing high-quality embeddings and detailed
semantic relationships, such as word similarity tasks, named entity
recognition, and machine translation.

Training Process for CBOW
• This training process helps the CBOW model learn meaningful word embeddings that capture the semantic relationships
between words based on their contexts in the training corpus.
• Step 1: Data Preprocessing
• Tokenize the corpus.
• Create word embeddings (e.g., Word2Vec).
• Step 2: CBOW Model Architecture
• Initialize input/output layers.
• Define the hidden layer.
• Step 3: Training the CBOW Model, For each word:
• Encode context words into embeddings.
• Average embeddings for the context vector.
• Pass context vector through the network.
• Calculate loss and update weights.
• Repeat for multiple epochs.
• Step 4: Inference with the CBOW Model
• Encode and average context words.
• Pass the context vector through the network to predict the word

Advantages and Disadvantages of CBOW Model
• Advantages of CBOW Model
• CBOW is faster to train compared to Skip-gram, especially for large datasets.
• It tends to perform well with frequent words and is useful in scenarios where context matters more
than individual word positions.
• Disadvantages of CBOW Model
• CBOW might not perform as well as Skip-gram in capturing rare words or phrases.
• It doesn't preserve word order information, which can be crucial in some applications.
• Use Cases and Applications of CBOW
• Word Embeddings: CBOW is widely utilized to make word embeddings which in turn is employed in
different NLP applications such as sentiment analysis, machine translation and information retrieval.
• Recommendation Systems: CBOW embeddings can be exploited for understanding words relationship
for similarity which is very important in recommendations system to recommend like items or content.
• Text Classification: The result of CBOW can be used as features in the text classification tasks, when the
important factor is a context of words.

Training Process for Skip Gram
• Step 1: Data Preprocessing
• Tokenize the corpus into individual
words.
• Create word embeddings using
techniques like Word2Vec.
• Step 2: Skip-gram Model Architecture
• Initialize input and output layers.
• Define a hidden layer with a specified
number of neurons.
• Step 3: Training the Skip-gram Model
• For each word w in the training corpus:
• Retrieve context words surrounding ?
w within a specified window size.
• For each context word:
• Encode the target word w into its corresponding word
embedding.
• Pass the word embedding as input to the neural network.
• Forward propagate through the network to obtain the
predicted context word.
• Compare the predicted context word with the actual
context word.
• Calculate the loss using a suitable loss function (e.g., cross-
entropy).
• Backpropagate the error to update the weights.
• Repeat the process for multiple epochs until convergence.
• Step 4: Inference with the Skip-gram Model
• Given a target word, retrieve its word embedding.
• Pass the word embedding through the trained neural
network.
• Obtain the predicted context words as the output of the
network.

Advantages and Disadvantages of Skip Gram Model
• Advantages of Skip Gram Model
• Huge advantage of Skip-Gram is that it
is able to perform better with low
frequency words because the idea is
actually to output as many context
words as possible given the target word.
• It preserves word order information,
which can be beneficial in tasks where
word order matters, such as language
translation or sequence generation.
• Disadvantages of Skip Gram Model
• Training the Skip-Gram model can be
computationally expensive compared to
CBOW, especially for large datasets, due
to the need to predict multiple context
words for each target word.
Skip-Gram might not perform as well with very
frequent words compared to CBOW.
Use-Cases and Applications of Skip Gram Model
Utilized in different areas of Natural Language
Processing such as sentiment analysis, machine
interpretation, and information retrieval.
Semantic Similarity: Skip-Gram embeddings are
useful for calculating the semantic relation
between two words and such applications
include recommendation systems and search
engines.
Text Generation: Skip-Gram embeddings can be
effectively used in text generation task to
generate meaningful sequences of words to
complete the given phrase.

The Word Embedding Model
• Word embeddings are a type of word representation that allows words with
similar meaning to have a similar representation. They are a distributed
representation for text that is perhaps one of the key breakthroughs for the
impressive performance of deep learning methods on challenging natural
language processing problems. In this chapter, you will discover the word
embedding approach for representing text data. After completing this chapter,
you will know:
• What the word embedding approach for representing text is and how it differs
from other feature extraction methods.
• That there are 3 main algorithms for learning a word embedding from text data.
• That you can either train a new embedding or use a pre-trained embedding on
your natural language processing task

• 1. What Are Word Embeddings?
• 2. Word Embedding Algorithms
• 3. Using Word Embeddings

What Are Word Embeddings?
• A word embedding is a learned representation for text where words that
have the same meaning have a similar representation. It is this approach
to representing words and documents that may be considered one of the
key breakthroughs of deep learning on challenging natural language
processing problems.
• One of the benefits of using dense and low-dimensional vectors is
computational: the majority of neural network toolkits do not play well
with very high-dimensional, sparse vectors. ... The main benefit of the
dense representations is generalization power: if we believe some
features may provide similar clues, it is worthwhile to provide a
representation that is able to capture these similarities.

• Word embeddings are in fact a class of techniques where individual words
are represented as real-valued vectors in a predefined vector space. Each
word is mapped to one vector and the vector values are learned in a way
that resembles a neural network, and hence the technique is often lumped
into the field of deep learning. Key to the approach is the idea of using a
dense distributed representation for each word. Each word is represented
by a real-valued vector, often tens or hundreds of dimensions. This is
contrasted to the thousands or millions of dimensions required for sparse
word representations, such as a one hot encoding. associate with each
word in the vocabulary a distributed word feature vector ... The feature
vector represents different aspects of the word: each word is associated
with a point in a vector space. The number of features ... is much smaller
than the size of the vocabulary

Word Embedding Algorithms
• Word embedding methods learn a real-valued vector representation
for a predefined fixed sized vocabulary from a corpus of text. The
learning process is either joint with the neural network model on
some task, such as document classification, or is an unsupervised
process, using document statistics. This section reviews three
techniques that can be used to learn a word embedding from text
data.

Embedding Layer
• An embedding layer, for lack of a better name, is a word embedding
that is learned jointly with a neural network model on a specific
natural language processing task, such as language modeling or
document classification. It requires that document text be cleaned
and prepared such that each word is one hot encoded. The size of the
vector space is specified as part of the model, such as 50, 100, or 300
dimensions. The vectors are initialized with small random numbers.
The embedding layer is used on the front end of a neural network and
is fit in a supervised way using the Backpropagation algorithm.

• The one hot encoded words are mapped to the word vectors. If a
Multilayer Perceptron model is used, then the word vectors are
concatenated before being fed as input to the model. If a recurrent
neural network is used, then each word may be taken as one input in
a sequence. This approach of learning an embedding layer requires a
lot of training data and can be slow, but will learn an embedding both
targeted to the specific text data and the NLP task.

Word2Vec
• Word2Vec is a statistical method for efficiently learning a standalone
word embedding from a text corpus. It was developed by Tomas
Mikolov, et al. at Google in 2013 as a response to make the neural-
network-based training of the embedding more efficient and since
then has become the de facto standard for developing pre-trained
word embedding. Additionally, the work involved analysis of the
learned vectors and the exploration of vector math on the
representations of words. For example, that subtracting the man-ness
from King and adding women-ness results in the word Queen,
capturing the analogy king is to queen as man is to woman.

CBOW
• Vocabulary: ["I", "like", "cats", "dogs"]
• Context-Target Pair:
• Context: ["I", "cats"]
• Target: "like”
• Assume the following:
• Vocabulary size (VVV) = 4
• Embedding dimensions (DDD) = 3
• Initial word embeddings (random values for simplicity):
• "I" → [0.1,0.2,0.3]
• "like" → [0.4,0.5,0.6]
• "cats" → [0.7,0.8,0.9]
• "dogs" → [0.2,0.1,0.3]

STEPS
1.Forward Pass:
1. Compute the average of the embeddings for the context words.
2. Pass this average through a dense layer to predict the probability distribution
over the vocabulary.
2.Loss Calculation:
1. Use cross-entropy loss to compute the error between predicted and actual
target word.
3.Backpropagation:
1. Update embeddings and weights using gradient descent.

Step 1: Encode Context Words
• Context Words: ["I", "cats"]
Embeddings:
• "I" → [0.1,0.2,0.3]
• "cats" → [0.7,0.8,0.9]
• Average Embedding:
• Average=[0.1,0.2,0.3]+[0.7,0.8,0.9]/2=[0.4,0.5,0.6]
=[0.4,0.5,0.6]

Compute Scores
• Using a weight matrix W of size D×V:
• W=[0.2 0.1 0.3 0.4
• 0.5 0.6 0.7. 0.8
• 0.1 0.9 0.4 0.3]
• The scores for the vocabulary are computed as:
• Scores=Average Embedding W
⋅
• Performing the dot product:
• Scores=[0.35,0.83,0.65,0.75}

Softmax for Probability Distribution
• Apply the softmax function to the scores to get probabilities:

Loss Calculation
• The true target is "like," corresponding to index 1 in the vocabulary.
The probability for "like" is 0.250.250.25.
• Cross-entropy loss:
• Loss=−log⁡
(P(like))=−log⁡
(0.25)≈1.39

• Here when we give a vector representation of a group of
context words, we will get the most appropriate target
word which will be within the vicinity of those words.
• For example, if we give the sentence: Deep _____ is very
hard, where [“Deep”, “is”, “very”, “hard”] represents the
context words, the neural network should hopefully give
“Learning” as the output target word. This is the core task
the neural network tries to train for in the case of CBOW.
• Word vectors help represent semantics of the words
— What does this mean?

• It we means we could use vector reasoning for words one of the most famous
example is from Mikolov’s paper, where we see that if we use the word vectors
and perform (here, we use V(word) to represent the vector representation of the
word) V(King) -V(Man) + V(Woman), and the resulting vector is closest to V(Queen).
It is easy to see why this is remarkable — our intuitive understanding of these
words is reflected in the learned vector representations of the words.
• This gives us the ability to add more of a punch in our text analysis pipelines-
having an intuitive semantic representation of vectors will come in handy more
than once.
• Word2Vec takes as its input a large corpus of text and produces a vector space,
typically of several hundred dimensions, with each unique words in the corpus
being assigned a corresponding vector in the space. Word vectors are positioned
to vector space such that words that share common contexts in the corpus are
located in close proximity to one other in the space

81
word2Vec: Local contexts
• Instead of entire documents, Word2Vec uses words k positions away
from each center word.
• These words are called context words.
• Example for k=3:
• “It was a bright cold day in April, and the clocks were striking”.
• Center word: red (also called focus word).
• Context words: blue (also called target words).
• Word2Vec considers all words as center words, and all their context
words.

Example
Example: d1 = “king brave man” , d2 = “queen beautiful women”

83
Word2Vec: Data generation (window size = 2)
word Word one hot
encoding
neighbor Neighbor one hot
encoding
king [1,0,0,0,0,0] brave [0,1,0,0,0,0]
king [1,0,0,0,0,0] man [0,0,1,0,0,0]
brave [0,1,0,0,0,0] king [1,0,0,0,0,0]
brave [0,1,0,0,0,0] man [0,0,1,0,0,0]
man [0,0,1,0,0,0] king [1,0,0,0,0,0]
man [0,0,1,0,0,0] brave [0,1,0,0,0,0]
queen [0,0,0,1,0,0] beautiful [0,0,0,0,1,0]
queen [0,0,0,1,0,0] women [0,0,0,0,0,1]
beautiful [0,0,0,0,1,0] queen [0,0,0,1,0,0]
beautiful [0,0,0,0,1,0] women [0,0,0,0,0,1]
woman [0,0,0,0,0,1] queen [0,0,0,1,0,0]
woman [0,0,0,0,0,1] beautiful [0,0,0,0,1,0]

84
Word2Vec: Data generation (window size = 2)
• Example: d1 = “king brave man” , d2 = “queen beautiful women”
word Word one hot
encoding
neighbor Neighbor one hot
encoding
king [1,0,0,0,0,0] brave [0,1,1,0,0,0]
man
brave [0,1,0,0,0,0] king [1,0,1,0,0,0]
man
man [0,0,1,0,0,0] king [1,1,0,0,0,0]
brave
queen [0,0,0,1,0,0] beautiful [0,0,0,0,1,1]
women
beautiful [0,0,0,0,1,0] queen [0,0,0,1,0,1]
women
woman [0,0,0,0,0,1] queen [0,0,0,1,1,0]
beautiful

85
Word2Vec: main context representation
models
Continuous Bag of Words
(CBOW) Skip-Ngram
Sum and
projectio
n
W-2
W-1
w2
w0
w1
Input
Output
Projectio
n
W-2
W-1
w2
w0
w1
Input
Output
 Word2Vec is a predictive model.
 Will focus on Skip-Ngram model

88
1.The dictionary of unique words present in our dataset or text. This dictionary is known as vocabulary and is known
words to the system. Vocabulary is represented by ‘V’.
2.N is the number of neurons present in the hidden layer.
3.The window size is the maximum context location at which the words need to be predicted. The window size is
denoted by c. For example, in the given architecture image the window size is 2, therefore, we will be predicting the
words at context location (t-2), (t-1), (t+1) and (t+2).
4.Context window is the number of words to be predicted which can occur in the range of the given word. The value of
a context window is double the window size that is 2*c and is represented by k. For the given image the value of the
context window is 4.
5.The dimension of an input vector is equal to |V|. Each word is encoded using one hot encoding.
7.The output vector of the hidden layer is H[N].
8.The weight matrix for the hidden layer(W) is of dimension [|V|, N]. || is the modulus function which returns the size
of an array.
8.The weight matrix between the hidden and the output layer(W’) is of dimension [N,|V|].
9.The dot product between W’ and H gives us an output vector U[|v|].

89
1.The words are converted into a vector using one hot encoding. The dimension of these vectors is [1,|v|].

91
How does word2Vec work?
• Represent each word as a d dimensional vector.
• Represent each context as a d dimensional vector.
• Initialize all vectors to random weights.
• Arrange vectors in two matrices, W and C.

92
Word2Vec : Neural Network representation
w1
w2
Hidden layer
Input layer
|Vw| |Vc|
Output (sigmoid)

93
1
0
0
0
0
0
0
1
1
0
0
0
w1
w2
Hidden layer
Input layer
king
brave
man
|Vw| |Vc|
Output (sigmoid)

94
0
1
0
0
0
0
1
0
1
0
0
0
w1
w2
Hidden layer
Input layer Output (sigmoid)
|Vw| |Vc|
brave
man
king

95
0
0
1
0
0
0
1
1
0
0
0
0
w1
w2
Hidden layer
|Vw| |Vc|
man
king
brave

96
0
0
0
1
0
0
0
0
0
0
1
1
w1
w2
Hidden layer
|Vw| |Vc|
queen
beautiful
women

97
0
0
0
0
1
0
0
0
0
1
0
1
w1
w2
Hidden layer
|Vw| |Vc|
beautiful
queen
women

98
0
0
0
0
1
0
0
0
0
1
1
0
w1
w2
Hidden layer
|Vw| |Vc|
women
queen
beautiful

Introduction of Vectors
What is the role of vectors in natural language processing?
How are word embeddings represented as vectors in NLP?
Can you explain the concept of a vector space model in the context of text
representation?
How do vectors facilitate mathematical operations in NLP tasks?
Word Analogy
How does the word analogy task work in NLP?
What is the significance of the equation "king- man + woman = queen" in word
embeddings?
Which algorithm is commonly used for solving word analogy problems?
How can you assess the quality of word vectors based on their performance in analogy
tasks?
Assess Word Vectors using TF-IDF and t-SNE
How does TF-IDF differ from word embeddings in representing words?
What is the purpose of using TF-IDF for assessing word vectors?
How does t-SNE help in visualizing high-dimensional word vectors?
What insights can be gained from a t-SNE plot of word vectors?

What is GloVe?
Global Vectors for Word Representation, or GloVe for short, is an
unsupervised learning algorithm that generates vector representations,
or embeddings, of words. Researchers Richard Socher, Christopher D.
Manning, and Jeffrey Pennington first presented it in 2014. By using the
statistical co-occurrence data of words in a given corpus, GloVe is
intended to capture the semantic relationships between words.
The fundamental concept underlying GloVe is the representation of
words as vectors in a continuous vector space, where the angle and
direction of the vectors correspond to the semantic connections
between the appropriate words. To do this, GloVe builds a co-
occurrence matrix using word pairs and then optimizes the word
vectors to minimize the difference between the pointwise mutual
information of the corresponding words and the dot product of vectors.

Glove Data
The glove has pre-defined dense vectors for around every 6 billion words of English
literature along with many other general-use characters like commas, braces, and
semicolons. The algorithm’s developers frequently make the pre-trained GloVe
embeddings available. It is not necessary to train the model from scratch when using
these pre-trained embeddings, which can be downloaded and used immediately in a
variety of natural language processing (NLP) applications. Users can select a pre-
trained GloVe embedding in a dimension (e.g., 50-d, 100-d, 200-d, or 300-d vectors)
that best fits their needs in terms of computational resources and task specificity.
Here d stands for dimension. 100d means, in this file each word has an equivalent
vector of size 100. Glove files are simple text files in the form of a dictionary. Words
are key and dense vectors are values of key.

GloVe Embeddings Applications
GloVe embeddings are a popular option for representing words in text data and have found applications in various natural language processing (NLP) tasks. The following are some typical uses for GloVe embeddings:
Text Classification:
GloVe embeddings can be utilised as features in machine learning models for sentiment analysis, topic classification, spam detection, and other applications.
Named Entity Recognition (NER):
By capturing the semantic relationships between words and enhancing the model’s capacity to identify entities in text, GloVe embeddings can improve the performance of NER systems.
Machine Translation:
GloVe embeddings can be used to represent words in the source and target languages in machine translation systems, which aim to translate text from one language to another, thereby enhancing the quality of the translation.
Question Answering Systems:
To help models comprehend the context and relationships between words and produce more accurate answers, GloVe embeddings are used in question-answering tasks.

Document Similarity and Clustering:
GloVe embeddings enable applications in information retrieval and document organization by measuring
the semantic similarity between documents or grouping documents according to their content.
Word Analogy Tasks:
In word analogy tasks, GloVe embeddings frequently yield good results. For instance, the generated
vector for “king-man + woman” might resemble the “queen” vector, demonstrating the capacity to
recognize semantic relationships.
Semantic Search:
In semantic search applications, where retrieving documents or passages according to their semantic
relevance to a user’s query is the aim, GloVe embeddings are helpful.

How GloVe works?
The creation of a word co-occurrence matrix is the fundamental component of GloVe. This matrix provides a
quantitative measure of the semantic affinity between words by capturing the frequency with which they
appear together in a given context. Next, by minimising the difference between the dot product of vectors
and the pointwise mutual information of corresponding words, GloVe optimises word vectors. GloVe is able
to produce dense vector representations that capture syntactic and semantic relationships thanks to its
innovative methodology.
Create Vocabulary Dictionary
Vocabulary is the collection of all unique words present in the training dataset. The first dataset is tokenized
into words, then all the frequency of each word is counted. Then words are sorted in decreasing order of
their frequencies. Words having high frequency are placed at the beginning of the dictionary.
Dataset= {The peon is ringing the bell}
Vocabulary= {'The':2, 'peon':1, 'is':1, 'ringing':1}

Algorithm for word embedding
•Preprocess the text data.
•Created the dictionary.
•Traverse the glove file of a specific dimension and compare each word with all words in the
dictionary,
•if a match occurs, copy the equivalent vector from the glove and paste into embedding_matrix at
the corresponding index.

Algorithm for Word Embedding
To generate glove embeddings using the GloVe algorithm, you will need to follow these steps:
•Step - 1:
Collect a large datasetof words and their co-oc currences. This can be a text corpus, such as
documents or web pages.
•Step - 2:
Preprocess the dataset by tokenizing the text into individual words and filtering out rare or
irrelevant words.
•Step - 3:
Construct a co-occurrence matrix that counts the number of times each word appears in the same
context as every other word.
•Step - 4:
Use the co-occurrence matrix to compute the word embeddings using the GloVe algorithm. This
involves training a model to minimize the error between the word vectors' dot product and the co-
occurrence counts' logarithm
•Step - 5:
Save the resulting word embeddings to a file or use them directly in your model.
•.

Training
Here are the general steps to train a GloVe model:
1.Prepare the training corpus:
The first step is to prepare a large corpus of text that we will use to train the GloVe model. This corpus
should be representative of the text that the model will be used on.
2.Tokenize the corpus:
The next step is to tokenize the text in the corpus. Tokenization is the process of breaking the text into
individual words or tokens. This step is necessary because we will train the GloVe model on individual
words.
3.Create the word-word co-occurrence matrix:
Once the text is tokenized, the next step is to create a word-word co-occurrence matrix. This matrix tracks
how frequently words occur alongside one another in the corpus. The matrix is typically sparse, which
means that most entries are zero.
4.Collect co-occurrence statistics:
This step is the most computationally expensive part of the process because the algorithm goes through
the whole corpus once to collect the statistics for the matrix; however, this step is only required once.
5.Train the GloVe model:
Once the co-occurrence statistics are collected, we can start training. The GloVe algorithm learns the word
vectors by solving an optimization problem on the co-occurrence matrix. This process is done iteratively,
adjusting the vectors after each iteration until it reaches a satisfactory solution.
6.Use the trained model:
Once the model is trained, we can use it for various natural languages processing tasks such as text

T-SNE
Many of you already heard about dimensionality reduction algorithms like PCA. One of those algorithms is
called t-SNE (t-distributed Stochastic Neighbor Embedding). It was developed by Laurens van der Maaten
and Geoffrey Hinton in 2008. You might ask “Why I should even care? I know PCA already!”, and that would
be a great question. t-SNE is something called nonlinear dimensionality reduction. What that means is
this algorithm allows us to separate data that cannot be separated by any straight line, let me show you an
example:

t-SNE, or t-Distributed Stochastic Neighbor Embedding, is a method for converting Visualize high-
dimensional data into a more manageable form, typically 2D or 3D, facilitating better comprehension.
Here’s how it operates:
1.Measure Similarity: Initially, it assesses the similarity between each pair of data points within the high-
dimensional space.
2.Map to Lower Dimensions: Subsequently, it endeavors to represent these similarities in a simplified
space while preserving the relationships between the points as faithfully as possible.
3.Adjust Positions: It iteratively adjusts the positions of points in the simplified space to align the
similarities as closely as feasible.
4.Visualize: Finally, it plots these points in the simplified space, often on a graph, facilitating easier
pattern recognition and analysis.
To utilize t SNE, one must prepare the data point, execute the t-SNE algorithm (accessible in libraries like
scikit-learn or TensorFlow), and visualize the outcomes. It is a valuable tool for comprehending intricate
datasets.

T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization
developed by Laurens van der Maaten and Geoffrey Hinton It is a nonlinear dimensionality reduction
technique well-suited for embedding high-dimensional data for visualization in a low-dimensional
space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or
three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar
objects are modeled by distant points with high probability.
This kernel will be devided into the following parts
1.Data Exploration
2.Data Preprocessing
3.Data Vizualization

NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx

More Related Content

Similar to NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx (20)

Recently uploaded (20)

NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx