0% found this document useful (0 votes)
47 views

Unit 2 Notes

Uploaded by

953621243060
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Unit 2 Notes

Uploaded by

953621243060
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

lOMoARcPSD|38130204

CCS369-UNIT 2 - GOOD

Computer Science and Engineering (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

CCS369- TEXT AND SPEECH ANALYSIS


LECTURE NOTES
UNIT II TEXT CLASSIFICATION 6
Vector Semantics and Embeddings -Word Embeddings - Word2Vec model – Glove model – FastText model – Overview
of Deep Learning models – RNN – Transformers – Overview of Text summarization and Topic Models

INTRODUCTION
Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. Text
classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical studies,
and files, and all over the web. For example, new articles can be organized by topics; support tickets can be organized by
urgency; chat conversations can be organized by language; brand mentions can be organized by sentiment; and so on.

Text classification is one of the fundamental tasks in natural language processing with broad applications such as sentiment
analysis, topic labeling, spam detection, and intent detection. Here’s an example of how it works:

“The user interface is quite straightforward and easy to use.”

A text classifier can take this phrase as input, analyze its content, and then automatically assign relevant tags, such as UI
and Easy To Use.

Fig. Basic Flow of Text Classification

Some Examples of Text Classification:


• Sentiment Analysis.
• Language Detection.
• Fraud Profanity & Online Abuse Detection.
• Detecting Trends in Customer Feedback.
• Urgency Detection in Customer Support.

Why is Text Classification Important?


It’s estimated that around 80% of all information is unstructured, with text being one of the most common types of
unstructured data. Because of the messy nature of text, analyzing, understanding, organizing, and sorting through text data
is hard and time-consuming, so most companies fail to use it to its full potential. This is where text classification with
machine learning comes in. Using text classifiers, companies can automatically structure all manner of relevant text, from
emails, legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way. This allows companies
to save time analyzing text data, automate business processes, and make data-driven business decisions.

Why use machine learning text classification?


Some of the top reasons:
• Scalability: Manually analyzing and organizing is slow and much less accurate.. Machine learning can
automatically analyze millions of surveys, comments, emails, etc., at a fraction of the cost, often in just a few
minutes. Text classification tools are scalable to any business needs, large or small.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

• Real-time analysis: There are critical situations that companies need to identify as soon as possible and take
immediate action (e.g., PR crises on social media). Machine learning text classification can follow your brand
mentions constantly and in real-time, so you'll identify critical information and be able to take action right away.
• Consistent criteria: Human annotators make mistakes when classifying text data due to distractions, fatigue, and
boredom, and human subjectivity creates inconsistent criteria. Machine learning, on the other hand, applies the
same lens and criteria to all data and results. Once a text classification model is properly trained it performs with
unsurpassed accuracy.

We can perform text classification in two ways: manual or automatic.

• Manual text classification involves a human annotator, who interprets the content of text and categorizes it
accordingly. This method can deliver good results but it’s time-consuming and expensive.

• Automatic text classification applies machine learning, natural language processing (NLP), and other AI-guided
techniques to automatically classify text in a faster, more cost-effective, and more accurate manner.

There are many approaches to automatic text classification, but they all fall under three types of systems:

• Rule-based systems
• Machine learning-based systems
• Hybrid systems

Rule-based systems
Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These rules instruct
the system to use semantically relevant elements of a text to identify relevant categories based on its content. Each rule
consists of an antecedent or pattern and a predicted category.

Example: Say that you want to classify news articles into two groups: Sports and Politics. First, you’ll need to define two
lists of words that characterize each group (e.g., words related to sports such as football, basketball, LeBron James, etc.,
and words related to politics, such as Donald Trump, Hillary Clinton, Putin, etc.). Next, when you want to classify a new
incoming text, you’ll need to count the number of sport-related words that appear in the text and do the same for politics-
related words. If the number of sports-related word appearances is greater than the politics-related word count, then the text
is classified as Sports and vice versa. For example, this rule-based system will classify the headline “When is LeBron James'
first game with the Lakers?” as Sports because it counted one sports-related term (LeBron James) and it didn’t count any
politics-related terms.

Rule-based systems are human comprehensible and can be improved over time. But this approach has some disadvantages.
For starters, these systems require deep knowledge of the domain. They are also time-consuming, since generating rules for
a complex system can be quite challenging and usually requires a lot of analysis and testing. Rule-based systems are also
difficult to maintain and don’t scale well given that adding new rules can affect the results of the pre-existing rules.

Machine learning-based systems


Instead of relying on manually crafted rules, machine learning text classification learns to make classifications based on
past observations. By using pre-labeled examples as training data, machine learning algorithms can learn the different
associations between pieces of text, and that a particular output (i.e., tags) is expected for a particular input (i.e., text). A
“tag” is the pre-determined classification or category that any given text could fall into.

The first step towards training a machine learning NLP classifier is feature extraction: a method used to transform each text
into a numerical representation in the form of a vector. One of the most frequently used approaches is the bag of words,
where a vector represents the frequency of a word in a predefined dictionary of words.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

For example, if we have defined our dictionary to have the following words {This, is, the, not, awesome, bad, basketball},
and we wanted to vectorize the text “This is awesome,” we would have the following vector representation of that text: (1,
1, 0, 0, 1, 0, 0). Then, the machine learning algorithm is fed with training data that consists of pairs of feature sets (vectors
for each text example) and tags (e.g. sports, politics) to produce a classification model:

Fig. Training process in Text Classification

Once it’s trained with enough training samples, the machine learning model can begin to make accurate predictions. The
same feature extractor is used to transform unseen text to feature sets, which can be fed into the classification model to get
predictions on tags (e.g., sports, politics):

Fig. Prediction process in Text Classification


Text classification with machine learning is usually much more accurate than human-crafted rule systems, especially on
complex NLP classification tasks. Also, classifiers with machine learning are easier to maintain and you can always tag
new examples to learn new tasks.

Machine Learning Text Classification Algorithms


Some of the most popular text classification algorithms include the Naive Bayes family of algorithms, support vector
machines (SVM), and deep learning.
• Naive Bayes
The Naive Bayes family of statistical algorithms are some of the most used algorithms in text classification and text analysis,
overall. One of the members of that family is Multinomial Naive Bayes (MNB) with a huge advantage, that you can get
really good results even when your dataset isn’t very large (~ a couple of thousand tagged samples) and computational
resources are scarce. Naive Bayes is based on Bayes’s Theorem, which helps us compute the conditional probabilities of

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

the occurrence of two events, based on the probabilities of the occurrence of each individual event. So we’re calculating the
probability of each tag for a given text, and then outputting the tag with the highest probability.

Naive Bayes formula.

The probability of A, if B is true, is equal to the probability of B, if A is true, times the probability of A being true, divided
by the probability of B being true. This means that any vector that represents a text will have to contain information about
the probabilities of the appearance of certain words within the texts of a given category so that the algorithm can compute
the likelihood of that text belonging to the category.

• Support Vector Machines


Support Vector Machines (SVM) is another powerful text classification machine learning algorithm because like Naive
Bayes, SVM doesn’t need much training data to start providing accurate results. SVM does, however, require more
computational resources than Naive Bayes, but the results are even faster and more accurate. In short, SVM draws a line or
“hyperplane” that divides a space into two subspaces. One subspace contains vectors (tags) that belong to a group, and
another subspace contains vectors that do not belong to that group.

Fig. Optimal SVM Hyperplane

The optimal hyperplane is the one with the largest distance between each tag. In two dimensions it looks like above: Those
vectors are representations of your training texts, and a group is a tag you have tagged your texts with. As data gets more
complex, it may not be possible to classify vectors/tags into only two categories. So, it looks like this:

• Deep Learning
Deep learning is a set of algorithms and techniques inspired by how the human brain works, called neural networks. Deep
learning architectures offer huge benefits for text classification because they perform at super high accuracy with lower-
level engineering and computation. The two main deep learning architectures for text classification are Convolutional Neural
Networks (CNN) and Recurrent Neural Networks (RNN). Deep learning is hierarchical machine learning, using multiple
algorithms in a progressive chain of events. It’s similar to how the human brain works when making decisions, using
different techniques simultaneously to process huge amounts of data.

Deep learning algorithms do require much more training data than traditional machine learning algorithms (at least millions
of tagged examples). However, they don’t have a threshold for learning from training data, like traditional machine learning
algorithms, such as SVM and Deep learning classifiers continue to get better the more data you feed them with: Deep
learning algorithms, like Word2Vec or GloVe are also used in order to obtain better vector representations for words and
improve the accuracy of classifiers trained with traditional machine learning algorithms.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

• Hybrid Systems
Hybrid systems combine a machine learning-trained base classifier with a rule-based system, used to further improve the
results. These hybrid systems can be easily fine-tuned by adding specific rules for those conflicting tags that haven’t been
correctly modeled by the base classifier.

VECTOR SEMANTICS AND EMBEDDINGS

Vector semantics is the standard way to represent word meaning in NLP, helping vector semantics us model many of the
aspects of word meaning. The idea of vector semantics is to represent a word as a point in a multidimensional semantic
space that is derived from the distributions of embeddings word neighbors. Vectors for representing words are called
embeddings (although the term is sometimes more strictly applied only to dense vectors like word2vec). Vector Semantics
defines semantics & interprets word meaning to explain features such as word similarity. Its central idea is: Two words are
similar if they have similar word contexts.

In its current form, the vector model inspires its working from the linguistic and philosophical work of the 1950s. Vector
semantics represents a word in multi-dimensional vector space. Vector model is also called Embeddings, due to the fact that
a word is embedded in a particular vector space. The vector model offers many advantages in NLP. For example, in
sentimental analysis, sets up a boundary class and predicts if the sentiment is positive or negative (a binomial classification).
Another key practical advantage of vector semantics is that it can learn automatically from text without complex labeling
or supervision. As a result of these advantages, vector semantics has become a de-facto standard for NLP applications such
as Sentiment Analysis, Named Entity Recognition (NER), topic modeling, and so on.
WORD EMBEDDINGS
It is an approach for representing words and documents. Word Embedding or Word Vector is a numeric vector input that
represents a word in a lower-dimensional space. It allows words with similar meaning to have a similar representation. They
can also approximate meaning. A word vector with 50 values can represent 50 unique features.
Features: Anything that relates words to one another. Eg: Age, Sports, Fitness, Employed etc. Each word vector has values
corresponding to these features.

Goal of Word Embeddings


• To reduce dimensionality
• To use a word to predict the words around it
• Inter word semantics must be captured
How are Word Embeddings used?
• They are used as input to machine learning models.
• Take the words —-> Give their numeric representation —-> Use in training or inference
• To represent or visualize any underlying patterns of usage in the corpus that was used to train them.

Implementations of Word Embeddings:

Word Embeddings are a method of extracting features out of text so that we can input those features into a machine learning
model to work with text data. They try to preserve syntactical and semantic information. The methods such as Bag of
Words(BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not save any syntactical or semantic
information. In these algorithms, the size of the vector is the number of elements in the vocabulary. We can get a sparse
matrix if most of the elements are zero. Large input vectors will mean a huge number of weights which will result in high
computation required for training. Word Embeddings give a solution to these problems.
Let’s take an example to understand how word vector is generated by taking emoticons which are most frequently used in
certain conditions and transform each emoji into a vector and the conditions will be our features.

Happy ???? ???? ????

Sad ???? ???? ????

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Excited ???? ???? ????

Sick ???? ???? ????

The emoji vectors for the emojis will be:


[happy, sad, excited, sick]
???? =[1,0,1,0]
???? =[0,1,0,1]
???? =[0,0,1,1]
.....

In a similar way, we can create word vectors for different words as well on the basis of given features. The words with
similar vectors are most likely to have the same meaning or are used to convey the same sentiment. There are two different
approaches for getting Word Embeddings:

1) Word2Vec:
In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.
One-Hot vector: A representation where only one bit in a vector is 1. If there are 500 words in the corpus then the vector
length will be 500. After assigning vectors to each word we take a window size and iterate through the entire corpus. While
we do this there are two neural embedding methods which are used:

1.1) Continuous Bowl of Words (CBOW)


In this model what we do is we try to fit the neighboring words in the window to the central word.

CBOW Architecture.

This architecture is very similar to a feed-forward neural network. This model architecture essentially tries to
predict a target word from a list of context words.

The intuition behind this model is quite simple: given a phrase "Have a great day", we will choose our target
word to be “a” and our context words to be [“have”, “great”, “day”]. What this model will do is take the
distributed representations of the context words to try and predict the target word.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

The English language contains almost 1.2 million words, making it impossible to include so many words in our
example. So I ‘ll consider a small example in which we have only four words i.e. live, home, they and at. For
simplicity, we will consider that the corpus contains only one sentence, that being, ‘They live at home’.

First, we convert each word into a one-hot encoding form. Also, we’ll not consider all the words in the sentence
but ll only take certain words that are in a window. For example for a window size equal to three, we only
consider three words in a sentence. The middle word is to be predicted and the surrounding two words are fed
into the neural network as context. The window is then slid and the process is repeated again.

Finally, after training the network repeatedly by sliding the window a shown above, we get weights which we
use to get the embeddings as shown below.

Usually, we take a window size of around 8-10 words and have a vector size of 300.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

1.2) Skip Gram


In this model, we try to make the central word closer to the neighboring words. It is the complete opposite of the CBOW
model. It is shown that this method produces more meaningful embeddings.

After applying the above neural embedding methods we get trained vectors of each word after many iterations through the
corpus. These trained vectors preserve syntactical or semantic information and are converted to lower dimensions. The
vectors with similar meaning or semantic information are placed close to each other in space.

The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to
predict the source context words (surrounding words) given a target word (the centre word)
The working of the skip-gram model is quite similar to the CBOW but there is just a difference in the architecture
of its neural network and the way the weight matrix is generated as shown in the figure below:

After obtaining the weight matrix, the steps to get word embedding is same as CBOW.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

So now which one of the two algorithms should we use for implementing word2vec? Turns out for large corpus
with higher dimensions, it is better to use skip-gram but is slow to train. Whereas CBOW is better for small corpus
and is faster to train too.

2) GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and iterate through it and get the
co-occurrence of each word with other words in the corpus. We get a co-occurrence matrix through this. The words which
occur next to each other get a value of 1, if they are one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:
Corpus:
It is a nice evening.
Good Evening!
Is it a nice evening?
it is a nice evening good

it 0

is 1+1 0

a 1/2+1 1+1/2 0

nice 1/3+1/2 1/2+1/3 1+1 0

evening 1/4+1/3 1/3+1/4 1/2+1/2 1+1 0

good 0 0 0 0 1 0

The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as well to calculate the
co-occurrences by shifting the frame till the end of the corpus. This helps gather information about the context in which the
word is used. Initially, the vectors for each word is assigned randomly. Then we take two pairs of vectors and see how close
they are to each other in space. If they occur together more often or have a higher value in the co-occurrence matrix and are
far apart in space then they are brought close to each other. If they are close to each other but are rarely or not frequently
used together then they are moved further apart in space.
After many iterations of the above process, we’ll get a vector space representation that approximates the information from
the co-occurrence matrix. The performance of GloVe is better than Word2Vec in terms of both semantic and syntactic
capturing.

Pre-trained Word Embedding Models:


People generally use pre-trained models for word embeddings. Few of them are:
• SpaCy
• fastText
• Flair etc.
Common Errors made:
• You need to use the exact same pipeline during deploying your model as were used to create the training data for
the word embedding. If you use a different tokenizer or different method of handling white space, punctuation etc.
you might end up with incompatible inputs.
• Words in your input that doesn’t have a pre-trained vector. Such words are known as Out of Vocabulary
Word(oov). What you can do is replace those words with “UNK” which means unknown and then handle them
separately.
• Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of length say 400 and then
try to apply vectors of length 1000 at inference time, you will run into errors. So make sure to use the same
dimensions throughout.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Benefits of using Word Embeddings:


• It is much faster to train than hand build models like WordNet(which uses graph embeddings)
• Almost all modern NLP applications start with an embedding layer
• It Stores an approximation of meaning
Drawbacks of Word Embeddings:
• It can be memory intensive
• It is corpus dependent. Any underlying bias will have an effect on your model
• It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc.

WORD2VEC MODEL IN DETAIL

Word embeddings is a technique where individual words are transformed into a numerical representation of the
word (a vector). Where each word is mapped to one vector, this vector is then learned in a way which resembles
a neural network. The vectors try to capture various characteristics of that word with regard to the overall text.
These characteristics can include the semantic relationship of the word, definitions, context, etc. With these
numerical representations, you can do many things like identify similarity or dissimilarities between words.
Clearly, these are integral as inputs to various aspects of machine learning. A machine cannot process text in its
raw form, thus converting the text into an embedding will allow users to feed the embedding to classic machine
learning models. The simplest embedding would be a one-hot encoding of text data where each vector would be
mapped to a category.

For example:
have = [1, 0, 0, 0, 0, 0, ... 0]
a = [0, 1, 0, 0, 0, 0, ... 0]
good = [0, 0, 1, 0, 0, 0, ... 0]
day = [0, 0, 0, 1, 0, 0, ... 0] ...

However, there are multiple limitations of simple embeddings such as this, as they do not capture characteristics
of the word, and they can be quite large depending on the size of the corpus.

Word2Vec Architecture
The effectiveness of Word2Vec comes from its ability to group together vectors of similar words. Given a large
enough dataset, Word2Vec can make strong estimates about a word’s meaning based on their occurrences in the
text. These estimates yield word associations with other words in the corpus. For example, words like “King” and
“Queen” would be very similar to one another. When conducting algebraic operations on word embeddings you
can find a close approximation of word similarities. For example, the 2-dimensional embedding vector of "king"
- the 2-dimensional embedding vector of "man" + the 2-dimensional embedding vector of "woman" yielded a
vector which is very close to the embedding vector of "queen". Note, that the values below were chosen
arbitrarily.

𝐾𝑖𝑛𝑔 − 𝑀𝑎𝑛 + 𝑊𝑜𝑚𝑎𝑛 = 𝑄𝑢𝑒𝑒𝑛


[5,3] − [2,1] + [3, 2] = [6,4]

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Graphical Representation of an Word2Vec Embedding ( King and Queen are close to each other in position)

There are two main architectures that yield the success of word2vec.
• Skip-gram
• CBOW architectures. (Refer Previous Section)

Continuous Skip-Gram Model

The skip-gram model is a simple neural network with one hidden layer trained in order to predict the probability
of a given word being present when an input word is present. Intuitively, you can imagine the skip-gram model
being the opposite of the CBOW model. In this architecture, it takes the current word as an input and tries to
accurately predict the words before and after this current word. This model essentially tries to learn and predict
the context words around the specified input word. Based on experiments assessing the accuracy of this model it
was found that the prediction quality improves given a large range of word vectors, however it also increases the
computational complexity. The process can be described visually as seen below.

Training Data Generation Model


Example of generating training data for skip-gram model. Window size is 3. Image provided by author
As seen above, given some corpus of text, a target word is selected over some rolling window. The training data
consists of pairwise combinations of that target word and all other words in the window. This is the resulting
training data for the neural network. Once the model is trained, we can essentially yield a probability of a word
being a context word for a given target. The following image below represents the architecture of the neural
network for the skip-gram model.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Skip-Gram Model architecture

A corpus can be represented as a vector of size N, where each element in N corresponds to a word in the corpus.
During the training process, we have a pair of target and context words, the input array will have 0 in all elements
except for the target word. The target word will be equal to 1. The hidden layer will learn the embedding
representation of each word, yielding a d-dimensional embedding space. The output layer is a dense layer with a
softmax activation function. The output layer will essentially yield a vector of the same size as the input, each
element in the vector will consist of a probability. This probability indicates the similarity between the target
word and the associated word in the corpus.

GLOVE MODEL

GloVe , coined from Global Vectors, is an unsupervised learning algorithm for obtaining vector representations for words.
Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting
representations showcase interesting linear substructures of the word vector space. It is a model for distributed word
representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is
achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.

The statistics of word occurrences in a corpus is the primary source of information available to all unsupervised methods
for learning word representations, and although many such methods now exist, the question still remains as to how meaning
is generated from these statistics, and how the resulting word vectors might represent that meaning. In this section, we shed
some light on this question. We use our insights to construct a new model for word representation which we call GloVe, for
Global Vectors, because the global corpus statistics are captured directly by the model.

GloVe model combines the advantages of the two major model families in the literature:
• global matrix factorization
• local context window methods

This model efficiently leverages statistical information by training only on the nonzero elements in a word-word
cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus.

• Matrix Factorization Methods: methods that reduce a matrix into constituent parts that make it easier to calculate
more complex matrix operations .
• Shallow Window-Based Methods: Another approach is to learn word representations that aid in making
predictions within local context windows.
Refer the previous section for example.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

FASTTEXT MODEL

This model allows training word embeddings from a training corpus with the additional ability to obtain word vectors for
out-of-vocabulary words. FastText is an open-source, free, lightweight library that allows users to learn text representations
and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile
devices. fastText embeddings exploit subword information to construct word embeddings. FastText is more stable than
Word2Vec architecture. In FastText, each word is represented as the average of the vector representation of its character n-
grams along with the word itself. So, the word embedding for the word 'equal' can be given as the sum of all vector
representations of all of its character n-gram and the word itself.

Word embedding techniques like word2vec and GloVe provide distinct vector representations for the words in the
vocabulary. This leads to ignorance of the internal structure of the language. This is a limitation for morphologically rich
language as it ignores the syntactic relation of the words. As many word formations follow the rules in morphologically rich
languages, it is possible to improve vector representations for these languages by using character-level information.
To improve vector representation for morphologically rich language, FastText provides embeddings for character n-grams,
representing words as the average of these embeddings. It is an extension of the word2vec model. Word2Vec model provides
embedding to the words, whereas fastText provides embeddings to the character n-grams. Like the word2vec model,
fastText uses CBOW and Skip-gram to compute the vectors.

FastText can also handle out-of-vocabulary words, i.e., the fast text can find the word embeddings that are not present at
the time of training.

Out-of-vocabulary (OOV) words are words that do not occur while training the data and are not present in the model’s
vocabulary. Word embedding models like word2vec or GloVe cannot provide embeddings for the OOV words because they
provide embeddings for words; hence, if a new word occurs, it cannot provide embedding. Since FastText provides
embeddings for character n-grams, it can provide embeddings for OOV words. If an OOV word occurs, then fastText
provides embedding for that word by embedding its character n-gram.

Understanding the Working of FastText


In FastText, each word is represented as the average of the vector representation of its character n-grams along with the
word itself.
Consider the word “equal” and n = 3, then the word will be represented by character n-grams:

< eq, equ, qua, ual, al > and < equal >

So, the word embedding for the word ‘equal’ can be given as the sum of all vector representations of all of its character n-
gram and the word itself.

Continuous Bag Of Words (CBOW) for FastTest:


In the Continuous Bag Of Words (CBOW), we take the context of the target word as input and predict the word that occurs
in the context.

For example, in the sentence “ I want to learn FastText.” In this sentence, the words “I,” “want,” “to,” and “FastText” are
given as input, and the model predicts “learn” as output.

All the input and output data are in the same dimension and have one-hot encoding. It uses a neural network for training.
The neural network has an input layer, a hidden layer, and an output layer. Figure 1.2 shows the working of CBOW for
FastText.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Skip-gram for FastText


Skip-gram works like CBOW, but the input is the target word, and the model predicts the context of the given the word. It
also uses neural networks for training. Figure 1.3 shows the working of Skip-gram.

Highlighting the Difference: Word2Vec vs. FastText

FastText can be viewed as an extension to word2vec. Some of the significant differences between word2vec and fastText
are as follows:
• Word2Vec works on the word level, while fastText works on the character n-grams.
• Word2Vec cannot provide embeddings for out-of-vocabulary words, while fastText can provide embeddings for
OOV words.
• FastText can provide better embeddings for morphologically rich languages compared to word2vec.
• FastText uses the hierarchical classifier to train the model; hence it is faster than word2vec.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

OVERVIEW OF DEEP LEARNING MODELS – RNN


Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by
example. Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish
a pedestrian from a lamppost. It is the key to voice control in consumer devices like phones, tablets, TVs, and hands-free
speakers. Deep learning is getting lots of attention lately and for good reason. It’s achieving results that were not possible
before.

In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Deep learning
models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using
a large set of labeled data and neural network architectures that contain many layers.

While deep learning was first theorized in the 1980s, there are two main reasons it has only recently become useful:
1. Deep learning requires large amounts of labeled data. For example, driverless car development requires millions
of images and thousands of hours of video.
2. Deep learning requires substantial computing power. High-performance GPUs have a parallel architecture that is
efficient for deep learning. When combined with clusters or cloud computing, this enables development teams to
reduce training time for a deep learning network from weeks to hours or less.
Deep learning applications are used in industries from automated driving to medical devices.

• Automated Driving: Automotive researchers are using deep learning to automatically detect objects such as stop
signs and traffic lights. In addition, deep learning is used to detect pedestrians, which helps decrease accidents.
• Aerospace and Defense: Deep learning is used to identify objects from satellites that locate areas of interest, and
identify safe or unsafe zones for troops.
• Medical Research: Cancer researchers are using deep learning to automatically detect cancer cells. Teams at UCLA
built an advanced microscope that yields a high-dimensional data set used to train a deep learning application to
accurately identify cancer cells.
• Industrial Automation: Deep learning is helping to improve worker safety around heavy machinery by
automatically detecting when people or objects are within an unsafe distance of machines.
• Electronics: Deep learning is being used in automated hearing and speech translation. For example, home assistance
devices that respond to your voice and know your preferences are powered by deep learning applications.

RNN
Recurrent Neural Network (RNN) is a type of Neural Network where the output from the previous step is fed as input to
the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases when
it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember
the previous words. Thus, RNN came into existence, which solved this issue with the help of a Hidden Layer. The main
and most important feature of RNN is its Hidden state, which remembers some information about a sequence. The state is
also referred to as Memory State since it remembers the previous input to the network. It uses the same parameters for each
input as it performs the same task on all the inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.

Recurrent neural network

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

A RNN treats each word of a sentence as a separate input occurring at time ‘t’ and uses the activation value at ‘t-1’ also,
as an input in addition to the input at time ‘t’. The diagram below shows a detailed structure of an RNN architecture. The
architecture described above is also called as a many to many architecture with (Tx = Ty) i.e. number of inputs = number
of outputs. Such structure is quite useful in Sequence modelling.

Apart from the architecture mentioned above there are three other types of architectures of RNN which are commonly used.
1. Many to One RNN : Many to one architecture refers to an RNN architecture where many inputs (Tx) are used to
give one output (Ty). A suitable example for using such an architecture will be a classification task.

RNN are a very important variant of neural networks heavily used in Natural Language Processing.

Conceptually they differ from a standard neural network as the standard input in a RNN is a word instead of the entire
sample as in the case of a standard neural network. This gives the flexibility for the network to work with varying lengths
of sentences, something which cannot be achieved in a standard neural network due to it’s fixed structure. It also provides
an additional advantage of sharing features learned across different positions of text which cannot be obtained in a
standard neural network.

In the image above H represents the output of the activation function.

2. One to Many RNN: One to Many architecture refers to a situation where a RNN generates a series of output values
based on a single input value. A prime example for using such an architecture will be a music generation task, where an
input is a jounre or the first note.

3. Many to Many Architecture (Tx not equals Ty): This architecture refers to where many inputs are read to produce
many outputs, where the length of inputs is not equal to the length of outputs. A prime example for using such an architecture
is machine translation tasks.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Encoder refers to the part of the network which reads the sentence to be translated, and, Decoder is the part of the
network which translates the sentence into desired language.

Limitations of RNN
Apart from all of its usefulness RNN does have certain limitations major of which are:
1. Examples of RNN architecture stated above can capture the dependencies in only one direction of the language.
Basically, in the case of Natural Language Processing it assumes that the word coming after has no effect on the
meaning of the word coming before. With our experience of languages, we know that it is certainly not true.
2. RNN are also not very good in capturing long term dependencies and the problem of vanishing gradients resurface
in RNN.

NNs are ideal for solving problems where the sequence is more important than the individual items themselves.
An RNNs is essentially a fully connected neural network that contains a refactoring of some of its layers into a loop. That
loop is typically an iteration over the addition or concatenation of two inputs, a matrix multiplication and a non-linear
function.
Among the text usages, the following tasks are among those RNNs perform well at:
• Sequence labelling
• Natural Language Processing (NLP) text classification
• Natural Language Processing (NLP) text generation
Other tasks that RNNs are effective at solving are time series predictions or other sequence predictions that aren’t image or
tabular-based.
There have been several highlighted and controversial reports in the media over the advances in text generation, OpenAI’s
GPT-2 algorithm. In many cases the generated text is often indistinguishable from text written by humans.

RNNs effectively have an internal memory that allows the previous inputs to affect the subsequent predictions. It’s much
easier to predict the next word in a sentence with more accuracy, if you know what the previous words were. Often with
tasks well suited to RNNs, the sequence of the items is as or more important than the previous item in the sequence.

Sequence-to-Sequence Models: TRANSFORMERS (Translate one language into another language)


Sequence-to-sequence (seq2seq) models in NLP are used to convert sequences of Type A to sequences of Type B. For
example, translation of English sentences to German sentences is a sequence-to-sequence task.

Recurrent Neural Network (RNN) based sequence-to-sequence models have garnered a lot of traction ever since they
were introduced in 2014. Most of the data in the current world are in the form of sequences – it can be a number sequence,
text sequence, a video frame sequence or an audio sequence.
The performance of these seq2seq models was further enhanced with the addition of the Attention Mechanism in 2015.
How quickly advancements in NLP have been happening in the last 5 years – incredible!
These sequence-to-sequence models are pretty versatile and they are used in a variety of NLP tasks, such as:
• Machine Translation
• Text Summarization
• Speech Recognition
• Question-Answering System, and so on

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

RNN based Sequence-to-Sequence Model


Let’s take a simple example of a sequence-to-sequence model. Check out the below illustration:

German to English Translation using seq2seq


The above seq2seq model is converting a German phrase to its English counterpart. Let’s break it down:
• Both Encoder and Decoder are RNNs
• At every time step in the Encoder, the RNN takes a word vector (xi) from the input sequence and a hidden state
(Hi) from the previous time step
• The hidden state is updated at each time step
• The hidden state from the last unit is known as the context vector. This contains information about the input
sequence
• This context vector is then passed to the decoder and it is then used to generate the target sequence (English phrase)
• If we use the Attention mechanism, then the weighted sum of the hidden states are passed as the context vector to
the decoder
Challenges
Despite being so good at what it does, there are certain limitations of seq-2-seq models with attention:
• Dealing with long-range dependencies is still challenging
• The sequential nature of the model architecture prevents parallelization. These challenges are addressed by Google
Brain’s Transformer concept
RNN can remember important things about the input it has received, which allows them to be very precise in predicting
what can be the next outcome. So this is the reason why they are performed or preferred on a sequential data algorithm.
And some of the examples of sequence data can be something like time, series, speech, text, financial data, audio, video,
weather, and many more. Although RNN was the state-of-the-art algorithm for dealing with sequential data, they come up
with their own drawbacks and some popular drawbacks over here can be like due to the complication or the complexity of
the algorithm. The neural network is pretty slow to train. And as a huge amount of dimensions here, the training is very
long and difficult to do.

TRANSFORMERS

Attention models/Transformers are the most exciting models being studied in NLP research today, but they can be a bit
challenging to grasp – the pedagogy is all over the place. This is both a bad thing (it can be confusing to hear different
versions) and in some ways a good thing (the field is rapidly evolving, there is a lot of space to improve).

Transformer

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Internally, the Transformer has a similar kind of architecture as the previous models above. But the Transformer consists
of six encoders and six decoders.

Each encoder is very similar to each other. All encoders have the same architecture. Decoders share the same property, i.e.
they are also very similar to each other. Each encoder consists of two layers: Self-attention and a feed Forward Neural
Network.

The encoder’s inputs first flow through a self-attention layer. It helps the encoder look at other words in the input sentence
as it encodes a specific word. The decoder has both those layers, but between them is an attention layer that helps the
decoder focus on relevant parts of the input sentence.

Self-Attention
Let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a
trained model into an output. As is the case in NLP applications in general, we begin by turning each input word into a
vector using an embedding algorithm.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Each word is embedded into a vector of size 512. We will represent those vectors with these simple boxes. The embedding
only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of
vectors each of the size 512.

In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder
that’s directly below. After embedding the words in our input sequence, each of them flows through each of the two layers
of the encoder.

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own
path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does
not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the
feed-forward layer.

Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.

Self-Attention
Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented —
using matrices.

Figuring out the relation of words within a sentence and giving the right attention to it.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case,
the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors
are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the
embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an
architecture choice to make the computation of multiheaded attention (mostly) constant.

Multiplying x1 by the WQ weight matrix produces q1, the “query” vector associated with that word. We end up creating a
“query”, a “key”, and a “value” projection of each word in the input sentence.

What are the “query”, “key”, and “value” vectors?


They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how
attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first
word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines
how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re
scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1
and k1. The second score would be the dot product of q1 and k2.

The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in this
example is 64. This leads to having more stable gradients. There could be other possible values here, but this is the default),
then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

This softmax score determines how much how much each word will be expressed at this position. Clearly the word at this
position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the
current word.

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is
to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny
numbers like 0.001, for example). The sixth step is to sum up the weighted value vectors. This produces the output of the
self-attention layer at this position (for the first word).

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural
network. In the actual implementation, however, this calculation is done in matrix form for faster processing.

Multihead attention
There are a few other details that make them work better. For example, instead of only paying attention to each other in one
dimension, Transformers use the concept of Multihead attention. The idea behind it is that whenever you are translating a
word, you may pay different attention to each word based on the type of question that you are asking. The images below
show what that means. For example, whenever you are translating “kicked” in the sentence “I kicked the ball”, you may
ask “Who kicked”. Depending on the answer, the translation of the word to another language can change. Or ask other
questions, like “Did what?”, etc…

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Positional Encoding
Another important step on the Transformer is to add positional encoding when encoding each word. Encoding the
position of each word is relevant since the position of each word is relevant to the translation.

OVERVIEW OF TEXT SUMMARIZATION AND TOPIC MODELS

Text summarization is the process of creating a concise and accurate representation of the main points and information in
a document. Topic modelling can help you generate summaries by extracting the most relevant and salient topics and words
from the document. Text summarization refers to the technique of shortening long pieces of text. The intention is to create
a coherent and fluent summary having only the main points outlined in the document.
Automatic text summarization is a common problem in machine learning and natural language processing (NLP). In
general, text summarization technique has proved to be critical in quickly and accurately summarizing voluminous texts,
something which could be expensive and time consuming if done without machines.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

There are two main types of how to summarize text in NLP:


• Extraction-based summarization
• Abstraction-based summarization

• Extraction-based summarization
The extractive text summarization technique involves pulling keyphrases from the source document and combining them
to make a summary. The extraction is made according to the defined metric without making any changes to the texts.

Here is an example:
Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave birth to
a child named Jesus.

Extractive summary: Joseph and Mary attend event Jerusalem. Mary birth Jesus.

As you can see above, the words in bold have been extracted and joined to create a summary — although sometimes the
summary can be grammatically strange.

• Abstraction-based summarization
The abstraction technique entails paraphrasing and shortening parts of the source document. When abstraction is applied
for text summarization in deep learning problems, it can overcome the grammar inconsistencies of the extractive method.
The abstractive text summarization algorithms create new phrases and sentences that relay the most useful information from
the original text — just like humans do.

Therefore, abstraction performs better than extraction. However, the text summarization algorithms required to do
abstraction are more difficult to develop; that’s why the use of extraction is still popular.

Here is an example:
Abstractive summary: Joseph and Mary came to Jerusalem where Jesus was born.

How does a text summarization algorithm work?


Usually, text summarization in NLP is treated as a supervised machine learning problem (where future outcomes are
predicted based on provided data).

Typically, here is how using the extraction-based approach to summarize texts can work:

1. Introduce a method to extract the merited keyphrases from the source document. For example, you can use part-
of-speech tagging, word sequences, or other linguistic patterns to identify the keyphrases.

2. Gather text documents with positively-labeled keyphrases. The keyphrases should be compatible to the stipulated
extraction technique. To increase accuracy, you can also create negatively-labeled keyphrases.

3. Train a binary machine learning classifier to make the text summarization. Some of the features you can use
include:
• Length of the keyphrase
• Frequency of the keyphrase
• The most recurring word in the keyphrase
• Number of characters in the keyphrase
4. Finally, in the test phrase, create all the keyphrase words and sentences and carry out classification for them.

Topic Modeling
Topic modeling is a technique that can help you discover the main themes and concepts in a large collection of text
documents. It can also help you summarize, classify, or cluster the documents based on their topics. In this article, you will
learn how to use topic modeling for these tasks and what are some of the common algorithms and tools that you can apply.

Topic modeling is a form of unsupervised learning that aims to find hidden patterns and structures in the text data. It assumes
that each document is composed of a mixture of topics, and each topic is a distribution of words that represent a specific

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

subject or idea. For example, a document about sports might have topics such as soccer, basketball, and fitness. Topic
modeling can help you identify these topics and their proportions in each document. Topic modeling can help you generate
summaries by extracting the most relevant and salient topics and words from the document for text summarization. You can
then use these topics and words to construct a summary that captures the essence and meaning of the document.

Topic modeling is a collection of text-mining techniques that uses statistical and machine learning models to automatically
discover hidden abstract topics in a collection of documents.
Topic modeling is also an amalgamation of a set of unsupervised techniques that’s capable of detecting word and phrase
patterns within documents and automatically cluster word groups and similar expressions helping in best representing a
set of documents.

There are many cases where humans or machines generate a huge amount of text over time and it is not prudent nor possible
to go through the entire text for gaining an understanding of what is important or to come to an opinion of the entire process
of generating the data.
In such cases, NLP algorithms and in particular topic modeling are useful to extract a summary of the underlying text and
discover important contexts from the text.

Topic modeling is the method of extracting needed attributes from a bag of words. This is critical because each word in the
corpus is treated as a feature in NLP. As a result, feature reduction allows us to focus on the relevant material rather than
wasting time sifting through all of the data's text.

There are many different topic modeling algorithms and tools available for text analysis projects. Popular methods include
Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Latent Semantic Analysis (LSA).
Common tools used to apply these algorithms include Gensim, a Python library providing implementations of LDA, NMF,
and other topic modeling methods; Scikit-learn, a Python library providing implementations of NMF, LSA, and other
machine learning methods; and MALLET, a Java-based toolkit providing implementations of LDA, NMF, and other topic
modeling methods. These tools offer various utilities and functionalities for preprocessing, evaluation, visualization, data
manipulation, feature extraction, model selection, and performance metrics.

Working and Methods of Topic Modeling:

To infer subjects from unstructured data, topic modeling includes counting words and grouping similar word patterns.
Suppose, if we are a software firm interested in learning what consumers have to say about specific elements of our product,
we would need to use a topic modeling algorithm to examine our comments instead of spending hours trying to figure out
which messages are talking about our topics of interest.

A topic model groups feedback that is comparable, as well as phrases and expressions that appear most frequently, by
recognizing patterns such as word frequency and distance between words. We may rapidly infer what each group of texts
is about using this information.

Five algorithms are particularly used for topic modeling. We are going to learn about the methods, taking help
from OpenGenus.

1. Latent Dirirchlet Allocation (LDA):

The statistical and graphical concept of Latent Dirichlet Allocation is used to find correlations between many documents in
a corpus. The greatest likelihood estimate from the entire corpus of text is obtained using the Variational Exception
Maximization (VEM) technique. This is traditionally solved by selecting the top few words from a bag of words. The
statement, however, is utterly devoid of meaning. Each document may be represented by a probabilistic distribution of
subjects, and each topic can be defined by a probabilistic distribution of words, according to this approach. As a result, we
have a much better picture of how the issues are related.

Example of LDA (Source)


Consider the following scenario: you have a corpus of 1000 documents. The bag of words is made up of 1000 common
words after preprocessing the corpus. We can determine the subjects that are relevant to each document using LDA.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])
lOMoARcPSD|38130204

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

The extraction of data from a corpus of data is therefore made straightforward. The upper level represents the documents,
the middle level represents the produced themes, and the bottom level represents the words in the diagram above.

As a result, the rule indicates that a text is represented as a distribution of themes, and topics are described as a distribution
of words.

2. Non Negative Matrix Factorization (NMF):

NMF is a matrix factorization method that ensures the non-negative elements of the factorized matrices. Consider the
document-term matrix produced after deleting stopwords from a corpus. The term-topic matrix and the topic-document
matrix are two matrices that may be factored out of the matrix.

Matrix factorization may be accomplished using a variety of optimization methods. NMF may be performed more quickly
and effectively using Hierarchical Alternating Least Square. The factorization takes place in this case by updating one
column at a time while leaving the other columns unchanged.

3. Latent Semantic Analysis (LSA):

Latent Semantic Analysis is another unsupervised learning approach for extracting relationships between words in a large
number of documents. This assists us in selecting the appropriate documents.

It merely serves as a dimensionality reduction tool for the massive corpus of text data. These extraneous data adds noise to
the process of extracting the proper insights from the data.

4. Parallel Latent Dirichlet Allocation:

Partially Labeled Dirichlet Allocation is another name for it. The model implies that there are a total of n labels, each of
which is associated with a different subject in the corpus.

Then, similar to the LDA, the individual themes are represented as the probability distribution of the entire corpus.
Optionally, each document might be allocated a global subject, resulting in l global topics, where l is the number of
individual documents in the corpus.

The technique also assumes that every subject in the corpus has just one label. In comparison to the other approaches, this
procedure is highly rapid and exact because the labels are supplied before creating the model.

5. Pachinko Allocation Model (PAM):

The Pachinko Allocation Model (PAM) is a more advanced version of the Latent Dirichlet Allocation Model. The LDA
model identifies themes based on thematic correlations between words in the corpus, bringing out the correlation between
words. PAM, on the other hand, makes do by modeling the correlation between the produced themes. Because it additionally
considers the link between subjects, this model has more ability in determining the semantic relationship precisely. Pachinko
is a popular Japanese game, and the model is named for it. To explore the association between themes, the model uses
Directed Acrylic Graphs (DAG).

***************

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Iniyan Inba ([email protected])

You might also like