Unit 2 Notes
Unit 2 Notes
CCS369-UNIT 2 - GOOD
INTRODUCTION
Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. Text
classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical studies,
and files, and all over the web. For example, new articles can be organized by topics; support tickets can be organized by
urgency; chat conversations can be organized by language; brand mentions can be organized by sentiment; and so on.
Text classification is one of the fundamental tasks in natural language processing with broad applications such as sentiment
analysis, topic labeling, spam detection, and intent detection. Here’s an example of how it works:
A text classifier can take this phrase as input, analyze its content, and then automatically assign relevant tags, such as UI
and Easy To Use.
• Real-time analysis: There are critical situations that companies need to identify as soon as possible and take
immediate action (e.g., PR crises on social media). Machine learning text classification can follow your brand
mentions constantly and in real-time, so you'll identify critical information and be able to take action right away.
• Consistent criteria: Human annotators make mistakes when classifying text data due to distractions, fatigue, and
boredom, and human subjectivity creates inconsistent criteria. Machine learning, on the other hand, applies the
same lens and criteria to all data and results. Once a text classification model is properly trained it performs with
unsurpassed accuracy.
• Manual text classification involves a human annotator, who interprets the content of text and categorizes it
accordingly. This method can deliver good results but it’s time-consuming and expensive.
• Automatic text classification applies machine learning, natural language processing (NLP), and other AI-guided
techniques to automatically classify text in a faster, more cost-effective, and more accurate manner.
There are many approaches to automatic text classification, but they all fall under three types of systems:
• Rule-based systems
• Machine learning-based systems
• Hybrid systems
Rule-based systems
Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These rules instruct
the system to use semantically relevant elements of a text to identify relevant categories based on its content. Each rule
consists of an antecedent or pattern and a predicted category.
Example: Say that you want to classify news articles into two groups: Sports and Politics. First, you’ll need to define two
lists of words that characterize each group (e.g., words related to sports such as football, basketball, LeBron James, etc.,
and words related to politics, such as Donald Trump, Hillary Clinton, Putin, etc.). Next, when you want to classify a new
incoming text, you’ll need to count the number of sport-related words that appear in the text and do the same for politics-
related words. If the number of sports-related word appearances is greater than the politics-related word count, then the text
is classified as Sports and vice versa. For example, this rule-based system will classify the headline “When is LeBron James'
first game with the Lakers?” as Sports because it counted one sports-related term (LeBron James) and it didn’t count any
politics-related terms.
Rule-based systems are human comprehensible and can be improved over time. But this approach has some disadvantages.
For starters, these systems require deep knowledge of the domain. They are also time-consuming, since generating rules for
a complex system can be quite challenging and usually requires a lot of analysis and testing. Rule-based systems are also
difficult to maintain and don’t scale well given that adding new rules can affect the results of the pre-existing rules.
The first step towards training a machine learning NLP classifier is feature extraction: a method used to transform each text
into a numerical representation in the form of a vector. One of the most frequently used approaches is the bag of words,
where a vector represents the frequency of a word in a predefined dictionary of words.
For example, if we have defined our dictionary to have the following words {This, is, the, not, awesome, bad, basketball},
and we wanted to vectorize the text “This is awesome,” we would have the following vector representation of that text: (1,
1, 0, 0, 1, 0, 0). Then, the machine learning algorithm is fed with training data that consists of pairs of feature sets (vectors
for each text example) and tags (e.g. sports, politics) to produce a classification model:
Once it’s trained with enough training samples, the machine learning model can begin to make accurate predictions. The
same feature extractor is used to transform unseen text to feature sets, which can be fed into the classification model to get
predictions on tags (e.g., sports, politics):
the occurrence of two events, based on the probabilities of the occurrence of each individual event. So we’re calculating the
probability of each tag for a given text, and then outputting the tag with the highest probability.
The probability of A, if B is true, is equal to the probability of B, if A is true, times the probability of A being true, divided
by the probability of B being true. This means that any vector that represents a text will have to contain information about
the probabilities of the appearance of certain words within the texts of a given category so that the algorithm can compute
the likelihood of that text belonging to the category.
The optimal hyperplane is the one with the largest distance between each tag. In two dimensions it looks like above: Those
vectors are representations of your training texts, and a group is a tag you have tagged your texts with. As data gets more
complex, it may not be possible to classify vectors/tags into only two categories. So, it looks like this:
• Deep Learning
Deep learning is a set of algorithms and techniques inspired by how the human brain works, called neural networks. Deep
learning architectures offer huge benefits for text classification because they perform at super high accuracy with lower-
level engineering and computation. The two main deep learning architectures for text classification are Convolutional Neural
Networks (CNN) and Recurrent Neural Networks (RNN). Deep learning is hierarchical machine learning, using multiple
algorithms in a progressive chain of events. It’s similar to how the human brain works when making decisions, using
different techniques simultaneously to process huge amounts of data.
Deep learning algorithms do require much more training data than traditional machine learning algorithms (at least millions
of tagged examples). However, they don’t have a threshold for learning from training data, like traditional machine learning
algorithms, such as SVM and Deep learning classifiers continue to get better the more data you feed them with: Deep
learning algorithms, like Word2Vec or GloVe are also used in order to obtain better vector representations for words and
improve the accuracy of classifiers trained with traditional machine learning algorithms.
• Hybrid Systems
Hybrid systems combine a machine learning-trained base classifier with a rule-based system, used to further improve the
results. These hybrid systems can be easily fine-tuned by adding specific rules for those conflicting tags that haven’t been
correctly modeled by the base classifier.
Vector semantics is the standard way to represent word meaning in NLP, helping vector semantics us model many of the
aspects of word meaning. The idea of vector semantics is to represent a word as a point in a multidimensional semantic
space that is derived from the distributions of embeddings word neighbors. Vectors for representing words are called
embeddings (although the term is sometimes more strictly applied only to dense vectors like word2vec). Vector Semantics
defines semantics & interprets word meaning to explain features such as word similarity. Its central idea is: Two words are
similar if they have similar word contexts.
In its current form, the vector model inspires its working from the linguistic and philosophical work of the 1950s. Vector
semantics represents a word in multi-dimensional vector space. Vector model is also called Embeddings, due to the fact that
a word is embedded in a particular vector space. The vector model offers many advantages in NLP. For example, in
sentimental analysis, sets up a boundary class and predicts if the sentiment is positive or negative (a binomial classification).
Another key practical advantage of vector semantics is that it can learn automatically from text without complex labeling
or supervision. As a result of these advantages, vector semantics has become a de-facto standard for NLP applications such
as Sentiment Analysis, Named Entity Recognition (NER), topic modeling, and so on.
WORD EMBEDDINGS
It is an approach for representing words and documents. Word Embedding or Word Vector is a numeric vector input that
represents a word in a lower-dimensional space. It allows words with similar meaning to have a similar representation. They
can also approximate meaning. A word vector with 50 values can represent 50 unique features.
Features: Anything that relates words to one another. Eg: Age, Sports, Fitness, Employed etc. Each word vector has values
corresponding to these features.
Word Embeddings are a method of extracting features out of text so that we can input those features into a machine learning
model to work with text data. They try to preserve syntactical and semantic information. The methods such as Bag of
Words(BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not save any syntactical or semantic
information. In these algorithms, the size of the vector is the number of elements in the vocabulary. We can get a sparse
matrix if most of the elements are zero. Large input vectors will mean a huge number of weights which will result in high
computation required for training. Word Embeddings give a solution to these problems.
Let’s take an example to understand how word vector is generated by taking emoticons which are most frequently used in
certain conditions and transform each emoji into a vector and the conditions will be our features.
In a similar way, we can create word vectors for different words as well on the basis of given features. The words with
similar vectors are most likely to have the same meaning or are used to convey the same sentiment. There are two different
approaches for getting Word Embeddings:
1) Word2Vec:
In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.
One-Hot vector: A representation where only one bit in a vector is 1. If there are 500 words in the corpus then the vector
length will be 500. After assigning vectors to each word we take a window size and iterate through the entire corpus. While
we do this there are two neural embedding methods which are used:
CBOW Architecture.
This architecture is very similar to a feed-forward neural network. This model architecture essentially tries to
predict a target word from a list of context words.
The intuition behind this model is quite simple: given a phrase "Have a great day", we will choose our target
word to be “a” and our context words to be [“have”, “great”, “day”]. What this model will do is take the
distributed representations of the context words to try and predict the target word.
The English language contains almost 1.2 million words, making it impossible to include so many words in our
example. So I ‘ll consider a small example in which we have only four words i.e. live, home, they and at. For
simplicity, we will consider that the corpus contains only one sentence, that being, ‘They live at home’.
First, we convert each word into a one-hot encoding form. Also, we’ll not consider all the words in the sentence
but ll only take certain words that are in a window. For example for a window size equal to three, we only
consider three words in a sentence. The middle word is to be predicted and the surrounding two words are fed
into the neural network as context. The window is then slid and the process is repeated again.
Finally, after training the network repeatedly by sliding the window a shown above, we get weights which we
use to get the embeddings as shown below.
Usually, we take a window size of around 8-10 words and have a vector size of 300.
After applying the above neural embedding methods we get trained vectors of each word after many iterations through the
corpus. These trained vectors preserve syntactical or semantic information and are converted to lower dimensions. The
vectors with similar meaning or semantic information are placed close to each other in space.
The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to
predict the source context words (surrounding words) given a target word (the centre word)
The working of the skip-gram model is quite similar to the CBOW but there is just a difference in the architecture
of its neural network and the way the weight matrix is generated as shown in the figure below:
After obtaining the weight matrix, the steps to get word embedding is same as CBOW.
So now which one of the two algorithms should we use for implementing word2vec? Turns out for large corpus
with higher dimensions, it is better to use skip-gram but is slow to train. Whereas CBOW is better for small corpus
and is faster to train too.
2) GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and iterate through it and get the
co-occurrence of each word with other words in the corpus. We get a co-occurrence matrix through this. The words which
occur next to each other get a value of 1, if they are one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:
Corpus:
It is a nice evening.
Good Evening!
Is it a nice evening?
it is a nice evening good
it 0
is 1+1 0
a 1/2+1 1+1/2 0
good 0 0 0 0 1 0
The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as well to calculate the
co-occurrences by shifting the frame till the end of the corpus. This helps gather information about the context in which the
word is used. Initially, the vectors for each word is assigned randomly. Then we take two pairs of vectors and see how close
they are to each other in space. If they occur together more often or have a higher value in the co-occurrence matrix and are
far apart in space then they are brought close to each other. If they are close to each other but are rarely or not frequently
used together then they are moved further apart in space.
After many iterations of the above process, we’ll get a vector space representation that approximates the information from
the co-occurrence matrix. The performance of GloVe is better than Word2Vec in terms of both semantic and syntactic
capturing.
Word embeddings is a technique where individual words are transformed into a numerical representation of the
word (a vector). Where each word is mapped to one vector, this vector is then learned in a way which resembles
a neural network. The vectors try to capture various characteristics of that word with regard to the overall text.
These characteristics can include the semantic relationship of the word, definitions, context, etc. With these
numerical representations, you can do many things like identify similarity or dissimilarities between words.
Clearly, these are integral as inputs to various aspects of machine learning. A machine cannot process text in its
raw form, thus converting the text into an embedding will allow users to feed the embedding to classic machine
learning models. The simplest embedding would be a one-hot encoding of text data where each vector would be
mapped to a category.
For example:
have = [1, 0, 0, 0, 0, 0, ... 0]
a = [0, 1, 0, 0, 0, 0, ... 0]
good = [0, 0, 1, 0, 0, 0, ... 0]
day = [0, 0, 0, 1, 0, 0, ... 0] ...
However, there are multiple limitations of simple embeddings such as this, as they do not capture characteristics
of the word, and they can be quite large depending on the size of the corpus.
Word2Vec Architecture
The effectiveness of Word2Vec comes from its ability to group together vectors of similar words. Given a large
enough dataset, Word2Vec can make strong estimates about a word’s meaning based on their occurrences in the
text. These estimates yield word associations with other words in the corpus. For example, words like “King” and
“Queen” would be very similar to one another. When conducting algebraic operations on word embeddings you
can find a close approximation of word similarities. For example, the 2-dimensional embedding vector of "king"
- the 2-dimensional embedding vector of "man" + the 2-dimensional embedding vector of "woman" yielded a
vector which is very close to the embedding vector of "queen". Note, that the values below were chosen
arbitrarily.
Graphical Representation of an Word2Vec Embedding ( King and Queen are close to each other in position)
There are two main architectures that yield the success of word2vec.
• Skip-gram
• CBOW architectures. (Refer Previous Section)
The skip-gram model is a simple neural network with one hidden layer trained in order to predict the probability
of a given word being present when an input word is present. Intuitively, you can imagine the skip-gram model
being the opposite of the CBOW model. In this architecture, it takes the current word as an input and tries to
accurately predict the words before and after this current word. This model essentially tries to learn and predict
the context words around the specified input word. Based on experiments assessing the accuracy of this model it
was found that the prediction quality improves given a large range of word vectors, however it also increases the
computational complexity. The process can be described visually as seen below.
A corpus can be represented as a vector of size N, where each element in N corresponds to a word in the corpus.
During the training process, we have a pair of target and context words, the input array will have 0 in all elements
except for the target word. The target word will be equal to 1. The hidden layer will learn the embedding
representation of each word, yielding a d-dimensional embedding space. The output layer is a dense layer with a
softmax activation function. The output layer will essentially yield a vector of the same size as the input, each
element in the vector will consist of a probability. This probability indicates the similarity between the target
word and the associated word in the corpus.
GLOVE MODEL
GloVe , coined from Global Vectors, is an unsupervised learning algorithm for obtaining vector representations for words.
Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting
representations showcase interesting linear substructures of the word vector space. It is a model for distributed word
representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is
achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.
The statistics of word occurrences in a corpus is the primary source of information available to all unsupervised methods
for learning word representations, and although many such methods now exist, the question still remains as to how meaning
is generated from these statistics, and how the resulting word vectors might represent that meaning. In this section, we shed
some light on this question. We use our insights to construct a new model for word representation which we call GloVe, for
Global Vectors, because the global corpus statistics are captured directly by the model.
GloVe model combines the advantages of the two major model families in the literature:
• global matrix factorization
• local context window methods
This model efficiently leverages statistical information by training only on the nonzero elements in a word-word
cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus.
• Matrix Factorization Methods: methods that reduce a matrix into constituent parts that make it easier to calculate
more complex matrix operations .
• Shallow Window-Based Methods: Another approach is to learn word representations that aid in making
predictions within local context windows.
Refer the previous section for example.
FASTTEXT MODEL
This model allows training word embeddings from a training corpus with the additional ability to obtain word vectors for
out-of-vocabulary words. FastText is an open-source, free, lightweight library that allows users to learn text representations
and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile
devices. fastText embeddings exploit subword information to construct word embeddings. FastText is more stable than
Word2Vec architecture. In FastText, each word is represented as the average of the vector representation of its character n-
grams along with the word itself. So, the word embedding for the word 'equal' can be given as the sum of all vector
representations of all of its character n-gram and the word itself.
Word embedding techniques like word2vec and GloVe provide distinct vector representations for the words in the
vocabulary. This leads to ignorance of the internal structure of the language. This is a limitation for morphologically rich
language as it ignores the syntactic relation of the words. As many word formations follow the rules in morphologically rich
languages, it is possible to improve vector representations for these languages by using character-level information.
To improve vector representation for morphologically rich language, FastText provides embeddings for character n-grams,
representing words as the average of these embeddings. It is an extension of the word2vec model. Word2Vec model provides
embedding to the words, whereas fastText provides embeddings to the character n-grams. Like the word2vec model,
fastText uses CBOW and Skip-gram to compute the vectors.
FastText can also handle out-of-vocabulary words, i.e., the fast text can find the word embeddings that are not present at
the time of training.
Out-of-vocabulary (OOV) words are words that do not occur while training the data and are not present in the model’s
vocabulary. Word embedding models like word2vec or GloVe cannot provide embeddings for the OOV words because they
provide embeddings for words; hence, if a new word occurs, it cannot provide embedding. Since FastText provides
embeddings for character n-grams, it can provide embeddings for OOV words. If an OOV word occurs, then fastText
provides embedding for that word by embedding its character n-gram.
< eq, equ, qua, ual, al > and < equal >
So, the word embedding for the word ‘equal’ can be given as the sum of all vector representations of all of its character n-
gram and the word itself.
For example, in the sentence “ I want to learn FastText.” In this sentence, the words “I,” “want,” “to,” and “FastText” are
given as input, and the model predicts “learn” as output.
All the input and output data are in the same dimension and have one-hot encoding. It uses a neural network for training.
The neural network has an input layer, a hidden layer, and an output layer. Figure 1.2 shows the working of CBOW for
FastText.
FastText can be viewed as an extension to word2vec. Some of the significant differences between word2vec and fastText
are as follows:
• Word2Vec works on the word level, while fastText works on the character n-grams.
• Word2Vec cannot provide embeddings for out-of-vocabulary words, while fastText can provide embeddings for
OOV words.
• FastText can provide better embeddings for morphologically rich languages compared to word2vec.
• FastText uses the hierarchical classifier to train the model; hence it is faster than word2vec.
In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Deep learning
models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using
a large set of labeled data and neural network architectures that contain many layers.
While deep learning was first theorized in the 1980s, there are two main reasons it has only recently become useful:
1. Deep learning requires large amounts of labeled data. For example, driverless car development requires millions
of images and thousands of hours of video.
2. Deep learning requires substantial computing power. High-performance GPUs have a parallel architecture that is
efficient for deep learning. When combined with clusters or cloud computing, this enables development teams to
reduce training time for a deep learning network from weeks to hours or less.
Deep learning applications are used in industries from automated driving to medical devices.
• Automated Driving: Automotive researchers are using deep learning to automatically detect objects such as stop
signs and traffic lights. In addition, deep learning is used to detect pedestrians, which helps decrease accidents.
• Aerospace and Defense: Deep learning is used to identify objects from satellites that locate areas of interest, and
identify safe or unsafe zones for troops.
• Medical Research: Cancer researchers are using deep learning to automatically detect cancer cells. Teams at UCLA
built an advanced microscope that yields a high-dimensional data set used to train a deep learning application to
accurately identify cancer cells.
• Industrial Automation: Deep learning is helping to improve worker safety around heavy machinery by
automatically detecting when people or objects are within an unsafe distance of machines.
• Electronics: Deep learning is being used in automated hearing and speech translation. For example, home assistance
devices that respond to your voice and know your preferences are powered by deep learning applications.
RNN
Recurrent Neural Network (RNN) is a type of Neural Network where the output from the previous step is fed as input to
the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases when
it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember
the previous words. Thus, RNN came into existence, which solved this issue with the help of a Hidden Layer. The main
and most important feature of RNN is its Hidden state, which remembers some information about a sequence. The state is
also referred to as Memory State since it remembers the previous input to the network. It uses the same parameters for each
input as it performs the same task on all the inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
A RNN treats each word of a sentence as a separate input occurring at time ‘t’ and uses the activation value at ‘t-1’ also,
as an input in addition to the input at time ‘t’. The diagram below shows a detailed structure of an RNN architecture. The
architecture described above is also called as a many to many architecture with (Tx = Ty) i.e. number of inputs = number
of outputs. Such structure is quite useful in Sequence modelling.
Apart from the architecture mentioned above there are three other types of architectures of RNN which are commonly used.
1. Many to One RNN : Many to one architecture refers to an RNN architecture where many inputs (Tx) are used to
give one output (Ty). A suitable example for using such an architecture will be a classification task.
RNN are a very important variant of neural networks heavily used in Natural Language Processing.
Conceptually they differ from a standard neural network as the standard input in a RNN is a word instead of the entire
sample as in the case of a standard neural network. This gives the flexibility for the network to work with varying lengths
of sentences, something which cannot be achieved in a standard neural network due to it’s fixed structure. It also provides
an additional advantage of sharing features learned across different positions of text which cannot be obtained in a
standard neural network.
2. One to Many RNN: One to Many architecture refers to a situation where a RNN generates a series of output values
based on a single input value. A prime example for using such an architecture will be a music generation task, where an
input is a jounre or the first note.
3. Many to Many Architecture (Tx not equals Ty): This architecture refers to where many inputs are read to produce
many outputs, where the length of inputs is not equal to the length of outputs. A prime example for using such an architecture
is machine translation tasks.
Encoder refers to the part of the network which reads the sentence to be translated, and, Decoder is the part of the
network which translates the sentence into desired language.
Limitations of RNN
Apart from all of its usefulness RNN does have certain limitations major of which are:
1. Examples of RNN architecture stated above can capture the dependencies in only one direction of the language.
Basically, in the case of Natural Language Processing it assumes that the word coming after has no effect on the
meaning of the word coming before. With our experience of languages, we know that it is certainly not true.
2. RNN are also not very good in capturing long term dependencies and the problem of vanishing gradients resurface
in RNN.
NNs are ideal for solving problems where the sequence is more important than the individual items themselves.
An RNNs is essentially a fully connected neural network that contains a refactoring of some of its layers into a loop. That
loop is typically an iteration over the addition or concatenation of two inputs, a matrix multiplication and a non-linear
function.
Among the text usages, the following tasks are among those RNNs perform well at:
• Sequence labelling
• Natural Language Processing (NLP) text classification
• Natural Language Processing (NLP) text generation
Other tasks that RNNs are effective at solving are time series predictions or other sequence predictions that aren’t image or
tabular-based.
There have been several highlighted and controversial reports in the media over the advances in text generation, OpenAI’s
GPT-2 algorithm. In many cases the generated text is often indistinguishable from text written by humans.
RNNs effectively have an internal memory that allows the previous inputs to affect the subsequent predictions. It’s much
easier to predict the next word in a sentence with more accuracy, if you know what the previous words were. Often with
tasks well suited to RNNs, the sequence of the items is as or more important than the previous item in the sequence.
Recurrent Neural Network (RNN) based sequence-to-sequence models have garnered a lot of traction ever since they
were introduced in 2014. Most of the data in the current world are in the form of sequences – it can be a number sequence,
text sequence, a video frame sequence or an audio sequence.
The performance of these seq2seq models was further enhanced with the addition of the Attention Mechanism in 2015.
How quickly advancements in NLP have been happening in the last 5 years – incredible!
These sequence-to-sequence models are pretty versatile and they are used in a variety of NLP tasks, such as:
• Machine Translation
• Text Summarization
• Speech Recognition
• Question-Answering System, and so on
TRANSFORMERS
Attention models/Transformers are the most exciting models being studied in NLP research today, but they can be a bit
challenging to grasp – the pedagogy is all over the place. This is both a bad thing (it can be confusing to hear different
versions) and in some ways a good thing (the field is rapidly evolving, there is a lot of space to improve).
Transformer
Internally, the Transformer has a similar kind of architecture as the previous models above. But the Transformer consists
of six encoders and six decoders.
Each encoder is very similar to each other. All encoders have the same architecture. Decoders share the same property, i.e.
they are also very similar to each other. Each encoder consists of two layers: Self-attention and a feed Forward Neural
Network.
The encoder’s inputs first flow through a self-attention layer. It helps the encoder look at other words in the input sentence
as it encodes a specific word. The decoder has both those layers, but between them is an attention layer that helps the
decoder focus on relevant parts of the input sentence.
Self-Attention
Let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a
trained model into an output. As is the case in NLP applications in general, we begin by turning each input word into a
vector using an embedding algorithm.
Each word is embedded into a vector of size 512. We will represent those vectors with these simple boxes. The embedding
only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of
vectors each of the size 512.
In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder
that’s directly below. After embedding the words in our input sequence, each of them flows through each of the two layers
of the encoder.
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own
path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does
not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the
feed-forward layer.
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
Self-Attention
Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented —
using matrices.
Figuring out the relation of words within a sentence and giving the right attention to it.
The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case,
the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors
are created by multiplying the embedding by three matrices that we trained during the training process.
Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the
embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an
architecture choice to make the computation of multiheaded attention (mostly) constant.
Multiplying x1 by the WQ weight matrix produces q1, the “query” vector associated with that word. We end up creating a
“query”, a “key”, and a “value” projection of each word in the input sentence.
The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first
word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines
how much focus to place on other parts of the input sentence as we encode a word at a certain position.
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re
scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1
and k1. The second score would be the dot product of q1 and k2.
The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in this
example is 64. This leads to having more stable gradients. There could be other possible values here, but this is the default),
then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
This softmax score determines how much how much each word will be expressed at this position. Clearly the word at this
position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the
current word.
The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is
to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny
numbers like 0.001, for example). The sixth step is to sum up the weighted value vectors. This produces the output of the
self-attention layer at this position (for the first word).
That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural
network. In the actual implementation, however, this calculation is done in matrix form for faster processing.
Multihead attention
There are a few other details that make them work better. For example, instead of only paying attention to each other in one
dimension, Transformers use the concept of Multihead attention. The idea behind it is that whenever you are translating a
word, you may pay different attention to each word based on the type of question that you are asking. The images below
show what that means. For example, whenever you are translating “kicked” in the sentence “I kicked the ball”, you may
ask “Who kicked”. Depending on the answer, the translation of the word to another language can change. Or ask other
questions, like “Did what?”, etc…
Positional Encoding
Another important step on the Transformer is to add positional encoding when encoding each word. Encoding the
position of each word is relevant since the position of each word is relevant to the translation.
Text summarization is the process of creating a concise and accurate representation of the main points and information in
a document. Topic modelling can help you generate summaries by extracting the most relevant and salient topics and words
from the document. Text summarization refers to the technique of shortening long pieces of text. The intention is to create
a coherent and fluent summary having only the main points outlined in the document.
Automatic text summarization is a common problem in machine learning and natural language processing (NLP). In
general, text summarization technique has proved to be critical in quickly and accurately summarizing voluminous texts,
something which could be expensive and time consuming if done without machines.
• Extraction-based summarization
The extractive text summarization technique involves pulling keyphrases from the source document and combining them
to make a summary. The extraction is made according to the defined metric without making any changes to the texts.
Here is an example:
Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave birth to
a child named Jesus.
Extractive summary: Joseph and Mary attend event Jerusalem. Mary birth Jesus.
As you can see above, the words in bold have been extracted and joined to create a summary — although sometimes the
summary can be grammatically strange.
• Abstraction-based summarization
The abstraction technique entails paraphrasing and shortening parts of the source document. When abstraction is applied
for text summarization in deep learning problems, it can overcome the grammar inconsistencies of the extractive method.
The abstractive text summarization algorithms create new phrases and sentences that relay the most useful information from
the original text — just like humans do.
Therefore, abstraction performs better than extraction. However, the text summarization algorithms required to do
abstraction are more difficult to develop; that’s why the use of extraction is still popular.
Here is an example:
Abstractive summary: Joseph and Mary came to Jerusalem where Jesus was born.
Typically, here is how using the extraction-based approach to summarize texts can work:
1. Introduce a method to extract the merited keyphrases from the source document. For example, you can use part-
of-speech tagging, word sequences, or other linguistic patterns to identify the keyphrases.
2. Gather text documents with positively-labeled keyphrases. The keyphrases should be compatible to the stipulated
extraction technique. To increase accuracy, you can also create negatively-labeled keyphrases.
3. Train a binary machine learning classifier to make the text summarization. Some of the features you can use
include:
• Length of the keyphrase
• Frequency of the keyphrase
• The most recurring word in the keyphrase
• Number of characters in the keyphrase
4. Finally, in the test phrase, create all the keyphrase words and sentences and carry out classification for them.
Topic Modeling
Topic modeling is a technique that can help you discover the main themes and concepts in a large collection of text
documents. It can also help you summarize, classify, or cluster the documents based on their topics. In this article, you will
learn how to use topic modeling for these tasks and what are some of the common algorithms and tools that you can apply.
Topic modeling is a form of unsupervised learning that aims to find hidden patterns and structures in the text data. It assumes
that each document is composed of a mixture of topics, and each topic is a distribution of words that represent a specific
subject or idea. For example, a document about sports might have topics such as soccer, basketball, and fitness. Topic
modeling can help you identify these topics and their proportions in each document. Topic modeling can help you generate
summaries by extracting the most relevant and salient topics and words from the document for text summarization. You can
then use these topics and words to construct a summary that captures the essence and meaning of the document.
Topic modeling is a collection of text-mining techniques that uses statistical and machine learning models to automatically
discover hidden abstract topics in a collection of documents.
Topic modeling is also an amalgamation of a set of unsupervised techniques that’s capable of detecting word and phrase
patterns within documents and automatically cluster word groups and similar expressions helping in best representing a
set of documents.
There are many cases where humans or machines generate a huge amount of text over time and it is not prudent nor possible
to go through the entire text for gaining an understanding of what is important or to come to an opinion of the entire process
of generating the data.
In such cases, NLP algorithms and in particular topic modeling are useful to extract a summary of the underlying text and
discover important contexts from the text.
Topic modeling is the method of extracting needed attributes from a bag of words. This is critical because each word in the
corpus is treated as a feature in NLP. As a result, feature reduction allows us to focus on the relevant material rather than
wasting time sifting through all of the data's text.
There are many different topic modeling algorithms and tools available for text analysis projects. Popular methods include
Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Latent Semantic Analysis (LSA).
Common tools used to apply these algorithms include Gensim, a Python library providing implementations of LDA, NMF,
and other topic modeling methods; Scikit-learn, a Python library providing implementations of NMF, LSA, and other
machine learning methods; and MALLET, a Java-based toolkit providing implementations of LDA, NMF, and other topic
modeling methods. These tools offer various utilities and functionalities for preprocessing, evaluation, visualization, data
manipulation, feature extraction, model selection, and performance metrics.
To infer subjects from unstructured data, topic modeling includes counting words and grouping similar word patterns.
Suppose, if we are a software firm interested in learning what consumers have to say about specific elements of our product,
we would need to use a topic modeling algorithm to examine our comments instead of spending hours trying to figure out
which messages are talking about our topics of interest.
A topic model groups feedback that is comparable, as well as phrases and expressions that appear most frequently, by
recognizing patterns such as word frequency and distance between words. We may rapidly infer what each group of texts
is about using this information.
Five algorithms are particularly used for topic modeling. We are going to learn about the methods, taking help
from OpenGenus.
The statistical and graphical concept of Latent Dirichlet Allocation is used to find correlations between many documents in
a corpus. The greatest likelihood estimate from the entire corpus of text is obtained using the Variational Exception
Maximization (VEM) technique. This is traditionally solved by selecting the top few words from a bag of words. The
statement, however, is utterly devoid of meaning. Each document may be represented by a probabilistic distribution of
subjects, and each topic can be defined by a probabilistic distribution of words, according to this approach. As a result, we
have a much better picture of how the issues are related.
The extraction of data from a corpus of data is therefore made straightforward. The upper level represents the documents,
the middle level represents the produced themes, and the bottom level represents the words in the diagram above.
As a result, the rule indicates that a text is represented as a distribution of themes, and topics are described as a distribution
of words.
NMF is a matrix factorization method that ensures the non-negative elements of the factorized matrices. Consider the
document-term matrix produced after deleting stopwords from a corpus. The term-topic matrix and the topic-document
matrix are two matrices that may be factored out of the matrix.
Matrix factorization may be accomplished using a variety of optimization methods. NMF may be performed more quickly
and effectively using Hierarchical Alternating Least Square. The factorization takes place in this case by updating one
column at a time while leaving the other columns unchanged.
Latent Semantic Analysis is another unsupervised learning approach for extracting relationships between words in a large
number of documents. This assists us in selecting the appropriate documents.
It merely serves as a dimensionality reduction tool for the massive corpus of text data. These extraneous data adds noise to
the process of extracting the proper insights from the data.
Partially Labeled Dirichlet Allocation is another name for it. The model implies that there are a total of n labels, each of
which is associated with a different subject in the corpus.
Then, similar to the LDA, the individual themes are represented as the probability distribution of the entire corpus.
Optionally, each document might be allocated a global subject, resulting in l global topics, where l is the number of
individual documents in the corpus.
The technique also assumes that every subject in the corpus has just one label. In comparison to the other approaches, this
procedure is highly rapid and exact because the labels are supplied before creating the model.
The Pachinko Allocation Model (PAM) is a more advanced version of the Latent Dirichlet Allocation Model. The LDA
model identifies themes based on thematic correlations between words in the corpus, bringing out the correlation between
words. PAM, on the other hand, makes do by modeling the correlation between the produced themes. Because it additionally
considers the link between subjects, this model has more ability in determining the semantic relationship precisely. Pachinko
is a popular Japanese game, and the model is named for it. To explore the association between themes, the model uses
Directed Acrylic Graphs (DAG).
***************