Text and Speech Analysis Notes CCS369-UNIT 1
Text and Speech Analysis Notes CCS369-UNIT 1
Reshma/AP/CSE
INTRODUCTION
Artificial intelligence (AI) integration has revolutionized various industries, and now it is transforming the realm of human
behavior research. This integration marks a significant milestone in the data collection and analysis endeavors, enabling
users to unlock deeper insights from spoken language and empower researchers and analysts with enhanced capabilities for
understanding and interpreting human communication. Human interactions are a critical part of many organizations. Many
organizations analyze speech or text via natural language processing (NLP) and link them to insights and automation such
as text categorization, text classification, information extraction, etc.
In business intelligence, speech and text analytics enable us to gain insights into customer-agent conversations through
sentiment analysis, and topic trends. These insights highlight areas of improvement, recognition, and concern, to better
understand and serve customers and employees. Speech and text analytics features provide automated speech and text
analytics capabilities on 100% of interactions to provide deep insight into customer-agent conversations. Speech and text
analytics is a set of features that uses natural language processing (NLP) to provide an automated analysis of an interaction’s
content and insight into customer-agent conversations. Speech and text analytics includes transcribing voice interactions,
analysis for customer sentiment and topic spotting, and creating meaning from otherwise unstructured data.
FOUNDATIONS OF NATURAL LANGUAGE PROCESSING
Natural Language Processing (NLP) is the process of producing meaningful phrases and sentences in the form of natural
language. Natural Language Processing precludes Natural Language Understanding (NLU) and Natural Language
Generation (NLG). NLU takes the data input and maps it into natural language. NLG conducts information extraction and
retrieval, sentiment analysis, and more. NLP can be thought of as an intersection of Linguistics, Computer Science and
Artificial Intelligence that helps computers understand, interpret and manipulate human language.
1. Data Preprocessing
2. Algorithm Development
In Natural Language Processing, machine learning training algorithms study millions of examples of text — words,
sentences, and paragraphs — written by humans. By studying the samples, the training algorithms gain an understanding of
the “context” of human speech, writing, and other modes of communication. This training helps NLP software to
differentiate between the meanings of various texts. The five phases of NLP involve lexical (structure) analysis, parsing,
semantic analysis, discourse integration, and pragmatic analysis. Some well-known application areas of NLP are Optical
Character Recognition (OCR), Speech Recognition, Machine Translation, and Chatbots.
The first phase of NLP is word structure analysis, which is referred to as lexical or morphological analysis. A lexicon is
defined as a collection of words and phrases in a given language, with the analysis of this collection being the process of
splitting the lexicon into components, based on what the user sets as parameters – paragraphs, phrases, words, or characters.
Similarly, morphological analysis is the process of identifying the morphemes of a word. A morpheme is a basic unit of
English language construction, which is a small element of a word, that carries meaning. These can be either a free
morpheme (e.g. walk) or a bound morpheme (e.g. -ing, -ed), with the difference between the two being that the latter cannot
stand on it’s own to produce a word with meaning, and should be assigned to a free morpheme to attach meaning.
In search engine optimization (SEO), lexical or morphological analysis helps guide web searching. For instance, when doing
on-page analysis, you can perform lexical and morphological analysis to understand how often the target keywords are used
in their core form (as free morphemes, or when in composition with bound morphemes). This type of analysis can ensure
that you have an accurate understanding of the different variations of the morphemes that are used. Morphological analysis
can also be applied in transcription and translation projects, so can be very useful in content repurposing projects, and
international SEO and linguistic analysis.
Syntax Analysis is the second phase of natural language processing. Syntax analysis or parsing is the process of checking
grammar, word arrangement, and overall – the identification of relationships between words and whether those make sense.
The process involved examination of all words and phrases in a sentence, and the structures between them.
As part of the process, there’s a visualisation built of semantic relationships referred to as a syntax tree (similar to a
knowledge graph). This process ensures that the structure and order and grammar of sentences makes sense, when
considering the words and phrases that make up those sentences. Syntax analysis also involves tagging words and phrases
with POS tags. There are two common methods, and multiple approaches to construct the syntax tree – top-down and
bottom-up, however, both are logical and check for sentence formation, or else they reject the input.
Semantic analysis is the third stage in NLP, when an analysis is performed to understand the meaning in a statement. This
type of analysis is focused on uncovering the definitions of words, phrases, and sentences and identifying whether the way
words are organized in a sentence makes sense semantically.
This task is performed by mapping the syntactic structure, and checking for logic in the presented relationships between
entities, words, phrases, and sentences in the text. There are a couple of important functions of semantic analysis, which
allow for natural language understanding:
• To ensure that the data types are used in a way that’s consistent with their definition.
• To ensure that the flow of the text is consistent.
• Identification of synonyms, antonyms, homonyms, and other lexical items.
• Overall word sense disambiguation.
• Relationship extraction from the different entities identified from the text.
There are several things you can utilise semantic analysis for in SEO. Here are some examples:
• Topic modeling and classification – sort your page content into topics (predefined or modelled by an algorithm).
You can then use this for ML-enabled internal linking, where you link pages together on your website using the
identified topics. Topic modeling can also be used for classifying first-party collected data such as customer service
tickets, or feedback users left on your articles or videos in free form (i.e. comments).
• Entity analysis, sentiment analysis, and intent classification – You can use this type of analysis to perform
sentiment analysis and identify intent expressed in the content analysed. Entity identification and sentiment analysis
are separate tasks, and both can be done on things like keywords, titles, meta descriptions, page content, but works
best when analysing data like comments, feedback forms, or customer service or social media interactions. Intent
classification can be done on user queries (in keyword research or traffic analysis), but can also be done in analysis
of customer service interactions.
Phase IV: Discourse integration
Discourse integration is the fourth phase in NLP, and simply means contextualisation. Discourse integration is the analysis
and identification of the larger context for any smaller part of natural language structure (e.g. a phrase, word or sentence).
During this phase, it’s important to ensure that each phrase, word, and entity mentioned are mentioned within the appropriate
context. This analysis involves considering not only sentence structure and semantics, but also sentence combination and
meaning of the text as a whole. Otherwise, when analyzing the structure of text, sentences are broken up and analyzed and
also considered in the context of the sentences that precede and follow them, and the impact that they have on the structure
of text. Some common tasks in this phase include: information extraction, conversation analysis, text summarisation,
discourse analysis.
Here are some complexities of natural language understanding introduced during this phase:
• Understanding of the expressed motivations within the text, and its underlying meaning.
• Understanding of the relationships between entities and topics mentioned, thematic understanding, and interactions
analysis.
• Understanding the social and historical context of entities mentioned.
Discourse integration and analysis can be used in SEO to ensure that appropriate tense is used, that the relationships
expressed in the text make logical sense, and that there is overall coherency in the text analysed. This can be especially
useful for programmatic SEO initiatives or text generation at scale. The analysis can also be used as part of international
SEO localization, translation, or transcription tasks on big corpuses of data.
There are some research efforts to incorporate discourse analysis into systems that detect hate speech (or in the SEO space
for things like content and comment moderation), with this technology being aimed at uncovering intention behind text by
aligning the expression with meaning, derived from other texts. This means that, theoretically, discourse analysis can also
be used for modeling of user intent (e.g search intent or purchase intent) and detection of such notions in texts.
Phase V: Pragmatic analysis
Pragmatic analysis is the fifth and final phase of natural language processing. As the final stage, pragmatic analysis
extrapolates and incorporates the learnings from all other, preceding phases of NLP. Pragmatic analysis involves the process
of abstracting or extracting meaning from the use of language, and translating a text, using the gathered knowledge from all
other NLP steps performed beforehand.
Here are some complexities that are introduced during this phase
• Information extraction, enabling an advanced text understanding functions such as question-answering.
• Meaning extraction, which allows for programs to break down definitions or documentation into a more accessible
language.
• Understanding of the meaning of the words, and context, in which they are used, which enables conversational
functions between machine and human (e.g. chatbots).
Pragmatic analysis has multiple applications in SEO. One of the most straightforward ones is programmatic SEO and
automated content generation. This type of analysis can also be used for generating FAQ sections on your product, using
textual analysis of product documentation, or even capitalizing on the ‘People Also Ask’ featured snippets by adding an
automatically-generated FAQ section for each page you produce on your site.
LANGUAGE SYNTAX AND STRUCTURE
For any language, syntax and structure usually go hand in hand, where a set of specific rules, conventions, and principles
govern the way words are combined into phrases; phrases get combines into clauses; and clauses get combined into
sentences. We will be talking specifically about the English language syntax and structure in this section. In English, words
usually combine together to form other constituent units. These constituents include words, phrases, clauses, and sentences.
Considering a sentence, “The brown fox is quick and he is jumping over the lazy dog”, it is made of a bunch of words and
just looking at the words by themselves don’t tell us much.
each POS tag like the noun (N) can be further subdivided into categories like singular nouns (NN), singular proper
nouns(NNP), and plural nouns (NNS).
The process of classifying and labeling POS tags for words called parts of speech tagging or POS tagging . POS tags are
used to annotate words and depict their POS, which is really helpful to perform specific analysis, such as narrowing down
upon nouns and seeing which ones are the most prominent, word sense disambiguation, and grammar analysis.
Let us consider both nltk and spacy which usually use the Penn Treebank notation for POS tagging. NLTK and spaCy are
two of the most popular Natural Language Processing (NLP) tools available in Python. You can build chatbots, automatic
summarizers, and entity extraction engines with either of these libraries. While both can theoretically accomplish any NLP
task, each one excels in certain scenarios. The Penn Treebank, or PTB for short, is a dataset maintained by the University
of Pennsylvania.
nltk_pos_tagged = nltk.pos_tag(sentence.split())
• Verb phrase (VP): These phrases are lexical units that have a verb acting as the head word. Usually, there are two
forms of verb phrases. One form has the verb components as well as other entities such as nouns, adjectives, or
adverbs as parts of the object.
• Adjective phrase (ADJP): These are phrases with an adjective as the head word. Their main role is to describe or
qualify nouns and pronouns in a sentence, and they will be either placed before or after the noun or pronoun.
• Adverb phrase (ADVP): These phrases act like adverbs since the adverb acts as the head word in the phrase.
Adverb phrases are used as modifiers for nouns, verbs, or adverbs themselves by providing further details that
describe or qualify them.
• Prepositional phrase (PP): These phrases usually contain a preposition as the head word and other lexical
components like nouns, pronouns, and so on. These act like an adjective or adverb describing other words or phrases.
Shallow parsing, also known as light parsing or chunking, is a popular natural language processing technique of analyzing
the structure of a sentence to break it down into its smallest constituents (which are tokens such as words) and group them
together into higher-level phrases. This includes POS tags and phrases from a sentence.
Constituency Parsing
Constituent-based grammars are used to analyze and determine the constituents of a sentence. These grammars can be used
to model or represent the internal structure of sentences in terms of a hierarchically ordered structure of their constituents.
Each and every word usually belongs to a specific lexical category in the case and forms the head word of different phrases.
These phrases are formed based on rules called phrase structure rules.
Phrase structure rules form the core of constituency grammars, because they talk about syntax and rules that govern the
hierarchy and ordering of the various constituents in the sentences. These rules cater to two things primarily.
• They determine what words are used to construct the phrases or constituents.
• They determine how we need to order these constituents together.
The generic representation of a phrase structure rule is S → AB , which depicts that the structure S consists of
constituents A and B , and the ordering is A followed by B . While there are several rules (refer to Chapter 1, Page 19: Text
Analytics with Python, if you want to dive deeper), the most important rule describes how to divide a sentence or a clause.
The phrase structure rule denotes a binary division for a sentence or a clause as S → NP VP where S is the sentence or
clause, and it is divided into the subject, denoted by the noun phrase (NP) and the predicate, denoted by the verb phrase
(VP).
A constituency parser can be built based on such grammars/rules, which are usually collectively available as context-free
grammar (CFG) or phrase-structured grammar. The parser will process input sentences according to these rules, and help
in building a parse tree.
Dependency Parsing
In dependency parsing, we try to use dependency-based grammars to analyze and infer both structure and semantic
dependencies and relationships between tokens in a sentence. The basic principle behind a dependency grammar is that in
any sentence in the language, all words except one, have some relationship or dependency on other words in the sentence.
The word that has no dependency is called the root of the sentence. The verb is taken as the root of the sentence in most
cases. All the other words are directly or indirectly linked to the root verb using links, which are the dependencies.
Considering the sentence “The brown fox is quick and he is jumping over the lazy dog”, if we wanted to draw the
dependency syntax tree for this, we would have the structure
is a lot faster than lemmatization. Hence, to stem/lemmatize is dependent on whether the application needs quick pre-
processing or requires more accurate base forms.
TOKENIZATION
Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP
methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers. As tokens are the
building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
Tokens are the building blocks of Natural Language.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words,
characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram
characters) tokenization.
The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence
results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.
Similarly, tokens can be either characters or sub-words. For example, let us consider “smarter”:
Here, Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a vocabulary.
Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by considering
each unique token in the corpus or by considering the top K Frequently Occurring Words.
Creating Vocabulary is the ultimate goal of Tokenization.
One of the simplest hacks to boost the performance of the NLP model is to create a vocabulary out of top K frequently
occurring words.
Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.
Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the
vocabulary is treated as a unique feature:
• In Advanced Deep Learning-based NLP architectures, vocabulary is used to create the tokenized input sentences.
Finally, the tokens of these sentences are passed as inputs to the model
As discussed earlier, tokenization can be performed on word, character, or subword level. It’s a common question – which
Tokenization should we use while solving an NLP task? Let’s address this question here.
Word Tokenization
Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based
on a certain delimiter. Depending upon delimiters, different word-level tokens are formed. Pretrained Word Embeddings
such as Word2Vec and GloVe comes under word tokenization.
1. Split the words in the corpus into characters after appending </w>
2. Initialize the vocabulary with unique characters in the corpus
3. Compute the frequency of a pair of characters or character sequences in corpus
4. Merge the most frequent pair in corpus
5. Save the best pair to the vocabulary
6. Repeat steps 3 to 5 for a certain number of iterations
1a) Append the end of the word (say </w>) symbol to every word in the corpus:
Iteration 1:
3. Compute frequency:
Repeat steps 3-5 for every iteration from now. Let me illustrate for one more iteration.
Iteration 2:
3. Compute frequency:
STEMMING
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly
referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”,
“choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming is
an important part of the pipelining process in Natural language processing. The input to the stemmer is tokenized words.
How do we get these tokenized words? Well, tokenization involves breaking down the document into different words.
Stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root
form. The process of stemming is used to normalize text and make it easier to process. It is an important step in text pre-
processing, and it is commonly used in information retrieval and text mining applications. There are several different
algorithms for stemming as follows:
• Porter stemmer
• Snowball stemmer
• Lancaster stemmer.
The Porter stemmer is the most widely used algorithm, and it is based on a set of heuristics that are used to remove common
suffixes from words. The Snowball stemmer is a more advanced algorithm that is based on the Porter stemmer, but it also
supports several other languages in addition to English. The Lancaster stemmer is a more aggressive stemmer and it is less
accurate than the Porter stemmer and Snowball stemmer.
Stemming can be useful for several natural language processing tasks such as text classification, information retrieval, and
text summarization. However, stemming can also have some negative effects such as reducing the readability of the text,
and it may not always produce the correct root form of a word. It is important to note that stemming is different from
Lemmatization. Lemmatization is the process of reducing a word to its base form, but unlike stemming, it takes into account
the context of the word, and it produces a valid word, unlike stemming which can produce a non-word as the root form.
Some more examples stemming from the root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"
Errors in Stemming:
There are mainly two errors in stemming –
• over-stemming
• under-stemming
Over-stemming occurs when two words are stemmed from the same root that are of different stems. Over-stemming can
also be regarded as false-positives. Over-stemming is a problem that can occur when using stemming algorithms in natural
language processing. It refers to the situation where a stemmer produces a root form that is not a valid word or is not the
correct root form of a word. This can happen when the stemmer is too aggressive in removing suffixes or when it does not
consider the context of the word.
Over-stemming can lead to a loss of meaning and make the text less readable. For example, the word “arguing” may be
stemmed to “argu,” which is not a valid word and does not convey the same meaning as the original word. Similarly, the
word “running” may be stemmed to “run,” which is the base form of the word but it does not convey the meaning of the
original word.
To avoid over-stemming, it is important to use a stemmer that is appropriate for the task and language. It is also important
to test the stemmer on a sample of text to ensure that it is producing valid root forms. In some cases, using a lemmatizer
instead of a stemmer may be a better solution as it takes into account the context of the word, making it less prone to errors.
Another approach to this problem is to use techniques like semantic role labeling, sentiment analysis, context-based
information, etc. that help to understand the context of the text and make the stemming process more precise.
Under-stemming occurs when two words are stemmed from the same root that are not of different stems. Under-stemming
can be interpreted as false-negatives. Under-stemming is a problem that can occur when using stemming algorithms in
natural language processing. It refers to the situation where a stemmer does not produce the correct root form of a word or
does not reduce a word to its base form. This can happen when the stemmer is not aggressive enough in removing suffixes
or when it is not designed for the specific task or language.
Under-stemming can lead to a loss of information and make it more difficult to analyze text. For example, the word
“arguing” and “argument” may be stemmed to “argu,” which does not convey the meaning of the original words. Similarly,
the word “running” and “runner” may be stemmed to “run,” which is the base form of the word but it does not convey the
meaning of the original words.
To avoid under-stemming, it is important to use a stemmer that is appropriate for the task and language. It is also important
to test the stemmer on a sample of text to ensure that it is producing the correct root forms. In some cases, using a lemmatizer
instead of a stemmer may be a better solution as it takes into account the context of the word, making it less prone to errors.
Another approach to this problem is to use techniques like semantic role labeling, sentiment analysis, context-based
information, etc. that help to understand the context of the text and make the stemming process more precise.
Applications of stemming:
Stemming is used in information retrieval systems like search engines. It is used to determine domain vocabularies in domain
analysis. To display search results by indexing while documents are evolving into numbers and to map documents to
common subjects by stemming. Sentiment Analysis, which examines reviews and comments made by different users about
anything, is frequently used for product analysis, such as for online retail stores. Before it is interpreted, stemming is
accepted in the form of the text-preparation mean.
A method of group analysis used on textual materials is called document clustering (also known as text clustering). Important
uses of it include subject extraction, automatic document structuring, and quick information retrieval.
Fun Fact: Google search adopted a word stemming in 2003. Previously a search for “fish” would not have returned “fishing”
or “fishes”.
Some Stemming algorithms are:
Porter’s Stemmer algorithm
It is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English
language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity.
The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only
limited to English words. Also, the group of stems is mapped on to the same stem and the output stem is not necessarily a
meaningful word. The algorithms are fairly lengthy in nature and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending
to EE” as ‘agreed’ becomes ‘agree’.
Advantage: It produces the best output as compared to other stemmers and it has less error rate.
Limitation: Morphological variants produced are not always real words.
Lovins Stemmer
It is proposed by Lovins in 1968, that removes the longest suffix from a word then the word is recorded to convert this stem
into valid words.
Example: sitting -> sitt -> sit
Advantage: It is fast and handles irregular plurals like 'teeth' and 'tooth' etc.
Limitation: It is time consuming and frequently fails to form words from stem.
Dawson Stemmer
It is an extension of Lovins stemmer in which suffixes are stored in the reversed order indexed by their length and last letter.
Advantage: It is fast in execution and covers more suffices.
Limitation: It is very complex to implement.
Krovetz Stemmer
It was proposed in 1993 by Robert Krovetz. Following are the steps:
1) Convert the plural form of a word to its singular form.
2) Convert the past tense of a word to its present tense and remove the suffix ‘ing’.
Example: ‘children’ -> ‘child’
Advantage: It is light in nature and can be used as pre-stemmer for other stemmers.
Limitation: It is inefficient in case of large documents.
Xerox Stemmer
Example:
‘children’ -> ‘child’
‘understood’ -> ‘understand’
‘whom’ -> ‘who’
‘best’ -> ‘good’
N-Gram Stemmer
An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of
n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*
Advantage: It is based on string comparisons and it is language dependent.
Limitation: It requires space to create and index the n-grams and it is not time efficient.
Snowball Stemmer:
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other
languages the Snowball Stemmers can be called a multi-lingual stemmer. The Snowball stemmers are also imported from
the nltk package. This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is
the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to
as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer is
having greater computational speed.
Lancaster Stemmer:
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really
faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball
Stemmers. The Lancaster stemmers save the rules externally and basically uses an iterative algorithm. Lancaster Stemmer
is straightforward, although it often produces results with excessive stemming. Over-stemming renders stems non-linguistic
or meaningless.
LEMMATIZATION
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down
to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to its root
word, or lemme, good. In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word.
There are different algorithms used to find out how many characters have to be chopped off, but the algorithms don’t actually
know the meaning of the word in the language it belongs to. In lemmatization, the algorithms do have this knowledge. In
fact, you can even say that these algorithms refer to a dictionary to understand the meaning of the word before reducing it
to its root word, or lemma. So, a lemmatization algorithm would know that the word better is derived from the
word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to do the same. There
could be over-stemming or under-stemming, and the word better could be reduced to either bet, or bett, or just
retained as better. But there is no way in stemming that can reduce better to its root word good. This is the
difference between stemming and lemmatization.
Lemmatization gives more context to chatbot conversations as it recognizes words based on their exact and contextual
meaning. On the other hand, lemmatization is a time-consuming and slow process. The obvious advantage of lemmatization
is that it is more accurate than stemming. So, if you’re dealing with an NLP application such as a chat bot or a virtual
assistant, where understanding the meaning of the dialogue is crucial, lemmatization would be useful. But this accuracy
comes at a cost. Because lemmatization involves deriving the meaning of a word from something like a dictionary, it’s very
time-consuming. So, most lemmatization algorithms are slower compared to their stemming counterparts. There is also a
computation overhead for lemmatization, however, in most machine-learning problems, computational resources are
rarely a cause of concern.
REMOVING STOP-WORDS
The words which are generally filtered out before processing a natural language are called stop words. These are actually
the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much
information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”. Stop words are available
in abundance in any human language. By removing these words, we remove the low-level information from our text in order
to give more focus to the important information. In order words, we can say that the removal of such words does not show
any negative consequences on the model we train for our task.
Removal of stop words definitely reduces the dataset size and thus reduces the training time due to the fewer number of
tokens involved in the training.
We do not always remove the stop words. The removal of stop words is highly dependent on the task we are performing
and the goal we want to achieve. For example, if we are training a model that can perform the sentiment analysis task, we
might not remove the stop words.
Movie review: “The movie was not good at all.”
Text after removal of stop words: “movie good”
We can clearly see that the review for the movie was negative. However, after the removal of stop words, the review became
positive, which is not the reality. Thus, the removal of stop words can be problematic here.
Tasks like text classification do not generally need stop words as the other words present in the dataset are more important
and give the general idea of the text. So, we generally remove stop words in such tasks.
In a nutshell, NLP has a lot of tasks that cannot be accomplished properly after the removal of stop words. So, think before
performing this step. The catch here is that no rule is universal and no stop words list is universal. A list not conveying any
important information to one task can convey a lot of information to the other task.
Word of caution: Before removing stop words, research a bit about your task and the problem you are trying to solve, and
then make your decision.
• Parsing
• PoS Tagging
• Name Entity Recognition (NER)
• Bag of Words (BoW)
Statistical Methods
This is a bit more advanced feature extraction method and uses the concepts from statistics and probability to extract features
from text data.
Advanced Methods
These methods can also be called vectorized methods as they aim to map a word, sentence, document to a fixed-length
vector of real numbers. The goal of this method is to extract semantics from a piece of text, both lexical and distributional.
Lexical semantics is just the meaning reflected by the words whereas distributional semantics refers to finding meaning
based on various distributions in a corpus.
• Word2Vec
• GloVe: Global Vector for word representation
Also, at a much granular level, the machine learning models work with numerical data rather than textual data. So to be
more specific, by using the bag-of-words (BoW) technique, we convert a text into its equivalent vector of numbers.
Let us see an example of how the bag of words technique converts text into vectors
Example (1) without preprocessing:
Sentence 1: “Welcome to Great Learning, Now start learning”
Sentence 2: “Learning is a good practice”
Sentence 1 Sentence 2
Welcome Learning
to is
Great a
Learning good
, practice
Now
start
learning
Step 1: Go through all the words in the above text and make a list of all of the words in the model vocabulary.
• Welcome
• To
• Great
• Learning
• ,
• Now
• start
• learning
• is
• a
• good
• practice
Note that the words ‘Learning’ and ‘ learning’ are not the same here because of the difference in their cases and hence are
repeated. Also, note that a comma ‘ , ’ is also taken in the list. Because we know the vocabulary has 12 words, we can use
a fixed-length document-representation of 12, with one position in the vector to score each word.
The scoring method we use here is to count the presence of each word and mark 0 for absence. This scoring method is used
more generally.
The scoring of sentence 1 would look as follows:
Word Frequency
Welcome 1
to 1
Great 1
Learning 1
, 1
Now 1
start 1
learning 1
is 0
a 0
good 0
practice 0
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]
Now for sentence 2, the scoring would like:
Word Frequency
Welcome 0
to 0
Great 0
Learning 1
, 0
Now 0
start 0
learning 0
is 1
a 1
good 1
practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
Sentence1 1 1 1 1 1 1 1 1 0 0 0 0
Sentence2 0 0 0 0 0 0 0 1 1 1 1 1
But is this the best way to perform a bag of words. The above example was not the best example of how to use a
bag of words. The words Learning and learning, although having the same meaning are taken twice. Also, a
comma “,” which does not convey any information is also included in the vocabulary.
Let us make some changes and see how we can use ‘bag of words in a more effective way.
Step 1: Convert the above sentences in lower case as the case of the word does not hold any information.
Step 2: Remove special characters and stopwords from the text. Stopwords are the words that do not contain much
information about text like ‘is’, ‘a’,’the and many more’.
Although the above sentences do not make much sense the maximum information is contained in these words only.
Step 3: Go through all the words in the above text and make a list of all of the words in our model vocabulary.
• welcome
• great
• learning
• now
• start
• good
• practice
Now as the vocabulary has only 7 words, we can use a fixed-length document-representation of 7, with one position in the
vector to score each word.
The scoring method we use here is the same as used in the previous example. For sentence 1, the count of words is as
follow:
Word Frequency
welcome 1
great 1
learning 2
now 1
start 1
good 0
practice 0
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,2,1,1,0,0 ]
Word Frequency
welcome 0
great 0
learning 1
now 0
start 0
good 1
practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]
Sentence1 1 1 2 1 1 0 0
Sentence2 0 0 1 0 0 1 1
The approach used in example two is the one that is generally used in the Bag-of-Words technique, the reason being that
the datasets used in Machine learning are tremendously large and can contain vocabulary of a few thousand or even millions
of words. Hence, preprocessing the text before using bag-of-words is a better way to go. There are various preprocessing
steps that can increase the performance of Bag-of-Words. Some of them are explained in great detail in this blog.
In the examples above we use all the words from vocabulary to form a vector, which is neither a practical way nor the best
way to implement the BoW model. In practice, only a few words from the vocabulary, more preferably the most common
words are used to form the vector.
Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on
your specific text data. It has been used with great success on prediction problems like language modeling and
documentation classification.
• Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts
the sparsity of the document representations.
• Sparsity: Sparse representations are harder to model both for computational reasons (space and time complexity)
and also for information reasons, where the challenge is for the models to harness so little information in such a
large representational space.
• Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics).
Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words
differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much
more.
For example, let’s use the following phrase and divide it into bi-grams (n=2).
“James is the best person ever.”
becomes
• <start>James
• James is
• is the
• the best
• best person
• person ever.
• ever.<end>
In a typical bag-of-n-grams model, these 6 bigrams would be a sample from a large number of bigrams observed in a corpus.
And then James is the best person ever. would be encoded in a representation showing which of the corpus’s bigrams were
observed in the sentence. A bag-of-n-grams model has the simplicity of the bag-of-words model but allows the preservation
of more word locality information.
TF-IDF MODEL
TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how
relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a
word appears but is compensated by the word frequency in the corpus (data-set).
Terminologies:
• Term Frequency: In document 𝑑, the frequency represents the number of instances of a given word 𝑡. Therefore,
we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of
terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term
in the paper, there is an entry with the value being the term frequency.
The weight of a term that occurs in a document is simply proportional to the term frequency.
• Document Frequency: This tests the meaning of the text, which is very similar to TF, in the whole corpus
collection. The only difference is that in document d, TF is the frequency counter for a term 𝑡, while df is the number
of occurrences in the document set N of the term t. In other words, the number of papers in which the word is
present is DF.
• Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate the
appropriate records that fit the demand. Since 𝑡𝑓 considers all terms equally significant, it is therefore not only
possible to use the term frequencies to measure the weight of the term in the paper. First, find the document
frequency of a term 𝑡 by counting the number of documents containing the term:
Term frequency is the number of instances of a term in a single document only; although the frequency of the document is
the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the definition
of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus separated by the
frequency of the text.
The more common word is supposed to be considered less significant, but the element (most definite integers) seems too
harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes:
• Computation: TF-IDF is one of the best metrics to determine how significant a term is to a text in a series or a
corpus. TF-IDF is a weighting system that assigns a weight to each word in a document based on its term frequency
(TF) and the reciprocal document frequency (TF) (IDF). The words with higher scores of weight are deemed to be
more significant.
Numerical Example
Imagine the term 𝑡 appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of 𝑡 can be
calculated as follow:
20
𝑡𝑓(𝑡, 𝑑) = = 0.2
100
Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents contain
the term 𝑡, Inverse Document Frequency (IDF) of 𝑡 can be calculated as follows
10000
𝑖𝑑𝑓(𝑡) = 𝑙𝑜𝑔 =2
100
Using these two quantities, we can calculate TF-IDF score of the term 𝑡 for the document.
**********