0% found this document useful (0 votes)

1K views

Text and Speech Analysis Notes CCS369-UNIT 1

Jkh

Uploaded by

pandiyn2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views

Text and Speech Analysis Notes CCS369-UNIT 1

Jkh

Uploaded by

pandiyn2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

V- SEM/III B.E. CSE Prepared By: R.

Reshma/AP/CSE

CCS369- TEXT AND SPEECH ANALYSIS

LECTURE NOTES
UNIT I NATURAL LANGUAGE BASICS 6
Foundations of natural language processing – Language Syntax and Structure- Text Preprocessing and Wrangling – Text
tokenization – Stemming – Lemmatization – Removing stop-words – Feature Engineering for Text representation – Bag of
Words model- Bag of N-Grams model – TF-IDF model

INTRODUCTION
Artificial intelligence (AI) integration has revolutionized various industries, and now it is transforming the realm of human
behavior research. This integration marks a significant milestone in the data collection and analysis endeavors, enabling
users to unlock deeper insights from spoken language and empower researchers and analysts with enhanced capabilities for
understanding and interpreting human communication. Human interactions are a critical part of many organizations. Many
organizations analyze speech or text via natural language processing (NLP) and link them to insights and automation such
as text categorization, text classification, information extraction, etc.
In business intelligence, speech and text analytics enable us to gain insights into customer-agent conversations through
sentiment analysis, and topic trends. These insights highlight areas of improvement, recognition, and concern, to better
understand and serve customers and employees. Speech and text analytics features provide automated speech and text
analytics capabilities on 100% of interactions to provide deep insight into customer-agent conversations. Speech and text
analytics is a set of features that uses natural language processing (NLP) to provide an automated analysis of an interaction’s
content and insight into customer-agent conversations. Speech and text analytics includes transcribing voice interactions,
analysis for customer sentiment and topic spotting, and creating meaning from otherwise unstructured data.
FOUNDATIONS OF NATURAL LANGUAGE PROCESSING

Natural Language Processing (NLP) is the process of producing meaningful phrases and sentences in the form of natural
language. Natural Language Processing precludes Natural Language Understanding (NLU) and Natural Language
Generation (NLG). NLU takes the data input and maps it into natural language. NLG conducts information extraction and
retrieval, sentiment analysis, and more. NLP can be thought of as an intersection of Linguistics, Computer Science and
Artificial Intelligence that helps computers understand, interpret and manipulate human language.

Fig. NLP Overview

Ever since then, there has been an immense amount of study and development in the field of Natural Language Processing.
Today NLP is one of the most in-demand and promising fields of Artificial Intelligence!
There are two main parts to Natural Language Processing:

1. Data Preprocessing
2. Algorithm Development

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Applications Core technologies

• Machine Translation • Language modeling
• Information Retrieval • Part-of-speech tagging
• Question Answering • Syntactic parsing
• Dialogue Systems • Named-entity recognition
• Information Extraction • Coreference resolution
• Summarization • Word sense disambiguation
• Sentiment Analysis • Semantic Role Labelling
• ... • ...

In Natural Language Processing, machine learning training algorithms study millions of examples of text — words,
sentences, and paragraphs — written by humans. By studying the samples, the training algorithms gain an understanding of
the “context” of human speech, writing, and other modes of communication. This training helps NLP software to
differentiate between the meanings of various texts. The five phases of NLP involve lexical (structure) analysis, parsing,
semantic analysis, discourse integration, and pragmatic analysis. Some well-known application areas of NLP are Optical
Character Recognition (OCR), Speech Recognition, Machine Translation, and Chatbots.

Fig: Five Phases of NLP

Phase I: Lexical or morphological analysis

The first phase of NLP is word structure analysis, which is referred to as lexical or morphological analysis. A lexicon is
defined as a collection of words and phrases in a given language, with the analysis of this collection being the process of
splitting the lexicon into components, based on what the user sets as parameters – paragraphs, phrases, words, or characters.

Similarly, morphological analysis is the process of identifying the morphemes of a word. A morpheme is a basic unit of
English language construction, which is a small element of a word, that carries meaning. These can be either a free
morpheme (e.g. walk) or a bound morpheme (e.g. -ing, -ed), with the difference between the two being that the latter cannot
stand on it’s own to produce a word with meaning, and should be assigned to a free morpheme to attach meaning.

In search engine optimization (SEO), lexical or morphological analysis helps guide web searching. For instance, when doing
on-page analysis, you can perform lexical and morphological analysis to understand how often the target keywords are used
in their core form (as free morphemes, or when in composition with bound morphemes). This type of analysis can ensure
that you have an accurate understanding of the different variations of the morphemes that are used. Morphological analysis
can also be applied in transcription and translation projects, so can be very useful in content repurposing projects, and
international SEO and linguistic analysis.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Phase II: Syntax analysis (parsing)

Syntax Analysis is the second phase of natural language processing. Syntax analysis or parsing is the process of checking
grammar, word arrangement, and overall – the identification of relationships between words and whether those make sense.
The process involved examination of all words and phrases in a sentence, and the structures between them.

As part of the process, there’s a visualisation built of semantic relationships referred to as a syntax tree (similar to a
knowledge graph). This process ensures that the structure and order and grammar of sentences makes sense, when
considering the words and phrases that make up those sentences. Syntax analysis also involves tagging words and phrases
with POS tags. There are two common methods, and multiple approaches to construct the syntax tree – top-down and
bottom-up, however, both are logical and check for sentence formation, or else they reject the input.

Syntax analysis can be beneficial for SEO in several ways:

• Programmatic SEO: Checking whether the produced content makes sense, especially when producing content at
scale using an automated or semi-automated approach.
• Semantic analysis: Once you have a syntax analysis conducted, semantic analysis is easy, as well as uncovering
the relationship between the different entities recognized in the content.

Phase III: Semantic analysis

Semantic analysis is the third stage in NLP, when an analysis is performed to understand the meaning in a statement. This
type of analysis is focused on uncovering the definitions of words, phrases, and sentences and identifying whether the way
words are organized in a sentence makes sense semantically.
This task is performed by mapping the syntactic structure, and checking for logic in the presented relationships between
entities, words, phrases, and sentences in the text. There are a couple of important functions of semantic analysis, which
allow for natural language understanding:
• To ensure that the data types are used in a way that’s consistent with their definition.
• To ensure that the flow of the text is consistent.
• Identification of synonyms, antonyms, homonyms, and other lexical items.
• Overall word sense disambiguation.
• Relationship extraction from the different entities identified from the text.
There are several things you can utilise semantic analysis for in SEO. Here are some examples:
• Topic modeling and classification – sort your page content into topics (predefined or modelled by an algorithm).
You can then use this for ML-enabled internal linking, where you link pages together on your website using the
identified topics. Topic modeling can also be used for classifying first-party collected data such as customer service
tickets, or feedback users left on your articles or videos in free form (i.e. comments).
• Entity analysis, sentiment analysis, and intent classification – You can use this type of analysis to perform
sentiment analysis and identify intent expressed in the content analysed. Entity identification and sentiment analysis
are separate tasks, and both can be done on things like keywords, titles, meta descriptions, page content, but works
best when analysing data like comments, feedback forms, or customer service or social media interactions. Intent
classification can be done on user queries (in keyword research or traffic analysis), but can also be done in analysis
of customer service interactions.
Phase IV: Discourse integration
Discourse integration is the fourth phase in NLP, and simply means contextualisation. Discourse integration is the analysis
and identification of the larger context for any smaller part of natural language structure (e.g. a phrase, word or sentence).
During this phase, it’s important to ensure that each phrase, word, and entity mentioned are mentioned within the appropriate
context. This analysis involves considering not only sentence structure and semantics, but also sentence combination and
meaning of the text as a whole. Otherwise, when analyzing the structure of text, sentences are broken up and analyzed and

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

also considered in the context of the sentences that precede and follow them, and the impact that they have on the structure
of text. Some common tasks in this phase include: information extraction, conversation analysis, text summarisation,
discourse analysis.
Here are some complexities of natural language understanding introduced during this phase:

• Understanding of the expressed motivations within the text, and its underlying meaning.
• Understanding of the relationships between entities and topics mentioned, thematic understanding, and interactions
analysis.
• Understanding the social and historical context of entities mentioned.

Discourse integration and analysis can be used in SEO to ensure that appropriate tense is used, that the relationships
expressed in the text make logical sense, and that there is overall coherency in the text analysed. This can be especially
useful for programmatic SEO initiatives or text generation at scale. The analysis can also be used as part of international
SEO localization, translation, or transcription tasks on big corpuses of data.
There are some research efforts to incorporate discourse analysis into systems that detect hate speech (or in the SEO space
for things like content and comment moderation), with this technology being aimed at uncovering intention behind text by
aligning the expression with meaning, derived from other texts. This means that, theoretically, discourse analysis can also
be used for modeling of user intent (e.g search intent or purchase intent) and detection of such notions in texts.
Phase V: Pragmatic analysis
Pragmatic analysis is the fifth and final phase of natural language processing. As the final stage, pragmatic analysis
extrapolates and incorporates the learnings from all other, preceding phases of NLP. Pragmatic analysis involves the process
of abstracting or extracting meaning from the use of language, and translating a text, using the gathered knowledge from all
other NLP steps performed beforehand.
Here are some complexities that are introduced during this phase
• Information extraction, enabling an advanced text understanding functions such as question-answering.
• Meaning extraction, which allows for programs to break down definitions or documentation into a more accessible
language.
• Understanding of the meaning of the words, and context, in which they are used, which enables conversational
functions between machine and human (e.g. chatbots).
Pragmatic analysis has multiple applications in SEO. One of the most straightforward ones is programmatic SEO and
automated content generation. This type of analysis can also be used for generating FAQ sections on your product, using
textual analysis of product documentation, or even capitalizing on the ‘People Also Ask’ featured snippets by adding an
automatically-generated FAQ section for each page you produce on your site.
LANGUAGE SYNTAX AND STRUCTURE
For any language, syntax and structure usually go hand in hand, where a set of specific rules, conventions, and principles
govern the way words are combined into phrases; phrases get combines into clauses; and clauses get combined into
sentences. We will be talking specifically about the English language syntax and structure in this section. In English, words
usually combine together to form other constituent units. These constituents include words, phrases, clauses, and sentences.
Considering a sentence, “The brown fox is quick and he is jumping over the lazy dog”, it is made of a bunch of words and
just looking at the words by themselves don’t tell us much.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Fig. A bunch of unordered words don’t convey much information

Knowledge about the structure and syntax of the language is helpful in many areas like text processing, annotation, and
parsing for further operations such as text classification or summarization. Typical parsing techniques for understanding
text syntax are mentioned below.
• Parts of Speech (POS) Tagging
• Shallow Parsing or Chunking
• Constituency Parsing
• Dependency Parsing
We will be looking at all of these techniques in subsequent sections. Considering the previous example sentence “The
brown fox is quick and he is jumping over the lazy dog”, if we were to annotate it using basic POS tags, it would look like
the following figure.

Fig. POS tagging for a sentence

Thus, a sentence typically follows a hierarchical structure consisting the following components,
sentence → clauses → phrases → words
Tagging Parts of Speech
Parts of speech (POS) are specific lexical categories to which words are assigned, based on their syntactic context and role.
Usually, words can fall into one of the following major categories.
• N(oun): This usually denotes words that depict some object or entity, which may be living or nonliving. Some
examples would be fox , dog , book , and so on. The POS tag symbol for nouns is N.
• V(erb): Verbs are words that are used to describe certain actions, states, or occurrences. There are a wide variety of
further subcategories, such as auxiliary, reflexive, and transitive verbs (and many more). Some typical examples of
verbs would be running , jumping , read , and write . The POS tag symbol for verbs is V.
• Adj(ective): Adjectives are words used to describe or qualify other words, typically nouns and noun phrases. The
phrase beautiful flower has the noun (N) flower which is described or qualified using the adjective (ADJ) beautiful .
The POS tag symbol for adjectives is ADJ .
• Adv(erb): Adverbs usually act as modifiers for other words including nouns, adjectives, verbs, or other adverbs.
The phrase very beautiful flower has the adverb (ADV) very , which modifies the adjective (ADJ) beautiful ,
indicating the degree to which the flower is beautiful. The POS tag symbol for adverbs is ADV.
Besides these four major categories of parts of speech , there are other categories that occur frequently in the English
language. These include pronouns, prepositions, interjections, conjunctions, determiners, and many others. Furthermore,

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

each POS tag like the noun (N) can be further subdivided into categories like singular nouns (NN), singular proper
nouns(NNP), and plural nouns (NNS).
The process of classifying and labeling POS tags for words called parts of speech tagging or POS tagging . POS tags are
used to annotate words and depict their POS, which is really helpful to perform specific analysis, such as narrowing down
upon nouns and seeing which ones are the most prominent, word sense disambiguation, and grammar analysis.
Let us consider both nltk and spacy which usually use the Penn Treebank notation for POS tagging. NLTK and spaCy are
two of the most popular Natural Language Processing (NLP) tools available in Python. You can build chatbots, automatic
summarizers, and entity extraction engines with either of these libraries. While both can theoretically accomplish any NLP
task, each one excels in certain scenarios. The Penn Treebank, or PTB for short, is a dataset maintained by the University
of Pennsylvania.

# create a basic pre-processed corpus, don't lowercase to get POS context

corpus = normalize_corpus(news_df['full_text'], text_lower_case=False,
text_lemmatization=False, special_char_removal=False)
# demo for POS tagging for sample news headline
sentence = str(news_df.iloc[1].news_headline)
sentence_nlp = nlp(sentence)
# POS tagging with Spacy
spacy_pos_tagged = [(word, word.tag_, word.pos_) for word in sentence_nlp]
pd.DataFrame(spacy_pos_tagged, columns=['Word', 'POS tag', 'Tag type'])
# POS tagging with nltk

nltk_pos_tagged = nltk.pos_tag(sentence.split())

pd.DataFrame(nltk_pos_tagged, columns=['Word', 'POS tag'])

Fig. Python code & Output of POS tagging a news headline

We can see that each of these libraries treat tokens in their own way and assign specific tags for them. Based on what we
see, spacy seems to be doing slightly better than nltk.
Shallow Parsing or Chunking
Based on the hierarchy we depicted earlier, groups of words make up phrases. There are five major categories of phrases:
• Noun phrase (NP): These are phrases where a noun acts as the head word. Noun phrases act as a subject or object
to a verb.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

• Verb phrase (VP): These phrases are lexical units that have a verb acting as the head word. Usually, there are two
forms of verb phrases. One form has the verb components as well as other entities such as nouns, adjectives, or
adverbs as parts of the object.
• Adjective phrase (ADJP): These are phrases with an adjective as the head word. Their main role is to describe or
qualify nouns and pronouns in a sentence, and they will be either placed before or after the noun or pronoun.
• Adverb phrase (ADVP): These phrases act like adverbs since the adverb acts as the head word in the phrase.
Adverb phrases are used as modifiers for nouns, verbs, or adverbs themselves by providing further details that
describe or qualify them.
• Prepositional phrase (PP): These phrases usually contain a preposition as the head word and other lexical
components like nouns, pronouns, and so on. These act like an adjective or adverb describing other words or phrases.
Shallow parsing, also known as light parsing or chunking, is a popular natural language processing technique of analyzing
the structure of a sentence to break it down into its smallest constituents (which are tokens such as words) and group them
together into higher-level phrases. This includes POS tags and phrases from a sentence.

Fig. An example of shallow parsing depicting higher level phrase annotations

Constituency Parsing
Constituent-based grammars are used to analyze and determine the constituents of a sentence. These grammars can be used
to model or represent the internal structure of sentences in terms of a hierarchically ordered structure of their constituents.
Each and every word usually belongs to a specific lexical category in the case and forms the head word of different phrases.
These phrases are formed based on rules called phrase structure rules.
Phrase structure rules form the core of constituency grammars, because they talk about syntax and rules that govern the
hierarchy and ordering of the various constituents in the sentences. These rules cater to two things primarily.
• They determine what words are used to construct the phrases or constituents.
• They determine how we need to order these constituents together.
The generic representation of a phrase structure rule is S → AB , which depicts that the structure S consists of
constituents A and B , and the ordering is A followed by B . While there are several rules (refer to Chapter 1, Page 19: Text
Analytics with Python, if you want to dive deeper), the most important rule describes how to divide a sentence or a clause.
The phrase structure rule denotes a binary division for a sentence or a clause as S → NP VP where S is the sentence or
clause, and it is divided into the subject, denoted by the noun phrase (NP) and the predicate, denoted by the verb phrase
(VP).
A constituency parser can be built based on such grammars/rules, which are usually collectively available as context-free
grammar (CFG) or phrase-structured grammar. The parser will process input sentences according to these rules, and help
in building a parse tree.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Fig. An example of constituency parsing showing a nested hierarchical structure

Dependency Parsing
In dependency parsing, we try to use dependency-based grammars to analyze and infer both structure and semantic
dependencies and relationships between tokens in a sentence. The basic principle behind a dependency grammar is that in
any sentence in the language, all words except one, have some relationship or dependency on other words in the sentence.
The word that has no dependency is called the root of the sentence. The verb is taken as the root of the sentence in most
cases. All the other words are directly or indirectly linked to the root verb using links, which are the dependencies.
Considering the sentence “The brown fox is quick and he is jumping over the lazy dog”, if we wanted to draw the
dependency syntax tree for this, we would have the structure

Fig. A dependency parse tree for a sentence

These dependency relationships each have their own meaning and are a part of a list of universal dependency types.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Some of the dependencies are as follows:

• The dependency tag det is pretty intuitive — it denotes the determiner relationship between a nominal head and the
determiner. Usually, the word with POS tag DET will also have the det dependency tag relation. Examples
include fox → the and dog → the.
• The dependency tag amod stands for adjectival modifier and stands for any adjective that modifies the meaning of
a noun. Examples include fox → brown and dog → lazy.
• The dependency tag nsubj stands for an entity that acts as a subject or agent in a clause. Examples
include is → fox and jumping → he.
• The dependencies cc and conj have more to do with linkages related to words connected by coordinating
conjunctions . Examples include is → and and is → jumping.
• The dependency tag aux indicates the auxiliary or secondary verb in the clause. Example: jumping → is.
• The dependency tag acomp stands for adjective complement and acts as the complement or object to a verb in the
sentence. Example: is → quick
• The dependency tag prep denotes a prepositional modifier, which usually modifies the meaning of a noun, verb,
adjective, or preposition. Usually, this representation is used for prepositions having a noun or noun phrase
complement. Example: jumping → over.
• The dependency tag pobj is used to denote the object of a preposition. This is usually the head of a noun phrase
following a preposition in the sentence. Example: over → dog.
TEXT PREPROCESSING OR WRANGLING
Text preprocessing or wrangling is a method to clean the text data and make it ready to feed data to the model. Text data
contains noise in various forms like emotions, punctuation, text in a different case. When we talk about Human Language
then, there are different ways to say the same thing, And this is only the main problem we have to deal with because
machines will not understand words, they need numbers so we need to convert text to numbers in an efficient manner.
Techniques to perform text preprocessing or wrangling are as follows:
Contraction Mapping/ Expanding Contractions: Contractions are a shortened version of words or a group of words, quite
common in both spoken and written language. In English, they are quite common, such as I will to I’ll, I have to I’ve , do
not to don’t, etc. Mapping these contractions to their expanded form helps in text standardization.
Tokenization: Tokenization is the process of separating a piece of text into smaller units called tokens. Given a document,
tokens can be sentences, words, subwords, or even characters depending on the application.
Noise cleaning: Special characters and symbols contribute to extra noise in unstructured text. Using regular expressions to
remove them or using tokenizers, which do the pre-processing step of removing punctuation marks and other special
characters, is recommended.
Spell-checking: Documents in a corpus are prone to spelling errors; In order to make the text clean for the subsequent
processing, it is a good practice to run a spell checker and fix the spelling errors before moving on to the next steps.
Stopwords Removal: Stop words are those words which are very common and often less significant. Hence, removing these
is a pre-processing step as well. This can be done explicitly by retaining only those words in the document which are not in
the list of stop words or by specifying the stop word list as an argument in CountVectorizer or TfidfVectorizer methods
when getting Bag-of-Words(BoW)/TF-IDF scores for the corpus of text documents.
Stemming/Lemmatization: Both stemming and lemmatization are methods to reduce words to their base form. While
stemming follows certain rules to truncate the words to their base form, often resulting in words that are not
lexicographically correct, lemmatization always results in base forms that are lexicographically correct. However, stemming

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

is a lot faster than lemmatization. Hence, to stem/lemmatize is dependent on whether the application needs quick pre-
processing or requires more accurate base forms.
TOKENIZATION

Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP
methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers. As tokens are the
building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
Tokens are the building blocks of Natural Language.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words,
characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram
characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence
results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.
Similarly, tokens can be either characters or sub-words. For example, let us consider “smarter”:

1. Character tokens: s-m-a-r-t-e-r

2. Sub-word tokens: smart-er

Here, Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a vocabulary.
Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by considering
each unique token in the corpus or by considering the top K Frequently Occurring Words.
Creating Vocabulary is the ultimate goal of Tokenization.
One of the simplest hacks to boost the performance of the NLP model is to create a vocabulary out of top K frequently
occurring words.
Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.
Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the
vocabulary is treated as a unique feature:

Fig. Traditional NLP: Count Vectorizer

• In Advanced Deep Learning-based NLP architectures, vocabulary is used to create the tokenized input sentences.
Finally, the tokens of these sentences are passed as inputs to the model
As discussed earlier, tokenization can be performed on word, character, or subword level. It’s a common question – which
Tokenization should we use while solving an NLP task? Let’s address this question here.
Word Tokenization
Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based
on a certain delimiter. Depending upon delimiters, different word-level tokens are formed. Pretrained Word Embeddings
such as Word2Vec and GloVe comes under word tokenization.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Drawbacks of Word Tokenization

One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new
words which are encountered at testing. These new words do not exist in the vocabulary. Hence, these methods fail in
handling Out-of-Vocabolary (OOV) words.
• A small trick can rescue word tokenizers from OOV words. The trick is to form the vocabulary with the Top K
Frequent Words and replace the rare words in training data with unknown tokens (UNK). This helps the model to learn the
representation of OOV words in terms of UNK tokens
• So, during test time, any word that is not present in the vocabulary will be mapped to a UNK token. This is how we
can tackle the problem of OOV in word tokenizers.
• The problem with this approach is that the entire information of the word is lost as we are mapping OOV to UNK
tokens. The structure of the word might be helpful in representing the word accurately. And another issue is that every OOV
word gets the same representation
Another issue with word tokens is connected to the size of the vocabulary. Generally, pre-trained models are trained on a
large volume of the text corpus. So, just imagine building the vocabulary with all the unique words in such a large corpus.
This explodes the vocabulary!
Character Tokenization
Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks we saw above about Word
Tokenization.
• Character Tokenizers handles OOV words coherently by preserving the information of the word. It breaks down
the OOV word into characters and represents the word in terms of these characters
• It also limits the size of the vocabulary. Want to talk a guess on the size of the vocabulary? 26 since the vocabulary
contains a unique set of characters
Drawbacks of Character Tokenization
Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are
representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between
the characters to form meaningful words. This brings us to another tokenization known as Subword Tokenization which is
in between a Word and Character tokenization.
Subword Tokenization
Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be
segmented as low-er, smartest as smart-est, and so on.
Transformed-based models – the SOTA in NLP – rely on Subword Tokenization algorithms for preparing vocabulary. Now,
we will discuss one of the most popular Subword Tokenization algorithm known as Byte Pair Encoding (BPE).
Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) is a widely used tokenization method among transformer-based models. BPE addresses the issues
of Word and Character Tokenizers:
• BPE tackles OOV effectively. It segments OOV as subwords and represents the word in terms of these subwords
• The length of input and output sentences after BPE are shorter compared to character tokenization
BPE is a word segmentation algorithm that merges the most frequently occurring character or character sequences
iteratively. Here is a step by step guide to learn BPE.
Steps to learn BPE

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

1. Split the words in the corpus into characters after appending </w>
2. Initialize the vocabulary with unique characters in the corpus
3. Compute the frequency of a pair of characters or character sequences in corpus
4. Merge the most frequent pair in corpus
5. Save the best pair to the vocabulary
6. Repeat steps 3 to 5 for a certain number of iterations

We will understand the steps with an example.

Consider a corpus:

1a) Append the end of the word (say </w>) symbol to every word in the corpus:

1b) Tokenize words in a corpus into characters:

2. Initialize the vocabulary:

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Iteration 1:
3. Compute frequency:

4. Merge the most frequent pair:

5. Save the best pair:

Repeat steps 3-5 for every iteration from now. Let me illustrate for one more iteration.
Iteration 2:
3. Compute frequency:

4. Merge the most frequent pair:

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

5. Save the best pair:

After 10 iterations, BPE merge operations looks like:

Applying BPE to OOV words

Here is a step-by-step procedure for representing OOV words:
1. Split the OOV word into characters after appending </w>
2. Compute pair of characters or character sequences in a word
3. Select the pairs present in the learned operations
4. Merge the most frequent pair
5. Repeat steps 2 and 3 until merging is possible

STEMMING
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly
referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”,
“choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming is
an important part of the pipelining process in Natural language processing. The input to the stemmer is tokenized words.
How do we get these tokenized words? Well, tokenization involves breaking down the document into different words.
Stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root
form. The process of stemming is used to normalize text and make it easier to process. It is an important step in text pre-
processing, and it is commonly used in information retrieval and text mining applications. There are several different
algorithms for stemming as follows:

• Porter stemmer
• Snowball stemmer
• Lancaster stemmer.
The Porter stemmer is the most widely used algorithm, and it is based on a set of heuristics that are used to remove common
suffixes from words. The Snowball stemmer is a more advanced algorithm that is based on the Porter stemmer, but it also
supports several other languages in addition to English. The Lancaster stemmer is a more aggressive stemmer and it is less
accurate than the Porter stemmer and Snowball stemmer.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Stemming can be useful for several natural language processing tasks such as text classification, information retrieval, and
text summarization. However, stemming can also have some negative effects such as reducing the readability of the text,
and it may not always produce the correct root form of a word. It is important to note that stemming is different from
Lemmatization. Lemmatization is the process of reducing a word to its base form, but unlike stemming, it takes into account
the context of the word, and it produces a valid word, unlike stemming which can produce a non-word as the root form.
Some more examples stemming from the root word "like" include:

->"likes"
->"liked"
->"likely"
->"liking"

Errors in Stemming:
There are mainly two errors in stemming –

• over-stemming
• under-stemming
Over-stemming occurs when two words are stemmed from the same root that are of different stems. Over-stemming can
also be regarded as false-positives. Over-stemming is a problem that can occur when using stemming algorithms in natural
language processing. It refers to the situation where a stemmer produces a root form that is not a valid word or is not the
correct root form of a word. This can happen when the stemmer is too aggressive in removing suffixes or when it does not
consider the context of the word.
Over-stemming can lead to a loss of meaning and make the text less readable. For example, the word “arguing” may be
stemmed to “argu,” which is not a valid word and does not convey the same meaning as the original word. Similarly, the
word “running” may be stemmed to “run,” which is the base form of the word but it does not convey the meaning of the
original word.
To avoid over-stemming, it is important to use a stemmer that is appropriate for the task and language. It is also important
to test the stemmer on a sample of text to ensure that it is producing valid root forms. In some cases, using a lemmatizer
instead of a stemmer may be a better solution as it takes into account the context of the word, making it less prone to errors.
Another approach to this problem is to use techniques like semantic role labeling, sentiment analysis, context-based
information, etc. that help to understand the context of the text and make the stemming process more precise.
Under-stemming occurs when two words are stemmed from the same root that are not of different stems. Under-stemming
can be interpreted as false-negatives. Under-stemming is a problem that can occur when using stemming algorithms in
natural language processing. It refers to the situation where a stemmer does not produce the correct root form of a word or
does not reduce a word to its base form. This can happen when the stemmer is not aggressive enough in removing suffixes
or when it is not designed for the specific task or language.
Under-stemming can lead to a loss of information and make it more difficult to analyze text. For example, the word
“arguing” and “argument” may be stemmed to “argu,” which does not convey the meaning of the original words. Similarly,
the word “running” and “runner” may be stemmed to “run,” which is the base form of the word but it does not convey the
meaning of the original words.
To avoid under-stemming, it is important to use a stemmer that is appropriate for the task and language. It is also important
to test the stemmer on a sample of text to ensure that it is producing the correct root forms. In some cases, using a lemmatizer
instead of a stemmer may be a better solution as it takes into account the context of the word, making it less prone to errors.
Another approach to this problem is to use techniques like semantic role labeling, sentiment analysis, context-based
information, etc. that help to understand the context of the text and make the stemming process more precise.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Applications of stemming:
Stemming is used in information retrieval systems like search engines. It is used to determine domain vocabularies in domain
analysis. To display search results by indexing while documents are evolving into numbers and to map documents to
common subjects by stemming. Sentiment Analysis, which examines reviews and comments made by different users about
anything, is frequently used for product analysis, such as for online retail stores. Before it is interpreted, stemming is
accepted in the form of the text-preparation mean.
A method of group analysis used on textual materials is called document clustering (also known as text clustering). Important
uses of it include subject extraction, automatic document structuring, and quick information retrieval.
Fun Fact: Google search adopted a word stemming in 2003. Previously a search for “fish” would not have returned “fishing”
or “fishes”.
Some Stemming algorithms are:
Porter’s Stemmer algorithm
It is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English
language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity.
The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only
limited to English words. Also, the group of stems is mapped on to the same stem and the output stem is not necessarily a
meaningful word. The algorithms are fairly lengthy in nature and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending
to EE” as ‘agreed’ becomes ‘agree’.
Advantage: It produces the best output as compared to other stemmers and it has less error rate.
Limitation: Morphological variants produced are not always real words.
Lovins Stemmer
It is proposed by Lovins in 1968, that removes the longest suffix from a word then the word is recorded to convert this stem
into valid words.
Example: sitting -> sitt -> sit
Advantage: It is fast and handles irregular plurals like 'teeth' and 'tooth' etc.
Limitation: It is time consuming and frequently fails to form words from stem.
Dawson Stemmer
It is an extension of Lovins stemmer in which suffixes are stored in the reversed order indexed by their length and last letter.
Advantage: It is fast in execution and covers more suffices.
Limitation: It is very complex to implement.
Krovetz Stemmer
It was proposed in 1993 by Robert Krovetz. Following are the steps:
1) Convert the plural form of a word to its singular form.
2) Convert the past tense of a word to its present tense and remove the suffix ‘ing’.
Example: ‘children’ -> ‘child’
Advantage: It is light in nature and can be used as pre-stemmer for other stemmers.
Limitation: It is inefficient in case of large documents.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Xerox Stemmer
Example:
‘children’ -> ‘child’
‘understood’ -> ‘understand’
‘whom’ -> ‘who’
‘best’ -> ‘good’

N-Gram Stemmer
An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of
n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*
Advantage: It is based on string comparisons and it is language dependent.
Limitation: It requires space to create and index the n-grams and it is not time efficient.
Snowball Stemmer:
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other
languages the Snowball Stemmers can be called a multi-lingual stemmer. The Snowball stemmers are also imported from
the nltk package. This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is
the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to
as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer is
having greater computational speed.
Lancaster Stemmer:
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really
faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball
Stemmers. The Lancaster stemmers save the rules externally and basically uses an iterative algorithm. Lancaster Stemmer
is straightforward, although it often produces results with excessive stemming. Over-stemming renders stems non-linguistic
or meaningless.
LEMMATIZATION
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down
to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to its root
word, or lemme, good. In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word.
There are different algorithms used to find out how many characters have to be chopped off, but the algorithms don’t actually
know the meaning of the word in the language it belongs to. In lemmatization, the algorithms do have this knowledge. In
fact, you can even say that these algorithms refer to a dictionary to understand the meaning of the word before reducing it
to its root word, or lemma. So, a lemmatization algorithm would know that the word better is derived from the
word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to do the same. There
could be over-stemming or under-stemming, and the word better could be reduced to either bet, or bett, or just
retained as better. But there is no way in stemming that can reduce better to its root word good. This is the
difference between stemming and lemmatization.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Fig. Stemming vs Lemmatization

Lemmatization gives more context to chatbot conversations as it recognizes words based on their exact and contextual
meaning. On the other hand, lemmatization is a time-consuming and slow process. The obvious advantage of lemmatization
is that it is more accurate than stemming. So, if you’re dealing with an NLP application such as a chat bot or a virtual
assistant, where understanding the meaning of the dialogue is crucial, lemmatization would be useful. But this accuracy
comes at a cost. Because lemmatization involves deriving the meaning of a word from something like a dictionary, it’s very
time-consuming. So, most lemmatization algorithms are slower compared to their stemming counterparts. There is also a
computation overhead for lemmatization, however, in most machine-learning problems, computational resources are
rarely a cause of concern.

REMOVING STOP-WORDS
The words which are generally filtered out before processing a natural language are called stop words. These are actually
the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much
information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”. Stop words are available
in abundance in any human language. By removing these words, we remove the low-level information from our text in order
to give more focus to the important information. In order words, we can say that the removal of such words does not show
any negative consequences on the model we train for our task.
Removal of stop words definitely reduces the dataset size and thus reduces the training time due to the fewer number of
tokens involved in the training.
We do not always remove the stop words. The removal of stop words is highly dependent on the task we are performing
and the goal we want to achieve. For example, if we are training a model that can perform the sentiment analysis task, we
might not remove the stop words.
Movie review: “The movie was not good at all.”
Text after removal of stop words: “movie good”
We can clearly see that the review for the movie was negative. However, after the removal of stop words, the review became
positive, which is not the reality. Thus, the removal of stop words can be problematic here.
Tasks like text classification do not generally need stop words as the other words present in the dataset are more important
and give the general idea of the text. So, we generally remove stop words in such tasks.
In a nutshell, NLP has a lot of tasks that cannot be accomplished properly after the removal of stop words. So, think before
performing this step. The catch here is that no rule is universal and no stop words list is universal. A list not conveying any
important information to one task can convey a lot of information to the other task.
Word of caution: Before removing stop words, research a bit about your task and the problem you are trying to solve, and
then make your decision.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Fig. Stop Word Removal

Next comes a very important question: why we should remove stop words from the text?. So, there are two main reasons
for that:
1. They provide no meaningful information, especially if we are building a text classification model. Therefore, we
have to remove stop words from our dataset.
2. As the frequency of stop words are too high, removing them from the corpus results in much smaller data in terms
of size. Reduced size results in faster computations on text data and the text classification model need to deal
with a lesser number of features resulting in a robust model.

FEATURE ENGINEERING FOR TEXT REPRESENTATION

Feature engineering is one of the most important steps in machine learning. It is the process of using domain knowledge of
the data to create features that make machine learning algorithms work. Think machine learning algorithm as a learning
child the more accurate information you provide the more they will be able to interpret the information well. Focusing first
on our data will give us better results than focusing only on models. Feature engineering helps us to create better data which
helps the model understand it well and provide reasonable results.
NLP is a subfield of artificial intelligence where we understand human interaction with machines using natural languages.
To understand a natural language, you need to understand how we write a sentence, how we express our thoughts using
different words, signs, special characters, etc basically we should understand the context of the sentence to interpret its
meaning.
Extracting Features from Text
In this section, we will learn about common feature extraction techniques and methods. We’ll also talk about when to use
them and some challenges we might face implementing those techniques. Feature extraction methods can be divided into 3
major categories, basic, statistical, and advanced/vectorized.
Basic Methods
These feature extraction methods are based on various concepts from NLP and linguistics. These are some of the oldest
methods but still can be very reliable are used frequently in many areas.

• Parsing
• PoS Tagging
• Name Entity Recognition (NER)
• Bag of Words (BoW)
Statistical Methods
This is a bit more advanced feature extraction method and uses the concepts from statistics and probability to extract features
from text data.

• Term Frequency-Inverse Document Frequency (TF-IDF)

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Advanced Methods
These methods can also be called vectorized methods as they aim to map a word, sentence, document to a fixed-length
vector of real numbers. The goal of this method is to extract semantics from a piece of text, both lexical and distributional.
Lexical semantics is just the meaning reflected by the words whereas distributional semantics refers to finding meaning
based on various distributions in a corpus.

• Word2Vec
• GloVe: Global Vector for word representation

Fig. Word2Vec vs GloVe

BAG OF WORDS MODEL
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR).
In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding
grammar and even word order but keeping multiplicity. A bag-of-words model, or BoW for short, is a way of extracting
features from text for use in modelling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded.
The model is only concerned with whether known words occur in the document, not where in the document. The intuition
is that documents are similar if they have similar content. Further, that from the content alone we can learn something about
the meaning of the document. The bag-of-words can be as simple or complex as you like. The complexity comes both in
deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.
One of the biggest problems with text is that it is messy and unstructured, and machine learning algorithms prefer structured,
well defined fixed-length inputs and by using the Bag-of-Words technique we can convert variable-length texts into a fixed-
length vector.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Also, at a much granular level, the machine learning models work with numerical data rather than textual data. So to be
more specific, by using the bag-of-words (BoW) technique, we convert a text into its equivalent vector of numbers.
Let us see an example of how the bag of words technique converts text into vectors
Example (1) without preprocessing:
Sentence 1: “Welcome to Great Learning, Now start learning”
Sentence 2: “Learning is a good practice”

Sentence 1 Sentence 2

Welcome Learning

to is

Great a

Learning good

, practice

Now

start

learning

Step 1: Go through all the words in the above text and make a list of all of the words in the model vocabulary.
• Welcome
• To
• Great
• Learning
• ,
• Now
• start
• learning
• is
• a
• good
• practice
Note that the words ‘Learning’ and ‘ learning’ are not the same here because of the difference in their cases and hence are
repeated. Also, note that a comma ‘ , ’ is also taken in the list. Because we know the vocabulary has 12 words, we can use
a fixed-length document-representation of 12, with one position in the vector to score each word.

The scoring method we use here is to count the presence of each word and mark 0 for absence. This scoring method is used
more generally.
The scoring of sentence 1 would look as follows:
Word Frequency

Welcome 1

to 1

Great 1

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Learning 1

, 1

Now 1

start 1

learning 1

is 0

a 0

good 0

practice 0
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]
Now for sentence 2, the scoring would like:

Word Frequency

Welcome 0

to 0

Great 0

Learning 1

, 0

Now 0

start 0

learning 0

is 1

a 1

good 1

practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]

Sentence Welcome to Great Learning , Now start learning is a good practice

Sentence1 1 1 1 1 1 1 1 1 0 0 0 0

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Sentence2 0 0 0 0 0 0 0 1 1 1 1 1

But is this the best way to perform a bag of words. The above example was not the best example of how to use a
bag of words. The words Learning and learning, although having the same meaning are taken twice. Also, a
comma “,” which does not convey any information is also included in the vocabulary.

Let us make some changes and see how we can use ‘bag of words in a more effective way.

Example(2) with preprocessing:

Sentence 1: ”Welcome to Great Learning, Now start learning”

Sentence 2: “Learning is a good practice”

Step 1: Convert the above sentences in lower case as the case of the word does not hold any information.

Step 2: Remove special characters and stopwords from the text. Stopwords are the words that do not contain much
information about text like ‘is’, ‘a’,’the and many more’.

After applying the above steps, the sentences are changed to

Sentence 1: ”welcome great learning now start learning”

Sentence 2: “learning good practice”

Although the above sentences do not make much sense the maximum information is contained in these words only.

Step 3: Go through all the words in the above text and make a list of all of the words in our model vocabulary.
• welcome
• great
• learning
• now
• start
• good
• practice
Now as the vocabulary has only 7 words, we can use a fixed-length document-representation of 7, with one position in the
vector to score each word.

The scoring method we use here is the same as used in the previous example. For sentence 1, the count of words is as
follow:

Word Frequency

welcome 1

great 1

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

learning 2

now 1

start 1

good 0

practice 0
Writing the above frequencies in the vector

Sentence 1 ➝ [ 1,1,2,1,1,0,0 ]

Now for sentence 2, the scoring would be like

Word Frequency

welcome 0

great 0

learning 1

now 0

start 0

good 1

practice 1
Similarly, writing the above frequencies in the vector form

Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]

Sentence welcome great learning now start good practice

Sentence1 1 1 2 1 1 0 0

Sentence2 0 0 1 0 0 1 1

The approach used in example two is the one that is generally used in the Bag-of-Words technique, the reason being that
the datasets used in Machine learning are tremendously large and can contain vocabulary of a few thousand or even millions
of words. Hence, preprocessing the text before using bag-of-words is a better way to go. There are various preprocessing
steps that can increase the performance of Bag-of-Words. Some of them are explained in great detail in this blog.

In the examples above we use all the words from vocabulary to form a vector, which is neither a practical way nor the best
way to implement the BoW model. In practice, only a few words from the vocabulary, more preferably the most common
words are used to form the vector.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on
your specific text data. It has been used with great success on prediction problems like language modeling and
documentation classification.

Nevertheless, it suffers from some shortcomings, such as:

• Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts
the sparsity of the document representations.
• Sparsity: Sparse representations are harder to model both for computational reasons (space and time complexity)
and also for information reasons, where the challenge is for the models to harness so little information in such a
large representational space.
• Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics).
Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words
differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much
more.

BAG OF N-GRAMS MODEL

A bag-of-n-grams model is a way to represent a document, similar to a [bag-of-words][/terms/bag-of-words/] model. A bag-

of-n-grams model represents a text document as an unordered collection of its n-grams.

For example, let’s use the following phrase and divide it into bi-grams (n=2).
“James is the best person ever.”

becomes

• <start>James
• James is
• is the
• the best
• best person
• person ever.
• ever.<end>

In a typical bag-of-n-grams model, these 6 bigrams would be a sample from a large number of bigrams observed in a corpus.
And then James is the best person ever. would be encoded in a representation showing which of the corpus’s bigrams were
observed in the sentence. A bag-of-n-grams model has the simplicity of the bag-of-words model but allows the preservation
of more word locality information.

TF-IDF MODEL

TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how
relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a
word appears but is compensated by the word frequency in the corpus (data-set).

Terminologies:

• Term Frequency: In document 𝑑, the frequency represents the number of instances of a given word 𝑡. Therefore,
we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of
terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term
in the paper, there is an entry with the value being the term frequency.

The weight of a term that occurs in a document is simply proportional to the term frequency.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

𝑡𝑓(𝑡, 𝑑) = 𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑡 𝑖𝑛 𝑑 / 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑑

• Document Frequency: This tests the meaning of the text, which is very similar to TF, in the whole corpus
collection. The only difference is that in document d, TF is the frequency counter for a term 𝑡, while df is the number
of occurrences in the document set N of the term t. In other words, the number of papers in which the word is
present is DF.

𝑑𝑓(𝑡) = 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒 𝑜𝑓 𝑡 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠

• Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate the
appropriate records that fit the demand. Since 𝑡𝑓 considers all terms equally significant, it is therefore not only
possible to use the term frequencies to measure the weight of the term in the paper. First, find the document
frequency of a term 𝑡 by counting the number of documents containing the term:

𝑑𝑓(𝑡) = 𝑁(𝑡) 𝑤ℎ𝑒𝑟𝑒 𝑑𝑓(𝑡) = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑡𝑒𝑟𝑚 𝑡,

𝑁(𝑡) = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑡

Term frequency is the number of instances of a term in a single document only; although the frequency of the document is
the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the definition
of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus separated by the
frequency of the text.

𝑖𝑑𝑓(𝑡) = 𝑁/ 𝑑𝑓(𝑡) = 𝑁/𝑁(𝑡)

The more common word is supposed to be considered less significant, but the element (most definite integers) seems too
harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes:

𝑖𝑑𝑓(𝑡) = 𝑙𝑜𝑔(𝑁/ 𝑑𝑓(𝑡))

• Computation: TF-IDF is one of the best metrics to determine how significant a term is to a text in a series or a
corpus. TF-IDF is a weighting system that assigns a weight to each word in a document based on its term frequency
(TF) and the reciprocal document frequency (TF) (IDF). The words with higher scores of weight are deemed to be
more significant.

Usually, the TF-IDF weight consists of two terms-

1. Normalized Term Frequency (𝒕𝒇)

2. Inverse Document Frequency (𝒊𝒅𝒇)

𝑡𝑓 − 𝑖𝑑𝑓(𝑡, 𝑑) = 𝑡𝑓(𝑡, 𝑑) ∗ 𝑖𝑑𝑓(𝑡)

Numerical Example

Imagine the term 𝑡 appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of 𝑡 can be
calculated as follow:

20
𝑡𝑓(𝑡, 𝑑) = = 0.2
100

Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents contain
the term 𝑡, Inverse Document Frequency (IDF) of 𝑡 can be calculated as follows

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

10000
𝑖𝑑𝑓(𝑡) = 𝑙𝑜𝑔 =2
100

Using these two quantities, we can calculate TF-IDF score of the term 𝑡 for the document.

𝑡𝑓 − 𝑖𝑑𝑓(𝑡, 𝑑) = 0.2 ∗ 2 = 0.4

**********

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering

Iready at Home Activity Packets Student Ela Grade 5 2020
No ratings yet
Iready at Home Activity Packets Student Ela Grade 5 2020
54 pages
CS8711 - Cloud Computing Laboratory Record: Department of Computer Science & Engineering
No ratings yet
CS8711 - Cloud Computing Laboratory Record: Department of Computer Science & Engineering
5 pages
Verb Transitive and Intransitive
No ratings yet
Verb Transitive and Intransitive
2 pages
Shivangi Tyagi (NLP Assignments)
No ratings yet
Shivangi Tyagi (NLP Assignments)
60 pages
Ccs369-Unit 3
No ratings yet
Ccs369-Unit 3
28 pages
Ccs369-Unit 4
No ratings yet
Ccs369-Unit 4
13 pages
Text and Speech CCS369-UNIT 5
No ratings yet
Text and Speech CCS369-UNIT 5
9 pages
CSE4022 Natural-Language-Processing ETH 1 AC41
No ratings yet
CSE4022 Natural-Language-Processing ETH 1 AC41
6 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
Information Visualization Technologies
No ratings yet
Information Visualization Technologies
15 pages
Question Bank For Ai
0% (1)
Question Bank For Ai
2 pages
CCS369 TEXT AND SPEECH ANALYSIS - Syllabus
No ratings yet
CCS369 TEXT AND SPEECH ANALYSIS - Syllabus
4 pages
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
No ratings yet
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
9 pages
Course Plan: Department of Computer Science Enginnering
No ratings yet
Course Plan: Department of Computer Science Enginnering
8 pages
CS8691 AI CO-PO Mapping
No ratings yet
CS8691 AI CO-PO Mapping
6 pages
MSD Previous Papers 2022-23
100% (1)
MSD Previous Papers 2022-23
4 pages
Cs3591 - CN Unit 2 Transport Layer
No ratings yet
Cs3591 - CN Unit 2 Transport Layer
15 pages
ui&ux .new
100% (1)
ui&ux .new
35 pages
R22B Tech CSE (AIML) IandIIYearSyllabus PDF
No ratings yet
R22B Tech CSE (AIML) IandIIYearSyllabus PDF
65 pages
Question Bank Unit-1and2 (UI and UX Design) - All
100% (1)
Question Bank Unit-1and2 (UI and UX Design) - All
18 pages
CS6007 Information Retrieval
No ratings yet
CS6007 Information Retrieval
8 pages
Theory of Computation
50% (2)
Theory of Computation
1 page
CS8080 Information Retrieval Techniques Reg 2017 Question Bank
No ratings yet
CS8080 Information Retrieval Techniques Reg 2017 Question Bank
6 pages
CS3492 Database Management Systems Question Bank 1
No ratings yet
CS3492 Database Management Systems Question Bank 1
11 pages
CS3491 Ai & ML Lab Manual
No ratings yet
CS3491 Ai & ML Lab Manual
57 pages
Vtu NLP Questions
No ratings yet
Vtu NLP Questions
5 pages
Scalable Parallel Computing
No ratings yet
Scalable Parallel Computing
11 pages
Compiler Design - (Book) .PDF 160
No ratings yet
Compiler Design - (Book) .PDF 160
165 pages
2.notes CS8080 - Information Retrieval Technique
No ratings yet
2.notes CS8080 - Information Retrieval Technique
164 pages
Ppl Decode
No ratings yet
Ppl Decode
141 pages
TOC Question Bank - Unit - 1 - 2 - 3 - 4 - 2022
No ratings yet
TOC Question Bank - Unit - 1 - 2 - 3 - 4 - 2022
7 pages
Cs3301 Unit Important Q-Data-Structures
No ratings yet
Cs3301 Unit Important Q-Data-Structures
8 pages
CP5191 Machine Learning Techniques L T P C3 0 0 3
No ratings yet
CP5191 Machine Learning Techniques L T P C3 0 0 3
7 pages
AIDS Syllabus 2021 L
No ratings yet
AIDS Syllabus 2021 L
87 pages
Compiler Design Model Lab Questions
No ratings yet
Compiler Design Model Lab Questions
4 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
CCS356 Object Oriented Software Engineering Apr May 2024 Question Paper Download
No ratings yet
CCS356 Object Oriented Software Engineering Apr May 2024 Question Paper Download
3 pages
Recursively Enumerable Languages
No ratings yet
Recursively Enumerable Languages
8 pages
nlp unit 1
No ratings yet
nlp unit 1
133 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
CCS334 - Bda Lab Manual
No ratings yet
CCS334 - Bda Lab Manual
40 pages
Ccs356 Oose Lab Manual Final
No ratings yet
Ccs356 Oose Lab Manual Final
132 pages
CD3281 Dsa Lab 2021 R
100% (1)
CD3281 Dsa Lab 2021 R
3 pages
Unit V - AI
No ratings yet
Unit V - AI
41 pages
CCS366-Software Testing and Automation-Lab-Manual
100% (1)
CCS366-Software Testing and Automation-Lab-Manual
55 pages
CCS374 Web Application Security
No ratings yet
CCS374 Web Application Security
18 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
CCS366 Software Testing and Automation Notes CCS366 Software Testing and Automation Notes
No ratings yet
CCS366 Software Testing and Automation Notes CCS366 Software Testing and Automation Notes
105 pages
Software Testing Question Bank - All Units - Watermark
No ratings yet
Software Testing Question Bank - All Units - Watermark
4 pages
Part I IR VTU M Tech SSE
No ratings yet
Part I IR VTU M Tech SSE
72 pages
Ai (Bad402)
100% (2)
Ai (Bad402)
4 pages
Anna University, Chennai Non-Autonomous Affiliated Colleges Regulations 2021 Choice Based Credit System B.E. Computer Science and Engineering
No ratings yet
Anna University, Chennai Non-Autonomous Affiliated Colleges Regulations 2021 Choice Based Credit System B.E. Computer Science and Engineering
86 pages
AI - Unit I QB
100% (1)
AI - Unit I QB
1 page
Be - Computer Engineering - Semester 6 - 2022 - May - Artificial Intelligence Ai Pattern 2019
No ratings yet
Be - Computer Engineering - Semester 6 - 2022 - May - Artificial Intelligence Ai Pattern 2019
2 pages
Embedded Systems and Lot
No ratings yet
Embedded Systems and Lot
164 pages
BAI601-NLP
No ratings yet
BAI601-NLP
5 pages
NLP Notes For Students
No ratings yet
NLP Notes For Students
18 pages
5 Unit 3 - Forward Chaining and Backward Chaining in AI
No ratings yet
5 Unit 3 - Forward Chaining and Backward Chaining in AI
18 pages
Ooad Lab Manual
No ratings yet
Ooad Lab Manual
68 pages
CCS369-UNIT 1
No ratings yet
CCS369-UNIT 1
27 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
28 pages
Ibt On Demand English Form A Test 7a Aarush Kanani - 2024 10 09 1106
No ratings yet
Ibt On Demand English Form A Test 7a Aarush Kanani - 2024 10 09 1106
6 pages
Happy Street II - 1st Tests 2017 - Key
No ratings yet
Happy Street II - 1st Tests 2017 - Key
1 page
Dispersed - Google Search
No ratings yet
Dispersed - Google Search
1 page
Department of Education: Table of Specification in English 5 Summative Test S.Y. 2021 - 2022
No ratings yet
Department of Education: Table of Specification in English 5 Summative Test S.Y. 2021 - 2022
3 pages
Module One Essay
100% (1)
Module One Essay
21 pages
Grammar A1 - 7 Past Simple 2
No ratings yet
Grammar A1 - 7 Past Simple 2
8 pages
Unit 2: City Life A Closer Look 2: Lesson 3
No ratings yet
Unit 2: City Life A Closer Look 2: Lesson 3
2 pages
Tổ chức dạy học THPT
No ratings yet
Tổ chức dạy học THPT
8 pages
Focus On IELTS Foundation U2 - Teachers Book
No ratings yet
Focus On IELTS Foundation U2 - Teachers Book
4 pages
Syntax Clauses and Sentence
No ratings yet
Syntax Clauses and Sentence
6 pages
Disciplines Within The Social Sciences
No ratings yet
Disciplines Within The Social Sciences
38 pages
Unit 1 Standard Test B
No ratings yet
Unit 1 Standard Test B
2 pages
Sports 6 Eng. Annual Planner (2022-23)
No ratings yet
Sports 6 Eng. Annual Planner (2022-23)
36 pages
4th-PT-TOS-READING-LITERACY-1
No ratings yet
4th-PT-TOS-READING-LITERACY-1
2 pages
Ingles
No ratings yet
Ingles
1 page
Simple Present Tense
No ratings yet
Simple Present Tense
8 pages
Question Tag
No ratings yet
Question Tag
12 pages
Sub Verb Concord 1
No ratings yet
Sub Verb Concord 1
9 pages
Rumus Conditional Sentence: If + Subjek + V1 S + Will + V1
No ratings yet
Rumus Conditional Sentence: If + Subjek + V1 S + Will + V1
3 pages
Week2 Detailed Author's Biased For
No ratings yet
Week2 Detailed Author's Biased For
14 pages
Lesson 62 o As in Ball Part 2
No ratings yet
Lesson 62 o As in Ball Part 2
14 pages
LARS1_TM
No ratings yet
LARS1_TM
121 pages
P.7 Grammar Lesson Notes
No ratings yet
P.7 Grammar Lesson Notes
26 pages
Automata Theory and Logic: Regular Language & Regular Expression
No ratings yet
Automata Theory and Logic: Regular Language & Regular Expression
41 pages
Discourse Anaphora
No ratings yet
Discourse Anaphora
4 pages
Task 4 Collaborative Phonetic
No ratings yet
Task 4 Collaborative Phonetic
18 pages
(Ebook) Bilingualism and Identity: Spanish at the Crossroads with Other Languages (Studies in Bilingualism, Volume 37) by Mercedes Niño-Murcia (Editor), Jason Rothman (Editor) ISBN 9789027241481, 9789027290434, 9027241481, 9027290431 instant download
100% (3)
(Ebook) Bilingualism and Identity: Spanish at the Crossroads with Other Languages (Studies in Bilingualism, Volume 37) by Mercedes Niño-Murcia (Editor), Jason Rothman (Editor) ISBN 9789027241481, 9789027290434, 9027241481, 9027290431 instant download
51 pages
Capitalization Rules LP
No ratings yet
Capitalization Rules LP
6 pages