0% found this document useful (0 votes)
4 views12 pages

NLP UNIT-2

The document provides an overview of the Natural Language Toolkit (NLTK), a Python library for natural language processing (NLP) that includes tools for text processing, tokenization, stemming, lemmatization, and part-of-speech tagging. It discusses the installation process, various NLP techniques, and the advantages and disadvantages of stemming and lemmatization. Additionally, it covers the use of WordNet for semantic similarity and query expansion in information retrieval systems.

Uploaded by

gaursujal02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views12 pages

NLP UNIT-2

The document provides an overview of the Natural Language Toolkit (NLTK), a Python library for natural language processing (NLP) that includes tools for text processing, tokenization, stemming, lemmatization, and part-of-speech tagging. It discusses the installation process, various NLP techniques, and the advantages and disadvantages of stemming and lemmatization. Additionally, it covers the use of WordNet for semantic similarity and query expansion in information retrieval systems.

Uploaded by

gaursujal02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT-2

Natural Language Toolkit (NLTK)


NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP. A lot of
the data that you could be analyzing is unstructured data and contains human-readable text.
Before you can analyze that data programmatically, you first need to preprocess it.
Installing NLTK
We can install NLTK on various OS as follows –
On Windows
In order to install NLTK on Windows OS, follow the below steps –
• First, open the Windows command prompt and navigate to the location of
the pip folder.
• Next, enter the following command to install NLTK –
pip3 install nltk
Now, open the PythonShell from Windows Start Menu and type the following command in
order to verify NLTK’s installation –
Import nltk
If you get no error, you have successfully installed NLTK on your Windows OS having
Python3.
NLTK is a toolkit build for working with NLP in Python. It provides us various text
processing libraries with a lot of test datasets. A variety of tasks can be performed using
NLTK such as tokenizing, parse tree visualization, etc.
Advantages of NLTK:
One of the most popular and widely used tools for text mining is NLTK, or Natural Language
Toolkit. NLTK is a Python library that provides a rich set of modules and resources for NLP,
such as tokenizers, parsers, stemmers, taggers, corpora, and models.

Size of NLTK:

It consists of about 30 compressed files requiring about 100Mb disk space. The full collection
of data (i.e., all in the downloader) is nearly ten times this size (at the time of writing) and
continues to expand. Once the data is downloaded to your machine, you can load some of it
using the Python interpreter.

Tokenization in NLP
Tokenization is the start of the NLP process, converting sentences into understandable bits of
data that a program can work with. Without a strong foundation built through tokenization,
the NLP process can quickly devolve into a messy telephone game.
NLTK Tokenization is used for parsing a large amount of textual data into parts to perform an
analysis of the character of the text. NLTK for tokenization can be used for training machine
learning models, Natural Language Processing text cleaning.
In Python tokenization basically refers to splitting up a larger body of text into smaller lines,
words or even creating words for a non-English language. The various tokenization functions
in-built into the nltk module itself and can be used in programs as shown below.

Line Tokenization:

import nltk
sentence_data = "The First sentence is about Python. The Second: about Django. You can
learn Python,Django and Data Ananlysis here. "
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

O/P:
['The First sentence is about Python.', 'The Second: about Django.', 'You can learn
Python,Django and Data Ananlysis here.']

Word Tokenzitaion

import nltk

word_data = "It originated from the idea that there are readers who prefer learning new skills
from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

O/P: ['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers',
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']
STEMMING IN NLP
Stemming is a technique used to reduce an inflected word down to its word stem. For
example, the words “programming,” “programmer,” and “programs” can all be reduced down
to the common word stem “program.” In other words, “program” can be used as a synonym
for the prior three inflection words.
Stemming is a technique used to extract the base form of the words by removing affixes from
them. It is just like cutting down the branches of a tree to its stems.
For example, the stem of the words eating, eats, eaten is eat.

Stemming Algorithms

Application of Stemming:
Stemming is used in information retrieval systems like search engines, text mining SEOs,
Web search results, indexing, tagging systems, and word analysis, stemming. It is used to
determine domain vocabularies in domain analysis.
Disadvantage:
Overstemming and understemming are two problems that can arise in stemming.
Overstemming occurs when a stemmer reduces a word to its base form too aggressively,
resulting in a stem that is not a valid word. For example, the word “fishing” might be
overstemmed to “fishin,” which is not correct.

Lemmatization
Lemmatization is the process of grouping together different inflected forms of the same word.
It's used in computational linguistics, natural language processing (NLP) and chatbots.
Lemmatization is a text pre-processing technique used in natural language processing (NLP)
models to break a word down to its root meaning to identify similarities. For example, a
lemmatization algorithm would reduce the word better to its root word, or lemme, good.
Example:
Lemmatization considers the context and converts the word to its meaningful base form,
which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'. For
instance, lemmatizing the word 'Caring' would return 'Care'.
Lemmatization is another technique used to reduce inflected words to their root word. It
describes the algorithmic process of identifying an inflected word's “lemma” (dictionary
form) based on its intended meaning.
Examples of lemmatization:
# import these modules
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))


print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"


print("better :", lemmatizer.lemmatize("better", pos ="a"))

Output :

rocks : rock
corpora : corpus
better : good
Advantages of Lemmatization with NLTK:
1. Improves text analysis accuracy: Lemmatization helps in improving the
accuracy of text analysis by reducing words to their base or dictionary form.
This makes it easier to identify and analyze words that have similar meanings.
2. Reduces data size: Since lemmatization reduces words to their base form, it
helps in reducing the data size of the text, which makes it easier to handle large
datasets.
3. Better search results: Lemmatization helps in retrieving better search results
since it reduces different forms of a word to a common base form, making it
easier to match different forms of a word in the text.
4. Useful for feature extraction: Lemmatization can be useful in feature extraction
tasks, where the aim is to extract meaningful features from text for machine
learning tasks.

Disadvantages of Lemmatization with NLTK:


1. Requires prior knowledge: Lemmatization requires prior knowledge of the
language and the rules governing the formation of words. If the rules for a
specific language are not available, then the accuracy of the lemmatizer may be
affected.
2. Time-consuming: Lemmatization can be time-consuming since it involves
parsing the text and performing a lookup in a dictionary or a database of word
forms.
3. Not suitable for real-time applications: Since lemmatization is time-consuming,
it may not be suitable for real-time applications that require quick response
times.
4. May lead to ambiguity: Lemmatization may lead to ambiguity, as a single word
may have multiple meanings depending on the context in which it is used. In
such cases, the lemmatizer may not be able to determine the correct meaning of
the word.

POS tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also
called grammatical tagging is the process of marking up a word in a text (corpus) as
corresponding to a particular part of speech, based on both its definition and its context.
Part-of-speech tagging (POS tagging) is a process in which each word in a text is assigned its
appropriate morphosyntactic category (for example noun-singular, verb-past, adjective,
pronoun-personal, and the like).
Example:
In a rule-based POS tagging system, words are assigned POS tags based on their
characteristics and the context in which they appear. For example, a rule-based POS tagger
might assign the tag “noun” to any word that ends in “-tion” or “-ment,” as these suffixes are
often used to form nouns.

Hidden Markov Model (HMM) POS Tagging


Before digging deep into HMM POS tagging, we must understand the concept of Hidden
Markov Model (HMM).
Hidden Markov Model
An HMM model may be defined as the doubly-embedded stochastic model, where the
underlying stochastic process is hidden. This hidden stochastic process can only be observed
through another set of stochastic processes that produces the sequence of observations.
Example
For example, a sequence of hidden coin tossing experiments is done and we see only the
observation sequence consisting of heads and tails. The actual details of the process - how
many coins used, the order in which they are selected - are hidden from us. By observing this
sequence of heads and tails, we can build several HMMs to explain the sequence. Following
is one form of Hidden Markov Model for this problem −

We assumed that there are two states in the HMM and each of the state corresponds to the
selection of different biased coin. Following matrix gives the state transition probabilities −
A=[a11a21a12a22]
Here,
• aij = probability of transition from one state to another from i to j.
• a11 + a12 = 1 and a21 + a22 =1
• P1 = probability of heads of the first coin i.e. the bias of the first coin.
• P2 = probability of heads of the second coin i.e. the bias of the second coin.
We can also create an HMM model assuming that there are 3 coins or more.
This way, we can characterize HMM by the following elements −
• N, the number of states in the model (in the above example N =2, only two
states).
• M, the number of distinct observations that can appear with each state in the
above example M = 2, i.e., H or T).
• A, the state transition probability distribution − the matrix A in the above
example.
• P, the probability distribution of the observable symbols in each state (in our
example P1 and P2).
• I, the initial state distribution.
chunking in NLP
Chunking is defined as the process of natural language processing used to identify parts of
speech and short phrases present in a given sentence.

Chunking, one of the important processes in natural language processing, is used to identify
parts of speech (POS) and short phrases. In other simple words, with chunking, we can get
the structure of the sentence. It is also called partial parsing.
Chunk patterns and chinks
Chunk patterns are the patterns of part-of-speech (POS) tags that define what kind of words
made up a chunk. We can define chunk patterns with the help of modified regular
expressions.
Moreover, we can also define patterns for what kind of words should not be in a chunk and
these unchunked words are known as chinks.
Implementation example
In the example below, along with the result of parsing the sentence “the book has many
chapters”, there is a grammar for noun phrases that combines both a chunk and a chink
pattern −
import nltk
sentence = [
("the", "DT"),
("book", "NN"),
("has","VBZ"),
("many","JJ"),
("chapters","NNS")
]
chunker = nltk.RegexpParser(
r'''
NP:{<DT><NN.*><.*>*<NN.*>}
}<VB.*>{
'''
)
chunker.parse(sentence)
Output = chunker.parse(sentence)
Output.draw()

OUTPUT
WordNet

WordNet is a massive lexicon of English words. Nouns, verbs, adjectives, and adverbs are
arranged into synsets,' which are collections of cognitive synonyms that communicate a
separate concept. Conceptual-semantic and linguistic links like hyponymy and antonymy are
used to connect synsets.

WordNet has been used for a number of purposes in information systems, including word-
sense disambiguation, information retrieval, automatic text classification, automatic text
summarization, machine translation and even automatic crossword puzzle generation.

WordNet is a network of words linked by lexical and semantic relations. Nouns, verbs,
adjectives and adverbs are grouped into sets of cognitive synonyms, called synsets, each
expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and
lexical relations.

Application of WordNet in Query Expansion


Query Expansion is the term given when a search engine adding search terms to a user's
weighted search. The goal is to improve precision and/or recall. Example: User Query: “car”;
Expanded Query: “car cars automobile automobiles auto” etc
Four main Phases:
1.decompositin,
2.optimization,
3. code
generation
4.execution.
query language, a computer programming language used to retrieve information from a
database. The uses of databases are manifold. They provide a means of retrieving records
orparts of records and performing various calculations before displaying the results.
Query by example is a query language used in relational databases that allows users to
searchfor information in tables and fields by providing a simple user interface where the
user will be able to input an example of the data that he or she wants to access.
Query expansion is used as a term for adding related words to a query in order to increase
thenumber of returned documents and improve recall accordingly. Typically, all the
keywords ineach query should first be extracted, and then for each keyword the synonyms
and acronyms are automatically selected

What is Semantic or Wordnet Similarity?

It is the task of measuring sentence similarity is defined as determining how similar the
meanings of the two sentences or words are. Example: Top image says that for two
questions How old are you? and What is your age? the answer is the same I am 20 years
old? How the system gave the same answer for the two questions, because the two questions
are semantically related i.e they are similar in meaning. Our task is how to identify the
similarity between the sentences.

Right at the start, I'm going to ask you a question. Which pair of words is the most similar in the
following,

• deer , elk
• deer , giraffe
• deer , horse
• deer and mouse?

Mostly every one of us says deer and elk but how a machine can say, this is by using some
semantic measures and semantic resources. There are a lot of high-performance techniques to
do this but here is the basic process.

Two entities can be termed as similar under one of the following conditions:

1. Both belong to the same class


2. Both belong to classes that have a common parent class
3. One entity belongs to a class that is a parent class to which the other entity
belongs
Two relationships can be termed as similar under one of the following conditions:

1. Both belong to the same class


2. Both belong to classes with a common parent class
3. One relation belongs to a class that is a parent class to which the other relation
belongs

When we should go with Semantic Similarity?

1. If your task is to group the words which are similar in meaning then we should go with Semantic
Similarity.
2. It is the basic building block of Natural Language Understanding Tasks. Textual
Entitlement: Let us consider a paragraph P and a sentence S if you want to find whether the
sentence S derives its meaning from paragraph P or not then you can go with Semantic
Similarity. Para-Phrasing: Paraphrasing is a task where you rephrase or rewrite some
sentences you get into another sentence that has the same meaning.

Resources used for Semantic Similarity

1. There are a lot of resources used for semantic similarity one of those is WordNet. WordNet is a
Semantic Dictionary of English Words that are interlinked by semantic relations.
2. It also includes rich Linguistic Information such as parts of speech, Word Senses i.e different
meanings of the word, hypernyms, and hyponyms, etc.
3. WordNet is Machine Readable and available freely hence it is used most for Natural Language
Processing Tasks.

Measures of Semantic Similarity

Here are a few different algorithms used to measure semantic similarity between words, and is important
to know in the context of the new age enterprise search platforms like 3RDi Search.

1]Path Length Score

Path Length is a score that represents the number of edges that connect two words in
the shortest path. In a thesaurus hierarchy graph, the shorter the path between two
words/senses, the more related they are.

2] Leakcock-Chodorow Score

This score includes log smoothing and it indicates the number of edges between two words/senses. This
is similar to path length with log smoothing and has the same features, except that due to the log
smoothing, it is continuous in nature.
3] Wu & Palmer Score
This is a score that considers the positions of concepts c1 and c2 in the taxonomy in relation to the Least
Common Subsumer (c1, c2). In path-based measurements, it considers that the similarity between two
concepts is a function of path length and depth.

4] Resnik Similarity Score


Based on the Information Content (IC) of the Least Common Subsumer, this score indicates how similar
two-word senses are.

The frequency counts of concepts contained in a corpus of text is displayed as information content. Each
time a concept is observed, the frequency associated with the idea is updated in WordNet, as are the
counts of the ancestor concepts in the WordNet hierarchy (for nouns and verbs).

5] Lin Similarity Score


This is a number that takes into account both the amount of information required to state the similarities
between the two concepts as well as the amount of information required to fully describe these terms.

6)Cosine similarity
Measuring semantic similarity doesn’t depend on this type separately but combines it with other
types for measuring the distance between non-zero vectors of features.
The most important algorithms in this type are Manhattan Distance, Euclidean Distance, Cosine
Similarity, Jaccard Index, and Sorensen-Dice Index.

where n is the size of features vector.


Conclusion:
Semantic similarity is a significant concept today as it has multiple applications in Natural Language
Processing (NLP) and forms one of the building blocks of the new age enterprise search platforms. Want
to witness semantic similarity through our 3RDi Search platform? Visit www.3rdisearch.com or drop us
an email on [email protected] and our team will get in touch with you.

You might also like