NLP UNIT-2
NLP UNIT-2
Size of NLTK:
It consists of about 30 compressed files requiring about 100Mb disk space. The full collection
of data (i.e., all in the downloader) is nearly ten times this size (at the time of writing) and
continues to expand. Once the data is downloaded to your machine, you can load some of it
using the Python interpreter.
Tokenization in NLP
Tokenization is the start of the NLP process, converting sentences into understandable bits of
data that a program can work with. Without a strong foundation built through tokenization,
the NLP process can quickly devolve into a messy telephone game.
NLTK Tokenization is used for parsing a large amount of textual data into parts to perform an
analysis of the character of the text. NLTK for tokenization can be used for training machine
learning models, Natural Language Processing text cleaning.
In Python tokenization basically refers to splitting up a larger body of text into smaller lines,
words or even creating words for a non-English language. The various tokenization functions
in-built into the nltk module itself and can be used in programs as shown below.
Line Tokenization:
import nltk
sentence_data = "The First sentence is about Python. The Second: about Django. You can
learn Python,Django and Data Ananlysis here. "
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)
O/P:
['The First sentence is about Python.', 'The Second: about Django.', 'You can learn
Python,Django and Data Ananlysis here.']
Word Tokenzitaion
import nltk
word_data = "It originated from the idea that there are readers who prefer learning new skills
from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)
O/P: ['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers',
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']
STEMMING IN NLP
Stemming is a technique used to reduce an inflected word down to its word stem. For
example, the words “programming,” “programmer,” and “programs” can all be reduced down
to the common word stem “program.” In other words, “program” can be used as a synonym
for the prior three inflection words.
Stemming is a technique used to extract the base form of the words by removing affixes from
them. It is just like cutting down the branches of a tree to its stems.
For example, the stem of the words eating, eats, eaten is eat.
Stemming Algorithms
Application of Stemming:
Stemming is used in information retrieval systems like search engines, text mining SEOs,
Web search results, indexing, tagging systems, and word analysis, stemming. It is used to
determine domain vocabularies in domain analysis.
Disadvantage:
Overstemming and understemming are two problems that can arise in stemming.
Overstemming occurs when a stemmer reduces a word to its base form too aggressively,
resulting in a stem that is not a valid word. For example, the word “fishing” might be
overstemmed to “fishin,” which is not correct.
Lemmatization
Lemmatization is the process of grouping together different inflected forms of the same word.
It's used in computational linguistics, natural language processing (NLP) and chatbots.
Lemmatization is a text pre-processing technique used in natural language processing (NLP)
models to break a word down to its root meaning to identify similarities. For example, a
lemmatization algorithm would reduce the word better to its root word, or lemme, good.
Example:
Lemmatization considers the context and converts the word to its meaningful base form,
which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'. For
instance, lemmatizing the word 'Caring' would return 'Care'.
Lemmatization is another technique used to reduce inflected words to their root word. It
describes the algorithmic process of identifying an inflected word's “lemma” (dictionary
form) based on its intended meaning.
Examples of lemmatization:
# import these modules
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
Output :
rocks : rock
corpora : corpus
better : good
Advantages of Lemmatization with NLTK:
1. Improves text analysis accuracy: Lemmatization helps in improving the
accuracy of text analysis by reducing words to their base or dictionary form.
This makes it easier to identify and analyze words that have similar meanings.
2. Reduces data size: Since lemmatization reduces words to their base form, it
helps in reducing the data size of the text, which makes it easier to handle large
datasets.
3. Better search results: Lemmatization helps in retrieving better search results
since it reduces different forms of a word to a common base form, making it
easier to match different forms of a word in the text.
4. Useful for feature extraction: Lemmatization can be useful in feature extraction
tasks, where the aim is to extract meaningful features from text for machine
learning tasks.
POS tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also
called grammatical tagging is the process of marking up a word in a text (corpus) as
corresponding to a particular part of speech, based on both its definition and its context.
Part-of-speech tagging (POS tagging) is a process in which each word in a text is assigned its
appropriate morphosyntactic category (for example noun-singular, verb-past, adjective,
pronoun-personal, and the like).
Example:
In a rule-based POS tagging system, words are assigned POS tags based on their
characteristics and the context in which they appear. For example, a rule-based POS tagger
might assign the tag “noun” to any word that ends in “-tion” or “-ment,” as these suffixes are
often used to form nouns.
We assumed that there are two states in the HMM and each of the state corresponds to the
selection of different biased coin. Following matrix gives the state transition probabilities −
A=[a11a21a12a22]
Here,
• aij = probability of transition from one state to another from i to j.
• a11 + a12 = 1 and a21 + a22 =1
• P1 = probability of heads of the first coin i.e. the bias of the first coin.
• P2 = probability of heads of the second coin i.e. the bias of the second coin.
We can also create an HMM model assuming that there are 3 coins or more.
This way, we can characterize HMM by the following elements −
• N, the number of states in the model (in the above example N =2, only two
states).
• M, the number of distinct observations that can appear with each state in the
above example M = 2, i.e., H or T).
• A, the state transition probability distribution − the matrix A in the above
example.
• P, the probability distribution of the observable symbols in each state (in our
example P1 and P2).
• I, the initial state distribution.
chunking in NLP
Chunking is defined as the process of natural language processing used to identify parts of
speech and short phrases present in a given sentence.
Chunking, one of the important processes in natural language processing, is used to identify
parts of speech (POS) and short phrases. In other simple words, with chunking, we can get
the structure of the sentence. It is also called partial parsing.
Chunk patterns and chinks
Chunk patterns are the patterns of part-of-speech (POS) tags that define what kind of words
made up a chunk. We can define chunk patterns with the help of modified regular
expressions.
Moreover, we can also define patterns for what kind of words should not be in a chunk and
these unchunked words are known as chinks.
Implementation example
In the example below, along with the result of parsing the sentence “the book has many
chapters”, there is a grammar for noun phrases that combines both a chunk and a chink
pattern −
import nltk
sentence = [
("the", "DT"),
("book", "NN"),
("has","VBZ"),
("many","JJ"),
("chapters","NNS")
]
chunker = nltk.RegexpParser(
r'''
NP:{<DT><NN.*><.*>*<NN.*>}
}<VB.*>{
'''
)
chunker.parse(sentence)
Output = chunker.parse(sentence)
Output.draw()
OUTPUT
WordNet
WordNet is a massive lexicon of English words. Nouns, verbs, adjectives, and adverbs are
arranged into synsets,' which are collections of cognitive synonyms that communicate a
separate concept. Conceptual-semantic and linguistic links like hyponymy and antonymy are
used to connect synsets.
WordNet has been used for a number of purposes in information systems, including word-
sense disambiguation, information retrieval, automatic text classification, automatic text
summarization, machine translation and even automatic crossword puzzle generation.
WordNet is a network of words linked by lexical and semantic relations. Nouns, verbs,
adjectives and adverbs are grouped into sets of cognitive synonyms, called synsets, each
expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and
lexical relations.
It is the task of measuring sentence similarity is defined as determining how similar the
meanings of the two sentences or words are. Example: Top image says that for two
questions How old are you? and What is your age? the answer is the same I am 20 years
old? How the system gave the same answer for the two questions, because the two questions
are semantically related i.e they are similar in meaning. Our task is how to identify the
similarity between the sentences.
Right at the start, I'm going to ask you a question. Which pair of words is the most similar in the
following,
• deer , elk
• deer , giraffe
• deer , horse
• deer and mouse?
Mostly every one of us says deer and elk but how a machine can say, this is by using some
semantic measures and semantic resources. There are a lot of high-performance techniques to
do this but here is the basic process.
Two entities can be termed as similar under one of the following conditions:
1. If your task is to group the words which are similar in meaning then we should go with Semantic
Similarity.
2. It is the basic building block of Natural Language Understanding Tasks. Textual
Entitlement: Let us consider a paragraph P and a sentence S if you want to find whether the
sentence S derives its meaning from paragraph P or not then you can go with Semantic
Similarity. Para-Phrasing: Paraphrasing is a task where you rephrase or rewrite some
sentences you get into another sentence that has the same meaning.
1. There are a lot of resources used for semantic similarity one of those is WordNet. WordNet is a
Semantic Dictionary of English Words that are interlinked by semantic relations.
2. It also includes rich Linguistic Information such as parts of speech, Word Senses i.e different
meanings of the word, hypernyms, and hyponyms, etc.
3. WordNet is Machine Readable and available freely hence it is used most for Natural Language
Processing Tasks.
Here are a few different algorithms used to measure semantic similarity between words, and is important
to know in the context of the new age enterprise search platforms like 3RDi Search.
Path Length is a score that represents the number of edges that connect two words in
the shortest path. In a thesaurus hierarchy graph, the shorter the path between two
words/senses, the more related they are.
2] Leakcock-Chodorow Score
This score includes log smoothing and it indicates the number of edges between two words/senses. This
is similar to path length with log smoothing and has the same features, except that due to the log
smoothing, it is continuous in nature.
3] Wu & Palmer Score
This is a score that considers the positions of concepts c1 and c2 in the taxonomy in relation to the Least
Common Subsumer (c1, c2). In path-based measurements, it considers that the similarity between two
concepts is a function of path length and depth.
The frequency counts of concepts contained in a corpus of text is displayed as information content. Each
time a concept is observed, the frequency associated with the idea is updated in WordNet, as are the
counts of the ancestor concepts in the WordNet hierarchy (for nouns and verbs).
6)Cosine similarity
Measuring semantic similarity doesn’t depend on this type separately but combines it with other
types for measuring the distance between non-zero vectors of features.
The most important algorithms in this type are Manhattan Distance, Euclidean Distance, Cosine
Similarity, Jaccard Index, and Sorensen-Dice Index.