0% found this document useful (0 votes)
21 views

nlp unit 1

The document outlines a course on Natural Language Processing (NLP) taught by Ms. S. Rama at SRM Institute of Science and Technology. It covers various topics including the introduction to NLP, its applications such as sentiment analysis, text extraction, and speech recognition, as well as components like Natural Language Understanding and Generation. Additionally, it discusses the phases of NLP, regular expressions, and methods for implementing NLP techniques.

Uploaded by

santalol95
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

nlp unit 1

The document outlines a course on Natural Language Processing (NLP) taught by Ms. S. Rama at SRM Institute of Science and Technology. It covers various topics including the introduction to NLP, its applications such as sentiment analysis, text extraction, and speech recognition, as well as components like Natural Language Understanding and Generation. Additionally, it discusses the phases of NLP, regular expressions, and methods for implementing NLP techniques.

Uploaded by

santalol95
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

21CSE356T– NATURAL

LANGUAGE PROCESSING

Instructor:
Ms. S. Rama,
Assistant Professor
Department of Information Technology,
SRM Institute of Science and Technology,
[email protected]
Unit I
▪ Introduction to Natural Lanaguage Processing

▪ Applications of NLP

▪ Levels of NLP

▪ Regular Expressions

▪ Morphological analysis

▪ Tokenization,Stemming, Lemmatization

▪ Feature Extraction :

▪ Term Frequency (TF), Inverse Document Frequency(IDF) , Modeling using


TF/IDF, Parts of Speech Tagging
▪ Named Entity Recognization,

▪ N-grams

▪ Smoothing

Department of Information Technology, SRMIST 2


meree kaksha
mein aapaka Willkommen in
svaagat hai meiner Klasse

Watashi no
kurasu e Eṉathu
yōkoso vakuppiṟku
varuka

Department of Information Technology, SRMIST 3


WHAT?

Department of Information Technology, SRMIST 4


WHEN?

Department of Information Technology, SRMIST 5


WHERE ?

Department of Information Technology, SRMIST 6


WHY?
▪ Analysing tons of data
▪ Identifying various languages and dialects
▪ Applying quantitative analysis
▪ Handling ambiguities

Department of Information Technology, SRMIST 7


Applications of NLP
▪Sentiment analysis
• Sentiment analysis, also referred to as opinion mining, is an approach to natural language
processing (NLP) that identifies the emotional tone behind a body of text.
• This is a popular way for organizations to determine and categorize opinions about a product,
service or idea.
• Sentiment analysis systems help organizations gather insights into real-time customer sentiment,
customer experience and brand reputation.
• Generally, these tools use text analytics to analyze online sources such as emails, blog posts, online
reviews, news articles, survey responses, case studies, web chats, tweets, forums and comments.
• Sentiment analysis uses machine learning models to perform text analysis of human language. The
metrics used are designed to detect whether the overall sentiment of a piece of text is positive,
negative or neutral.

Department of Information Technology, SRMIST 8


Applications of NLP
▪Text Extraction
• There are a number of natural language processing techniques that can be used to extract
information from text or unstructured data.
• These techniques can be used to extract information such as entity names, locations, quantities, and
more.
• With the help of natural language processing, computers can make sense of the vast amount of
unstructured text data that is generated every day, and humans can reap the benefits of having this
information readily available.
• Industries such as healthcare, finance, and e-commerce are already using natural language
processing techniques to extract information and improve business processes.
• As the machine learning technology continues to develop, we will only see more and more
information extraction use cases covered.

Department of Information Technology, SRMIST 9


Applications of NLP
▪Text Classification

• Unstructured text is everywhere, such as emails, chat conversations, websites, and social media.
Nevertheless, it’s hard to extract value from this data unless it’s organized in a certain way.
• Text classification also known as text tagging or text categorization is the process of categorizing
text into organized groups. By using Natural Language Processing (NLP), text classifiers
can automatically analyze text and then assign a set of pre-defined tags or categories based on its
content.
• Text classification is becoming an increasingly important part of businesses as it allows to easily
get insights from data and automate business processes.

Department of Information Technology, SRMIST 10


Applications of NLP
▪Speech Recognition
• Speech recognition is an interdisciplinary subfield of computer science and computational
linguistics that develops methodologies and technologies that enable the recognition and
translation of spoken language into text by computers.
• It is also known as automatic speech recognition (ASR), computer speech recognition or speech to
text (STT).
• It incorporates knowledge and research in the computer science, linguistics
and computer engineering fields. The reverse process is speech synthesis.

Department of Information Technology, SRMIST 11


Applications of NLP
▪Chatbot

• Chatbots are computer programs that conduct automatic conversations with people. They
are mainly used in customer service for information acquisition. As the name implies,
these are bots designed with the purpose of chatting and are also simply referred to as
“bots.”
• You’ll come across chatbots on business websites or messengers that give pre-scripted
replies to your questions. As the entire process is automated, bots can provide quick
assistance 24/7 without human intervention.

Department of Information Technology, SRMIST 12


Applications of NLP
▪Machine Translation

• Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of


computational linguistics that investigates the use of software to translate text or speech from one
language to another.
• On a basic level, MT performs mechanical substitution of words in one language for words in
another, but that alone rarely produces a good translation because recognition of whole phrases and
their closest counterparts in the target language is needed.
• Not all words in one language have equivalent words in another language, and many words have
more than one meaning.
• Solving this problem with corpus statistical and neural techniques is a rapidly growing field that is
leading to better translations, handling differences in linguistic typology, translation of idioms, and
the isolation of anomalies.
• Corpus: A collection of written texts, especially the entire works of a particular author.

Department of Information Technology, SRMIST 13


Applications of NLP
▪Email Filter
• One of the most fundamental and essential applications of NLP online is email filtering. It began
with spam filters, which identified specific words or phrases that indicate a spam message. But,
like early NLP adaptations, filtering has been improved.
• Gmail's email categorization is one of the more common, newer implementations of NLP. Based
on the contents of emails, the algorithm determines whether they belong in one of three categories
(main, social, or promotional).
• This maintains your inbox manageable for all Gmail users, with critical, relevant emails you want
to see and reply to fast.

Department of Information Technology, SRMIST 14


Applications of NLP
▪Search Autocorrect and Autocomplete
• When you type 2-3 letters into Google to search for anything, it displays a list of probable
search keywords. Alternatively, if you search for anything with mistakes, it corrects them
for you while still returning relevant results. Isn't it incredible?
• Everyone uses Google search autocorrect autocomplete on a regular basis but seldom
gives it any thought. It's a fantastic illustration of how natural language processing is
touching millions of people across the world, including you and me.
• Both, search autocomplete and autocorrect make it much easier to locate accurate results

Department of Information Technology, SRMIST 15


Applications of NLP
▪ Automatic Summarization
• Summarizing the meaning of documents and information
• Extract the key emotional information from the text to understand there actions
(Social Media

Department of Information Technology, SRMIST 16


Components of NLP
There are the following two components of NLP –
1. Natural Language Understanding (NLU)
▪Natural Language Understanding (NLU) helps the machine to understand and analyse human
language by extracting the metadata from content such as concepts, entities, keywords,
emotion, relations, and semantic roles.

▪NLU mainly used in Business applications to understand the customer's problem in both
spoken and written language.

▪NLU involves the following tasks -

o It is used to map the given input into useful representation.

o It is used to analyze different aspects of the language.

▪2. Natural Language Generation (NLG)

▪ Natural Language Generation (NLG) acts as a translator that converts the computerized data
into natural language representation. It mainly involves Text planning, Sentence planning, and
Text Realization.

Department of Information Technology, SRMIST 17


Difference between NLU and NLG

NLU NLG

NLU is the process of reading and NLG is the process of writing or generating language.
interpreting language.

It produces non-linguistic outputs from It produces constructing natural language outputs from
natural language inputs. non-linguistic inputs.

Department of Information Technology, SRMIST 18


Terminologies
Phonology − It is study of organizing sound systematically.

Morphology − It is a study of construction of words from primitive meaningful units.

Morpheme − It is primitive unit of meaning in a language.

Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of
words in the sentence and in phrases.

Semantics − It is concerned with the meaning of words and how to combine words into meaningful phrases
and sentences.

Pragmatics − It deals with using and understanding sentences in different situations and how the
interpretation of the sentence is affected.

Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next
sentence.

World Knowledge − It includes the general knowledge about the world.


Department of Information Technology, SRMIST 19
Phases / Levels of NLP

Department of Information Technology, SRMIST 20


Lexical Analysis
▪ The first phase is lexical analysis/morphological processing.
In this phase, the sentences, paragraphs are broken into
tokens.
▪ • These tokens are the smallest unit of text. It scans the entire
source text and divides it into meaningful lexemes.
▪ • For example, The sentence “He goes to college.” is divided
into [ ‘He’ , ‘goes’ , ‘to’ , ‘college’, ‘.’] . • There are five tokens
in the sentence. A paragraph may also be divided into
sentences
▪ Lexeme: A lexeme is a basic unit of meaning. In linguistics, the abstract unit of morphological analysis that
corresponds to a set of forms taken by a single word is called lexeme.

Department of Information Technology, SRMIST 21


Department of Information Technology, SRMIST 22
Syntactic Analysis/Parsing
▪ The second phase is Syntactic analysis. In this phase, the
sentence is checked whether it is wellformed or not.

▪ • The word arrangement is studied and a syntactic


relationship is found between them. It is checked for
word arrangements and grammar.

▪ • For example, the sentence “Delhi goes to him” is


rejected by the syntactic parser.

Department of Information Technology, SRMIST 23


Sematic analyzer
▪ The third phase is Semantic Analysis. In this phase, the sentence is checked for
the literal meaning of each word and their arrangement together.
▪ Syntax is the grammatical structure of the text, whereas semantics is the
meaning being conveyed.
▪ A sentence that is syntactically correct, however, is not always semantically
correct. For example, “cows flow supremely” is grammatically valid (subject—
verb—adverb) but it doesn't make any sense.
▪ The sentence “I ate hot ice cream” will get rejected by the semantic analyzer
because it doesn’t make sense.
▪ “colorless green idea.” This would be rejected by the Symantec analysis as
colorless Here; green doesn’t make any sense.

Department of Information Technology, SRMIST 24


Discourse Integration
▪ The fourth phase is discourse integration. In this phase,
the impact of the sentences before a particular sentence
and the effect of the current sentence on the upcoming
sentences is determined.
▪ • For example, the word “that” in the sentence “He
wanted that” depends upon the prior discourse context.

Department of Information Technology, SRMIST 25


Pragmatic Analysis
▪ The last phase of natural language processing is Pragmatic
analysis. Sometimes the discourse integration phase and
pragmatic analysis phase are combined.

▪ • The actual effect of the text is discovered by applying the set of


rules that characterize cooperative dialogues.
▪ • E.g., “close the window?” should be interpreted as a request
instead of an order.

Department of Information Technology, SRMIST 26


Ambiguity
▪ She is looking for a match
▪ The fisher man went to the bank
▪ I saw the man with the binoculars
▪ Visiting relatives can be boring
▪ The boy told his father the theft. He is very upset

Department of Information Technology, SRMIST 27


NLP Implementation
▪ • Below, given are popular methods used for Natural Learning Process: –
Machine learning:
▪ The learning nlp procedures used during machine learning. It automatically
focuses on the most common cases. So when we write rules by hand, it is often
not correct at all concerned about human errors.
▪ Statistical inference:
▪ NLP can make use of statistical inference algorithms. It helps you to produce
models that are robust. e.g., containing words or structures which are known to
everyone.

Department of Information Technology, SRMIST 28


Basic NLP pipeline
(1) Language detection
(2) Text cleanup ( normalization . . . )
(3) Sentence segmentation
(4) Tokenization
(5) Morphological processing: stemming
(6) POS tagging
(7) Morphological processing: lemmatization
(8) Syntactic processing: parsing . . .
Higher-level tasks (semantics, information extraction, . . . )

Department of Information Technology, SRMIST 29


REGULAR EXPRESSIONS

Department of Information Technology, SRMIST 30


Regular Expressions
Regular expression (RE):
a language for specifying text search strings
● They are particularly useful for searching in texts, when we have a
pattern to search for and a corpus of texts to search through
○ In an information retrieval (IR) system such as a Web search
engine, the texts might be entire documents or Web pages
○ In a word-processor, the texts might be individual words, or lines
of a document
● grep command in Linux
○ grep ‘nlp’ /path/file

Department of Information Technology, SRMIST 31


Regular Expressions
The simplest kind of regular expression is a sequence of simple
characters
● For example, to search for language, we type /language/
● The search string can consist of a single character (like /!/) or a
sequence of characters (like /urgl/)

Regular expressions are case-sensitive; lower case /s/ is distinct


from uppercase /S/

Department of Information Technology, SRMIST 32


Examples
▪ To match a sequence of literal characters, simply write those characters in the pattern.

▪ To match a single character from a set of possibilities, use square brackets, e.g. [0123456789]
matches any digit.

▪ To match zero or more occurrences of the preceding expression, use the star (*) symbol.

▪ To match one or more occurrences of the preceding expression, use the plus (+) symbol.

▪ It is important to note that regex can be complex and difficult to read, so it is recommended to
use tools like regex testers to debug and optimize your patterns.

Department of Information Technology, SRMIST 33


Basic Regular Expression patterns
Use of [] to specify disjunction of characters Use of ^for negation

[] plus the dash – to specify range

period . To specify any character

? Marks optionality previous range

Department of Information Technology, SRMIST 34


Regular Expression Matches any one element
Aliases for common set of characters separated by the vertical bar (|)
character.

Grouping Characters ( )

Comment ( ?# comment )

RE operators for counting

Department of Information Technology, SRMIST 35


Backslash characters and anchors
▪ Some characters need to be backslashed
Anchors ^ $ \b)

The caret ˆ matches the start of a line. The pattern


/ˆThe/ matches the word The only at the start of a line

The dollar sign $ matches the end of a line

\b matches a word boundary. e.g., Thus, /\bthe\b/


matches the word “the” but not the word other

Department of Information Technology, SRMIST 36


Examples
RE RE Decode
cat|dog cat or dog
gupp(y|ies) guppy or guppies
Colou?r color or colour
Beg.n, Begin, Begun
a.*b anything starts with a and ends with b
(a|b)?c ac, bc, c
(ba){2,3} baba , bababa
“How are you\?” How are you?

Department of Information Technology, SRMIST 37


RegEx Module(re module) in Python

The re module offers a set of methods that allows us to search a string for a match, those are
• match Returns a Match object if there is a match else returns None
• findall Returns a list containing all matches
• search Returns a Match object if there is a match anywhere in the string
• split Returns a list where the string has been split at each match
▪ sub Replaces one or many matches with a string

Department of Information Technology, SRMIST 38


match()

Example:
Syntax: import re
# matching python in the given sentence
result = re.match('python', 'python programming and python')
re.match(pattern, print (result)
string) result = re.match('programming', 'python programming and
python')
print ('\n Result :', result)

Output:
<re.Match object; span=(0, 6), match='python'>
Matching string : python
Result : None

Department of Information Technology, SRMIST 39


search()
import re
Syntax: result = re.search('python', 'python programming and
python')
print (result)
re.search(pattern, result = re.search('programming', 'python
string) programming and python')
print (result)

It is similar to match() but it doesn’t restrict us to find


matches at the beginning of the string only

Department of Information Technology, SRMIST 40


findall()

import re
▪ re.findall(pattern, str = 'python 23 program 363
string) script 37'
▪ It helps to get a list of ptrn = '\d+'
all matching patterns. result = re.findall(ptrn, str)
print(result)
Output:
['23', '363', '37']

Department of Information Technology, SRMIST 41


Morphology
Department of Information Technology, SRMIST 42
Morphology
► What is Morphology Morph = form or shape, ology = study of
► Morphology is the study of the way words are built up from smaller
meaning-bearing (that is, non-decomposable) units, morphemes. A
morpheme is often defined as the minimal meaning-bearing unit in a
language
► for example the word fox consists of a single morpheme (the morpheme fox) while the word
cats consists of two: the morpheme cat and the morpheme –s

Department of Information Technology, SRMIST 43


Why?
► Many language processing applications need to extract the information encoded in the words

► Parsers which analyze sentence structure need to know/check agreement between

► Subjects and verbs

► Adjectives and nouns


▪ Productivity: going, drinking, running, playing
▪ Storing every form leads to inefficiency
▪ Addition of new words
▪ Verb: To fax. Forms: fax, faxes, faxed, faxing

► Information retrieval systems benefit from know what the stem of a word is

► Machine translation systems need to analyze words to their components and generate words with
specific features in the target language
Department of Information Technology, SRMIST 44
Basics of Morphology
► The stem

► it corresponds to what remains of the word once the flexional affixes


removed. It does not therefore necessarily constitute an atomic entity
and can be further decomposed into derivational and radical affixes

► The lemma: it carries the main meaning of the word

► The root: is an abstract entity, bearing common sense to all the words
formed from this root

► The words base, radical and root refer to very similar notions
Department of Information Technology, SRMIST 45
MORPHOLOGY - Terms
MORPHEMES consider

re ation

reconsideration

STEMS AFFIXES

PREFIX SUFFIX
E.g.: buckle, E.g.: Eat, INFIX CIRCUMFIX
unbuckle eats

Department of Information Technology, SRMIST 46


Types/Classes of Morphology

Department of Information Technology, SRMIST 47


Types/Classes of Morphology
► Inflection Morphology

► After adding the affix be in the same word class


► inflectional morphology “does not create new words but adapts existing words so that they operate
effectively in sentences. It is not a process of lexical innovation but of grammatical function

► Derivation

► After adding the affix changes to another word class

Department of Information Technology, SRMIST 48


Inflectional Morphology
► Phenomena of declination and conjugation (change of number, gender, time, person, mode and
case – grammatical change )

► It does not change the word class

► Horse Horses

► Eat Eating

► Like Likes

► Teacher teacher’s

► Walk, walks, walking walked


► Big , bigger , biggest

Department of Information Technology, SRMIST 49


Derivational Morphology
► Formation of new words while adding affixes to the root

► Derivational morphology produces a new word with usually a different word class

► E.g., make a verb from a noun.

► The new word is said to be derived from the old word

► happy (Adj) ⇒ happi + ness (Noun)

► nation/national/nationalise/ nationalist/nationalism/

► Believe → believer

► Cook – cooker

► Garden -→ gardener
Department of Information Technology, SRMIST 50
Suffix Base verb/Adj Noun

-tion Computerize Computerization

-ee employ employee


-er kill Killer
-ness Lazy(A) laziness

Department of Information Technology, SRMIST 51


Morphology in NLP

► Stemming: it consists in segmenting the word in

►– prefix + stem + suffix

► Lemmatizing: it brings back the (inflectional) variants of


the same word to their canonical form which is the lemma

► Rooting: it aims to search for the roots of words.

Department of Information Technology, SRMIST 52


Department of Information Technology, SRMIST 53
Morphology
Compuding eg : eggplant , lemongrass , snowman

Cliticization
its status lies between a word and an affix.

Full form clitic


am ‘m
have ‘ve
would ‘d
will ‘ll

Department of Information Technology, SRMIST 54


MORPHOLOGICAL ANALYSIS (Morphological Parsing)
In order to build a morphological parser, we’ll need at least the following:
1. lexicon: the list of stems and affixes, together with basic information about them (whether a stem is
a Noun stem or a Verb stem, etc.).
2. morphotactics: the model of morpheme ordering that explains which classes of morphemes can
follow other classes of morphemes inside a word. For example, the fact that the English plural
morpheme follows the noun rather than preceding it is a morphotactic fact.
3. orthographic rules: these spelling rules are used to model the changes that occur in a word, usually
when two morphemes combine (e.g., the y→ie spelling rule discussed above that changes city + -s

Department of Information Technology, SRMIST 55


Department of Information Technology, SRMIST 56
Finite State Transducer
- maps between one state to another

Department of Information Technology, SRMIST 57


Tokenization, Stemming,
Lemmatization

Department of Information Technology, SRMIST 58


TOKENIZATION
▪ Break a complex sentence into words
▪ Understand the importance of each word with respect to the sentence
▪ Produce structural description of an input sentence

Department of Information Technology, SRMIST 59


Types of Tokenization 2. Subword Tokenization:
This method breaks text into smaller units
1. Word Tokenization: than words, often used to handle out-of-
This is the most common form of tokenization, vocabulary words and to reduce the
where text is split into individual words. vocabulary size.
Example: Examples include Byte Pair Encoding
Original Text: "Tokenization is crucial for NLP." (BPE) and WordPiece.
Example (BPE):
Word Tokens: ["Tokenization", "is", "crucial",
Original Text: "unhappiness"
"for", "NLP", "."]
Subword Tokens: ["un", "hap", "pi", "ness"]
3. Character Tokenization:
Here, text is tokenized at the character level, useful for languages with a large set of
characters or for specific tasks like spelling correction.
Example:
Original Text: "Tokenization"
Character Tokens: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
Department of Information Technology, SRMIST 60
Stemming
Stemming is a technique used in Natural Language Processing (NLP) that involves reducing
words to their base or root form, called stems

Department of Information Technology, SRMIST 61


Types
▪ The Porter stemmer or the Porter stemmer algorithm uses the suffix-
stemming approach to generate stems.

▪ The Snowball stemmer approach is also called Porter2 stemmer. It is a better


version of the Porter stemmer

▪ Lancaster stemming is an aggressive approach because it implements over-


stemming for a lot of terms. It reduces the word to the shortest stem possible.

Department of Information Technology, SRMIST 62


LEMMATIZATION
▪ Lemmatization is another technique used to reduce inflected words to their root word. It
describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary
form) based on its intended meaning.

Department of Information Technology, SRMIST 63


Department of Information Technology, SRMIST 64
Department of Information Technology, SRMIST 65
Stemming Lemmatization

Stemming is faster because it chops words Lemmatization is slower as compared to


without knowing the context of the word in stemming but it knows the context of the word
given sentences. before proceeding.
It is a rule-based approach. It is a dictionary-based approach.
Accuracy is less. Accuracy is more as compared to Stemming.
When we convert any word into root-form then
Lemmatization always gives the dictionary
stemming may create the non-existence
meaning word while converting into root-form.
meaning of a word.
Lemmatization would be recommended when
Stemming is preferred when the meaning of the
the meaning of the word is important for
word is not important for analysis.
analysis.
Example: Spam Detection
Example: Question Answer
For Example: For Example:
“Studies” => “Studi” “Studies” => “Study”

Department of Information Technology, SRMIST 66


Example

nltk.download('all')
import nltk

SENTENCE TOKENIZATION

from nltk.tokenize import sent_tokenize


text = "Good morning! Welcome to NLP practice session. It will be of great fun!"
sent_tok=sent_tokenize(text)
print(sent_tok)

['Good morning!', 'Welcome to NLP practice session.', 'It will be of great fun!']

WORD TOKENIZATION

from nltk.tokenize import word_tokenize


text = "Good morning! Welcome to NLP practice session. It will be of great fun!"
word_tok = word_tokenize(text)
print(word_tok)
Department of Information Technology, SRMIST 67
['Good', 'morning', '!', 'Welcome', 'to', 'NLP', 'practice', 'session', '.', 'It', 'will', 'be', 'of', 'great', 'fun', '!']
CLEAN DATA

from nltk.corpus import stopwords


text = "Good morning! Welcome to NLP practice session. It will be of great
fun!"
word_tok = word_tokenize(text)
print(word_tok)
text_no_sw = [word for word in word_tok if not word in stopwords.words()]
print(text_no_sw)

['Good', 'morning', '!', 'Welcome', 'to', 'NLP', 'practice', 'session', '.', 'It', 'will',
'be', 'of', 'great', 'fun', '!']
['Good', 'morning', '!', 'Welcome', 'NLP', 'practice', 'session', '.', 'It', 'great',
'fun', '!']
Department of Information Technology, SRMIST 68
STEMMING
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
text = "study studies studied cry cries crying cried"
tokenize = word_tokenize(text)
for w in tokenize:
print("Stemming for {} is {}".format(w, porter.stem(w)))

Stemming for study is studi


Stemming for studies is studi
Stemming for studied is studi
Stemming for cry is cri
Stemming for cries is cri
Stemming for crying is cri
Stemming for cried is cri
Department of Information Technology, SRMIST 69
LEMMATIZATION
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
text = "study studies studied cry cries crying
cried"
tokenize = word_tokenize(text)
for w in tokenize:
print("Lemma for {} is {}".format(w,
lem.lemmatize(w)))

Lemma for study is study


Lemma for studies is study
Lemma for studied is studied
Lemma for cry is cry
Lemma for cries is cry
Lemma for crying is cry
Department of Information Technology, SRMIST 70
TERM FREQUENCY AND
INVERSE DOCUMENT FREQUENCY

Department of Information Technology, SRMIST 71


FEATURE EXTRACTION :
TERM FREQUENCY AND INVERSE TERM FREQUENCY
Fundamental concepts used in text analysis, especially for converting text data
into numerical features for machine learning algorithms.

Term Frequency: TF of a term or word is the number of times the term


appears in a document compared to the total number of words in the
document.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚𝑠 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
𝑇𝐹 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡

Department of Information Technology, SRMIST 72


FEATURE EXTRACTION :
TERM FREQUENCY AND INVERSE TERM FREQUENCY

Inverse Document Frequency: IDF of a term reflects the proportion of documents in the
corpus that contain the term. Words unique to a small percentage of documents (e.g.,
technical jargon terms) receive higher importance values than words common across all
documents (e.g., a, the, and).
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠
𝐼𝑇𝐹 = log(
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑡ℎ𝑒𝑟 𝑡𝑒𝑟𝑚
To avoid divide by zero,
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠
𝐼𝑇𝐹 = log(
1 + 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑡ℎ𝑒𝑟 𝑡𝑒𝑟𝑚

Department of Information Technology, SRMIST 73


TF-IDF
▪ TF-IDF
▪ TF-IDF combines both TF and IDF to assign a weight to
each term. This weight indicates how important a term is
in a specific document relative to the corpus.

TF-IDF=TF×IDF

Department of Information Technology, SRMIST 74


Let’s break down the numerical calculation of TF-IDF for the given documents:
Documents:
1. “The sky is blue.”
2. “The sun is bright today.”
3. “The sun in the sky is bright.”
4. “We can see the shining sun, the bright sun.”

Department of Information Technology, SRMIST 75


Consider a corpus of 3 documents: Step 2: Calculate IDF
Document 1: "The sky is blue." The term "blue" appears in 2 documents
Document 2: "The sun is bright." (Document 1 and Document 3).
Document 3: "The sun in the blue
sky is bright.“ Total documents = 3:
𝑡ℎ𝑖𝑠 3
IDF"blue"=log( ) = log(1)= 0
Step 1: Calculate TF 𝑡ℎ𝑖𝑠 1+2
For the term "blue" in Document 1: Step 3: Compute TF-IDF
1
TF"blue"= = 0.25 For the term "blue" in Document 1:
4
TF-IDF"blue"=TF"blue"×IDF"blue"=0.25×0=0
For the term "blue" in Document 3:
1 For terms like "sun," which appear less
TF"blue"= = 0.125
8 frequently across the corpus, their IDF
would be higher, emphasizing their
importance.
Department of Information Technology, SRMIST 76
Step 1: Calculate Term Frequency (TF)
The sky is blue The sun is bright today
Term Count TF Term Count TF
the 1 1/4 the 1 1/5
sky 1 1/4 sun 1 1/5
is 1 1/4 is 1 1/5
blue 1 1/4 bright 1 1/5
today 1 1/5

The sun in the sky is bright We can see the shining sun, the bright sun
Term Count TF Term Count TF
the 2 2/7 we 1 1/9
sun 1 1/7 can 1 1/9
in 1 1/7 see 1 1/9
sky 1 1/7 the 2 2/9
is 1 1/7 shining 1 1/9
bright 1 1/7 sun 2 2/9
Department of Information Technology, SRMIST
bright 1 1/9 77
Step 2: Calculate Inverse Document Frequency (IDF)

Term TF IDF
the 4 log(4/4+1)=log(0.8)≈−0.223
sky 2 log(4/2+1)=log(1.333)≈0.287
is 3 log(4/3+1)=log(1)=0
blue 1 log(4/1+1)=log(2)≈0.693
sun 3 log(4/3+1)=log(1)=0
bright 3 log(4/3+1)=log(1)=0
today 1 log(4/1+1)=log(2)≈0.693
in 1 log(4/1+1)=log(2)≈0.693
we 1 log(4/1+1)=log(2)≈0.693
can 1 log(4/1+1)=log(2)≈0.693
see 1 log(4/1+1)=log(2)≈0.693
shining 1
Department of Information Technology, SRMIST log(4/1+1)=log(2)≈0.693 78
Step 3: Calculate TF-IDF
Now, let’s calculate the TF-IDF values for each term in each document.
Document 1: “The sky is blue.”
Term TF IDF TF-IDF
the 0.25 -0.223 0.25 * -0.223 ≈-0.056
sky 0.25 0.287 0.25 * 0.287 ≈ 0.072
is 0.25 0 0.25 * 0 = 0
blue 0.25 0.693 0.25 * 0.693 ≈ 0.173
Document 2: “The sun is bright today.”

Term TF IDF TF-IDF


the 0.2 -0.223 0.2 * -0.223 ≈ -0.045
sun 0.2 0 0.2 * 0 = 0
is 0.2 0 0.2 * 0 = 0
bright 0.2 0 0.2 * 0 = 0
today 0.2 0.693 0.2 * 0.693 ≈0.139
Department of Information Technology, SRMIST 79
Document 3: “The sun in the sky is bright.”
Term TF IDF TF-IDF
the 0.285 -0.223 0.285 * -0.223 ≈ -0.064
sun 0.142 0 0.142 * 0 = 0
in 0.142 0.693 0.142 * 0.693 ≈0.098
sky 0.142 0.287 0.142 * 0.287≈0.041
is 0.142 0 0.142 * 0 = 0
bright 0.142 0 0.142 * 0 = 0
Document 4: “We can see the shining sun, the bright sun.”
Term TF IDF TF-IDF
we 0.111 0.693 0.111 * 0.693 ≈0.077
can 0.111 0.693 0.111 * 0.693 ≈0.077
see 0.111 0.693 0.111 * 0.693≈0.077
the 0.222 -0.223 0.222 * -0.223≈-0.049
shining 0.111 0.693 0.111 * 0.693 ≈0.077
sun 0.222 0 0.222 * 0 = 0
bright 0.111
Department of Information Technology, SRMIST
0 0.111 * 0 = 0 80
Parts Of Speech (PoS) Tagging

Department of Information Technology, SRMIST 81


Parts of Speech (PoS) tagging
▪ Parts of Speech (PoS) tagging is the process of labeling each word
in a sentence with its corresponding part of speech, such as noun,
verb, adjective, etc.
▪ PoS tagging is a fundamental task in Natural Language Processing
(NLP) and serves as a building block for numerous downstream
applications, including parsing, machine translation, and sentiment
analysis.

Department of Information Technology, SRMIST 82


Part of Role Examples
speech
Noun Is a person, place, or thing mountain, bagel, Poland
Pronoun Replaces a noun you, she, we
Adjective Gives information about what a efficient, windy, colorful
noun is like
Verb Is an action or a state of being learn, is, go
Adverb Gives information about a verb, an efficiently, always, very
adjective, or another adverb
Preposition Gives information about how a noun from, about, at
or pronoun is connected to another
word
Conjunction Connects two other words or so, because, and
phrases
Interjection Is an exclamation yay, ow, wow

Department of Information Technology, SRMIST 83


Common tags

Department of Information Technology, SRMIST 84


Department of Information Technology, SRMIST 85
Department of Information Technology, SRMIST 86
Department of Information Technology, SRMIST 87
Department of Information Technology, SRMIST 88
Department of Information Technology, SRMIST 89
Department of Information Technology, SRMIST 90
Approaches to POS Tagging
▪ Rule-Based Tagging
▪ Statistical Tagging
▪ Neural Network-Based Tagging
Rule-Based Tagging
Rule-based taggers rely on a predefined set of linguistic rules to assign parts of speech to
words.
These rules are typically crafted by linguists and may include:
1.Morphological Analysis: Examining word suffixes or prefixes to infer POS tags.
•Example: Words ending in -ly are often adverbs, e.g., "quickly."
•Words ending in -ed might be past-tense verbs, e.g., "walked."
2.Syntactic Rules: Leveraging the context of a word in a sentence.
•Example: If the word follows a determiner (e.g., "the," "a"), it is likely a noun.
3.Dictionary Lookup: Using a lexicon that maps words to possible POS tags.
•Example: "Run" may be tagged as both a noun and a verb based on its entry in the
dictionary.
Department of Information Technology, SRMIST 91
Statistical Tagging
▪ Statistical tagging operates by analyzing a large annotated corpus
to learn patterns and probabilities. These probabilities are used to
determine the most likely sequence of POS tags for a given
sentence.

Department of Information Technology, SRMIST 92


Statistical Tagging
•Sequence Modeling:
•Statistical taggers aim to find the sequence of tags T=(t1,t2,…,tn) that maximizes the
joint probability P(W,T) where W=(w1,w2,…,wn) is the sequence of words.

𝑷(𝒕,𝒘) P(t,w): the joint distribution of the labels we


argmaxt P(t|w) = argmaxt want to predict (t) and the observed data
𝑷(𝒘)
(w). We decompose P(t,w) into P(t) and
= argmaxt P(t,w) P(w | t) since these distributions are easier
to estimate. Models based on joint
= argmaxt P(t)P(w|t) distributions of labels and observed data
are called generative models
Department of Information Technology, SRMIST 93
Statistical Tagging - Hidden Markov Models (HMMs)
HMMs are the most commonly used generative models for POS tagging (and other
tasks, e.g. in speech recognition) HMMs make specfic independence assumptions
in P(t) and P(w| t).
▪ Assumptions:

1. The probability of a tag depends only on the previous tag (first-order Markov
assumption).
2. The probability of a word depends only on its tag (emission probability).
Goal: Maximize P(T∣W) using Bayes' Rule:
▪ P(T∣W)∝P(W∣T)P(T)
▪ P(T) : Transition probabilities, learned from the corpus.
• P(W∣T) : Emission probabilities, derived from word-tag frequencies.

•Basics :
•Emission Probability: The probability of a word being associated with a particular tag. P(word∣tag)
Example: The probability of "run" being a verb vs. a noun.
•Transition Probability: The probability of one tag following another. P(tagi∣tagi−1) Example: The likelihood of
a verb being followed by a noun.
Department of Information Technology, SRMIST 94
HMM
▪ Can be represented as

Department of Information Technology, SRMIST 95


Statistical Tagging - Maximum
Entropy Markov Models
▪ MEMMs use a logistic regression (“Maximum Entropy”) classier for each
P(ti |wi, ti−1 )

Where:
•fi(w,t) : Features of the word www and tag t.
• λi: Weights learned during training.

Department of Information Technology, SRMIST 96


Statistical Tagging - Conditional Random Fields (CRFs)
▪ Conditional Random Fields have the same mathematical definition as MEMMs,
but:
▪ —CRFS are trained globally to maximize the probability of the overall
sequence,
▪ — MEMMs are trained locally to maximize the probability of each individual
label
▪ — Decoding: Viterbi

Department of Information Technology, SRMIST 97


Neural Network-Based Tagging
▪ Neural network-based methods have revolutionized POS tagging by leveraging deep learning
techniques.
Feedforward Neural Networks
▪ Early neural network models use fixed-size context windows around a word to predict its tag.
While these models were innovative, their inability to handle variable-length contexts limited
their success.
Recurrent Neural Networks (RNNs)
▪ RNNs process input sequences word by word, maintaining a hidden state that captures
information about prior words. Variants like Long Short-Term Memory (LSTM) networks
address the vanishing gradient problem, enabling the model to learn dependencies over long
contexts.
Transformer-Based Models
▪ Transformers (e.g., BERT, GPT) represent the state-of-the-art in POS tagging. They use self-
attention mechanisms to capture relationships between all words in a sentence, regardless
of distance.
• BERT (Bidirectional Encoder Representations from Transformers): Pre-trained on large
corpora, BERT can model context bidirectionally, making it highly effective for POS tagging.
• Fine-Tuning: BERT can be fine-tuned on smaller, domain-specific datasets to improve tagging
performance

Department of Information Technology, SRMIST 98


Summary Table of Approaches

Approach Strengths Weaknesses


Simple, interpretable, language-specific Poor scalability, struggles with
Rule-Based Tagging
tuning ambiguity
HMMs Probabilistic, efficient for structured data Assumes independence, limited context
MaxEnt Models Flexible, handles sparse data Requires feature engineering
CRFs Considers full sentence context Computationally expensive
RNNs (LSTMs) Captures sequential dependencies High training cost
State-of-the-art, handles long Resource-intensive, requires pre-
Transformers (BERT)
dependencies training
Combines strengths of multiple
Hybrid Methods Complex to implement
approaches

Department of Information Technology, SRMIST 99


PoS Tagging for English
Treebank Name Size Tagset Domain Features Applications

POS tagging,
Penn Treebank News articles POS tags, parsing,
Penn Treebank 4.5 million words
(36 tags) (WSJ) syntactic trees dependency
conversion
Early POS tagging
Brown Corpus 1 million words Custom (80+ tags) Multigenre POS tags models, genre
analysis
POS tags,
Universal dependency
Universal POS (17 Cross-lingual NLP
Dependencies Varies by treebank Multidomain relations,
tags) tasks, parsing
(UD) morphological
features
Speech
~5 hours of POS tags on
TIMIT Corpus Custom Spoken English recognition and
speech transcriptions
POS tagging

…… and many more…… for various other languages for too…..

Department of Information Technology, SRMIST 100


Named Entity Recognition
(NER)

Department of Information Technology, SRMIST 101


Named Entity Recognition (NER)
▪ Named Entity Recognition (NER) is a sub-task of Natural Language Processing (NLP)
that focuses on identifying and classifying named entities in text into predefined
categories
▪ Eg., Person Names (e.g., "Albert Einstein"), Organizations (e.g., "United Nations")

At its core, NER processes textual data to identify and categorize key information. For example, in
the sentence
"Apple is looking at buying U.K. startup for $1 billion."
An NER system should recognize "Apple" as an Organization (ORG), "U.K." as a Geopolitical
entity (GPE), and "$1 billion" as a Monetary value (MONEY).

Department of Information Technology, SRMIST 102


Labels Description Example
Names of people or fictional Albert Einstein," "Marie Curie,"
Person (PER)
characters. "Sherlock Holmes."
Names of companies,
Organization "Google," "United Nations,"
institutions, agencies, or other
(ORG) "Harvard University."
groups of people.
Names of geographical places
Location "Mount Everest," "Nile River,"
such as cities, countries,
(LOC) "Paris."
mountains, rivers.
Geo-Political Geographical regions that are "United States," "Germany,"
Entity (GPE) also political entities. "Tokyo."
Expressions of calendar dates or "January 1, 2022," "the 19th
Date
periods. century," "2010-2015."
Specific times within a day or "5 PM," "midnight," "two
Time
durations. hours."

Department of Information Technology, SRMIST 103


Monetary values, often
"$100," "€50 million," "1,000
Money accompanied by currency
yen."
symbols.
Percent Percentage expressions. "50%," "3.14%," "half."
"Eiffel Tower," "JFK Airport,"
Facility (FAC) Buildings or infrastructure.
"Golden Gate Bridge."
Objects, vehicles, software, or any "iPhone," "Boeing 747,"
Product
tangible items. "Windows 10."
Named occurrences such as wars, "World War II," "Olympics,"
Event
sports events, disasters. "Hurricane Katrina."
Titles of books, songs, paintings, "Mona Lisa," "To Kill a
Work of Art
movies. Mockingbird," "Star Wars."
Language Names of languages. "English," "Mandarin," "Spanish."
"The Affordable Care Act,"
Law Legal documents, treaties, acts.
"Treaty of Versailles."
NORP
(Nationality, Nationalities, religious groups, or "American," "Christians,"
Religious, or
Department of Information political affiliations.
Technology, SRMIST "Democrat." 104
Political Group)
Approaches / Methods
Approach Description Advantages Disadvantages
Requires extensive
Relies on hand-crafted rules
domain knowledge. Not
Rule-Based and patterns, such as regular
Simple and interpretable. scalable for large
Approaches expressions or gazetteers
datasets or multiple
(predefined lists of entities).
languages.
Treats NER as a sequence
labeling task using algorithms Data-driven and less Requires annotated
Statistical like Hidden Markov Models reliant on manual rules. training data. Limited
Approaches (HMMs), Conditional Random Works well for structured generalization to unseen
Fields (CRFs), or Support text. domains or languages.
Vector Machines (SVMs).
Leverages neural networks
High accuracy with large
like RNNs (LSTMs, GRUs), Computationally
datasets. Handles
Deep Learning CNNs, and transformers (e.g., expensive. Requires
complex and ambiguous
Approaches BERT, RoBERTa) to large annotated datasets
contexts better than
automatically learn features for fine-tuning.
other methods.
from raw text.
Department of Information Technology, SRMIST 105
Rules , ML /DL training
Example representation for rule/learning
[PERSON] earns [MONEY]
[PERSON] joined [ORGANIZATION]
[PERSON] joined [ORGANIZATION] as [JOBTITLE]
Date using Regular Expression

Department of Information Technology, SRMIST 106


Challenges in NER
1. Ambiguity:
Words can belong to multiple categories (e.g., “Apple” can be a company or a fruit).
2. Out-of-Vocabulary Entities:
Recognizing entities not seen in the training data.
3. Domain Adaptation:
Difficulty in transferring models across domains (e.g., news vs. medical text).
4. Multilingual NER:
Handling languages with complex morphology or lack of resources.
5. Nested Entities:
Entities embedded within other entities (e.g., “University of California, Berkeley”).

Department of Information Technology, SRMIST 107


N-Grams
Department of Information Technology, SRMIST 108
N-Grams
• Models that assign probabilities to sequences of words are called language models (LMs).
• The simplest language model that assigns probabilities to sentences and sequences of words is n-
gram language model. . An n-gram is a sequence of N words
Unigrams (1-grams): These are single words, like individual building blocks of a sentence. -
Example: In the sentence "I have a cat," the unigrams are "I," "have," "a," and "cat."
Bigrams (2-grams): These are pairs of words that appear next to each other in a sentence. –
Example: In the sentence "I love ice cream" the bigrams are "I love," "love ice," and "ice
cream,"
Trigrams (3-grams): These are groups of three words in a row.
Example: In the sentence "I want to learn," the trigrams are "I want to," and "want to learn."
4-grams (Quadgrams or 4-tuples): These consist of four consecutive words. - Example: In the
sentence "The quick brown fox jumps," a 4-gram could be "The quick brown fox," or "quick
brown fox jumps."
5-grams (5-tuples):
Department of Information These
Technology,are sequences of five consecutive words.
SRMIST 109
Probabilistic Language Models
Probabilistic language models can be used to assign a probability to a sentence in
many NLP tasks. – In many NLP applications, we can use the probability as a way to
choose a better sentence or word over a less-appropriate one.
• Machine Translation:
– P(high winds tonight) > P(large winds tonight)
Spell Correction:
– Thek office is about ten minutes from here
– P(The office is) > P(Then office is)
• Speech Recognition:
– P(I saw a van) >> P(eyes awe of an)
• Summarization, question-answering, …
Department of Information Technology, SRMIST 110
Our goal is to compute the probability of a sentence or sequence of words
W (=w1 ,w2 ,…,wn ): P(W) = P(w1 ,w2 ,…,wn)
What is the probability of an upcoming word? : – P(w5 | w1 ,w2 ,w3 ,w4 )
A model that computes either of these: – P(W) or P(wn | w1 ,w2 ,…wn-1 ) is called
language model.
Chain Rule of Probability
How can we compute probabilities of entire word sequences like w1 ,w2 ,…,wn?
– The probability of the word sequence w1 ,w2 ,…,wn is P(w1 ,w2 ,…,wn).
We can use the chain rule of the probability to decompose this probability:
P(w1n ) = P(w1 ) P(w2 |w1 ) P(w3 |w1 2 ) … P(wn |w1n-1 )

Example:
P(the man from jupiter) = P(the) P(man|the) P(from|the man) P(jupiter|the man from)

Department of Information Technology, SRMIST 111


The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words.
• Definition of Conditional Probabilities:
P(B|A) = P(A,B) / P(A) ➔ P(A,B) = P(A) P(B|A)
• Conditional Probabilities with More Variables: P(A,B,C,D) = P(A) P(B|A)
P(C|A,B) P(D|A,B,C)
Chain Rule: P(w1… wn ) = P(w1 ) P(w2 |w1 ) P(w3 |w1w2 ) … P(wn |w1…wn-1 )

Department of Information Technology, SRMIST 112


Estimating N-Gram Probabilities
▪Estimating n-gram probabilities is called maximum likelihood estimation (or MLE). • We get the MLE estimate for the
parameters of an n-gram model by getting counts from a corpus, and normalizing the counts so that they lie between 0 and 1

2-gram Probability Formula:


The probability of a 2-gram, denoted as P(w2∣w1), is the probability of word w2​ occurring given that the previous
word was w1​. The formula is: P(w2∣w1)=Count(w1,w2)/Count(w1) ​Where:
•P(w2∣w1) is the probability of word w2​ given word w1​, Count(w1,w2) is the count of how often the word pair
(w1,w2) appears in the training corpus, Count(w1) is the count of how often the word w1appears in the training
corpus.
Explanation:
•The formula calculates the probability of seeing w2​ right after w1​, based on how often the pair (w1,w2) occurs
divided by the frequency of w1​ alone. In this case, you're assuming that the probability of each word depends only
on the previous word (Markov assumption).
Example:
If you have the sentence: "I like pizza", and you want to estimate the probability of "pizza" given "I", you would
count: Department
Count(I,pizza): Number
of Information Technology, of times "I" is followed by "pizza". Count(I) : Number of times "I" appears. 113
SRMIST
Department of Information Technology, SRMIST 114
A mini-corpus: We augment each sentence with a special symbol at the beginning of the
sentence, to give us the bigram context of the first word, and special end-symbol .
<s>I am Sam </s>
<s>Sam I am </s>
▪<s>I fly </s>

• Unique words: I, am, Sam, fly


▪Bigrams: and are also tokens. There are 6(4+2) tokens and 6*6=36 bigrams

Department of Information Technology, SRMIST 115


Example
Problem on N-gram model Consider following Training data:
<s> I am Sam </s>
<s> Sam I am </s>
<s> Sam I like </s>
<s> Sam I do like </s>
<s> do I like Sam </s>
Assume that we use a bigram language model based on the above training data.
What is the most probable next word predicted by the model for the following
word sequences?
(1) <s> Sam ...
(2) <s> Sam I do ...
(3) <s> Sam I am Sam ...
(4) <s> do I like ...
Department of Information Technology, SRMIST 116
Bigram Count

Let's solve this step by step <s> I 1

by analyzing the bigram I am 2


probabilities from the am Sam 1
training data:
Sam </s> 2
Training Data
<s> Sam 3
1.<s> I am Sam </s>
2.<s> Sam I am </s> Sam I 3

3.<s> Sam I like </s> I am 1

4.<s> Sam I do like </s> am </s> 1


5.<s> do I like Sam </s> I like 2

I do 1

am Sam 2

like Sam 1

like </s> 1

do like 2

Sam </s> 1

Department of Information Technology, SRMIST I </s> 0 117


Department of Information Technology, SRMIST 118
Department of Information Technology, SRMIST 119
Department of Information Technology, SRMIST 120
Department of Information Technology, SRMIST 121
Department of Information Technology, SRMIST 122
Department of Information Technology, SRMIST 123
Department of Information Technology, SRMIST 124
Training corpus:
<s> I am from Vellore </s>
<s> I am a teacher </s>
<s> students are good and are from various cities</s>
<s> students from Chennai do engineering</s>
▪ Test data:

▪ <s> students are from Chennai </s>


Estimate using bigrams

Department of Information Technology, SRMIST 125


SMOOTHING

Department of Information Technology, SRMIST 126


SMOOTHING
Smoothing What do we do with words that are in our vocabulary (they are not unknown words) but
appear in a test set in an unseen context (assigns zero probability and to prevent that )

Taking a bit of the probability mass from more frequent events and giving it to unseen
events,sometimes also called “discounting”

Many different smoothing techniques:


▪ • Laplace (add-one) or additive
▪• Add-k

▪• Stupid backoff
▪• Kneser-Ney

Department of Information Technology, SRMIST 127


Laplace Smoothing
•Not the highest-performing technique for
language modeling, but a useful baseline
•Practical method for other text classification
tasks
•Add one to all n-gram counts before they are
normalized into probabilities

Department of Information Technology, SRMIST 128


Department of Information Technology, SRMIST 129
Department of Information Technology, SRMIST 130
Add-K Smoothing
Moves a bit less of the probability mass from seen to unseen events
• Rather than adding one to each count, add a fractional count
• 0.5
• 0.05
• 0.01
The value k can be optimized on a validation set

Department of Information Technology, SRMIST 131


Kneser-Ney Smoothing
One of the most commonly used and best-performing n-gram
smoothing methods
• Incorporates absolute discounting

Backoff
• If the n-gram we need has zero counts, approximate it by backing off to the (n-1)-gram
• Continue backing off until we reach a size that has non-zero counts
• Just like with smoothing, some probability mass from higher order n-grams needs to be redistributed to
lower-order ngrams
Department of Information Technology, SRMIST 132
Finally,NLP Pipeline

Department of Information Technology, SRMIST 133

You might also like