nlp unit 1
nlp unit 1
LANGUAGE PROCESSING
Instructor:
Ms. S. Rama,
Assistant Professor
Department of Information Technology,
SRM Institute of Science and Technology,
[email protected]
Unit I
▪ Introduction to Natural Lanaguage Processing
▪ Applications of NLP
▪ Levels of NLP
▪ Regular Expressions
▪ Morphological analysis
▪ Tokenization,Stemming, Lemmatization
▪ Feature Extraction :
▪ N-grams
▪ Smoothing
Watashi no
kurasu e Eṉathu
yōkoso vakuppiṟku
varuka
• Unstructured text is everywhere, such as emails, chat conversations, websites, and social media.
Nevertheless, it’s hard to extract value from this data unless it’s organized in a certain way.
• Text classification also known as text tagging or text categorization is the process of categorizing
text into organized groups. By using Natural Language Processing (NLP), text classifiers
can automatically analyze text and then assign a set of pre-defined tags or categories based on its
content.
• Text classification is becoming an increasingly important part of businesses as it allows to easily
get insights from data and automate business processes.
• Chatbots are computer programs that conduct automatic conversations with people. They
are mainly used in customer service for information acquisition. As the name implies,
these are bots designed with the purpose of chatting and are also simply referred to as
“bots.”
• You’ll come across chatbots on business websites or messengers that give pre-scripted
replies to your questions. As the entire process is automated, bots can provide quick
assistance 24/7 without human intervention.
▪NLU mainly used in Business applications to understand the customer's problem in both
spoken and written language.
▪ Natural Language Generation (NLG) acts as a translator that converts the computerized data
into natural language representation. It mainly involves Text planning, Sentence planning, and
Text Realization.
NLU NLG
NLU is the process of reading and NLG is the process of writing or generating language.
interpreting language.
It produces non-linguistic outputs from It produces constructing natural language outputs from
natural language inputs. non-linguistic inputs.
Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of
words in the sentence and in phrases.
Semantics − It is concerned with the meaning of words and how to combine words into meaningful phrases
and sentences.
Pragmatics − It deals with using and understanding sentences in different situations and how the
interpretation of the sentence is affected.
Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next
sentence.
▪ To match a single character from a set of possibilities, use square brackets, e.g. [0123456789]
matches any digit.
▪ To match zero or more occurrences of the preceding expression, use the star (*) symbol.
▪ To match one or more occurrences of the preceding expression, use the plus (+) symbol.
▪ It is important to note that regex can be complex and difficult to read, so it is recommended to
use tools like regex testers to debug and optimize your patterns.
Grouping Characters ( )
Comment ( ?# comment )
The re module offers a set of methods that allows us to search a string for a match, those are
• match Returns a Match object if there is a match else returns None
• findall Returns a list containing all matches
• search Returns a Match object if there is a match anywhere in the string
• split Returns a list where the string has been split at each match
▪ sub Replaces one or many matches with a string
Example:
Syntax: import re
# matching python in the given sentence
result = re.match('python', 'python programming and python')
re.match(pattern, print (result)
string) result = re.match('programming', 'python programming and
python')
print ('\n Result :', result)
Output:
<re.Match object; span=(0, 6), match='python'>
Matching string : python
Result : None
import re
▪ re.findall(pattern, str = 'python 23 program 363
string) script 37'
▪ It helps to get a list of ptrn = '\d+'
all matching patterns. result = re.findall(ptrn, str)
print(result)
Output:
['23', '363', '37']
► Information retrieval systems benefit from know what the stem of a word is
► Machine translation systems need to analyze words to their components and generate words with
specific features in the target language
Department of Information Technology, SRMIST 44
Basics of Morphology
► The stem
► The root: is an abstract entity, bearing common sense to all the words
formed from this root
► The words base, radical and root refer to very similar notions
Department of Information Technology, SRMIST 45
MORPHOLOGY - Terms
MORPHEMES consider
re ation
reconsideration
STEMS AFFIXES
PREFIX SUFFIX
E.g.: buckle, E.g.: Eat, INFIX CIRCUMFIX
unbuckle eats
► Derivation
► Horse Horses
► Eat Eating
► Like Likes
► Teacher teacher’s
► Derivational morphology produces a new word with usually a different word class
► nation/national/nationalise/ nationalist/nationalism/
► Believe → believer
► Cook – cooker
► Garden -→ gardener
Department of Information Technology, SRMIST 50
Suffix Base verb/Adj Noun
Cliticization
its status lies between a word and an affix.
nltk.download('all')
import nltk
SENTENCE TOKENIZATION
['Good morning!', 'Welcome to NLP practice session.', 'It will be of great fun!']
WORD TOKENIZATION
['Good', 'morning', '!', 'Welcome', 'to', 'NLP', 'practice', 'session', '.', 'It', 'will',
'be', 'of', 'great', 'fun', '!']
['Good', 'morning', '!', 'Welcome', 'NLP', 'practice', 'session', '.', 'It', 'great',
'fun', '!']
Department of Information Technology, SRMIST 68
STEMMING
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
text = "study studies studied cry cries crying cried"
tokenize = word_tokenize(text)
for w in tokenize:
print("Stemming for {} is {}".format(w, porter.stem(w)))
Inverse Document Frequency: IDF of a term reflects the proportion of documents in the
corpus that contain the term. Words unique to a small percentage of documents (e.g.,
technical jargon terms) receive higher importance values than words common across all
documents (e.g., a, the, and).
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠
𝐼𝑇𝐹 = log(
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑡ℎ𝑒𝑟 𝑡𝑒𝑟𝑚
To avoid divide by zero,
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠
𝐼𝑇𝐹 = log(
1 + 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑡ℎ𝑒𝑟 𝑡𝑒𝑟𝑚
TF-IDF=TF×IDF
The sun in the sky is bright We can see the shining sun, the bright sun
Term Count TF Term Count TF
the 2 2/7 we 1 1/9
sun 1 1/7 can 1 1/9
in 1 1/7 see 1 1/9
sky 1 1/7 the 2 2/9
is 1 1/7 shining 1 1/9
bright 1 1/7 sun 2 2/9
Department of Information Technology, SRMIST
bright 1 1/9 77
Step 2: Calculate Inverse Document Frequency (IDF)
Term TF IDF
the 4 log(4/4+1)=log(0.8)≈−0.223
sky 2 log(4/2+1)=log(1.333)≈0.287
is 3 log(4/3+1)=log(1)=0
blue 1 log(4/1+1)=log(2)≈0.693
sun 3 log(4/3+1)=log(1)=0
bright 3 log(4/3+1)=log(1)=0
today 1 log(4/1+1)=log(2)≈0.693
in 1 log(4/1+1)=log(2)≈0.693
we 1 log(4/1+1)=log(2)≈0.693
can 1 log(4/1+1)=log(2)≈0.693
see 1 log(4/1+1)=log(2)≈0.693
shining 1
Department of Information Technology, SRMIST log(4/1+1)=log(2)≈0.693 78
Step 3: Calculate TF-IDF
Now, let’s calculate the TF-IDF values for each term in each document.
Document 1: “The sky is blue.”
Term TF IDF TF-IDF
the 0.25 -0.223 0.25 * -0.223 ≈-0.056
sky 0.25 0.287 0.25 * 0.287 ≈ 0.072
is 0.25 0 0.25 * 0 = 0
blue 0.25 0.693 0.25 * 0.693 ≈ 0.173
Document 2: “The sun is bright today.”
1. The probability of a tag depends only on the previous tag (first-order Markov
assumption).
2. The probability of a word depends only on its tag (emission probability).
Goal: Maximize P(T∣W) using Bayes' Rule:
▪ P(T∣W)∝P(W∣T)P(T)
▪ P(T) : Transition probabilities, learned from the corpus.
• P(W∣T) : Emission probabilities, derived from word-tag frequencies.
•Basics :
•Emission Probability: The probability of a word being associated with a particular tag. P(word∣tag)
Example: The probability of "run" being a verb vs. a noun.
•Transition Probability: The probability of one tag following another. P(tagi∣tagi−1) Example: The likelihood of
a verb being followed by a noun.
Department of Information Technology, SRMIST 94
HMM
▪ Can be represented as
Where:
•fi(w,t) : Features of the word www and tag t.
• λi: Weights learned during training.
POS tagging,
Penn Treebank News articles POS tags, parsing,
Penn Treebank 4.5 million words
(36 tags) (WSJ) syntactic trees dependency
conversion
Early POS tagging
Brown Corpus 1 million words Custom (80+ tags) Multigenre POS tags models, genre
analysis
POS tags,
Universal dependency
Universal POS (17 Cross-lingual NLP
Dependencies Varies by treebank Multidomain relations,
tags) tasks, parsing
(UD) morphological
features
Speech
~5 hours of POS tags on
TIMIT Corpus Custom Spoken English recognition and
speech transcriptions
POS tagging
At its core, NER processes textual data to identify and categorize key information. For example, in
the sentence
"Apple is looking at buying U.K. startup for $1 billion."
An NER system should recognize "Apple" as an Organization (ORG), "U.K." as a Geopolitical
entity (GPE), and "$1 billion" as a Monetary value (MONEY).
Example:
P(the man from jupiter) = P(the) P(man|the) P(from|the man) P(jupiter|the man from)
I do 1
am Sam 2
like Sam 1
like </s> 1
do like 2
Sam </s> 1
Taking a bit of the probability mass from more frequent events and giving it to unseen
events,sometimes also called “discounting”
▪• Stupid backoff
▪• Kneser-Ney
Backoff
• If the n-gram we need has zero counts, approximate it by backing off to the (n-1)-gram
• Continue backing off until we reach a size that has non-zero counts
• Just like with smoothing, some probability mass from higher order n-grams needs to be redistributed to
lower-order ngrams
Department of Information Technology, SRMIST 132
Finally,NLP Pipeline