SlideShare a Scribd company logo
NLP Toolkits and Preprocessing Techniques
NLP Toolkits and Preprocessing Techniques
• NLP Toolkits
▪ Python libraries for natural language processing
• Text Preprocessing Techniques
▪ Converting text to a meaningful format for analysis
▪ Preprocessing and cleaning text
NLP Toolkits
• NLTK (Natural Language Toolkit)
▪ The most popular NLP library
• TextBlob
▪ Wraps around NLTK and makes it easier to use
• spaCy
▪ Built on Cython, so it’s fast and powerful
• gensim
▪ Great for topic modeling and document similarity
Code: How to Install NLTK
Command Line
pip install nltk
Jupyter Notebook
import nltk
nltk.download()
# downloads all data & models
# this will take a while
Sample Text Data
Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from
the store. Should I pick up some black-eyed peas as well?
Text data is messy.
To analyze this data, we need to preprocess and normalize the text.
Preprocessing Techniques
1. Turn text into a meaningful format for analysis
• Tokenization
2. Clean the data
• Remove: capital letters, punctuation, numbers, stop words
• Stemming
• Parts of speech tagging
• Correct misspellings
• Chunking (named entity recognition, compound term extraction)
Tokenization
Tokenization = splitting raw text into small, indivisible units for
processing
These units can be:
• Words
• Sentences
• N-grams
• Other characters defined by regular expressions
Code: Tokenization (Words)
from nltk.tokenize import word_tokenize
my_text = "Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers)
from the store. Should I pick up some black-eyed peas as well?"
print(word_tokenize(my_text)) # print function requires Python 3
['Hi', 'Mr.', 'Smith', '!', 'I', '’', 'm', ‘going', ‘to', 'buy', 'some',
'vegetables', '(', 'tomatoes', 'and', 'cucumbers', ')', 'from', 'the',
'store', '.', 'Should', 'I', 'pick', 'up', ‘some', 'black-eyed', 'peas', 'as',
'well', '?']
Output:
Input:
Requires python 3
Tokenization: Sentences
Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from
the store. Should I pick up some black-eyed peas as well?
Tokens can be sentences. How would you split this into sentences? What rules
would you put in place?
It’s a difficult task. This is where tokenizers in Python can help.
Code: Tokenization (Sentences)
from nltk.tokenize import sent_tokenize
my_text = "Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers)
from the store. Should I pick up some black-eyed peas as well?"
print(sent_tokenize(my_text))
['Hi Mr. Smith!',
'I’m going to buy some vegetables (tomatoes and cucumbers) from the store.',
'Should I pick up some black-eyed peas as well?']
Output:
Input:
Requires python 3
Code: Tokenization (N-Grams)
from nltk.util import ngrams
my_words = word_tokenize(my_text) # This is the list of all words
twograms = list(ngrams(my_words,2)) # This is for two-word combos, but can pick any n
print(twograms)
[('Hi', 'Mr.'), ('Mr.', 'Smith'), ('Smith', '!'), ('!', 'I'), ('I', '’'), ('’',
'm'), ('m', 'going'), ('going', 'to'), ('to', ‘buy'), ('buy', 'some'), ('some',
'vegetables'), ('vegetables', '('), ('(', 'tomatoes'), ('tomatoes', 'and'), ('and',
'cucumbers'), ('cucumbers', ')'), (')', 'from'), ('from', 'the'), ('the', 'store'),
('store', '.'), ('.', 'Should'), ('Should', 'I'), ('I', 'pick'), ('pick', 'up'),
('up', '1/2'), ('1/2', 'lb'), ('lb', 'of'), ('of', 'black-eyed'), ('black-eyed',
'peas'), ('peas', 'as'), ('as', 'well'), ('well', '?')]
Output:
Input:
Requires python 3
Tokenization: Regular Expressions
Let’s say you want to tokenize by some other type of grouping or pattern.
Regular expressions (regex) allows you to do so.
Some examples of regular expressions:
• Find white spaces: s+
• Find words starting with capital letters: [A-Z]['w]+
Code: Tokenization (Regular Expressions)
Code: Tokenization (Regular Expressions)
from nltk.tokenize import RegexpTokenizer
# RegexpTokenizer to match only capitalized words
cap_tokenizer = RegexpTokenizer("[A-Z]['w]+")
print(cap_tokenizer.tokenize(my_text))
['Hi', 'Mr', 'Smith', 'Should']
Output:
Input:
Tokenization Summary
Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from
the store. Should I pick up some black-eyed peas as well?
With tokenization, we were able to break this messy text data down into small
units for us to do analysis
• By sentence, word, n-grams
• By characters and patterns using regular expressions
Preprocessing Checkpoint
What have we done so far?
• Tokenized text by sentence, word, n-grams and using regex
This is only one step. There is a lot more preprocessing that we can do.
Preprocessing Techniques
1. Turn text into a meaningful format for analysis
• Tokenization
2. Clean the data
• Remove: capital letters, punctuation, numbers, stop words
• Stemming
• Correct misspellings
• Parts of speech tagging
• Chunking (named entity recognition, compound term extraction)
Preprocessing: Remove Characters
Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from
the store. Should I pick up 2lbs of black-eyed peas as well?
How can we normalize this text?
• Remove punctuation
• Remove capital letters and make all letters lowercase
• Remove numbers
Code: Remove Punctuation
import re # Regular expression library
import string
# Replace punctuations with a white space
clean_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', my_text)
clean_text
Input:
Output:
'Hi Mr Smith I m going to buy some vegetables tomatoes and cucumbers from
the store Should I pick up 2lbs of black eyed peas as well '
'Hi Mr Smith Im going to buy some vegetables tomatoes and cucumbers from the
store Should I pick up 2lbs of blackeyed peas as well'
Replace with '' instead of ' '
Code: Make All Text Lowercase
clean_text = clean_text.lower()
clean_text
Input:
Output:
'hi mr smith i m going to buy some vegetables tomatoes and cucumbers from
the store should i pick up 2lbs of black eyed peas as well '
Code: Remove Numbers
# Removes all words containing digits
clean_text = re.sub('w*dw*', ' ', clean_text)
clean_text
Input:
Output:
'hi mr smith i m going to buy some vegetables tomatoes and cucumbers from
the store should i pick up of black eyed peas as well '
Tips and Tricks: Lambda
INPUT OUTPUT
Tips and Tricks: Lambdas and Maps
Tips and Tricks: Lambdas and Maps
Preprocessing: Stop Words
Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from
the store. Should I pick up some black-eyed peas as well?
What is the most frequent term in the text above? Is that information meaningful?
Stop words are words that have very little semantic value.
There are language and context-specific stop word lists online that you can use.
Code: Stop Words
from nltk.corpus import stopwords
set(stopwords.words('english'))
Input:
Output:
{'but', 'isn', 'under', 'weren', 'those', 'when', 'why', 'few', 'for', 'it', 'of', 'down', 'ma',
'over', 'd', 'during', 'shouldn', 'did', 'above', 'below', 'myself', 'further', 'very', 'same',
'too', 'does', 'through', 'from', 'didn', 'whom', 'and', 'am', 'such', 'out', 'or', 'me', 'has',
'will', 'shan', 'on', 'then', 'here', 't', 'with', 'some', 'what', 'don', 'were', 'an',
'themselves', 'yourselves', 'off', 'being', 'more', 'they', 'ourselves', 'into', 'my', 'them',
'ain', 'a', 'wouldn', 'itself', 'i', 'hasn', 'her', 'their', 'mustn', 'our', 'herself', 'where',
'hers', 'once', 'any', 'theirs', 'before', 'most', 'other', 'not', 'himself', 'his', 'if', 'he',
'each', 'are', 'how', 'couldn', 'ours', 'doing', 'hadn', 'needn', 'again', 'these', 'wasn', 'nor',
'do', 'just', 'so', 'we', 'there', 'have', 'by', 'o', 'than', 're', 'while', 'your', 'at', 'him',
'own', 'can', 'you', 'll', 'between', 'been', 'that', 'is', 'she', 'yours', 'this', 'was', 'be',
'had', 'doesn', 'no', 'because', 'won', 'both', 'to', 'against', 'aren', 'y', 'after', 'all', 'up',
've', 'should', 'as', 'in', 'the', 'having', 'until', 'who', 'haven', 'only', 'm', 'yourself',
'about', 's', 'which', 'now', 'mightn', 'its'}
Code: Remove Stop Words
my_text = ["Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers)
from the store. Should I pick up some black-eyed peas as well?"]
# Incorporate stop words when creating the count vectorizer
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(my_text)
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
Input:
Output:
Including stop words
Preprocessing: Stemming
Stemming & Lemmatization = Cut word down to base form
• Stemming: Uses rough heuristics to reduce words to base
• Lemmatization: Uses vocabulary and morphological analysis
• Makes the meaning of run, runs, running, ran all the same
• Cuts down on complexity by reducing the number of unique words
Multiple stemmers available in NLTK
• PorterStemmer, LancasterStemmer, SnowballStemmer
• WordNetLemmatizer
Code: Stemming
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()
# Try some stems
print('drive: {}’.format(stemmer.stem('drive')))
print('drives: {}'.format(stemmer.stem('drives')))
print('driver: {}'.format(stemmer.stem('driver')))
print('drivers: {}'.format(stemmer.stem('drivers')))
print('driven: {}'.format(stemmer.stem('driven')))
Input:
Output:
Preprocessing: Parts of Speech Tagging
Parts of Speech
• Nouns, verbs, adjectives, etc.
• Parts of speech tagging labels each word as a part of speech
Code: Parts of Speech Tagging
from nltk.tag import pos_tag
my_text = "James Smith lives in the United States."
tokens = pos_tag(word_tokenize(my_text))
print(tokens)
Input:
Output:
Code: Parts of Speech Tagging
nltk.help.upenn_tagset()
Input:
Output:
DT: determiner all an another any both del each either every half la many much nary neither no some such that
the them these this those
IN: preposition or conjunction, subordinating astride among uppon whether out inside pro despite on by
throughout below within for towards near behind atop around if like until below next into if beside ...
NNP: noun, proper, singular Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside
Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists Andalusians
Andes Andruses Angels Animals Anthony Antilles Antiques Apache Apaches Apocrypha ...
VBZ: verb, present tense, 3rd person singular bases reconstructs marks mixes displeases seals carps weaves
snatches slumps stretches authorizes smolders pictures emerges stockpiles seduces fizzes uses bolsters slaps
speaks pleads ...
Preprocessing: Named Entity Recognition
Named Entity Recognition (NER) aka Entity Extraction
• Identifies and tags named entities in text (people, places, organizations,
phone numbers, emails, etc.)
• Can be tremendously valuable for further NLP tasks
• For example: “United States” --> “United_States”
Code: Named Entity Recognition
from nltk.chunk import ne_chunk
my_text = "James Smith lives in the United States."
tokens = pos_tag(word_tokenize(my_text)) # this labels each word as a part of speech
entities = ne_chunk(tokens) # this extracts entities from the list of words
entities.draw()
Input:
Output:
Preprocessing: Compound Term Extraction
Extracting and tagging compound words or phrases in text
• This can be very valuable for special cases
• For example: “black eyed peas“ --> “black_eyed_peas”
• This totally changes the conceptual meaning!
• Named entity recognition groups together words and identifies entities, but
doesn’t capture them all, so you can identify your own compound words
Code: Compound Term Extraction
from nltk.tokenize import MWETokenizer # multi-word expression
my_text = "You all are the greatest students of all time."
mwe_tokenizer = MWETokenizer([('You','all'), ('of', 'all', 'time')])
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(my_text))
mwe_tokens
Input:
Output:
['You_all', 'are', 'the', 'greatest', 'students', 'of_all_time', '.']
Preprocessing Checkpoint
What have we done so far?
• Introduced Python’s Natural Language Toolkit
• Converted text into token form
• Further cleaned the data by removing characters, using stop words,
stemming, parts of speech tagging, named entity recognition and compound
words
Preprocessing Review
Given the text below, what are some preprocessing techniques you could apply?
We’re rushing our patient to the nearest hospital in Bend, Oregon. He has a
traumatic brain injury and requires medical attention within the next 10 minutes!
Tokenization
Sentence
Word
N-Gram
Regex
Remove
Punctuation
Capital Letters
Numbers
Stop Words
Chunking
Named Entity
Recognition
Compound
Term Extraction
More
Stemming
Parts of Speech
Misspellings
Diff Languages
Pandas for Data Analysis Review
• Pandas is an open-source python library used for data manipulation and
analysis.
• It provides easy-to-use data structures and data analysis tools which can be
used in a wide range of fields.
• We will only discuss some of the NLP-related frequently used Pandas
functions.
Pandas DataFrame
A DataFrame is a two-dimensional array with heterogeneous data.
It basically a table of data much like in Excel or SQL
Creating Pandas DataFrame
import pandas as pd
new_dataframe = pd.DataFrame(
{ “column_name” : [“jack”, “jill”, “john”],
“column_age” : [13, 14, 12],
“column_weight” : [130.4, 123.6, 150.2] }
)
DataFrames can be created manually or from file.
Manually: From csv file:
import pandas as pd
file_dataframe = pd.read_csv(‘file_data.csv’)
Selecting specific column:
Basic Pandas Functionality
import pandas as pd
data = pd.read_csv(‘data.csv’)
Selecting top and bottom rows:
pd.head() Returns the first n rows.
pd.tail() Returns the last n rows.
Selecting columns:
data[‘column_name’] or data.column_name
Selecting by indexer:
data.iloc[0] - first row of data frame
data.iloc[-1] - last row of data frame
data.iloc[:,0] - first column of data frame
data.iloc[:,-1] - last column of data frame
Data.iloc[0,1] – first row, second column of the dataframe
data.iloc[0:4, 3:5] # first 4 rows and 3rd, 4th, 5th columns of data frame
Preprocessing Summary
• Text data is messy
▪ Preprocessing must be done before doing analysis
▪ Python has some great libraries for NLP, such as NLTK, TextBlob and spaCy
• There are many preprocessing techniques
▪ Tokenization and organizing the data for analysis is necessary
▪ Otherwise, pick and choose the techniques that makes most sense for your
data and your analysis
Nlp toolkits and_preprocessing_techniques
Ad

More Related Content

What's hot (20)

Natural language processing
Natural language processingNatural language processing
Natural language processing
Yogendra Tamang
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Jaganadh Gopinadhan
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
Anuj Gupta
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
prashantdahake
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
Benjamin Bengfort
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ananth
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Rishikese MR
 
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Nirav Raje
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
Kuppusamy P
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
Vaibhav Khanna
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
Shubhankar Mohan
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Hansi Thenuwara
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Yogendra Tamang
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
Anuj Gupta
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
prashantdahake
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
Benjamin Bengfort
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ananth
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Rishikese MR
 
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Nirav Raje
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
Kuppusamy P
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
Vaibhav Khanna
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
Shubhankar Mohan
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Hansi Thenuwara
 

Similar to Nlp toolkits and_preprocessing_techniques (20)

JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docxJNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
bslsdevi
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
Siva Arunachalam
 
ACM init() Spring 2015 Day 1
ACM init() Spring 2015 Day 1ACM init() Spring 2015 Day 1
ACM init() Spring 2015 Day 1
UCLA Association of Computing Machinery
 
Zen and the Art of Python
Zen and the Art of PythonZen and the Art of Python
Zen and the Art of Python
Clayton Parker
 
Quality code 2019
Quality code 2019Quality code 2019
Quality code 2019
Michael Durrant
 
Code Quality Makes Your Job Easier
Code Quality Makes Your Job EasierCode Quality Makes Your Job Easier
Code Quality Makes Your Job Easier
Tonya Mork
 
Python Peculiarities
Python PeculiaritiesPython Peculiarities
Python Peculiarities
noamt
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
Rodrigo Senra
 
NLTK Python Basic Natural Language Processing.ppt
NLTK Python Basic Natural Language Processing.pptNLTK Python Basic Natural Language Processing.ppt
NLTK Python Basic Natural Language Processing.ppt
abdul623429
 
week7.ppt
week7.pptweek7.ppt
week7.ppt
GiannisPagges
 
Advanced Int->Bigint Conversions
Advanced Int->Bigint ConversionsAdvanced Int->Bigint Conversions
Advanced Int->Bigint Conversions
Robert Treat
 
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
Thoughtworks
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
Mike Felch
 
Clean Code - The Next Chapter
Clean Code - The Next ChapterClean Code - The Next Chapter
Clean Code - The Next Chapter
Victor Rentea
 
How I Learned to Stop Worrying and Love Legacy Code.....
How I Learned to Stop Worrying and Love Legacy Code.....How I Learned to Stop Worrying and Love Legacy Code.....
How I Learned to Stop Worrying and Love Legacy Code.....
Mike Harris
 
Perfect Code
Perfect CodePerfect Code
Perfect Code
Artem Tabalin
 
Lua pitfalls
Lua pitfallsLua pitfalls
Lua pitfalls
Dmitriy Kotelnikov
 
Begin with c++ Fekra Course #1
Begin with c++ Fekra Course #1Begin with c++ Fekra Course #1
Begin with c++ Fekra Course #1
Amr Alaa El Deen
 
Helvetia
HelvetiaHelvetia
Helvetia
ESUG
 
From Zero to Iterators: Building and Extending the Iterator Hierarchy in a Mo...
From Zero to Iterators: Building and Extending the Iterator Hierarchy in a Mo...From Zero to Iterators: Building and Extending the Iterator Hierarchy in a Mo...
From Zero to Iterators: Building and Extending the Iterator Hierarchy in a Mo...
Patrick Niedzielski
 
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docxJNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
bslsdevi
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
Siva Arunachalam
 
Zen and the Art of Python
Zen and the Art of PythonZen and the Art of Python
Zen and the Art of Python
Clayton Parker
 
Code Quality Makes Your Job Easier
Code Quality Makes Your Job EasierCode Quality Makes Your Job Easier
Code Quality Makes Your Job Easier
Tonya Mork
 
Python Peculiarities
Python PeculiaritiesPython Peculiarities
Python Peculiarities
noamt
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
Rodrigo Senra
 
NLTK Python Basic Natural Language Processing.ppt
NLTK Python Basic Natural Language Processing.pptNLTK Python Basic Natural Language Processing.ppt
NLTK Python Basic Natural Language Processing.ppt
abdul623429
 
Advanced Int->Bigint Conversions
Advanced Int->Bigint ConversionsAdvanced Int->Bigint Conversions
Advanced Int->Bigint Conversions
Robert Treat
 
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
Thoughtworks
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
Mike Felch
 
Clean Code - The Next Chapter
Clean Code - The Next ChapterClean Code - The Next Chapter
Clean Code - The Next Chapter
Victor Rentea
 
How I Learned to Stop Worrying and Love Legacy Code.....
How I Learned to Stop Worrying and Love Legacy Code.....How I Learned to Stop Worrying and Love Legacy Code.....
How I Learned to Stop Worrying and Love Legacy Code.....
Mike Harris
 
Begin with c++ Fekra Course #1
Begin with c++ Fekra Course #1Begin with c++ Fekra Course #1
Begin with c++ Fekra Course #1
Amr Alaa El Deen
 
Helvetia
HelvetiaHelvetia
Helvetia
ESUG
 
From Zero to Iterators: Building and Extending the Iterator Hierarchy in a Mo...
From Zero to Iterators: Building and Extending the Iterator Hierarchy in a Mo...From Zero to Iterators: Building and Extending the Iterator Hierarchy in a Mo...
From Zero to Iterators: Building and Extending the Iterator Hierarchy in a Mo...
Patrick Niedzielski
 
Ad

More from ankit_ppt (20)

Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
ankit_ppt
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
07 learning
07 learning07 learning
07 learning
ankit_ppt
 
06 image features
06 image features06 image features
06 image features
ankit_ppt
 
05 contours seg_matching
05 contours seg_matching05 contours seg_matching
05 contours seg_matching
ankit_ppt
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
ankit_ppt
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
ankit_ppt
 
02 image processing
02 image processing02 image processing
02 image processing
ankit_ppt
 
01 foundations
01 foundations01 foundations
01 foundations
ankit_ppt
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
ankit_ppt
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
ankit_ppt
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
ankit_ppt
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
ankit_ppt
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
ankit_ppt
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
ankit_ppt
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
ankit_ppt
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
ankit_ppt
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
ankit_ppt
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
ankit_ppt
 
Ml7 bagging
Ml7 baggingMl7 bagging
Ml7 bagging
ankit_ppt
 
Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
ankit_ppt
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
06 image features
06 image features06 image features
06 image features
ankit_ppt
 
05 contours seg_matching
05 contours seg_matching05 contours seg_matching
05 contours seg_matching
ankit_ppt
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
ankit_ppt
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
ankit_ppt
 
02 image processing
02 image processing02 image processing
02 image processing
ankit_ppt
 
01 foundations
01 foundations01 foundations
01 foundations
ankit_ppt
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
ankit_ppt
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
ankit_ppt
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
ankit_ppt
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
ankit_ppt
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
ankit_ppt
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
ankit_ppt
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
ankit_ppt
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
ankit_ppt
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
ankit_ppt
 
Ad

Recently uploaded (20)

Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 

Nlp toolkits and_preprocessing_techniques

  • 1. NLP Toolkits and Preprocessing Techniques
  • 2. NLP Toolkits and Preprocessing Techniques • NLP Toolkits ▪ Python libraries for natural language processing • Text Preprocessing Techniques ▪ Converting text to a meaningful format for analysis ▪ Preprocessing and cleaning text
  • 3. NLP Toolkits • NLTK (Natural Language Toolkit) ▪ The most popular NLP library • TextBlob ▪ Wraps around NLTK and makes it easier to use • spaCy ▪ Built on Cython, so it’s fast and powerful • gensim ▪ Great for topic modeling and document similarity
  • 4. Code: How to Install NLTK Command Line pip install nltk Jupyter Notebook import nltk nltk.download() # downloads all data & models # this will take a while
  • 5. Sample Text Data Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from the store. Should I pick up some black-eyed peas as well? Text data is messy. To analyze this data, we need to preprocess and normalize the text.
  • 6. Preprocessing Techniques 1. Turn text into a meaningful format for analysis • Tokenization 2. Clean the data • Remove: capital letters, punctuation, numbers, stop words • Stemming • Parts of speech tagging • Correct misspellings • Chunking (named entity recognition, compound term extraction)
  • 7. Tokenization Tokenization = splitting raw text into small, indivisible units for processing These units can be: • Words • Sentences • N-grams • Other characters defined by regular expressions
  • 8. Code: Tokenization (Words) from nltk.tokenize import word_tokenize my_text = "Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from the store. Should I pick up some black-eyed peas as well?" print(word_tokenize(my_text)) # print function requires Python 3 ['Hi', 'Mr.', 'Smith', '!', 'I', '’', 'm', ‘going', ‘to', 'buy', 'some', 'vegetables', '(', 'tomatoes', 'and', 'cucumbers', ')', 'from', 'the', 'store', '.', 'Should', 'I', 'pick', 'up', ‘some', 'black-eyed', 'peas', 'as', 'well', '?'] Output: Input: Requires python 3
  • 9. Tokenization: Sentences Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from the store. Should I pick up some black-eyed peas as well? Tokens can be sentences. How would you split this into sentences? What rules would you put in place? It’s a difficult task. This is where tokenizers in Python can help.
  • 10. Code: Tokenization (Sentences) from nltk.tokenize import sent_tokenize my_text = "Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from the store. Should I pick up some black-eyed peas as well?" print(sent_tokenize(my_text)) ['Hi Mr. Smith!', 'I’m going to buy some vegetables (tomatoes and cucumbers) from the store.', 'Should I pick up some black-eyed peas as well?'] Output: Input: Requires python 3
  • 11. Code: Tokenization (N-Grams) from nltk.util import ngrams my_words = word_tokenize(my_text) # This is the list of all words twograms = list(ngrams(my_words,2)) # This is for two-word combos, but can pick any n print(twograms) [('Hi', 'Mr.'), ('Mr.', 'Smith'), ('Smith', '!'), ('!', 'I'), ('I', '’'), ('’', 'm'), ('m', 'going'), ('going', 'to'), ('to', ‘buy'), ('buy', 'some'), ('some', 'vegetables'), ('vegetables', '('), ('(', 'tomatoes'), ('tomatoes', 'and'), ('and', 'cucumbers'), ('cucumbers', ')'), (')', 'from'), ('from', 'the'), ('the', 'store'), ('store', '.'), ('.', 'Should'), ('Should', 'I'), ('I', 'pick'), ('pick', 'up'), ('up', '1/2'), ('1/2', 'lb'), ('lb', 'of'), ('of', 'black-eyed'), ('black-eyed', 'peas'), ('peas', 'as'), ('as', 'well'), ('well', '?')] Output: Input: Requires python 3
  • 12. Tokenization: Regular Expressions Let’s say you want to tokenize by some other type of grouping or pattern. Regular expressions (regex) allows you to do so. Some examples of regular expressions: • Find white spaces: s+ • Find words starting with capital letters: [A-Z]['w]+
  • 14. Code: Tokenization (Regular Expressions) from nltk.tokenize import RegexpTokenizer # RegexpTokenizer to match only capitalized words cap_tokenizer = RegexpTokenizer("[A-Z]['w]+") print(cap_tokenizer.tokenize(my_text)) ['Hi', 'Mr', 'Smith', 'Should'] Output: Input:
  • 15. Tokenization Summary Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from the store. Should I pick up some black-eyed peas as well? With tokenization, we were able to break this messy text data down into small units for us to do analysis • By sentence, word, n-grams • By characters and patterns using regular expressions
  • 16. Preprocessing Checkpoint What have we done so far? • Tokenized text by sentence, word, n-grams and using regex This is only one step. There is a lot more preprocessing that we can do.
  • 17. Preprocessing Techniques 1. Turn text into a meaningful format for analysis • Tokenization 2. Clean the data • Remove: capital letters, punctuation, numbers, stop words • Stemming • Correct misspellings • Parts of speech tagging • Chunking (named entity recognition, compound term extraction)
  • 18. Preprocessing: Remove Characters Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from the store. Should I pick up 2lbs of black-eyed peas as well? How can we normalize this text? • Remove punctuation • Remove capital letters and make all letters lowercase • Remove numbers
  • 19. Code: Remove Punctuation import re # Regular expression library import string # Replace punctuations with a white space clean_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', my_text) clean_text Input: Output: 'Hi Mr Smith I m going to buy some vegetables tomatoes and cucumbers from the store Should I pick up 2lbs of black eyed peas as well ' 'Hi Mr Smith Im going to buy some vegetables tomatoes and cucumbers from the store Should I pick up 2lbs of blackeyed peas as well' Replace with '' instead of ' '
  • 20. Code: Make All Text Lowercase clean_text = clean_text.lower() clean_text Input: Output: 'hi mr smith i m going to buy some vegetables tomatoes and cucumbers from the store should i pick up 2lbs of black eyed peas as well '
  • 21. Code: Remove Numbers # Removes all words containing digits clean_text = re.sub('w*dw*', ' ', clean_text) clean_text Input: Output: 'hi mr smith i m going to buy some vegetables tomatoes and cucumbers from the store should i pick up of black eyed peas as well '
  • 22. Tips and Tricks: Lambda INPUT OUTPUT
  • 23. Tips and Tricks: Lambdas and Maps
  • 24. Tips and Tricks: Lambdas and Maps
  • 25. Preprocessing: Stop Words Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from the store. Should I pick up some black-eyed peas as well? What is the most frequent term in the text above? Is that information meaningful? Stop words are words that have very little semantic value. There are language and context-specific stop word lists online that you can use.
  • 26. Code: Stop Words from nltk.corpus import stopwords set(stopwords.words('english')) Input: Output: {'but', 'isn', 'under', 'weren', 'those', 'when', 'why', 'few', 'for', 'it', 'of', 'down', 'ma', 'over', 'd', 'during', 'shouldn', 'did', 'above', 'below', 'myself', 'further', 'very', 'same', 'too', 'does', 'through', 'from', 'didn', 'whom', 'and', 'am', 'such', 'out', 'or', 'me', 'has', 'will', 'shan', 'on', 'then', 'here', 't', 'with', 'some', 'what', 'don', 'were', 'an', 'themselves', 'yourselves', 'off', 'being', 'more', 'they', 'ourselves', 'into', 'my', 'them', 'ain', 'a', 'wouldn', 'itself', 'i', 'hasn', 'her', 'their', 'mustn', 'our', 'herself', 'where', 'hers', 'once', 'any', 'theirs', 'before', 'most', 'other', 'not', 'himself', 'his', 'if', 'he', 'each', 'are', 'how', 'couldn', 'ours', 'doing', 'hadn', 'needn', 'again', 'these', 'wasn', 'nor', 'do', 'just', 'so', 'we', 'there', 'have', 'by', 'o', 'than', 're', 'while', 'your', 'at', 'him', 'own', 'can', 'you', 'll', 'between', 'been', 'that', 'is', 'she', 'yours', 'this', 'was', 'be', 'had', 'doesn', 'no', 'because', 'won', 'both', 'to', 'against', 'aren', 'y', 'after', 'all', 'up', 've', 'should', 'as', 'in', 'the', 'having', 'until', 'who', 'haven', 'only', 'm', 'yourself', 'about', 's', 'which', 'now', 'mightn', 'its'}
  • 27. Code: Remove Stop Words my_text = ["Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from the store. Should I pick up some black-eyed peas as well?"] # Incorporate stop words when creating the count vectorizer cv = CountVectorizer(stop_words='english') X = cv.fit_transform(my_text) pd.DataFrame(X.toarray(), columns=cv.get_feature_names()) Input: Output: Including stop words
  • 28. Preprocessing: Stemming Stemming & Lemmatization = Cut word down to base form • Stemming: Uses rough heuristics to reduce words to base • Lemmatization: Uses vocabulary and morphological analysis • Makes the meaning of run, runs, running, ran all the same • Cuts down on complexity by reducing the number of unique words Multiple stemmers available in NLTK • PorterStemmer, LancasterStemmer, SnowballStemmer • WordNetLemmatizer
  • 29. Code: Stemming from nltk.stem.lancaster import LancasterStemmer stemmer = LancasterStemmer() # Try some stems print('drive: {}’.format(stemmer.stem('drive'))) print('drives: {}'.format(stemmer.stem('drives'))) print('driver: {}'.format(stemmer.stem('driver'))) print('drivers: {}'.format(stemmer.stem('drivers'))) print('driven: {}'.format(stemmer.stem('driven'))) Input: Output:
  • 30. Preprocessing: Parts of Speech Tagging Parts of Speech • Nouns, verbs, adjectives, etc. • Parts of speech tagging labels each word as a part of speech
  • 31. Code: Parts of Speech Tagging from nltk.tag import pos_tag my_text = "James Smith lives in the United States." tokens = pos_tag(word_tokenize(my_text)) print(tokens) Input: Output:
  • 32. Code: Parts of Speech Tagging nltk.help.upenn_tagset() Input: Output: DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those IN: preposition or conjunction, subordinating astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside ... NNP: noun, proper, singular Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool ... NNPS: noun, proper, plural Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques Apache Apaches Apocrypha ... VBZ: verb, present tense, 3rd person singular bases reconstructs marks mixes displeases seals carps weaves snatches slumps stretches authorizes smolders pictures emerges stockpiles seduces fizzes uses bolsters slaps speaks pleads ...
  • 33. Preprocessing: Named Entity Recognition Named Entity Recognition (NER) aka Entity Extraction • Identifies and tags named entities in text (people, places, organizations, phone numbers, emails, etc.) • Can be tremendously valuable for further NLP tasks • For example: “United States” --> “United_States”
  • 34. Code: Named Entity Recognition from nltk.chunk import ne_chunk my_text = "James Smith lives in the United States." tokens = pos_tag(word_tokenize(my_text)) # this labels each word as a part of speech entities = ne_chunk(tokens) # this extracts entities from the list of words entities.draw() Input: Output:
  • 35. Preprocessing: Compound Term Extraction Extracting and tagging compound words or phrases in text • This can be very valuable for special cases • For example: “black eyed peas“ --> “black_eyed_peas” • This totally changes the conceptual meaning! • Named entity recognition groups together words and identifies entities, but doesn’t capture them all, so you can identify your own compound words
  • 36. Code: Compound Term Extraction from nltk.tokenize import MWETokenizer # multi-word expression my_text = "You all are the greatest students of all time." mwe_tokenizer = MWETokenizer([('You','all'), ('of', 'all', 'time')]) mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(my_text)) mwe_tokens Input: Output: ['You_all', 'are', 'the', 'greatest', 'students', 'of_all_time', '.']
  • 37. Preprocessing Checkpoint What have we done so far? • Introduced Python’s Natural Language Toolkit • Converted text into token form • Further cleaned the data by removing characters, using stop words, stemming, parts of speech tagging, named entity recognition and compound words
  • 38. Preprocessing Review Given the text below, what are some preprocessing techniques you could apply? We’re rushing our patient to the nearest hospital in Bend, Oregon. He has a traumatic brain injury and requires medical attention within the next 10 minutes! Tokenization Sentence Word N-Gram Regex Remove Punctuation Capital Letters Numbers Stop Words Chunking Named Entity Recognition Compound Term Extraction More Stemming Parts of Speech Misspellings Diff Languages
  • 39. Pandas for Data Analysis Review • Pandas is an open-source python library used for data manipulation and analysis. • It provides easy-to-use data structures and data analysis tools which can be used in a wide range of fields. • We will only discuss some of the NLP-related frequently used Pandas functions.
  • 40. Pandas DataFrame A DataFrame is a two-dimensional array with heterogeneous data. It basically a table of data much like in Excel or SQL
  • 41. Creating Pandas DataFrame import pandas as pd new_dataframe = pd.DataFrame( { “column_name” : [“jack”, “jill”, “john”], “column_age” : [13, 14, 12], “column_weight” : [130.4, 123.6, 150.2] } ) DataFrames can be created manually or from file. Manually: From csv file: import pandas as pd file_dataframe = pd.read_csv(‘file_data.csv’) Selecting specific column:
  • 42. Basic Pandas Functionality import pandas as pd data = pd.read_csv(‘data.csv’) Selecting top and bottom rows: pd.head() Returns the first n rows. pd.tail() Returns the last n rows. Selecting columns: data[‘column_name’] or data.column_name Selecting by indexer: data.iloc[0] - first row of data frame data.iloc[-1] - last row of data frame data.iloc[:,0] - first column of data frame data.iloc[:,-1] - last column of data frame Data.iloc[0,1] – first row, second column of the dataframe data.iloc[0:4, 3:5] # first 4 rows and 3rd, 4th, 5th columns of data frame
  • 43. Preprocessing Summary • Text data is messy ▪ Preprocessing must be done before doing analysis ▪ Python has some great libraries for NLP, such as NLTK, TextBlob and spaCy • There are many preprocessing techniques ▪ Tokenization and organizing the data for analysis is necessary ▪ Otherwise, pick and choose the techniques that makes most sense for your data and your analysis

Editor's Notes

  • #2: Welcome to Week 2! Today we’ll show you some of the most popular NLP libraries in Python, and also go through a series of preprocessing techniques. Text data is typically quite messy, so a lot of preprocessing has to be done before you can do any analysis.
  • #3: NLP Toolkits - There are a lot out there, but a few stand out. Text Preprocessing Techniques - A lot of preprocessing has to be done before doing the fun analysis. We’ll go through the common steps and key terms.
  • #4: NLTK - Pretty much everyone starts here. TextBlob - Can use NLTK features by writing very simple code. Highly recommend. spaCy - This is up and coming. It's marketed as an "industrial-strength" Python NLP library that's geared toward performance. gensim - We will be going over this later in the course.
  • #5: After doing nltk.download(), choose to download all in the GUI. This will give you all the tokenizers, chunkers, algorithms and corporas. The students should run this code now, so everything will be ready in time for when they do the exercises. It takes about 15 minutes.
  • #6: Brainstorm with the group. What are some ways you can think of to clean up this data for analysis?
  • #7: Tokenization is the process of dividing our data into the smaller units that we will analyze Chunking is a general term. It means to extract meaningful units, or chunks, of text from raw text.
  • #8: For most purposes, we will be splitting our text up into words, but there are many options out there for splitting text up. A word you might not have seen before is ‘n-gram’, which means a sequence of n items from a text. These ‘items’ might be syllables, letters, words, etc. but usually refer to words. Researchers usually vary the size of their n gram depending on their application. Example: “Hi my name is Rick” -> [(Hi, my), (my, name), (name, is), (is, Rick)] would be an example of splitting up a piece of text into 2-grams (called bigrams). We will see how to do this automatically using nltk later in the slides.
  • #9: As you can see, the nltk’s built in word tokenizer is able to separate out the sentence into tokens for us to analyze.
  • #10: You might say that capitalization or punctuation are good places to start but as you can see proper nouns (Smith) and titles (Mr.) are two examples of common words that break those rules for tokenization This is where built-in tokenizers can help
  • #11: As you can see, developers have been hard at work to make sure that tokenizers are able to accurately pick out sentences from a piece of text.
  • #12: Why is this useful? Let’s say you didn’t just want to find the most common words, but the most common two-word phrases (like black-eyed peas). N-grams can help with that.
  • #13: Good website for interpreting regular expressions: https://regex101.com \s+ / \s+ matches any whitespace character (equal to [\r\n\t\f\v ]) + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) [A-Z]['\w]+ Match a single character present in the list below [A-Z] A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive) Match a single character present in the list below ['\w]+ + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) ' matches the character ' literally (case sensitive) \w matches any word character (equal to [a-zA-Z0-9_])
  • #14: As you can see, the possibilities are endless for tokenization, and the regex you learned in last week’s notes will be very useful if you want full control in how you create your tokens.
  • #15: Another example using capitalized words to split the text. This example is a little different though, only the capitalized words were returned. Can you guess why that is? If you guessed that is had something to do with the ‘gaps’ parameter you’d be correct; specifying ‘gaps=True’ tells python that you want to split the text along these patterns to make tokens, otherwise it uses the pattern to find the tokens themselves.
  • #16: none
  • #17: none
  • #18: Stanford’s CoreNLP has different language options
  • #19: Why do we want to remove these things? They impede our analysis of the text. To analyze texts, we want a uniform format that we can read in a consistent way. Therefore, we try to remove some of the ‘artifacts’ that are language-specific and don’t contribute to the meaning of the text. The major ways to normalize texts are in the removal of punctuation, capital letters, and numbers.
  • #20: Why are we replacing the string with a whitespace instead of nothing? Which is better? Note that the ‘black eyed’ peas portion is different - nothing is good here since you want black and eyed to be grouped together Note that the word ‘I’m’ is different - the whitespace is good here so later ‘I’ can be grouped with other words like ‘I’, ‘me’, etc. when we move into stemming Later in the presentation, we talk about Compound Word Extraction, and that’s a way you can keep ‘black-eyed peas’ together
  • #21: Thankfully, strings in python are really easy to send to lowercase, simply use the .lower() function.
  • #22: Here are regex knowledge comes in handy because we are able to quickly scan for any digits and words containing digits using the expression ‘\w*\d\w*’. Here is the breakdown of what those characters mean: \d = digits \w = any word character * = 0+ of that character So this removes digits and any words containing digits Notice that the word ‘2lb’ has been removed.
  • #23: Lambdas are what as known as “anonymous functions”. They’re meant to be short functions and most of the time they’re single use. The format of a lambda is described above. It is “ lambda input : output “
  • #24: Lambdas are cool, but their real power is when you combine them with maps. A map is a python function that takes a function and an iterable as input and iterates over the iterable and applies the function to every object within the iterable. For the example above, you can see that the we applied the ‘square_me’ lambda to every number in the list ‘my_numbers’ The output from a map is a map object, so make sure to cast it to list() type if you want to use the data as a list after.
  • #25: Now, let’s look at an example of these tools applied to NLP. As you can see, lambdas and maps make it very easy to remove the numbers in a bunch of texts just by creating one lambda.
  • #26: Stop words don’t contribute to the meaning of the text and so for our purposes will just confuse the algorithm. For example, the word ‘some’ doesn’t tell us anything meaningful about what’s going on in this text. On the next slide, there are some of the commonly removed stop words.
  • #27: The nltk corpus already comes with a list of stopwords that are commonly removed for text analysis but you can also code your own stopwords manually!
  • #28: This is just an early example of CounterVectorizer. We will discuss how a CounterVectorizer is used further in week three. In simple terms, CounterVectorizer convert a collection of text documents to a matrix of token counts. In the example in the slide, the English stops words will be removed from the resulting tokens.
  • #29: Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, because stemming does not look at this additional context, stemmers run faster and typically are easier to implement. Depending on your application, the reduced accuracy may not matter.
  • #30: As you can see all these different variations of the same word of ‘driving’/’drive’ are reduced to their base form ‘driv’, which is shared by all variations of the word. An example where lemmatization may provide higher accuracy is: The stemmed form of leafs is: leaf The stemmed form of leaves is: leav The lemmatized form of leafs is: leaf The lemmatized form of leaves is: leaf Source: https://towardsdatascience.com/pre-processing-in-natural-language-machine-learning-898a84b8bd47
  • #31: Some words change meaning depending on their context in a sentence. For example, the word ‘run’ might be a verb (I like to run in the mornings) or a noun (How did your run go this morning?) depending on how it is used. Thus part of speech tagging may help gain greater insight into the meaning of a text.
  • #32: Thankfully, nltk has a built in tagger, so we all we need to do is call this pos_tag from the nltk library. Some POS tags: NN - is a noun NNP – is a proper noun JJ – is an adjective IN – is a preposition VBZ – is a verb, 3rd person sing. present takes
  • #33: We find the list with all possible POS tags used by the Natural Language Toolkit (nltk) with nltk.help.upenn_tagset() or nltk.help.upenn_tagset('RB’) for information on a specific tag.
  • #34: Named entity recognition is a case where our preprocessing steps would actually hurt us. If we simply removed punctuations and made everything lowercase, we could accidently convert ‘U.S.’ to ‘us’, which could really change the meaning of a document. Being able to extract these ‘entities’ (proper nouns) is a valuable tool, that has a lot of applications. For example, if a news site wants to show all the news that pertains to Chicago or U.S.A, it needs to be able to preserve these words.
  • #35: Again, nltk makes our lives easy by providing a built in ‘ne_chunk’ function that is able to detect what the proper nouns in the sentence are (NNP). Notice that it is even able to tell the type of proper noun, and labels ‘James Smith’ as ‘PERSON’ and ‘United States’ as ‘GPE’, which stands for geopolitical entity.
  • #36: Compound term extraction again allows us to better preserve the meaning of our text. In the example above, having a text with the word ‘black’, ‘eyed’ and ‘peas’ is different than a text that is about ‘black eyed peas’, which is one distinct concept.
  • #37: This is a way to manually do it in NLTK Note that this MWE tokenizer is case sensitive. To make it work better, you’d need to make everything lowercase to begin with.
  • #38: none
  • #39: Brainstorm as a group. There were a lot of terms that were covered.
  • #40: Brainstorm as a group. There were a lot of terms that were covered.