SlideShare a Scribd company logo
NLTK in 20 minutes
A sprint thru Python's Natural Language ToolKit
Jacob Perkins

Co-founder/CTO @ Weotta (we're hiring :)
"Python Text Processing with NLTK 2.0 Cookbook"
NLTK Contributor
Blog: http://streamhacker.com
NLTK Demos & APIs: http://text-processing.com
@japerk
Why Text Processing?
sentiment analysis
spam filtering
plagariasm detection / document similarity
document categorization / topic detection
phrase extraction, summarization
smarter search
simple keyword frequency analysis
Some NLTK Features

sentence & word tokenization
part-of-speech tagging
chunking & named entity recognition
text classification
many included corpora
Sentence Tokenization
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize("Hello SF Python. This is NLTK.")
['Hello SF Python.', 'This is NLTK.']

>>> sent_tokenize("Hello, Mr. Anderson. We missed you!")
['Hello, Mr. Anderson.', 'We missed you!']
Word Tokenization
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize('This is NLTK.')
['This', 'is', 'NLTK', '.']
What's a Word?
>>> word_tokenize("What's up?")
['What', "'s", 'up', '?']
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize("What's up?")
['What', "'", 's', 'up', '?']




Learn More: http://text-processing.com/demo/tokenize/
Part-of-Speech Tagging
>>> words = word_tokenize("And now for something
completely different")
>>> from nltk.tag import pos_tag
>>> pos_tag(words)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'),
('something', 'NN'), ('completely', 'RB'), ('different',
'JJ')]

Tags List: http://www.ling.upenn.edu/courses/
Fall_2003/ling001/penn_treebank_pos.html
Why Part-of-Speech Tag?

word definition lookup (WordNet, WordNik)
fine-grained text analytics
part-of-speech specific keyword analysis
chunking & named entity recognition (NER)
Chunking & NER
>>> from nltk.chunk import ne_chunk
>>> ne_chunk(pos_tag(word_tokenize('My name is Jacob
Perkins.')))
Tree('S', [('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'),
Tree('PERSON', [('Jacob', 'NNP'), ('Perkins', 'NNP')]),
('.', '.')])
NER not perfect
>>> ne_chunk(pos_tag(word_tokenize('San Francisco is
foggy.')))
Tree('S', [Tree('GPE', [('San', 'NNP')]), Tree('PERSON',
[('Francisco', 'NNP')]), ('is', 'VBZ'), ('foggy', 'NN'),
('.', '.')])
Text Classification
def bag_of_words(words):
    return dict([(word, True) for word in words])

>>> feats = bag_of_words(word_tokenize("great movie"))
>>> import nltk.data
>>> classifier = nltk.data.load('classifiers/
movie_reviews_NaiveBayes.pickle')
>>> classifier.classify(feats)
'pos'
Classification Algos in NLTK


 Naive Bayes
 Maximum Entropy / Logistic Regression
 Decision Tree
 SVM (coming soon)
NLTK-Trainer

https://github.com/japerk/nltk-trainer
command line scripts
train custom models
analyze corpora
analyze models against corpora
Train a Sentiment Classifier
$ ./train_classifier.py movie_reviews --instances paras
loading movie_reviews
2 labels: ['neg', 'pos']
2000 training feats, 2000 testing feats
training NaiveBayes classifier
accuracy: 0.967000
neg precision: 1.000000
neg recall: 0.934000
neg f-measure: 0.965874
pos precision: 0.938086
pos recall: 1.000000
pos f-measure: 0.968054
dumping NaiveBayesClassifier to ~/nltk_data/classifiers/
movie_reviews_NaiveBayes.pickle
Notable Included Corpora

movie_reviews: pos & neg categorized IMDb reviews
treebank: tagged and parsed WSJ text
treebank_chunk: tagged and chunked WSJ text
brown: tagged & categorized english text
60 other corpora in many languages
Other NLTK Features

clustering
metrics
parsing
stemming
WordNet
... and a lot more
Other Python NLP Libraries


pattern: http://www.clips.ua.ac.be/pages/pattern
scikits.learn: http://scikit-learn.sourceforge.net/stable/
fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy
Learn More
http://www.nltk.org/
http://streamhacker.com
http://text-processing.com
nltk-users mailing list
NLTK Tutorial @ PyCon


What would you want to learn in 3 hours?
What kinds of NLP problems do you face at work?
What do you want to do with text?
Ad

More Related Content

What's hot (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
National Institute of Technology Durgapur
 
Word embedding
Word embedding Word embedding
Word embedding
ShivaniChoudhary74
 
Introduction to NLTK
Introduction to NLTKIntroduction to NLTK
Introduction to NLTK
Sreejith Sasidharan
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
NLP
NLPNLP
NLP
guestff64339
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
Rahul Borate
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
Nlp
NlpNlp
Nlp
Nishanthini Mary
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
Pranav Gupta
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
VeenaSKumar2
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Jaganadh Gopinadhan
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
Sara Hooker
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
live_and_let_live
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Edureka!
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
gulshan kumar
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
Pranav Gupta
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
VeenaSKumar2
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Jaganadh Gopinadhan
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
Sara Hooker
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Edureka!
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
gulshan kumar
 

Similar to NLTK in 20 minutes (20)

Procesamiento del lenguaje natural con python
Procesamiento del lenguaje natural con pythonProcesamiento del lenguaje natural con python
Procesamiento del lenguaje natural con python
Facultad de Ciencias y Sistemas
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
Rodrigo Senra
 
Casting for not so strange actors
Casting for not so strange actorsCasting for not so strange actors
Casting for not so strange actors
zucaritask
 
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docxJNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
bslsdevi
 
appengine java night #1
appengine java night #1appengine java night #1
appengine java night #1
Shinichi Ogawa
 
支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒
Toki Kanno
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
Fasihul Kabir
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
Rebecca Bilbro
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
Poetic APIs
Poetic APIsPoetic APIs
Poetic APIs
Erik Rose
 
Beyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCodeBeyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCode
Aijaz Ansari
 
Clojure for Java developers - Stockholm
Clojure for Java developers - StockholmClojure for Java developers - Stockholm
Clojure for Java developers - Stockholm
Jan Kronquist
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
Yaqi Zhao
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
anntp
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
NLTK introduction
NLTK introductionNLTK introduction
NLTK introduction
Prakash Pimpale
 
Building and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning CBuilding and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning C
David Wheeler
 
The (unknown) collections module
The (unknown) collections moduleThe (unknown) collections module
The (unknown) collections module
Pablo Enfedaque
 
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Docker, Inc.
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Eelco Visser
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
Rodrigo Senra
 
Casting for not so strange actors
Casting for not so strange actorsCasting for not so strange actors
Casting for not so strange actors
zucaritask
 
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docxJNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
bslsdevi
 
appengine java night #1
appengine java night #1appengine java night #1
appengine java night #1
Shinichi Ogawa
 
支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒
Toki Kanno
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
Fasihul Kabir
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
Rebecca Bilbro
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
Beyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCodeBeyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCode
Aijaz Ansari
 
Clojure for Java developers - Stockholm
Clojure for Java developers - StockholmClojure for Java developers - Stockholm
Clojure for Java developers - Stockholm
Jan Kronquist
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
Yaqi Zhao
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
anntp
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
Building and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning CBuilding and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning C
David Wheeler
 
The (unknown) collections module
The (unknown) collections moduleThe (unknown) collections module
The (unknown) collections module
Pablo Enfedaque
 
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Docker, Inc.
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Eelco Visser
 
Ad

Recently uploaded (20)

DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Asthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdfAsthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdf
VanessaRaudez
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtBuckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Lynda Kane
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 
Datastucture-Unit 4-Linked List Presentation.pptx
Datastucture-Unit 4-Linked List Presentation.pptxDatastucture-Unit 4-Linked List Presentation.pptx
Datastucture-Unit 4-Linked List Presentation.pptx
kaleeswaric3
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko
Fwdays
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Asthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdfAsthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdf
VanessaRaudez
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtBuckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Lynda Kane
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 
Datastucture-Unit 4-Linked List Presentation.pptx
Datastucture-Unit 4-Linked List Presentation.pptxDatastucture-Unit 4-Linked List Presentation.pptx
Datastucture-Unit 4-Linked List Presentation.pptx
kaleeswaric3
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko
Fwdays
 
Ad

NLTK in 20 minutes

  • 1. NLTK in 20 minutes A sprint thru Python's Natural Language ToolKit
  • 2. Jacob Perkins Co-founder/CTO @ Weotta (we're hiring :) "Python Text Processing with NLTK 2.0 Cookbook" NLTK Contributor Blog: http://streamhacker.com NLTK Demos & APIs: http://text-processing.com @japerk
  • 3. Why Text Processing? sentiment analysis spam filtering plagariasm detection / document similarity document categorization / topic detection phrase extraction, summarization smarter search simple keyword frequency analysis
  • 4. Some NLTK Features sentence & word tokenization part-of-speech tagging chunking & named entity recognition text classification many included corpora
  • 5. Sentence Tokenization >>> from nltk.tokenize import sent_tokenize >>> sent_tokenize("Hello SF Python. This is NLTK.") ['Hello SF Python.', 'This is NLTK.'] >>> sent_tokenize("Hello, Mr. Anderson. We missed you!") ['Hello, Mr. Anderson.', 'We missed you!']
  • 6. Word Tokenization >>> from nltk.tokenize import word_tokenize >>> word_tokenize('This is NLTK.') ['This', 'is', 'NLTK', '.']
  • 7. What's a Word? >>> word_tokenize("What's up?") ['What', "'s", 'up', '?'] >>> from nltk.tokenize import wordpunct_tokenize >>> wordpunct_tokenize("What's up?") ['What', "'", 's', 'up', '?'] Learn More: http://text-processing.com/demo/tokenize/
  • 8. Part-of-Speech Tagging >>> words = word_tokenize("And now for something completely different") >>> from nltk.tag import pos_tag >>> pos_tag(words) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] Tags List: http://www.ling.upenn.edu/courses/ Fall_2003/ling001/penn_treebank_pos.html
  • 9. Why Part-of-Speech Tag? word definition lookup (WordNet, WordNik) fine-grained text analytics part-of-speech specific keyword analysis chunking & named entity recognition (NER)
  • 10. Chunking & NER >>> from nltk.chunk import ne_chunk >>> ne_chunk(pos_tag(word_tokenize('My name is Jacob Perkins.'))) Tree('S', [('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), Tree('PERSON', [('Jacob', 'NNP'), ('Perkins', 'NNP')]), ('.', '.')])
  • 11. NER not perfect >>> ne_chunk(pos_tag(word_tokenize('San Francisco is foggy.'))) Tree('S', [Tree('GPE', [('San', 'NNP')]), Tree('PERSON', [('Francisco', 'NNP')]), ('is', 'VBZ'), ('foggy', 'NN'), ('.', '.')])
  • 12. Text Classification def bag_of_words(words): return dict([(word, True) for word in words]) >>> feats = bag_of_words(word_tokenize("great movie")) >>> import nltk.data >>> classifier = nltk.data.load('classifiers/ movie_reviews_NaiveBayes.pickle') >>> classifier.classify(feats) 'pos'
  • 13. Classification Algos in NLTK Naive Bayes Maximum Entropy / Logistic Regression Decision Tree SVM (coming soon)
  • 15. Train a Sentiment Classifier $ ./train_classifier.py movie_reviews --instances paras loading movie_reviews 2 labels: ['neg', 'pos'] 2000 training feats, 2000 testing feats training NaiveBayes classifier accuracy: 0.967000 neg precision: 1.000000 neg recall: 0.934000 neg f-measure: 0.965874 pos precision: 0.938086 pos recall: 1.000000 pos f-measure: 0.968054 dumping NaiveBayesClassifier to ~/nltk_data/classifiers/ movie_reviews_NaiveBayes.pickle
  • 16. Notable Included Corpora movie_reviews: pos & neg categorized IMDb reviews treebank: tagged and parsed WSJ text treebank_chunk: tagged and chunked WSJ text brown: tagged & categorized english text 60 other corpora in many languages
  • 18. Other Python NLP Libraries pattern: http://www.clips.ua.ac.be/pages/pattern scikits.learn: http://scikit-learn.sourceforge.net/stable/ fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy
  • 20. NLTK Tutorial @ PyCon What would you want to learn in 3 hours? What kinds of NLP problems do you face at work? What do you want to do with text?

Editor's Notes

  • #2: \n
  • #3: \n
  • #4: text processing is very useful in a number of areas, and there's tons of unstructured text flooding the internet nowadays, and NLP/ML is one of the best ways to deal with it\n
  • #5: this is what I'll cover today, but there's a lot more I won't be covering\n
  • #6: loads a trained sentence tokenizer, then calls its tokenize() method. has sentence tokenizers for 16 languages. Smarter than just splitting on punctuation.\n
  • #7: loads a word tokenizer trained on treebank, then calls the tokenize() method\n
  • #8: non-ascii characters are also a problem for word_tokenize(). wordpunct_tokenize() can often be better, but you need to first decide what a word is for your specific case. do contractions matter? can you replace them with two words? Demo shows the results from 4 different tokenizers\n
  • #9: loads a pos tagger trained on treebank - first call will take a few seconds to load the pickle file off disk, every subsequent call will use in-memory tagger. can find tables of pos tag definitions online.\n
  • #10: pos tags might not be useful by themselves, but they are useful metadata for other NLP tasks like dictionary lookup, pos specific keyword analysis, and they are essential for chunking & NER\n
  • #11: every Tree has a draw() method that uses TKinter\n
  • #12: \n
  • #13: bag-of-words is the simplest model, but ignores frequency. good for small text, but frequency can be very important for larger documents. other algorithms, like SVM, create sparse arrays of 1 or 0 depending on word presence, but require knowning full vocabulary beforehand. this classifier is one I trained with nltk-trainer, and can be used for sentiment analysis because it's categories are "pos" and "neg".\n
  • #14: \n
  • #15: can train taggers, chunkers, and text classifiers, and is great for analyzing corpora and how a model performs against a labeled corpus. I use nltk-trainer to train all my models nowadays.\n
  • #16: this trains a very basic sentiment analysis classifier on the movie_reviews corpus, which has reviews categorized into pos or neg\n
  • #17: treebank is a very standard corpus for testing taggers and chunkers\n
  • #18: NLP isn't black magic, but you can treat it as a black box until the defaults aren't good enough. Then you need to dig in and learn how it works so you can make it do what you want. At that point, the best thing you can do is find/make good data, then use existing algos to learn from it.\n
  • #19: \n
  • #20: the original NLTK is very good, available for free online, but takes "textbook" approach. I tried to be a lot more practical in my cookbook. nltk-users mailing list is pretty active, and you can also try stackoverflow\n
  • #21: \n