SlideShare a Scribd company logo
Week 8
The Natural Language Toolkit
(NLTK)
Except where otherwise noted, this work is licensed under:
http://creativecommons.org/licenses/by-nc-sa/3.0
2
List methods
• Getting information about a list
– list.index(item)
– list.count(item)
• These modify the list in-place, unlike str operations
– list.append(item)
– list.insert(index, item)
– list.remove(item)
– list.extend(list2)
• same as list += list2
– list.sort()
– list.reverse()
3
List exercise
• Write a script to print the most frequent token in a text file.
4
And now for something completely different
5
• So far, we've studied programming syntax and techniques
• What about tasks for programming?
– Homework
– Mathematics, statistics
– Biology
– Animation
– Website development
– Game development
– Natural language processing
Programming tasks?
(Sage)
(Biopython)
(Blender)
(Django)
(PyGame)
(NLTK)
6
Natural Language Processing (NLP)
• How can we make a computer understand language?
– Can a human write/talk to the computer?
• Or can the computer guess/predict the input?
– Can the computer talk back?
– Based on language rules, patterns, or statistics
• For now, statistics are more accurate and popular
7
Some areas of NLP
• shallow processing – the surface level
– tokenization
– part-of-speech tagging
– forms of words
• deep processing – the underlying structures of language
– word order (syntax)
– meaning
– translation
• natural language generation
8
The NLTK
• A collection of:
– Python functions and objects for accomplishing NLP tasks
– sample texts (corpora)
• Available at: http://nltk.sourceforge.net
– Requires Python 2.4 or higher
– Click 'Download' and follow instructions for your OS
9
Tokenization
• Say we want to know the words in Marty's vocabulary
– "You know what I hate? Anybody who drives an S.U.V. I'd really
like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him
square in the teeth. Booyah. Be like, I'm Marty Stepp, the best
ever. Booyah!"
• How do we split his speech into tokens?
10
Tokenization (cont.)
• How do we split his speech into tokens?
>>> martysSpeech.split()
['You', 'know', 'what', 'I', 'hate?', 'Anybody',
'who', 'drives', 'an', 'S.U.V.', "I'd", 'really',
'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100-
Dollars-To-Gas-Up', 'and', 'kick', 'him',
'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be',
'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best',
'ever.', 'Booyah!']
• Now, how often does he use the word "booyah"?
>>> martysSpeech.split().count("booyah")
0
>>> # What the!
11
Tokenization (cont.)
• We could lowercase the speech
• We could write our own method to split on "." split on ",",
split on "-", etc.
• The NLTK already has several tokenizer options
• Try:
• nltk.tokenize.WordPunctTokenizer
– tokenizes on all punctuation
• nltk.tokenize.PunktWordTokenizer
– trained algorithm to statistically split on words
12
Part-of-speech (POS) tagging
• If you know a token's POS you know:
– is it the subject?
– is it the verb?
– is it introducing a grammatical structure?
– is it a proper name?
13
Part-of-speech (POS) tagging
• Exercise: most frequent proper noun in the Penn Treebank?
– Try:
• nltk.corpus.treebank
• Python's dir() to list attributes of an object
– Example:
>>> dir("hello world!")
[..., 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs',
'find', 'index', 'isalnum', 'isalpha',
'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', ...]
14
Tuples
• tagged_words() gives us a list of tuples
– tuple: the same thing as a list, but you can't change it
– in this case, the tuples are a (word, tag) pairs
>>> # Get the (word, tag) pair at list index 0
...
>>> pair = nltk.corpus.treebank.tagged_words()[0]
>>> pair
('Pierre', 'NNP')
>>> word = pair[0]
>>> tag = pair[1]
>>> print word, tag
Pierre NNP
>>> word, tag = pair # or unpack in 1 line!
>>> print word, tag
Pierre NNP
15
POS tagging (cont.)
• How do we tag plain sentences?
– A NLTK tagger needs a list of tagged sentences to train on
• We'll use nltk.corpus.treebank.tagged_sents()
– Then it is ready to tag any input! (but how well?)
– Try these tagger objects:
• nltk.UnigramTagger(tagged_sentences)
• nltk.TrigramTagger(tagged_sentences)
– Call the tagger's tag(tokens) method
>>> tagger = nltk.UnigramTagger(tagged_sentences)
>>> result = tagger.tag(tokens)
>>> result
[('You', 'PRP'), ('know', 'VB'), ('what', 'WP'),
('I', 'PRP'), ('hate', None), ('?', '.'), ...]
16
POS tagging (cont.)
• Exercise: Mad Libs
– I have a passage I want filled with the right parts of speech
– Let's use random picks from our own data!
– This code will print it out:
print properNoun1, "has always been a", adjective1, 
singularNoun, "unlike the", adjective2, 
properNoun2, "who I", pastVerb, "as he was", 
ingVerb, "yesterday."
17
Eliza (NLG)
• Eliza simulates a Rogerian psychotherapist
• With while loops and tokenization, you can make a chat bot!
– Try:
• nltk.chat.eliza.eliza_chat()
18
Parsing
• Syntax is as important for a compiler as it is for natural
language
• Realizing the hidden structure of a sentence is useful for:
– translation
– meaning analysis
– relationship analysis
– a cool demo!
• Try:
– nltk.draw.rdparser.demo()
19
Conclusion
• NLTK: NLP made easy with Python
– Functions and objects for:
• tokenization, tagging, generation, parsing, ...
• and much more!
– Even armed with these tools, NLP has a lot of difficult problems!
• Also saw:
– List methods
– dir()
– Tuples
Ad

More Related Content

Similar to NLTK Python Basic Natural Language Processing.ppt (20)

Python- Basic. pptx with lists, tuples dictionaries and data types
Python- Basic. pptx with lists, tuples dictionaries and data typesPython- Basic. pptx with lists, tuples dictionaries and data types
Python- Basic. pptx with lists, tuples dictionaries and data types
harinithiyagarajan4
 
Python- Basic.pptx with data types, lists, and tuples with dictionary
Python- Basic.pptx with data types, lists, and tuples with dictionaryPython- Basic.pptx with data types, lists, and tuples with dictionary
Python- Basic.pptx with data types, lists, and tuples with dictionary
harinithiyagarajan4
 
Baabtra.com little coder chapter - 3
Baabtra.com little coder   chapter - 3Baabtra.com little coder   chapter - 3
Baabtra.com little coder chapter - 3
baabtra.com - No. 1 supplier of quality freshers
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
Matt Harrison
 
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docxJNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
bslsdevi
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Keunwoo Choi
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slides
aiclub_slides
 
Class 5: If, while & lists
Class 5: If, while & listsClass 5: If, while & lists
Class 5: If, while & lists
Marc Gouw
 
Introduction ,numeric Data types,python Data types.pptx
Introduction ,numeric Data types,python Data types.pptxIntroduction ,numeric Data types,python Data types.pptx
Introduction ,numeric Data types,python Data types.pptx
vijayalakshmi257551
 
ELUTE
ELUTEELUTE
ELUTE
Mike Tian-Jian Jiang
 
Python Workshop
Python  Workshop Python  Workshop
Python Workshop
Assem CHELLI
 
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Worajedt Sitthidumrong
 
Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Well Grounded Python Coding - Revision 1 (Day 1 Slides)Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Worajedt Sitthidumrong
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
ankit_ppt
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
Class 6: Lists & dictionaries
Class 6: Lists & dictionariesClass 6: Lists & dictionaries
Class 6: Lists & dictionaries
Marc Gouw
 
Python language data types
Python language data typesPython language data types
Python language data types
Fraboni Ec
 
Python language data types
Python language data typesPython language data types
Python language data types
Young Alista
 
Python language data types
Python language data typesPython language data types
Python language data types
Tony Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data types
James Wong
 
Python- Basic. pptx with lists, tuples dictionaries and data types
Python- Basic. pptx with lists, tuples dictionaries and data typesPython- Basic. pptx with lists, tuples dictionaries and data types
Python- Basic. pptx with lists, tuples dictionaries and data types
harinithiyagarajan4
 
Python- Basic.pptx with data types, lists, and tuples with dictionary
Python- Basic.pptx with data types, lists, and tuples with dictionaryPython- Basic.pptx with data types, lists, and tuples with dictionary
Python- Basic.pptx with data types, lists, and tuples with dictionary
harinithiyagarajan4
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
Matt Harrison
 
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docxJNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
bslsdevi
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Keunwoo Choi
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slides
aiclub_slides
 
Class 5: If, while & lists
Class 5: If, while & listsClass 5: If, while & lists
Class 5: If, while & lists
Marc Gouw
 
Introduction ,numeric Data types,python Data types.pptx
Introduction ,numeric Data types,python Data types.pptxIntroduction ,numeric Data types,python Data types.pptx
Introduction ,numeric Data types,python Data types.pptx
vijayalakshmi257551
 
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Worajedt Sitthidumrong
 
Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Well Grounded Python Coding - Revision 1 (Day 1 Slides)Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Worajedt Sitthidumrong
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
ankit_ppt
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
Class 6: Lists & dictionaries
Class 6: Lists & dictionariesClass 6: Lists & dictionaries
Class 6: Lists & dictionaries
Marc Gouw
 
Python language data types
Python language data typesPython language data types
Python language data types
Fraboni Ec
 
Python language data types
Python language data typesPython language data types
Python language data types
Young Alista
 
Python language data types
Python language data typesPython language data types
Python language data types
Tony Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data types
James Wong
 

Recently uploaded (20)

1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Ad

NLTK Python Basic Natural Language Processing.ppt

  • 1. Week 8 The Natural Language Toolkit (NLTK) Except where otherwise noted, this work is licensed under: http://creativecommons.org/licenses/by-nc-sa/3.0
  • 2. 2 List methods • Getting information about a list – list.index(item) – list.count(item) • These modify the list in-place, unlike str operations – list.append(item) – list.insert(index, item) – list.remove(item) – list.extend(list2) • same as list += list2 – list.sort() – list.reverse()
  • 3. 3 List exercise • Write a script to print the most frequent token in a text file.
  • 4. 4 And now for something completely different
  • 5. 5 • So far, we've studied programming syntax and techniques • What about tasks for programming? – Homework – Mathematics, statistics – Biology – Animation – Website development – Game development – Natural language processing Programming tasks? (Sage) (Biopython) (Blender) (Django) (PyGame) (NLTK)
  • 6. 6 Natural Language Processing (NLP) • How can we make a computer understand language? – Can a human write/talk to the computer? • Or can the computer guess/predict the input? – Can the computer talk back? – Based on language rules, patterns, or statistics • For now, statistics are more accurate and popular
  • 7. 7 Some areas of NLP • shallow processing – the surface level – tokenization – part-of-speech tagging – forms of words • deep processing – the underlying structures of language – word order (syntax) – meaning – translation • natural language generation
  • 8. 8 The NLTK • A collection of: – Python functions and objects for accomplishing NLP tasks – sample texts (corpora) • Available at: http://nltk.sourceforge.net – Requires Python 2.4 or higher – Click 'Download' and follow instructions for your OS
  • 9. 9 Tokenization • Say we want to know the words in Marty's vocabulary – "You know what I hate? Anybody who drives an S.U.V. I'd really like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him square in the teeth. Booyah. Be like, I'm Marty Stepp, the best ever. Booyah!" • How do we split his speech into tokens?
  • 10. 10 Tokenization (cont.) • How do we split his speech into tokens? >>> martysSpeech.split() ['You', 'know', 'what', 'I', 'hate?', 'Anybody', 'who', 'drives', 'an', 'S.U.V.', "I'd", 'really', 'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100- Dollars-To-Gas-Up', 'and', 'kick', 'him', 'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be', 'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best', 'ever.', 'Booyah!'] • Now, how often does he use the word "booyah"? >>> martysSpeech.split().count("booyah") 0 >>> # What the!
  • 11. 11 Tokenization (cont.) • We could lowercase the speech • We could write our own method to split on "." split on ",", split on "-", etc. • The NLTK already has several tokenizer options • Try: • nltk.tokenize.WordPunctTokenizer – tokenizes on all punctuation • nltk.tokenize.PunktWordTokenizer – trained algorithm to statistically split on words
  • 12. 12 Part-of-speech (POS) tagging • If you know a token's POS you know: – is it the subject? – is it the verb? – is it introducing a grammatical structure? – is it a proper name?
  • 13. 13 Part-of-speech (POS) tagging • Exercise: most frequent proper noun in the Penn Treebank? – Try: • nltk.corpus.treebank • Python's dir() to list attributes of an object – Example: >>> dir("hello world!") [..., 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', ...]
  • 14. 14 Tuples • tagged_words() gives us a list of tuples – tuple: the same thing as a list, but you can't change it – in this case, the tuples are a (word, tag) pairs >>> # Get the (word, tag) pair at list index 0 ... >>> pair = nltk.corpus.treebank.tagged_words()[0] >>> pair ('Pierre', 'NNP') >>> word = pair[0] >>> tag = pair[1] >>> print word, tag Pierre NNP >>> word, tag = pair # or unpack in 1 line! >>> print word, tag Pierre NNP
  • 15. 15 POS tagging (cont.) • How do we tag plain sentences? – A NLTK tagger needs a list of tagged sentences to train on • We'll use nltk.corpus.treebank.tagged_sents() – Then it is ready to tag any input! (but how well?) – Try these tagger objects: • nltk.UnigramTagger(tagged_sentences) • nltk.TrigramTagger(tagged_sentences) – Call the tagger's tag(tokens) method >>> tagger = nltk.UnigramTagger(tagged_sentences) >>> result = tagger.tag(tokens) >>> result [('You', 'PRP'), ('know', 'VB'), ('what', 'WP'), ('I', 'PRP'), ('hate', None), ('?', '.'), ...]
  • 16. 16 POS tagging (cont.) • Exercise: Mad Libs – I have a passage I want filled with the right parts of speech – Let's use random picks from our own data! – This code will print it out: print properNoun1, "has always been a", adjective1, singularNoun, "unlike the", adjective2, properNoun2, "who I", pastVerb, "as he was", ingVerb, "yesterday."
  • 17. 17 Eliza (NLG) • Eliza simulates a Rogerian psychotherapist • With while loops and tokenization, you can make a chat bot! – Try: • nltk.chat.eliza.eliza_chat()
  • 18. 18 Parsing • Syntax is as important for a compiler as it is for natural language • Realizing the hidden structure of a sentence is useful for: – translation – meaning analysis – relationship analysis – a cool demo! • Try: – nltk.draw.rdparser.demo()
  • 19. 19 Conclusion • NLTK: NLP made easy with Python – Functions and objects for: • tokenization, tagging, generation, parsing, ... • and much more! – Even armed with these tools, NLP has a lot of difficult problems! • Also saw: – List methods – dir() – Tuples