0% found this document useful (0 votes)

247 views

Natural Language Toolkit NLTK PDF

The document discusses the Natural Language Toolkit (NLTK), a popular Python package for natural language processing. It provides functions and objects for common NLP tasks like tokenization, part-of-speech tagging, parsing, and more. These allow programmers to preprocess and analyze human language in an automated way. The document also introduces scikit-learn, a widely used machine learning library in Python, and discusses how to access these tools on a shared computing cluster.

Uploaded by

Sam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

247 views

Natural Language Toolkit NLTK PDF

Uploaded by

Sam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

The Natural Language Toolkit

(NLTK)
Natural Language Processing (NLP)
• How can we make a computer understand language?
– Can a human write/talk to the computer?
• Or can the computer guess/predict the input?
– Can the computer talk back?
– Based on language rules, patterns, or statistics
• For now, statistics are more accurate and popular

2
Some areas of NLP
• shallow processing – the surface level
– tokenization
– part-of-speech tagging
– forms of words
• deep processing – the underlying structures of language
– word order (syntax)
– meaning
– translation
• natural language generation

3
The NLTK
• A collection of:
– Python functions and objects for accomplishing NLP tasks
– sample texts (corpora)
• Available at: http://nltk.sourceforge.net
– Requires Python 2.4 or higher
– Click 'Download' and follow instructions for your OS

4
Tokenization
• Say we want to know the words in Marty's vocabulary
– "You know what I hate? Anybody who drives an S.U.V. I'd really
like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him
square in the teeth. Booyah. Be like, I'm Marty Stepp, the best
ever. Booyah!"
• How do we split his speech into tokens?

5
Tokenization (cont.)
• How do we split his speech into tokens?

>>> martysSpeech.split()
['You', 'know', 'what', 'I', 'hate?', 'Anybody',
'who', 'drives', 'an', 'S.U.V.', "I'd", 'really',
'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100-
Dollars-To-Gas-Up', 'and', 'kick', 'him',
'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be',
'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best',
'ever.', 'Booyah!']

• Now, how often does he use the word "booyah"?

>>> martysSpeech.split().count("booyah")
0
>>> # What the!

6
Tokenization (cont.)
• We could lowercase the speech
• We could write our own method to split on "." split on ",",
split on "-", etc.
• The NLTK already has several tokenizer options
• Try:
• nltk.tokenize.WordPunctTokenizer
– tokenizes on all punctuation
• nltk.tokenize.PunktWordTokenizer
– trained algorithm to statistically split on words

7
Part-of-speech (POS) tagging
• If you know a token's POS you know:
– is it the subject?
– is it the verb?
– is it introducing a grammatical structure?
– is it a proper name?

8
Part-of-speech (POS) tagging
• Exercise: most frequent proper noun in the Penn Treebank?
– Try:
• nltk.corpus.treebank
• Python's dir() to list attributes of an object
– Example:

>>> dir("hello world!")

[..., 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs',
'find', 'index', 'isalnum', 'isalpha',
'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', ...]

9
Tuples
• tagged_words() gives us a list of tuples
– tuple: the same thing as a list, but you can't change it
– in this case, the tuples are a (word, tag) pairs

10
POS tagging (cont.)
• How do we tag plain sentences?
– A NLTK tagger needs a list of tagged sentences to train on
• We'll use nltk.corpus.treebank.tagged_sents()
– Then it is ready to tag any input! (but how well?)

– Try these tagger objects:

• nltk.UnigramTagger(tagged_sentences)
• nltk.TrigramTagger(tagged_sentences)
– Call the tagger's tag(tokens) method

>>> tagger = nltk.UnigramTagger(tagged_sentences)

>>> result = tagger.tag(tokens)
>>> result
[('You', 'PRP'), ('know', 'VB'), ('what', 'WP'),
('I', 'PRP'), ('hate', None), ('?', '.'), ...]

11
Parsing
• Syntax is as important for a compiler as it is for natural
language
• Realizing the hidden structure of a sentence is useful for:
– translation
– meaning analysis
– relationship analysis
– a cool demo!
• Try:
– nltk.draw.rdparser.demo()

12
Conclusion
• NLTK: NLP made easy with Python
– Functions and objects for:
• tokenization, tagging, generation, parsing, ...
• and much more!
– Even armed with these tools, NLP has a lot of difficult problems!
• Also saw:
– List methods
– dir()
– Tuples

13
Python scikit-learn
• Popular machine learning toolkit in Python http://scikit-
learn.org/stable/
• Requirements
– Anaconda
– Available from https://www.continuum.io/downloads
– Includes numpy, scipy, and scikit-learn (former two are
necessary for scikit-learn)

14
SciKit
Many popular Python toolboxes/libraries:
– NumPy
– SciPy
– Pandas
– SciKit-Learn All these
libraries are
installed on
Visualization libraries the SCC

– matplotlib
– Seaborn

and many15more …

15
Python Libraries for Data Science
SciPy:
▪ collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
more

▪ part of SciPy Stack

▪ built on NumPy

Link: https://www.scipy.org/scipylib/
16

16
Python Libraries for Data Science
SciKit-Learn:
▪ provides machine learning algorithms: classification,
regression, clustering, model validation etc.

▪ built on NumPy, SciPy and matplotlib

Link: http://scikit-learn.org/
17

17
Python Libraries for Data Science
matplotlib:
▪ python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced

visualization
18

18
Python Libraries for Data Science
Seaborn:
▪ based on matplotlib

▪ provides high level interface for drawing attractive statistical

graphics

▪ Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/
19

19
Login to the Shared Computing
Cluster
• Use your SCC login information if you have SCC account

• If you are using tutorial accounts see info on the blackboard

Note: Your password will not be displayed while you enter it.

20
Selecting Python Version on the
SCC
# view available python versions on the SCC

[scc1 ~] module avail python

# load python 3 version

[scc1 ~] module load python/3.6.2

21
Start Jupyter notebook
# On the Shared Computing Cluster
[scc1 ~] jupyter notebook

22
Loading Python Libraries

In [ #Import Python Libraries

]: import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Press Shift+Enter to execute the jupyter cell

Fiber Optic Cable Plant Documentation PDF
No ratings yet
Fiber Optic Cable Plant Documentation PDF
6 pages
Tutorials
No ratings yet
Tutorials
17 pages
Tube and Pipe Inventor PDF
No ratings yet
Tube and Pipe Inventor PDF
15 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
Deep Learning RNN
100% (1)
Deep Learning RNN
53 pages
Backpropagation
No ratings yet
Backpropagation
7 pages
Design A Machine Learning System
No ratings yet
Design A Machine Learning System
9 pages
Keras Succinctly
No ratings yet
Keras Succinctly
107 pages
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
No ratings yet
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
73 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
GANppt
100% (1)
GANppt
34 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
DL Lab Manual
100% (1)
DL Lab Manual
35 pages
Deploy A Machine Learning Model Using Flask - Towards Data Science
No ratings yet
Deploy A Machine Learning Model Using Flask - Towards Data Science
12 pages
Python Data Structures
No ratings yet
Python Data Structures
20 pages
Data Science
No ratings yet
Data Science
39 pages
Ai Unit 1 - Compressed
No ratings yet
Ai Unit 1 - Compressed
142 pages
Data Science Guide
No ratings yet
Data Science Guide
275 pages
Python Core Material
No ratings yet
Python Core Material
162 pages
Introduction To Machine Learning PDF
100% (1)
Introduction To Machine Learning PDF
17 pages
Text Summarization Using NLP
No ratings yet
Text Summarization Using NLP
6 pages
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
No ratings yet
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
8 pages
Ensemble Machine Learning With Python: 7-Day Mini-Course Jason Brownlee - The full ebook version is ready for instant download
100% (1)
Ensemble Machine Learning With Python: 7-Day Mini-Course Jason Brownlee - The full ebook version is ready for instant download
46 pages
CCS355 Neural Networks and Deep Learning Lab
No ratings yet
CCS355 Neural Networks and Deep Learning Lab
43 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Python Data Science
No ratings yet
Python Data Science
25 pages
Machine Learning Handouts
No ratings yet
Machine Learning Handouts
110 pages
The Elements of Differentiable Programming
No ratings yet
The Elements of Differentiable Programming
300 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
27 SVM Interview Questions (ANSWERED) To Master Before ML & Data Science Interview - MLStack - Cafe
No ratings yet
27 SVM Interview Questions (ANSWERED) To Master Before ML & Data Science Interview - MLStack - Cafe
25 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
Role of Machine Learning in The Field of Fiber Reinforced Polymer
No ratings yet
Role of Machine Learning in The Field of Fiber Reinforced Polymer
6 pages
Deep Learning
No ratings yet
Deep Learning
39 pages
What Is Convolutional Neural Network
No ratings yet
What Is Convolutional Neural Network
16 pages
Bernd Klein Python Data Analysis Letter
No ratings yet
Bernd Klein Python Data Analysis Letter
514 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Introduction To Parallel Computing
100% (1)
Introduction To Parallel Computing
34 pages
ML First Unit
No ratings yet
ML First Unit
70 pages
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
No ratings yet
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
6 pages
Pandas Visualisation
No ratings yet
Pandas Visualisation
27 pages
Python Content Manual
No ratings yet
Python Content Manual
95 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Advances in Quantum Machine Learning
No ratings yet
Advances in Quantum Machine Learning
38 pages
Python With Data Science
No ratings yet
Python With Data Science
102 pages
AIML - 04 Single Layer Perceptron
No ratings yet
AIML - 04 Single Layer Perceptron
11 pages
Machine Learning Interviews V 2 Week 11715787639480
0% (1)
Machine Learning Interviews V 2 Week 11715787639480
49 pages
ML Lab File
No ratings yet
ML Lab File
53 pages
L1 - Machine Learning For Finance
No ratings yet
L1 - Machine Learning For Finance
131 pages
Flask Restplus
No ratings yet
Flask Restplus
86 pages
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
No ratings yet
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
60 pages
Lec16 - Autoencoders
No ratings yet
Lec16 - Autoencoders
18 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick
18 pages
Python Mastery Lessons
No ratings yet
Python Mastery Lessons
28 pages
Data Science Chapitre 0
No ratings yet
Data Science Chapitre 0
25 pages
1 - Machine Learning (Start)
No ratings yet
1 - Machine Learning (Start)
32 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
219 pages
The Art of ChatGPT Prompting
No ratings yet
The Art of ChatGPT Prompting
18 pages
Python Deep Learning Complete Self-Assessment Guide
From Everand
Python Deep Learning Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
13-Strings - Understanding String in Build Methods and Operations (Slicing) - 12-04-2023
No ratings yet
13-Strings - Understanding String in Build Methods and Operations (Slicing) - 12-04-2023
17 pages
Password, Email & URL Validation
No ratings yet
Password, Email & URL Validation
4 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Strings
No ratings yet
Strings
29 pages
Unit2: Part III: Data Link Layer Error Detection and Correction
No ratings yet
Unit2: Part III: Data Link Layer Error Detection and Correction
48 pages
Buss Plan
No ratings yet
Buss Plan
20 pages
Ds Syllabus
No ratings yet
Ds Syllabus
6 pages
Osn Quantastor 2015 Price Guide
No ratings yet
Osn Quantastor 2015 Price Guide
5 pages
Business Process Procedure: Quality Management
No ratings yet
Business Process Procedure: Quality Management
16 pages
Support For Netbackup 7.7.X, 8.X, and 9.X in Virtual Environments
No ratings yet
Support For Netbackup 7.7.X, 8.X, and 9.X in Virtual Environments
56 pages
ThinkPad_E14_Gen_6_Intel_21M70024GR
No ratings yet
ThinkPad_E14_Gen_6_Intel_21M70024GR
2 pages
Open Source For You - August 2013 PDF
No ratings yet
Open Source For You - August 2013 PDF
112 pages
Industrial IoT Whitepaper
No ratings yet
Industrial IoT Whitepaper
11 pages
69675uk Tech Paper Polaris Platform
No ratings yet
69675uk Tech Paper Polaris Platform
7 pages
Codebits12 - Oauth2 - Theory and Practice - Slides
100% (1)
Codebits12 - Oauth2 - Theory and Practice - Slides
34 pages
Deep Learning For Credit Risk 1713932406
No ratings yet
Deep Learning For Credit Risk 1713932406
13 pages
Cisco Pic PDM 3.0 Users Guide
No ratings yet
Cisco Pic PDM 3.0 Users Guide
584 pages
Karad-Thakur2021 Article EfficientMonitoringAndControlO
No ratings yet
Karad-Thakur2021 Article EfficientMonitoringAndControlO
18 pages
MẪU CV BẰNG TIẾNG ANH
No ratings yet
MẪU CV BẰNG TIẾNG ANH
6 pages
Lenovo K2450
No ratings yet
Lenovo K2450
29 pages
CO Unit 2-2
0% (2)
CO Unit 2-2
19 pages
BT2 - Project 1 - 43MF750
No ratings yet
BT2 - Project 1 - 43MF750
4 pages
FreeMind Manual
100% (9)
FreeMind Manual
65 pages
Test Tool User's Manual Pico Repeater
No ratings yet
Test Tool User's Manual Pico Repeater
12 pages
Cambridge International AS & A Level: Information Technology 9626/31
No ratings yet
Cambridge International AS & A Level: Information Technology 9626/31
12 pages
UBA18E0041
No ratings yet
UBA18E0041
1 page
Broadband CAF
No ratings yet
Broadband CAF
2 pages
Start The MATLAB Program.: Generation of Discrete Time Signals
No ratings yet
Start The MATLAB Program.: Generation of Discrete Time Signals
14 pages
Java DOM Tutorial
No ratings yet
Java DOM Tutorial
51 pages
Future Trends in Internet Technologies
No ratings yet
Future Trends in Internet Technologies
5 pages
Bitcoin PV For Too Much
No ratings yet
Bitcoin PV For Too Much
10 pages
Dell FluidFS Version 5
No ratings yet
Dell FluidFS Version 5
210 pages

Natural Language Toolkit NLTK PDF

Uploaded by

Natural Language Toolkit NLTK PDF

Uploaded by

The Natural Language Toolkit

• Now, how often does he use the word "booyah"?

>>> dir("hello world!")

>>> # Get the (word, tag) pair at list index 0

– Try these tagger objects:

>>> tagger = nltk.UnigramTagger(tagged_sentences)

▪ part of SciPy Stack

▪ built on NumPy, SciPy and matplotlib

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced

▪ provides high level interface for drawing attractive statistical

▪ Similar (in style) to the popular ggplot2 library in R

• If you are using tutorial accounts see info on the blackboard

[scc1 ~] module avail python

# load python 3 version

[scc1 ~] module load python/3.6.2

In [ #Import Python Libraries

Press Shift+Enter to execute the jupyter cell

You might also like