0% found this document useful (0 votes)
247 views

Natural Language Toolkit NLTK PDF

The document discusses the Natural Language Toolkit (NLTK), a popular Python package for natural language processing. It provides functions and objects for common NLP tasks like tokenization, part-of-speech tagging, parsing, and more. These allow programmers to preprocess and analyze human language in an automated way. The document also introduces scikit-learn, a widely used machine learning library in Python, and discusses how to access these tools on a shared computing cluster.

Uploaded by

Sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
247 views

Natural Language Toolkit NLTK PDF

The document discusses the Natural Language Toolkit (NLTK), a popular Python package for natural language processing. It provides functions and objects for common NLP tasks like tokenization, part-of-speech tagging, parsing, and more. These allow programmers to preprocess and analyze human language in an automated way. The document also introduces scikit-learn, a widely used machine learning library in Python, and discusses how to access these tools on a shared computing cluster.

Uploaded by

Sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

The Natural Language Toolkit

(NLTK)
Natural Language Processing (NLP)
• How can we make a computer understand language?
– Can a human write/talk to the computer?
• Or can the computer guess/predict the input?
– Can the computer talk back?
– Based on language rules, patterns, or statistics
• For now, statistics are more accurate and popular

2
Some areas of NLP
• shallow processing – the surface level
– tokenization
– part-of-speech tagging
– forms of words
• deep processing – the underlying structures of language
– word order (syntax)
– meaning
– translation
• natural language generation

3
The NLTK
• A collection of:
– Python functions and objects for accomplishing NLP tasks
– sample texts (corpora)
• Available at: http://nltk.sourceforge.net
– Requires Python 2.4 or higher
– Click 'Download' and follow instructions for your OS

4
Tokenization
• Say we want to know the words in Marty's vocabulary
– "You know what I hate? Anybody who drives an S.U.V. I'd really
like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him
square in the teeth. Booyah. Be like, I'm Marty Stepp, the best
ever. Booyah!"
• How do we split his speech into tokens?

5
Tokenization (cont.)
• How do we split his speech into tokens?

>>> martysSpeech.split()
['You', 'know', 'what', 'I', 'hate?', 'Anybody',
'who', 'drives', 'an', 'S.U.V.', "I'd", 'really',
'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100-
Dollars-To-Gas-Up', 'and', 'kick', 'him',
'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be',
'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best',
'ever.', 'Booyah!']

• Now, how often does he use the word "booyah"?

>>> martysSpeech.split().count("booyah")
0
>>> # What the!

6
Tokenization (cont.)
• We could lowercase the speech
• We could write our own method to split on "." split on ",",
split on "-", etc.
• The NLTK already has several tokenizer options
• Try:
• nltk.tokenize.WordPunctTokenizer
– tokenizes on all punctuation
• nltk.tokenize.PunktWordTokenizer
– trained algorithm to statistically split on words

7
Part-of-speech (POS) tagging
• If you know a token's POS you know:
– is it the subject?
– is it the verb?
– is it introducing a grammatical structure?
– is it a proper name?

8
Part-of-speech (POS) tagging
• Exercise: most frequent proper noun in the Penn Treebank?
– Try:
• nltk.corpus.treebank
• Python's dir() to list attributes of an object
– Example:

>>> dir("hello world!")


[..., 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs',
'find', 'index', 'isalnum', 'isalpha',
'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', ...]

9
Tuples
• tagged_words() gives us a list of tuples
– tuple: the same thing as a list, but you can't change it
– in this case, the tuples are a (word, tag) pairs

>>> # Get the (word, tag) pair at list index 0


...
>>> pair = nltk.corpus.treebank.tagged_words()[0]
>>> pair
('Pierre', 'NNP')
>>> word = pair[0]
>>> tag = pair[1]
>>> print word, tag
Pierre NNP
>>> word, tag = pair # or unpack in 1 line!
>>> print word, tag
Pierre NNP

10
POS tagging (cont.)
• How do we tag plain sentences?
– A NLTK tagger needs a list of tagged sentences to train on
• We'll use nltk.corpus.treebank.tagged_sents()
– Then it is ready to tag any input! (but how well?)

– Try these tagger objects:


• nltk.UnigramTagger(tagged_sentences)
• nltk.TrigramTagger(tagged_sentences)
– Call the tagger's tag(tokens) method

>>> tagger = nltk.UnigramTagger(tagged_sentences)


>>> result = tagger.tag(tokens)
>>> result
[('You', 'PRP'), ('know', 'VB'), ('what', 'WP'),
('I', 'PRP'), ('hate', None), ('?', '.'), ...]

11
Parsing
• Syntax is as important for a compiler as it is for natural
language
• Realizing the hidden structure of a sentence is useful for:
– translation
– meaning analysis
– relationship analysis
– a cool demo!
• Try:
– nltk.draw.rdparser.demo()

12
Conclusion
• NLTK: NLP made easy with Python
– Functions and objects for:
• tokenization, tagging, generation, parsing, ...
• and much more!
– Even armed with these tools, NLP has a lot of difficult problems!
• Also saw:
– List methods
– dir()
– Tuples

13
Python scikit-learn
• Popular machine learning toolkit in Python http://scikit-
learn.org/stable/
• Requirements
– Anaconda
– Available from https://www.continuum.io/downloads
– Includes numpy, scipy, and scikit-learn (former two are
necessary for scikit-learn)

14
SciKit
Many popular Python toolboxes/libraries:
– NumPy
– SciPy
– Pandas
– SciKit-Learn All these
libraries are
installed on
Visualization libraries the SCC

– matplotlib
– Seaborn

and many15more …

15
Python Libraries for Data Science
SciPy:
▪ collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
more

▪ part of SciPy Stack

▪ built on NumPy

Link: https://www.scipy.org/scipylib/
16

16
Python Libraries for Data Science
SciKit-Learn:
▪ provides machine learning algorithms: classification,
regression, clustering, model validation etc.

▪ built on NumPy, SciPy and matplotlib

Link: http://scikit-learn.org/
17

17
Python Libraries for Data Science
matplotlib:
▪ python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced


visualization
18

18
Python Libraries for Data Science
Seaborn:
▪ based on matplotlib

▪ provides high level interface for drawing attractive statistical


graphics

▪ Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/
19

19
Login to the Shared Computing
Cluster
• Use your SCC login information if you have SCC account

• If you are using tutorial accounts see info on the blackboard

Note: Your password will not be displayed while you enter it.

20

20
Selecting Python Version on the
SCC
# view available python versions on the SCC

[scc1 ~] module avail python

# load python 3 version

[scc1 ~] module load python/3.6.2

21

21
Start Jupyter notebook
# On the Shared Computing Cluster
[scc1 ~] jupyter notebook

22

22
Loading Python Libraries

In [ #Import Python Libraries


]: import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Press Shift+Enter to execute the jupyter cell

23

23

You might also like