0% found this document useful (0 votes)

61 views

Natural Language Processing

Uploaded by

Sumit Das

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

Natural Language Processing

Uploaded by

Sumit Das

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 116

U NIVERSIT Y OF TOMORROW

NLP
Contents

Unit 1 Language Processing and Python ..................................................................................1

Unit 2 Accessing Text Corpora and Lexical Resources ..........................................................12
Unit 3 Processing Raw Text .................................................................................................... 22
Unit 4 Writing Structured Programs ......................................................................................31
Unit 5 Categorizing and Tagging Words ................................................................................43

Unit 6 Learning to Classify Text ............................................................................................ 50

Unit 7 Extracting Information from Text ...............................................................................60
Unit 8 Analyzing Sentence Structure .....................................................................................69
Unit 9 Building Feature Based Grammars ............................................................................80
Unit 10 -Analyzing the Meaning of Sentences ...........................................................................91

Unit 11 Managing Linguistic Data ...........................................................................................98

Unit 12 Afterword: Facing the Language Challenge ..............................................................109
Unit 1

Language Processing and

Python
Learning Objectives Introduction

By the end of this unit, you will be

It is not difficult for us to have access to millions of
able to understand:
words of material. What are some things that we could
● Computing with language: texts
do with it, supposing that we know how to create some
and words
basic programs? Within the scope of this chapter, we will
● Started with NLTK discuss the following issues:

● Searching text
● What do you think we will be able to do if we combine
● Counting vocabulary
basic programming skills with a significant amount of
text?

● How can we programmatically identify the most

important words and phrases from a document that
best summarize its tone and scope?

● What kinds of resources and strategies are made

available for such work by the Python programming
language?

● The processing of natural language presents a

number of intriguing issues. What are some of these
challenges?

This chapter is broken up into parts that switch between

two writing styles that couldn’t be farther different.
During the parts under “Computing with Language,” we
will attempt several linguistically inspired programming
tasks without necessarily discussing how they function.

1 Language Processing and Python

Throughout the parts labeled “a more in-depth which is one of the user-friendly aspects of the
look at Python,” we will methodically go through Python programming language. The Python
fundamental programming ideas. interpreter may be accessed via a straightforward
graphical user interface that is known as the
In the section names, we will make note of the two Interactive Development Environment (IDE). This
styles; but, in subsequent chapters, we will combine may be found in the Applications folder on a
the two styles without being quite as transparent Mac under the name MacPython, and in the All-
about it. While introducing a variety of fundamental Programs folder in the Python folder on a Windows
ideas in linguistics and computer science, we hope computer. Python may be executed directly from
that this manner of introduction will provide you the shell on Unix by using the idle command (if this
with an accurate taste of what is to come later on. is not installed, try typing python). The interpreter
will output some information about your version of
You may skip this chapter if you already have a Python; merely verify that you are using a version
fundamental understanding of both subjects; we of Python 3.2 or later (the relevant information for
will review any particularly salient elements 3.4.2 is as follows):
in subsequent chapters. If you have
no prior knowledge of the subject Python 3.4.2 (default, Oct 15 2014,
matter, you will leave this chapter 22:01:37)
with more questions than
[GCC 4.2.1 Compatible Apple
answers; the next chapters of
LLVM 5.1 (clang-503.0.40)]
this book will cover the topics
on darwin
covered in these questions.
Type “help”, “copyright”,

Computational Linguistics: A “credits” or “license” for more

Study of Texts and Words information >>>

Text is something that every one of us Note

is quite acquainted with since we read and write

If you are unable to run the Python interpreter, you
it on a daily basis. In this section, we are going to
probably don’t have Python installed correctly.
use text as raw data for the programs that we are
Please visit http://python.org/ for detailed
going to develop, algorithms that will transform and
instructions. NLTK 3.0 works for Python 2.6 and
analyze it in a number of intriguing ways. However,
2.7. If you are using one of these older versions,
before we can proceed with this, we will need to
note that the / operator rounds fractional results
have the Python interpreter up and running.
downwards (so 1/3 will give you 0). In order to get

Getting Started with the Python Programming the expected behavior of division you need to type:

Language from future import division

You are able to input directly into the interactive The >>> prompt indicates that the Python interpreter

interpreter, which is the software that will be is now waiting for input. When copying examples

responsible for executing your Python programs, from this book, don’t type the “>>>” yourself. Now,

2 Language Processing and Python

let’s begin by using Python as a calculator: 1+

^
>>>1 + 5 * 2 - 3
SyntaxError: invalid syntax
8
>>>
>>>

The question will appear once again when the Over this, a syntax error occurred. In Python, it is
interpreter has completed computing the answer not appropriate to close an instruction with the
and showing it to the user. This indicates that the plus sign since it does not make sense. The Python
Python interpreter is awaiting more instruction at interpreter points out the line number on which
this time. the error occurred. In this case, it is line 1 of stdin,
which is short for “standard input.”
Note
Since we are now able to access the Python
Your Turn: Enter a few more expressions of your interpreter, we are prepared to begin dealing with
own. You can use asterisk (*) for multiplication data relating to languages.
and slash (/) for division, and parentheses for
bracketing expressions. To get the version that is necessary for your
platform, just follow the instructions provided there.
The accompanying examples explain how you can
work interactively with the Python interpreter by After you have finished installing, restart the Python
playing with different expressions in the language interpreter in the same manner as before, and then
to see what they accomplish. You may do this by install the data that is necessary for the book by
following the instructions in the interpreter’s help typing the following two commands at the Python
file. Now, let’s examine how the interpreter deals prompt, and then selecting the book collection in
with a meaningless phrase by giving it a shot: the manner that is outlined in the following:

>>>1 + >>>import nltk

File “<stdin>”, line 1 >>>nltk.download()

3 Language Processing and Python

After the data has been successfully downloaded to just have to enter their names at the Python prompt:
your computer, you may use the Python interpreter
to load some of the data. At the Python prompt, >>>text1
the first thing you need to do is enter a specific <Text: Moby Dick by Herman Melville 1851>
command that tells the interpreter to load some
>>>text2
texts for us to investigate. This command is as
follows: from book import *. This instructs you to <Text: Sense and Sensibility by Jane Austen 1811>
load all items from the book module. The book >>>
module provides access to all the data that you
will need to complete this chapter. Following the We are prepared to get started now that we know
printing of an initial greeting, it loads the text of a how to utilize the Python interpreter and have some
number of books (this will take a few seconds). data to work with.
This is the command one more time, along with
the output that you will see this time. Take special Text Searching
care to ensure that your spelling and grammar are
correct, and keep in mind that you should not type
the >>>.

>>>frombookimport *

* Introductory Examples for the NLTK Book *

Loading text1, ..., text9 and sent1, ..., sent9

Type the name of the text or sentence to view it.

Type: ‘texts()’ or ‘sents()’ to list the materials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen 1811

text3: The Book of Genesis

In addition to reading the actual text, there are
text4: Inaugural Address Corpus
a variety of other approaches to investigate the
text5: Chat Corpus context of a document. A concordance view
text6: Monty Python and the Holy Grail provides us with each occurrence of a certain term,
along with some background information about
text7: Wall Street Journal
the passage. In this part of the Moby Dick search,
text8: Personals Corpus
we seek the word “monstrous” by entering text1
text9: The Man Who Was Thursday by G . K . followed by a period, then the term concordance, and
Chesterton 1908 finally inserting “monstrous” between parentheses:

>>>
>>>text1.concordance(“monstrous”)

Any time we want to find out about these texts, we Displaying 11 of 11 matches:

4 Language Processing and Python

ong the former, one was of a most monstrous size. also try searches on some of the other texts we
... This came towards us, have included. For example, search Sense and
Sensibility for the word affection, using text2.
ON OF THE PSALMS. “Touching that monstrous
concordance(“affection”). Search the book of
bulk of the whale or ork we have r
Genesis to find out how long some people lived,
ll over with a heathenish array of monstrous clubs
using text3.concordance(“lived”). You could look
and spears. Some were thick
at text4, the Inaugural Address Corpus, to see
d as you gazed, and wondered what monstrous examples of English going back to 1789, and search
cannibal and savage could ever hav for words like nation, terror, god to see how these

that has survived the flood; most monstrous and words have been used differently over time. We’ve

most mountainous! That Himmal also included text5, the NPS Chat Corpus: search
this for unconventional words like im, ur, lol. (Note
they might scout at Moby Dick as a monstrous
that this corpus is uncensored!)
fable, or still worse and more de

th of Radney.’” CHAPTER 55 Of the monstrous We hope that once you’ve spent some time reading
Pictures of Whales. I shall ere l these works, you’ll have a fresh perspective

ing Scenes. In connexion with the on the variety of ways language may

monstrous pictures of whales, I am be used and how rich it can be. In

strongly the next chapter, you will acquire

the knowledge necessary to get
ere to enter upon those still
access to a wider variety of
more monstrous stories of
content, including text written
them which are to be fought
in languages other than English.
have been rummaged out of this
monstrous cabinet there is no telling.
We are able to better understand
Butof Whale - Bones; for Whales of a
the meaning of words by using
monstrous size are oftentimes cast up
concordances. For instance, we
dead u
discovered that the word “monstrous” may be used
>>> in a variety of settings, such as the photos and a
size. Which additional terms are found in a variety
When you use a concordance for the first time on of settings that are similar to this one? We are able
a specific piece of text, it takes a few additional to find out by adding the phrase that is synonymous
seconds to establish an index so that future with the name of the text that is in question, followed
searches on that text are more efficient. by the pertinent word enclosed in parentheses:

Note >>>text1.similar(“monstrous”)

mean part maddens doleful gamesome subtly

Your Turn: Try searching for other words; to save
uncommon careful untoward
re-typing, you might be able to use up-arrow, Ctrl-
up-arrow or Alt-p to access the previous command exasperate loving passing mouldy christian few
and modify the word being searched. You can true mystifying

5 Language Processing and Python

imperial modifies contemptible

>>>text2.similar(“monstrous”)

very heartily so exceedingly remarkably as vast a great amazingly

extremely good sweet

>>>

Observe that we get different results for different texts. Austen uses this word quite differently from Melville;
for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very.

The term common contexts allows us to examine just the contexts that are shared by two or more words,
such as monstrous and very. We must enclose these words in square brackets as well as parentheses and
separate them with a comma:

Note

Your Turn: Pick another pair of words and compare their usage in two different texts, using the similar() and
common_contexts() functions.

One thing is to show certain words that exist in the same context as others that have been automatically
detected as being in the same text as a specific term that has been searched for in a text. On the other hand,
we can also identify the placement of a word in the text by counting the number of words that have passed
since we started reading and finding it. A dispersion plot may be used to present this location information for
readers’ convenience. Each column illustrates every individual occurrence of a word, and each row illustrates
the whole body of text. We see some remarkable trends in the way words have been used during the last
220 years (in an artificial text constructed by joining the texts of the Inaugural Address Corpus end-to-end).
You may create this plot in the manner shown below. You might try out some more terms (such as “liberty”
or “constitution”) as well as some alternative passages. Are you able to guess how a word will be distributed
before you ever see it? As was the case previously, ensure that the quotation marks, commas, brackets, and
parentheses are entered correctly.

6 Language Processing and Python

>>>text4.dispersion_plot([“citizens”, “democracy”, “freedom”, “duties”, “America”])

>>>

Note

Important: You need to have Python’s NumPy and Matplotlib packages installed in order to produce the
graphical plots used in this book.

Note

You can also plot the frequency of word usage through time using https://books.google.com/ngrams

Let’s generate some random text now, just for fun, and see how it looks in the different styles that we’ve been
discussing. In order to do this, we first input the name of the text, then the word “create.” (It is necessary for
us to add the parenthesis, but there should be nothing included in that space.)

>>>text3.generate()

In the beginning of his brother is a hairy man, whose top may reach

unto heaven; and ye shall sow the land of Egypt there was no bread in

all that he was taken out of the month, upon the earth. So, shall thy

wages be? And they made their father; and Isaac was old, and kissed

him: and Laban with his cattle in the midst of the hands of Esau thy

first born, and Phichol the chief butler unto his son Isaac , she

>>>

7 Language Processing and Python

Counting Vocabulary

The most striking difference between the two sets of instances is the terminology that is used in each of
the two sets of texts, which is undoubtedly the case. In the following discussion, we will look at a number
of practical applications for counting the number of words in a piece of writing using a computer. You will
proceed as you have in the past and immediately begin experimenting with the Python interpreter, despite the
fact that you may not have yet studied Python in a systematic manner. Modifying the examples and working
through the activities at the conclusion of the chapter will help you evaluate your level of comprehension.

Let’s begin by counting the number of words and punctuation marks that occur in a piece of writing from
beginning to end so that we may get a sense of its overall length. Len is an abbreviation that stands for
“length of anything,” and in this context, we’ll use it to refer to the book of Genesis:

8 Language Processing and Python

>>>len(text3) >>>

44764
We were able to generate a sorted list of vocabulary
>>>
items by surrounding the Python statement
set(text3) with the sorted() function. The list begins
Therefore, the book of Genesis has 44,764 “tokens,”
with a variety of punctuation marks and continues
which are words and punctuation symbols. The
with words that begin with the letter A. Every word
string of letters (such as “hairy,” “his,” or “:”) that we
with a capital letter comes before a word with a
wish to be treated as a single unit is referred to by
lowercase letter. Indirectly, we get the size of the
its proper technical term, which is a token. We are
vocabulary by inquiring about the number of items
counting instances of these sequences when we
in the set; once again, we may use len to retrieve
count the number of tokens in a text, for example
this amount.
when we count the number of times the phrase
“to be or not to be” appears in the text. Therefore,
This book includes just 2,789 unique words, which
our example sentence has two instances of the
are referred to as “word kinds,” despite the fact that
word “to,” two instances of the word “be,” and one
it contains 44,764 tokens. A word type is the form
instance each of the words “or” and “not.” However,
or spelling of the term independent of its particular
there are only four unique words or phrases included
occurrences in a text; in other words, it refers to
inside this sentence. How many unique words
the word itself in its capacity as an individual piece
can be found throughout the whole of the book
of vocabulary. Our tally of 2,789 items will contain
of Genesis? In order to solve this problem using
various punctuation symbols; hence, we will refer
Python, we will need to restate the issue in a slightly
to these singular things in general as types rather
different way. A text’s vocabulary is just the set of
than word types.
tokens that it employs, because in a set, any tokens
that are used more than once are compressed into
a single entry. With the set command in Python, we
are able to extract the vocabulary items that text3
contains (text3). When you do this, a lot of screens
filled with text will go by very quickly. Now put the
following into practice:

>>>sorted(set(text3))

[‘!’, “’”, ‘(‘, ‘)’, ‘,’, ‘,)’, ‘.’, ‘.)’, ‘:’, ‘;’, ‘;)’, ‘?’, ‘?)’,

‘A’, ‘Abel’, ‘Abelmizraim’, ‘Abidah’, ‘Abide’, ‘Abimael’,

‘Abimelech’,

‘Abr’, ‘Abrah’, ‘Abraham’, ‘Abram’, ‘Accad’, ‘Achbor’,

‘Adah’, ...]

>>>len(set(text3))

2789

9 Language Processing and Python

Now that we have that out of the way, let’s determine you only have to type a short name instead of one
how rich the text is in terms of its vocabulary. The or more full lines of Python code, and you can use
following example demonstrates to us that the it as often as you want. A function is a block of
number of unique words constitutes just 6% of the code that does something for us. With the keyword
entire number of words, or, to put it another way, “def,” we give our function a short name. The next
that each word is used an average of 16 times example shows how to set up two new functions,
(remember, if you’re using Python 2, to begin with lexical_diversity() and percentage():
from future import division).
>>>deflexical_diversity(text):
>>>len(set(text3)) / len(text3)
... return len(set(text)) / len(text)
0.06230453042623537
...
>>>
>>>defpercentage(count, total):

... return 100 * count / total

Moving on, let’s zero in on some specific terms.
We are able to determine the percentage ...
of a given text that is comprised of
a certain word by counting the The Python interpreter changes
number of times a particular the prompt from >>> to ... after
word appears in the text and encountering the colon at the
computing the result: end of the first line. The ...
prompt indicates that Python
>>>text3.count(“smote”) expects an indented code
block to appear next. It is up
5
to you to do the indentation, by
>>>100 * text4.count(‘a’) /
typing four spaces or hitting the
len(text4)
tab key. To finish the indented block
1.4643016433938312 just enter a blank line.

>>>
In the lexical_diversity() [1] definition, we set

Note a parameter called “text.” This parameter is a

“placeholder” for the actual text whose lexical
Your Turn: How many times does the word “lol” diversity we want to calculate. It appears more than
appear in text5? How much is this as a percentage once in the block of code that will run when the
of the total number of words in this text? function is called [2]. In the same way, percentage()
is set up to take two arguments called “count” and
You might want to do these kinds of calculations “total” [3].
on more than one text, but typing the formula over
and over again is a pain. You can instead give a Once Python knows that lexical_diversity() and
task your own name, such as “lexical diversity” or percentage() are the names of specific blocks of
“percentage,” and link it to a block of code. Now code, we can use these functions:

10 Language Processing and Python

>>>lexical_diversity(text3)

0.06230453042623537

>>>lexical_diversity(text5)

0.13477005109975562

>>>percentage(4, 5)

80.0

>>>percentage(text4.count(‘a’), len(text4))

1.4643016433938312

>>>

Summary

● To sum up, we use or call a function like lexical_diversity() by typing its name, followed by an open
parenthesis, the name of the text, and then a close parenthesis.

● You’ll see these parentheses a lot. Their job is to separate the name of a task, like lexical_diversity(), from
the data on which the task is to be done, like text3. When we call a function, the data value we put in the
parentheses is called an argument to the function.

● You have already seen functions like len(), set(), and sorted in this chapter. By convention, we always put
an empty pair of parentheses after a function name, like len(), to show that we are talking about a function
and not some other kind of Python expression.

● Functions are an important part of programming, and we only talk about them at the beginning to show
newcomers how powerful and flexible programming can be.

● Don’t worry if it’s hard to understand right now. We’ll learn how to use functions when tabulating data later.
We’ll use a function to do the same work over and over for each row of the table, even though each row has
different data.

11 Language Processing and Python

Unit 2

Accessing Text Corpora and

Lexical Resources
Learning Objectives Introduction

By the end of this unit, you will be

Corpora are vast collections of linguistic data that
able to understand:
are often used in the practical applications of natural
● Accessing text corpora
language processing. The purpose of this chapter is to
● Web and chat text provide answers to the following questions:

● Brown corpus
● Which text corpora and lexical resources are most
● Reuters corpus
helpful, and how can we utilize Python to access them?

● Which aspects of Python’s syntax are most useful for

accomplishing this task?

● When creating code in Python, how can we keep from

writing the same thing several times?

This chapter continues to demonstrate programming

fundamentals by way of examples, with the focus shifting
to a language processing activity this time around.
Before we thoroughly investigate each Python construct,
however, we will have to wait until later. Do not be alarmed
if you come across an example that includes anything
that is foreign to you; all you need to do is test it out to
see what it does, and if you’re feeling adventurous, you
may edit it by replacing some portion of the code with
new text or words. By doing this, you will be able to link a
task with a programming idiom and can study the hows
and whys of it afterwards.

12 Accessing Text Corpora and Lexical Resources

Accessing Text Corpora

Figure No. 2.1: Accessing Text Copora

A big body of text is referred to as a text corpus, the Python interpreter to load the package, and
as was just explained. A great number of corpora then we make a request to view the file IDs included
are constructed with the intention of including inside this corpus using the corpus.gutenberg.
an appropriate mix of information from one or fileids() function:
more categories. In step one, we looked at a few
tiny collections of text, such as the speeches that >>>import nltk
are collectively referred to as the US Presidential >>>corpus.gutenberg.fielids()
Inaugural Addresses. This specific corpus, in fact,
[‘austen-emma.txt’, ‘austen-persuasion.txt’,
consists of dozens of separate texts, one for each
‘austen-sense.txt’, ‘bible-kjv.txt’,
address; but, for the sake of convenience, we have
joined all of these texts together and are treating ‘blake-poems.txt’, ‘bryant-stories.txt’, ‘burgess-
them as a single text. We also used a variety of busterbrown.txt’,
pre-defined texts, any of which could be accessed ‘carroll-alice.txt’, ‘chesterton-ball.txt’, ‘chesterton-
by entering “from book import *”. This part, on the brown.txt’,
other hand, looks at a number of different text
‘chesterton-thursday.txt’, ‘edgeworth-parents.txt’,
corpora since we want to be able to deal with a
‘melville-moby_dick.txt’,
range of different texts. We are going to go through
the process of selecting certain texts and working ‘milton-paradise.txt’, ‘shakespeare-caesar.txt’,
with those texts. ‘shakespeare-hamlet.txt’,

‘shakespeare-macbeth.txt’, ‘whitman-leaves.txt’]
Gutenberg Corpus
Let’s take the first of these works, “Emma” by Jane
A tiny subset of the texts that may be found in the Austen, give it the abbreviated name “emma”, and
Project Gutenberg electronic text collection, which then count the number of words it contains:
can be found at http://www.gutenberg.org/ and
houses around 25,000 free electronic books, are >>>emma = corpus.gutenberg.words(‘austen-
included in the NLTK toolkit. We begin by instructing emma.txt’)

13 Accessing Text Corpora and Lexical Resources

>>>len(emma) values of file-id corresponding to the “gutenberg”
file identifiers listed earlier and then computing
192427
statistics for each text. For a compact output

Note display, we will round each number to the nearest

integer, using “round()”:
In [1], we demonstrated how you can perform
concordancing of a text by using the command >>>for fileid in gutenberg.fileids():

text1.concordance. An example of this would ... num_chars = len(gutenberg.raw(fileid))

be the text 1. However, this is based on the
... num_words = len(gutenberg.words(fileid))
assumption that you are utilizing one of the nine
... num_sents = len(gutenberg.sents(fileid))
texts that were acquired as a result of carrying
out the “from book import *” command. Now that ... num_vocab = len(set(w.lower() for w in
you have begun looking at data from the corpus, gutenberg.words(fileid)))
as in the example before this one, you need to use
... print(round(num_chars/num_words),
the following pair of statements in order
round(num_words/num_sents),
to perform concordancing and other
round(num_words/num_vocab),
tasks beginning with [1]:
fileid)

...
>>>emma = Text(corpus.
gutenberg.words(‘austen- 5 25 26 austen-emma.txt
emma.txt’))
5 26 17 austen-persuasion.
> > > e m m a . txt
concordance(“surprize”)
5 28 22 austen-sense.txt

4 34 79 bible-kjv.txt
In the process of defining “emma”,
we used the words() method of 5 19 5 blake-poems.txt
the “gutenberg” object included inside
4 19 14 bryant-stories.txt
the “corpus” package. Python, however, offers a
4 18 12 burgess-busterbrown.txt
different variant of the import statement, which
may be summarized as follows: 4 20 13 carroll-alice.txt

5 20 12 chesterton-ball.txt
>>>from.corpus import gutenberg
5 23 11 chesterton-brown.txt
>>>gutenberg.fileids()
5 18 11 chesterton-thursday.txt
[‘austen-emma.txt’, ‘austen-persuasion.txt’,
4 21 25 edgeworth-parents.txt
‘austen-sense.txt’, ...]
5 26 15 melville-moby_dick.txt
>>>emma = gutenberg.words(‘austen-emma.txt’)
5 52 11 milton-paradise.txt
Let’s write a short program to display other
4 12 9 shakespeare-caesar.txt
information about each text, by looping over all the
4 12 8 shakespeare-hamlet.txt

14 Accessing Text Corpora and Lexical Resources

4 12 7 shakespeare-macbeth.txt ‘1603’, ‘]’], [‘Actus’, ‘Primus’, ‘.’], ...]

5 36 12 whitman-leaves.txt >>>macbeth_sentences[1116]

[‘Double’, ‘,’, ‘double’, ‘,’, ‘toile’, ‘and’, ‘trouble’, ‘;’,

This software will provide three statistics for each
‘Fire’, ‘burne’, ‘,’, ‘and’, ‘Cauldron’, ‘bubble’]
piece of writing: the average length of each word,
the average length of each sentence, and the >>>longest_len = max(len(s) for s in macbeth_
average number of times that each vocabulary item sentences)
occurs in the text (our lexical diversity score). The
>>>[s for s in macbeth_sentences if len(s) ==
fact that there is a consistent number of 4 for the
longest_len]
average length of a word in English is something
[[‘Doubtfull’, ‘it’, ‘stood’, ‘,’, ‘As’, ‘two’, ‘spent’,
that should be brought to your attention. (Given
‘Swimmers’, ‘,’, ‘that’,
that the “num chars” variable takes into account the
number of space characters, the average length of ‘doe’, ‘cling’, ‘together’, ‘,’, ‘And’, ‘choake’, ‘their’,
a word is really three rather than four.) On the other ‘Art’, ‘:’, ‘The’,
hand, the length of sentences on average and the ‘mercilesse’, ‘Macdonwald’, ...]]
variety of words that are used seem to
be characteristics of certain writers. Note

The example that came before Most NLTK corpus readers

this one demonstrated how we include a variety of access
may get access to the “raw” methods apart from words(),
text of the book [1, which is raw(), and sents(). Richer
not fragmented into tokens]. The linguistic content is available
contents of the file are returned to us from some corpora, such as part-of-
by the raw() method, which does not do speech tags, dialogue tags, syntactic
any linguistic processing. For instance, trees, and so forth; we will see these in
using the command len(gutenberg. later chapters.
raw(‘blake-poems.txt’)) will provide information
on the total number of letters present in the text, Web and Chat Text
taking into account the pauses that occur between
sentences. The text is broken up into its individual Despite the fact that it includes thousands of
sentences using the sents() function, where each volumes, Project Gutenberg is representative of
sentence is represented by a list of words: canonical works of literature. It is necessary to also
take into consideration terminology that is less
>>>macbeth_sentences = gutenberg. formal. The NLTK’s tiny collection of online text
sents(‘shakespeare-macbeth.txt’) contains material from a discussion forum hosted
>>>macbeth_sentences by Firefox, conversations overheard in New York
City, the screenplay for the movie “Pirates of the
[[‘[‘, ‘The’, ‘Tragedie’, ‘of’, ‘Macbeth’, ‘by’, ‘William’,
Caribbean,” personal adverts, and wine reviews:
‘Shakespeare’,

15 Accessing Text Corpora and Lexical Resources

>>>from nltk.corpus import webtext >>>from nltk.corpus import nps_chat

>>>for fileid in webtext.fileids(): >>>chatroom = nps_chat.posts(‘10-19-

20s_706posts.xml’)
... print(fileid, webtext.raw(fileid)[:65], ‘...’)
>>>chatroom[123]
...
[‘i’, ‘do’, “n’t”, ‘want’, ‘hot’, ‘pics’, ‘of’, ‘a’, ‘female’, ‘,’,
firefox.txt Cookie Manager: “Don’t allow sites that
set removed cookies to se... ‘I’, ‘can’, ‘look’, ‘in’, ‘a’, ‘mirror’, ‘.’]

grail.txt SCENE 1: [wind] [clop clop clop] KING

ARTHUR: Whoa there! [clop...

overheard.txt White guy: So, do you have any plans

for this evening? Asian girl...

pirates.txt PIRATES OF THE CARRIBEAN: DEAD

MAN’S CHEST, by Ted Elliott & Terr...

singles.txt 25 SEXY MALE, seeks attrac older

single lady, for discreet encoun...

wine.txt Lovely delicate, fragrant Rhone wine.

Polished leather and strawb... Brown Corpus

In addition, there is a corpus of instant messaging In 1961, researchers at Brown University established
conversation sessions that were first compiled by the Brown Corpus, which is credited as being the
the Naval Postgraduate School for the purpose of first computerized English corpus to include one
studying the automated identification of Internet million words. Text from five hundred different
predators. The corpus includes approximately sources has been compiled into this corpus, and
10,000 posts, all of which have been anonymized those sources have been arranged in this corpus
by having their usernames replaced with generic according to genre, such as editorial, news, and so
names of the type “UserNNN,” and by having any on. [1.1] provides an example of each kind of music
other identifying information removed by human (for a comprehensive list, please refer to http://
editing. The messages for an age-restricted icame.uib.no/brown/bcm-los.html).
chatroom were compiled into the corpus, and it is
arranged into 15 files, where each file comprises We have the option of gaining access to the corpus

several hundred entries gathered on a certain in the form of a list of words or a list of phrases

day (teens, 20s, 30s, 40s, plus a generic adults (where each sentence is itself just a list of words).

chatroom). The date, chatroom, and total number We have the option of specifying specified file types

of posts are all included in the name of the file. For or categories for you to read:

example, the file “10-19-20s 706posts.xml” includes

>>>from nltk.corpus import brown
all of the posts that were collected on 10/19/2006
from the 20s chat room. >>>brown.categories()

[‘adventure’, ‘belles_lettres’, ‘editorial’, ‘fiction’,

16 Accessing Text Corpora and Lexical Resources

‘government’, ‘hobbies’, can: 94 could: 87 may: 93 might: 38 must: 53 will:
389
‘humor’, ‘learned’, ‘lore’, ‘mystery’, ‘news’, ‘religion’,
‘reviews’, ‘romance’,
Note
‘science_fiction’]

>>>brown.words(categories=’news’) In order to have the print function’s output appear

on a single line, it is necessary for us to add the
[‘The’, ‘Fulton’, ‘County’, ‘Grand’, ‘Jury’, ‘said’, ...]
“end” parameter.
>>>brown.words(fileids=[‘cg22’])

[‘Does’, ‘our’, ‘society’, ‘have’, ‘a’, ‘runaway’, ‘,’, ...] Note

>>>brown.sents(categories=[‘news’, ‘editorial’,
Now it’s your turn: choose a new area of the Brown
‘reviews’])
Corpus, and modify the preceding example so that
[[‘The’, ‘Fulton’, ‘County’...], [‘The’, ‘jury’, ‘further’...], it counts a selection of wh-words, such as “what,”
...] “when,” “where,” “who,” and “why.”

The field of linguistics research The next step is to compile counts

known as stylistics makes use of for every kind of content that
the Brown Corpus, which is a interests us. We are going to
helpful resource for analyzing take advantage of the support
the systematic differences for conditional frequency
that exist between different distributions that is provided.
types of writing. Let’s look at These are laid out in an
how different types of writing organized manner in [2], where
use modal verbs and compare we also go through the following
them. The first thing that has to code and unpick it line by line. You
be done is to generate the counts for are free to disregard the specifics for
a certain category. Don’t forget to import the time being and should instead focus on the
anything before continuing with the following: results.

>>>from corpus import brown >>>cfd = nltk.ConditionalFreqDist(

>>>news_text = brown.words(categories=’news’) ... (genre, word)
>>>fdist = nltk.FreqDist(w.lower() for w in news_ ... for genre in brown.categories()
text)
... for word in brown.words(categories=genre))
>>>modals = [‘can’, ‘could’, ‘may’, ‘might’, ‘must’,
>>>genres = [‘news’, ‘religion’, ‘hobbies’, ‘science_
‘will’]
fiction’, ‘romance’, ‘humor’]
>>>for m in modals:
>>>modals = [‘can’, ‘could’, ‘may’, ‘might’, ‘must’,
... print(m + ‘:’, fdist[m], end=’ ‘) ‘will’]
... >>>cfd.tabulate(conditions=genres,

17 Accessing Text Corpora and Lexical Resources

samples=modals) [‘acq’, ‘alum’, ‘barley’, ‘bop’, ‘carcass’, ‘castor-oil’,
‘cocoa’,
can could may might must will
‘coconut’, ‘coconut-oil’, ‘coffee’, ‘copper’, ‘copra-
news 93 86 66 38 50 389
cake’, ‘corn’,
religion 82 59 78 12 54 71
‘cotton’, ‘cotton-oil’, ‘cpi’, ‘cpu’, ‘crude’, ‘dfl’, ‘dlr’, ...]
hobbies 268 58 131 22 83 264

science_fiction 16 49 4 12 8 16 Unlike the Brown Corpus, categories in the Reuters

corpus overlap with each other, simply because a
romance 74 193 11 51 45 43
news story often covers multiple topics. We can ask
humor 16 30 8 8 9 13
for the topics covered by one or more documents,
or for the documents included in one or more
It is interesting to note that the modal that appears
categories. For convenience, the corpus methods
most often in news articles is “would,” but the modal
accept a single fileid or a list of fileids.
that appears most frequently in romance articles
is “could.” Could you have anticipated this? The
>>>reuters.words(‘training/9865’)[:14]
concept that word counts can serve as a
[‘FRENCH’, ‘FREE’, ‘MARKET’, ‘CEREAL’,
way to differentiate across genres will
‘EXPORT’, ‘BIDS’,
be revisited in chap-data-intensive.
‘DETAILED’, ‘French’, ‘operators’,
Reuters Corpus ‘have’, ‘requested’, ‘licences’,
‘to’, ‘export’]
There are a total of 1.3 million
> > > r e u t e r s .
words included among the
words([‘training/9865’,
10,788 news articles that make
‘training/9880’])
up the Reuters Corpus. Therefore,
the text with the fileid ‘test/14826’ is a [‘FRENCH’, ‘FREE’, ‘MARKET’, ‘CEREAL’,

document that was drawn from the test ‘EXPORT’, ...]

set. The documents have been categorized into >>>reuters.words(categories=’barley’)

ninety different subjects, and then grouped into
[‘FRENCH’, ‘FREE’, ‘MARKET’, ‘CEREAL’, ‘EXPORT’,
two sets that are referred to as “training” and “test.”
...]
As we will see in chap-data-intensive, this split
>>>reuters.words(categories=[‘barley’, ‘corn’])
was created for the purpose of training and testing
algorithms that can automatically determine the [‘THAI’, ‘TRADE’, ‘DEFICIT’, ‘WIDENS’, ‘IN’, ‘FIRST’,
subject matter of a document. ...]

>>>from corpus import reuters In a similar manner, we are able to specify the
files or categories that contain the words or
>>>reuters.fileids()
sentences of our choosing. The titles, which are
[‘test/14826’, ‘test/14828’, ‘test/14829’,
always presented in uppercase because that is the
‘test/14832’, ...]
accepted format for storing them, make up the first
>>>reuters.categories() few words of each of these texts.
18 Accessing Text Corpora and Lexical Resources
>>>reuters.words(‘training/9865’)[:14] >>>from corpus import inaugural

[‘FRENCH’, ‘FREE’, ‘MARKET’, ‘CEREAL’, ‘EXPORT’, >>>inaugural.fileids()

‘BIDS’,
[‘1789-Washington.txt’, ‘1793-Washington.txt’,
‘DETAILED’, ‘French’, ‘operators’, ‘have’, ‘requested’, ‘1797-Adams.txt’, ...]
‘licences’, ‘to’, ‘export’]
>>>[fileid[:4] for fileid in inaugural.fileids()]
>>>reuters.words([‘training/9865’, ‘training/9880’])
[‘1789’, ‘1793’, ‘1797’, ‘1801’, ‘1805’, ‘1809’, ‘1813’,
[‘FRENCH’, ‘FREE’, ‘MARKET’, ‘CEREAL’, ‘EXPORT’, ‘1817’, ‘1821’, ...]
...]
Take note that the year that each text was written is
>>>reuters.words(categories=’barley’)
included in the filename for that text. The first four
[‘FRENCH’, ‘FREE’, ‘MARKET’, ‘CEREAL’, ‘EXPORT’,
characters of the filename were retrieved using the
...]
command fileid[:4] so that we could determine the
>>>reuters.words(categories=[‘barley’, ‘corn’]) year the file was created.

[‘THAI’, ‘TRADE’, ‘DEFICIT’, ‘WIDENS’, ‘IN’, ‘FIRST’,

Let’s take a look at the evolution of the meanings
...]
of the phrases “America” and “citizen” throughout
the years. The following piece of code takes the
words in the Inaugural corpus, transforms them
to lowercase using the w.lower() [1] function, and
then tests to see whether they begin with either the
“target” “america” or the “citizen” keyword using the
startswith() [1] function. Words like “American’s”
and “Citizens” will be counted as a result of this.
In lesson 2, we will study conditional frequency
distributions; for the time being, let’s focus on the
output, which is shown in lesson 1.1.
Inaugural Address Corpus

>>>cfd = ConditionalFreqDist(
In [1], we examined the Inaugural Address Corpus,
but we handled the whole collection as if it were a ... (target, fileid[:4])
single text. The “word offset” was used as one of the ... for fileid in inaugural.fileids()
axes in the graph that was shown in fig-inaugural.
... for w in inaugural.words(fileid)
This is the numerical index of the word in the corpus,
and it is calculated by starting with the very first ... for target in [‘america’, ‘citizen’]
word of the very first address. On the other hand, the ... if w.lower().startswith(target))
corpus is a collection of 55 separate writings, one
>>>cfd.plot()
of which corresponds to each presidential speech.
One aspect of this collection that is particularly
intriguing is its temporal dimension:

19 Accessing Text Corpora and Lexical Resources

Figure No. 2.2: Conditional Frequency Distribution

Plot of a Conditional Frequency Distribution: All words in the Inaugural Address Corpus that begin with
“America” or “citizen” are counted; separate counts are kept for each address; these are plotted so that trends
in usage over time can be observed; counts are not normalized for document length. Plot of a Conditional
Frequency Distribution: all words in the Inaugural Address Corpus that begin with “America” or “citizen” are
counted. Separate counts are kept for each address.

Annotations Text Corpora

Annotations of linguistic content are found in a great number of text corpora. These annotations might
represent POS tags, named entities, grammatical structures, semantic roles, and so on. The NLTK gives easy
access to a number of these corpora and includes data packages that can be freely downloaded for use in
education and research that comprise corpora as well as samples of corpora. The following are examples of
some of the corpora:

Summary

● A text corpus is a huge collection of texts that has been organized in a certain way. The Natural Language
Toolkit (NLTK) includes a variety of corpora, such as the Brown Corpus.

● Some text corpora are organized into categories, such as those based on genre or subject matter; in other
cases, the categories in a corpus overlap with one another.

● A collection of frequency distributions, each one for a distinct circumstance, is what we mean when we talk
about a conditional frequency distribution. They are useful for calculating the number of times a certain
word or phrase appears inside a specific context or kind of writing.

● Python code that is more than a few lines long has to be written in a text editor, then saved to a file with the
.py suffix, and finally retrieved using an import statement.

20 Accessing Text Corpora and Lexical Resources

● You are able to connect a name with a specific block of code in Python functions, and then reuse that code
as many times as required.

● The name of an object is followed by a period and then the function that is connected with the object, as
seen here: x.funct(y). For example, the function “word.isalpha” is associated with the word object and is
referred to as a “method” ().

● In the Python interactive interpreter, you may read the help entry for objects of this kind by typing help(v)
and then looking up the variable you want information on.

● WordNet is an English dictionary that takes a semantic approach. It is comprised of synonym sets, often
known as synsets, and is structured in the form of a network.

● Certain functions cannot be used without first using Python’s import statement because they are not made
accessible by default.

21 Accessing Text Corpora and Lexical Resources

Unit 3

Processing Raw Text

Learning Objectives Introduction

By the end of this unit, you will be

The Internet is without a doubt the most valuable
able to understand:
resource for finding texts. It is helpful to have pre-existing
● Text from the web and from disk
text collections to investigate, such as the corpora that
● Deal with html we have seen in the prior chapters. On the other hand, you

● Search engine results most likely have your own text sources in mind, and you
will need to discover how to access them.
● Local files

The purpose of this chapter is to provide answers to the

questions that follow:

● How can we develop programs that can access text

from both local files and the web in order to get a
limitless variety of linguistic resources?

● How can we separate documents into their constituent

words and punctuation marks so that we may do the
same kind of analysis that we conducted with text
corpora in prior chapters?

● How can we design programs that generate output that

is formatted and store it in a file at the same time?

In order to provide answers to these issues, we will

be discussing fundamental ideas in natural language
processing, such as tokenization and stemming. Your
existing knowledge of Python will be expanded upon,

22 Processing Raw Text

and you will get an understanding of regular The following provides instructions on how to
expressions, strings, and files along the way. We will obtain text number 2554, which is an English
also look at ways to eliminate the need for markup, version of Crime and Punishment.
since the majority of the content on the internet is
now presented in HTML format. >>>from urllib import request

>>>url = “http://www.gutenberg.org/
Note
files/2554/2554-0.txt”

>>>response = request.urlopen(url)
Important: From this chapter onwards, our program
samples will assume you begin your interactive >>>raw = response.read().decode(‘utf8’)
session or your program with the following import
>>>type(raw)
statements:
<class ‘str’>

>>>from future import division # Python 2 >>>len(raw)

users only
1176893
>>>import, re, pprint
>>>raw[:75]
>>>fromimport word_tokenize
‘The Project Gutenberg EBook
Textually Accessible of Crime and Punishment, by
E-Books Readable from Fyodor Dostoevsky\r\n’
both the Web and Disk
Note
There is a minuscule selection
of texts from Project Gutenberg The read() procedure will
included in the collection continue to download this
maintained by NLTK. On the other substantial book, which will take a few
hand, you could be interested in evaluating seconds. It is possible that you will need
any of the other writings that are available on to manually specify the proxy that you are using
Project Gutenberg. You may receive a URL to an in order to use urlopen with Python if the internet
ASCII text file by visiting the website http://www. proxy that you are using is not correctly detected
gutenberg.org/catalog/, where you can also by Python.
explore the catalog of 25,000 free online books.
Project Gutenberg has content in approximately 50 >>>proxies = {‘http’: ‘http://www.someproxy.
other languages, including Catalan, Chinese, Dutch, com:3128’}
Finnish, French, German, Italian, Portuguese, and >>>request.ProxyHandler(proxies)
Spanish. Even though 90% of the writings in the
database are in English, the remaining 10% are in The value stored in the variable raw is a string
other languages (with more than 100 texts each). that has 1,176,893 characters. (By using the
type(raw) function, we are able to determine that it

23 Processing Raw Text

is a string.) This is the unprocessed content of the [‘CHAPTER’, ‘I’, ‘On’, ‘an’, ‘exceptionally’, ‘hot’,
book, which includes a great deal of information ‘evening’, ‘early’, ‘in’,
that is irrelevant to our work, such as whitespace,
‘July’, ‘a’, ‘young’, ‘man’, ‘came’, ‘out’, ‘of’, ‘the’,
line breaks, and blank lines. Take note of the r and
‘garret’, ‘in’,
n letters that appear in the very first line of the
‘which’, ‘he’, ‘lodged’, ‘in’, ‘S.’, ‘Place’, ‘and’, ‘walked’,
file. These represent the peculiar carriage return
‘slowly’,
and line feed characters that Python uses (the file
must have been created on a Windows machine). ‘,’, ‘as’, ‘though’, ‘in’, ‘hesitation’, ‘,’, ‘towards’, ‘K.’,
As was seen in 1, we need to separate the string ‘bridge’, ‘.’]
into individual words and punctuation marks so >>>text.collocations()
that it can be processed by our language software.
Katerina Ivanovna; Pyotr Petrovitch; Pulcheria
Tokenization is the process that results in the
Alexandrovna; Avdotya
structure that we are used to seeing, which is a list
of words and punctuation. Romanovna; Rodion Romanovitch; Marfa Petrovna;
Sofya Semyonovna; old
>>>tokens = word_tokenize(raw) woman; Project Gutenberg-tm; Porfiry
>>>type(tokens) Petrovitch; Amalia Ivanovna;

<class ‘list’> great deal; Nikodim Fomitch; young

man; Ilya Petrovitch; n’t know;
>>>len(tokens)
Project Gutenberg;
254354
Dmitri Prokofitch; Andrey
>>>tokens[:10]
Semyonovitch; Hay Market
[‘The’, ‘Project’, ‘Gutenberg’,
‘EBook’, ‘of’, ‘Crime’, ‘and’, It should be brought to your attention
‘Punishment’, ‘,’, ‘by’] that Project Gutenberg is used as a
collocation. This is due to the fact that
Take note that accessing a URL and reading its each and every text that can be downloaded from
contents into a string did not need NLTK, but Project Gutenberg includes a header that includes
tokenization did. Tokenization was the only step information such as the name of the text, the author,
that required NLTK. If we now take the additional the names of people who scanned and corrected
step of constructing an NLTK text from this list, in the text, a license, and other relevant information.
addition to the standard list operations such as There are instances in which this data is included in
slicing, we will be able to carry out all of the other a footer at the very bottom of the document. We are
linguistic processing we saw in 1. unable to detect the beginning and ending points
of the content in a reliable manner. As a result, we
>>>text = nltk.Text(tokens) are forced to resort to manual inspection of the file

>>>type(text) to find distinct strings that mark the beginning and

ending points, and then we trim the raw data so
<class ‘nltk.text.Text’>
that it contains only the content and nothing else:
>>>text[1024:1062]

24 Processing Raw Text

>>>raw.find(“PART I”) Dealing with HTML
5338
Documents written in HTML make up a significant
>>>raw.rfind(“End of Project Gutenberg’s Crime”)
portion of the content that can be found on the
1157743 World Wide Web. You can use a web browser to
>>>raw = raw[5338:1157743] save a page as text to a local file, and then you may
access this file using the instructions provided in
>>>raw.find(“PART I”)
the section below that is devoted to files. On the
0 other hand, if you plan to do this often, the simplest
solution is to have Python carry out the task on
Take note of the fact that Project Gutenberg is used its own. Utilizing urlopen is the initial step, just
as a collocation. This is due to the fact that each as it was in the previous phase. Just for laughs,
and every text that may be downloaded from Project let’s have a look at a report from BBC News titled
Gutenberg has a header that includes information “Blondes to Die Out in 200 Years,” which is really an
such as the name of the book, the author’s name, urban legend that the BBC has propagated as if it
the names of persons who scanned and corrected were a well-established scientific fact.
the text, a license, and other relevant information.
At the very end of the document, there is often a >>>url = “http://news.bbc.co.uk/2/hi/
footer that contains this information. We are unable health/2284783.stm”
to detect the beginning and ending points of the
>>>html = request.urlopen(url).read().
content in a reliable manner.
decode(‘utf8’)

As a result, we are forced to resort to manually >>>html[:60]

inspecting the file to find distinct strings that mark ‘<!doctype html public “-//W3C//DTD HTML 4.0
the beginning and ending points. After this, we trim Transitional//EN’
the raw data so that it contains only the content and
nothing else. You can view the HTML content in all of its glory by
typing print(html) into the console. This will display
the meta tags, an image map, JavaScript, forms,
and tables.

A Python library known as BeautifulSoup, which

can be downloaded from http://www.crummy.com/
software/BeautifulSoup/, will be utilized by our
team in order to extract text from HTML: :

>>>from bs4 import BeautifulSoup

>>>raw = BeautifulSoup(html, ‘html.parser’).get_

text()

>>>tokens = word_tokenize(raw)

25 Processing Raw Text

>>>tokens a search engine. This means that you have a greater
chance of discovering any linguistic pattern that
[‘BBC’, ‘NEWS’, ‘|’, ‘Health’, ‘|’, ‘Blondes’, “’to”, ‘die’,
you are interested in. In addition, you are able to
‘out’, ...]
make use of highly particular patterns, the likes of
This still contains unwanted content pertaining to
which would only match one or two instances on a
the navigation of the website and related stories.
smaller example, but which, when performed on the
You can find the beginning and ending indexes of
web, may match tens of thousands of examples.
the content with some trial and error, select the
Another benefit of using search engines on the
tokens that are of interest to you, and initialize a
internet is that they are relatively simple to use. As a
text in the same manner as before.
result, they provide a highly practical instrument for
rapidly testing a hypothesis to determine whether
>>>tokens = tokens[110:390]
or not it is logical.
>>>text = nltk.Text(tokens)

>>>text.concordance(‘gene’) The unfortunate reality is that search engines

have a number of fundamental flaws. To
Displaying 5 of 5 matches:
begin, there is a narrow window for
hey say too few people now carry the kind of search patterns that
the gene for blondes to last are permitted. Search engines
beyond the next normally only enable you to
blonde hair is caused by a search for individual words or
recessive gene . In order for a strings of words, occasionally
child to have blond with wildcards, in contrast to
local corpora, which allow you
have blonde hair , it must have
to develop programs to search
the gene on both sides of the
for arbitrarily complex patterns.
family in the g
Local corpora allow you to write
ere is a disadvantage of having that programs to search for these patterns.
gene or by chance . They do n’t disappear Second, the results that are returned by search
des would disappear is if having the gene was a engines are not always reliable and might be quite
disadvantage and I do not thin different when they are utilized at different times or
in various locations throughout the world. There is
Processing Search Engine Results a possibility that search results will improve when
material has been replicated across numerous
One way to think of the internet is as a massive websites. Last but not least, the markup that is
repository of unannotated text. The use of web included in the results provided by a search engine
search engines provides an effective method could change unexpectedly, rendering any pattern-
for looking through such a massive volume of based approach to discovering specific information
information to find relevant instances of linguistic obsolete (a problem which is ameliorated by the
use. The fact that you are exploring such a vast use of search engine APIs).
group of documents is the primary benefit of using

26 Processing Raw Text

Note ‘visiting’,

‘graduate’, ‘students’, ‘from’, ‘the’, ‘PRC’, ‘.’,

Your Turn: Search the web for “the of” (inside
‘Thinking’, ‘that’, ‘I’,
quotes). Based on the large count, can we conclude
‘was’, ‘being’, ‘au’, ‘courant’, ‘,’, ‘I’, ‘mentioned’, ‘the’,
that “the of” is a frequent collocation in English?
‘expression’,

Processing RSS Feeds ‘DUI4XIANG4’, ‘\u5c0d\u8c61’, ‘(“’, ‘boy’, ‘/’, ‘girl’,

‘friend’, ‘”’, ...]
The blogosphere has emerged as a significant
source of text, not only in informal but also in With a little more effort, we will be able to write some
formal registers. We are able to access the content programs that will generate a small corpus of blog
of a blog with the assistance of a Python library posts, and then we will use this as the foundation
known as the Universal Feed Parser, which can be for our natural language processing work.
downloaded from https://pypi.python.org/pypi/
feedparser; see the examples that follow for more Reading Local Files
information:
In order to read the contents of a
>>>import feedparser local file, we will first use the open()
function that is already built into
>>>llog = feedparser.parse(“http://
Python, and then we will use the
languagelog.ldc.upenn.edu/
read() method. Assuming you
nll/?feed=atom”)
have access to the document.
>>>llog[‘feed’][‘title’]
txt file, you can load its contents
‘Language Log’ in the following manner:

>>>len(llog.entries)
>>>f = open(‘document.txt’)
15
>>>raw = f.read()
>>>post = llog.entries[2]

>>>post.title Note

“He’s My BF”
Now it’s your turn: using a text editor, create a new
>>>content = post.content[0].value
file on your computer and name it document.txt.
>>>content[:70] Type in a few lines of text and save the file as plain
text. If you are working with IDLE, select the New
‘Today I was chatting with three of our visiting
Window command from the File menu. Type the
graduate students f’
necessary text into this window, and then save the
>>>raw = BeautifulSoup(content, ‘html.parser’).
file as document.txt within the directory that IDLE
get_text()
suggests you use in the pop-up dialogue box. If you
>>>word_tokenize(raw) are working with IDLE, you can download it here.

[‘Today’, ‘I’, ‘was’, ‘chatting’, ‘with’, ‘three’, ‘of’, ‘our’, The following step is to open the file in the Python
interpreter by typing f = open(‘document.txt’), and
27 Processing Raw Text
then to examine its contents by typing print(f. There are a few different approaches you might take
read()). in order to read the file if you are able to access it.
The read() function generates a string that contains
When you attempted to do this, a number of different the whole of the file’s contents, including:
things may have gone wrong. In the event that the
interpreter was unable to locate your file, you would >>>f.read()
have received an error similar to the following:
‘Time flies like an arrow.\nFruit flies like a
banana.\n’
>>>f = open(‘document.txt’)

Traceback (most recent call last): It is important to keep in mind that the letters ‘n’
represent newlines; this is the same as beginning a
File “<pyshell#7>”, line 1, in -toplevel-
new line by hitting the Enter key on a keyboard.
f = open(‘document.txt’)

IOError: [Errno 2] No such file or directory: We also have the option of reading a file line by line
‘document.txt’ by using a for loop:

Use the Access command found in the File menu >>>f = open(‘document.txt’, ‘rU’)
of IDLE; this will show a list of all the files located
>>>for line in f:
in the directory where IDLE is now operating. Using
... print(line.strip())
this command will allow you to verify that the file
you are attempting to open is indeed located in the Time flies like an arrow.
correct directory. An additional option is to do the Fruit flies like a banana.
search from inside Python, using the directory that
is now active: In this case, we make use of the strip() function to
get rid of the newline character that appears at the
>>>import os end of each line of input.
>>>os.listdir(‘.’)

When viewing a text file, you may also have run

into the issue of the newline norms, which vary
depending on the operating system. This is another
potential issue that you may have experienced.
There is a second option that may be used with
the built-in open() method to modify how the file is
opened. For example, open(‘document.txt’, ‘rU’) —
The letter ‘r’ tells the computer to open the file for
reading, which is the default action, and the letter
‘U’ stands for “Universal,” which tells the computer
to disregard the various marking schemes that are
used for newlines.

28 Processing Raw Text

These techniques may also be used to access the files that make up the NLTK corpus. To get the filename
for any item in the corpus, we need to make use of the nltk.data.find() function. After that, we can open it and
read it using the method that we went through just above:

>>>path = nltk.data.find(‘corpora/gutenberg/melville-moby_dick.txt’)

>>>raw = open(path, ‘rU’).read()

Text Extraction from Binary Formats Such As PDF, MS Word, and Other Formats

Text formats that are understandable by humans include ASCII text and HTML text. Binary file formats, such
as PDF and MSWord, are used for text documents, and these formats need specific software to be accessed.
Access to these formats may be gained via the use of third-party libraries such as pypdf and pywin32. It is a
particularly difficult task to extract text from documents that have several columns. It is easier to do a one-
time conversion of a few documents if you first open the document in an appropriate program, then save it as
text to your local disk, and then access it in the manner that is outlined in the following paragraphs. You may
search for the document by entering its URL into Google’s search box if it is already available on the internet.
A link to an HTML version of the document, which you are able to save as text, is often included in the results
of the search.

Capturing User Input

When a user is engaging with our software, there are occasions when we would want to record the text that
the user types in. Calling the Python function input will cause the user to be prompted to enter a line of input

29 Processing Raw Text

(). Once the input has been saved to a variable, we are able to manipulate it in the same way that we have
manipulated previous strings.

Summary

● In this unit, a “text” is understood to be nothing more than a collection of words. A “raw text” is a potentially
lengthy string that contains words and whitespace formatting, and it is the traditional way that we store
and perceive a text. This string may also be called “plain text.”

● Python allows you to specify strings by enclosing them in single or double quotes, such as “Monty Python”
or “Monty Python.”

● To retrieve the characters included in a string, indices are used, with counting beginning with zero: The
value M is determined by ‘Monty Python’[0]. The len function is what’s used to determine the length of a
string ().

● Accessing a substring requires employing a technique known as slice notation. For example, “Monty
Python”[1:5] returns the result onty. If the start index is not specified, the substring will start at the beginning
of the string. On the other hand, if the end index is not specified, the slice will continue until it reaches the
end of the text.

● ‘Monty Python’ is an example of a string that may be parsed into many lists.

● The result of using split() is [‘Monty,’ ‘Python’]. It is possible to combine several lists into one string by using
the ‘/’ character. The expression “join(‘Monty, ‘Python’]) produces the string “Monty/Python.”

● Through the use of the formula text = open(‘input.txt’), we are able to read text from the file input.txt.

● read(). The formula text = request.urlopen allows us to read text from URLs (url). read(). decode(‘utf8’).
Using the for line function in open, we are able to loop through the lines of a text file (f).

● Text may be written to a file by first opening the file for writing using the output file = open(‘output.txt’, ‘w’)
command, and then appending the text using the print(“Monty Python”, file=output file) command.

● Before we can do any kind of linguistic processing on texts that were obtained on the internet, we need to
remove any undesired data that they could have included, such as headers, footers, or markup.

● The process of breaking up a text into its component parts, also known as tokens, such as words and
punctuation is known as tokenization. As it groups punctuation in with the words it recognizes, tokenization
that is based on whitespace is not suitable for many applications. NLTK has a pre-built tokenizer that you
may use called nltk.word tokenize().

● The process of lemmatization translates the many forms of a word (such as appeared and appears) to
the canonical or citation form of the term, which is also referred to as the lexeme or lemma (e.g., appear).

30 Processing Raw Text

Unit 4

Writing Structured Programs

Learning Objectives Introduction

By the end of this unit, you will be

You should now have a basic understanding of the
able to understand:
possibilities of the Python programming language when it
● Understand equality
comes to processing natural language. On the other hand,
● Conditionals if you are new to Python or to programming in general, it

● About sequences is possible that you are still struggling with Python and
do not yet feel as if you have complete control. Within
● Sequence types
the scope of this chapter, we will discuss the following
issues:

● How do you design readily readable programs that

have good structure, and that you and others will be
able to reuse with little effort?

● What are the inner workings of the essential building

elements, such as loops, functions, and assignment?

● What are some of the potential difficulties of using

Python as a programming language, and how may
these problems be avoided?

Along the process, you will be able to solidify your

understanding of essential programming principles, learn
more about leveraging aspects of the Python language in
a natural and succinct manner, and discover some helpful
approaches for visually representing natural language
data. This chapter, much like the previous ones, is full

31 Writing Structured Programs

of examples and activities (and as before, some >>>foo = ‘Monty’
exercises introduce new material). Readers who are
>>>bar = foo
not familiar with programming should go through
>>>foo = ‘Python’
them carefully and reference other introductions to
programming if required; however, readers who are >>>bar
knowledgeable in programming may swiftly scan ‘Monty’
through this chapter.
This performs just as one would anticipate it to. The
The organization of the programming principles value of foo, which is the text ‘Monty,’ is allocated to
included in the next chapters of this book was bar when we write the equation bar = foo in the code
determined by the requirements of natural language shown above [1]. That is to say, bar is a duplicate of
processing (NLP). In this section, we resort to an foo; hence, the value of bar is unaffected when we
approach that is more typical, in which the content replace foo with a new string containing the word
is more tightly connected to the structure of the “Python” on line [2].
programming language. As a comprehensive
exposition of the language is not possible in this Nevertheless, producing copies in this manner is
setting, we will limit our attention to those language not always required for all assignment declarations.
constructions and idioms that are considered to be The value of an expression is copied over whenever
the most significant for NLP. assignment is performed, but the value itself is not
necessarily what you may anticipate it to be. To be
Going Back to the Fundamentals more specific, the “value” of a structured object such
as a list is, in reality, nothing more than a reference
Assignment
to the item itself. In the next example, the reference
of foo is given to the brand created variable bar via
It would seem that assignment is the most
the operator [1]. Now, if we update anything within
fundamental idea in programming and hence does
foo on line [2], we can see that the contents of bar
not need its own distinct lecture. On the other hand,
have also been modified. This is because the two
this topic has a few unexpectedly nuanced aspects.
variables are directly connected.
Take into consideration the following piece of code:

32 Writing Structured Programs

>>>foo = [‘Monty’, ‘Python’]

>>>bar = foo

>>>foo[1] = ‘Bodkin’

>>>bar

[‘Monty’, ‘Bodkin’]

Figure No.4.1: Computer Memory

The Figure Showing the Assignment of Lists and the Computer Memory: Because the two list objects foo
and bar reference the same address in the memory of the computer, any changes you make to foo will
automatically be reflected in bar, and vice versa.

The contents of the variable are not copied; just its “object reference” is copied when the line bar = foo [1] is
used. It is necessary for us to have an understanding of how lists are kept in the memory of the computer
to comprehend what is taking on here. We can see from the contents of the figure that a list called foo is a
reference to an object that is kept at the position 3133 (which is itself a series of pointers to other locations
holding strings). When we do an assignment like bar = foo, all that is transferred is the object reference,
which is 3133. This behavior is applicable to other facets of the language as well, such as the passing of
parameters.

Let’s conduct some further tests by first establishing a variable called empty that will store the empty list, and
then making use of that variable three times on the line that follows:

>>>empty = []

>>>nested = [empty, empty, empty]

>>>nested

[[], [], []]

>>>nested[1].append(‘Python’)

33 Writing Structured Programs

>>>nested references included inside the nested list was
changed. On the other hand, the [‘Python’] object
[[‘Python’], [‘Python’], [‘Python’]]
was not modified, and it is still referenced in our

Take note that by altering one of the items in our nested list of lists in two different locations. It is

nested list of lists, we were able to modify the essential to have a thorough understanding of

other items as well. This is the case due to the fact the distinction that exists between modifying an

that each of the three elements is, in reality, just a object through the use of an object reference and

reference to the same list that is stored in memory. overwriting an object reference.

Take Note: It’s Your Turn: Multiplication will be used Note

to produce the following list of lists: nested = [[]] *

You may put bar = foo[:] in order to replicate the
3. Now, make a change to one of the items on the
elements from a list called foo into a new list called
list, and you’ll notice that the rest of the items also
bar. This duplicates the references to the objects
undergo a change. Find out the numeric identifier
included inside the list. Use the copy.deepcopy
for any object by using the id() function in Python,
function to copy a structure without
and make sure that id(nested[0]),
duplicating any of its object references.
id(nested[1]), and id(nested[2]) all
return the same value.
Equality

Now, take note of the fact that

Python offers two different
it does not propagate to the
techniques to determine
other elements of the list
whether or not two objects are
whenever we give a new value to
equivalent to one another. The is
one of the elements of the list:
operator determines whether or not
an item is unique. It is possible for us to
>>>nested = [[]] * 3
utilize it to validate our past observations
>>>nested[1].append(‘Python’)
about various items. To begin, we will produce a list
>>>nested[1] = [‘Monty’] that contains many copies of the same object, and

>>>nested then we will show that these copies are not only
identical according to the operator ==, but also that
[[‘Python’], [‘Monty’], [‘Python’]]
they are the same object in their entirety:

When we started, we had a list that included three

>>>size = 5
pointers to a single item that was an empty list. After
that, we changed that object by inserting ‘Python’ >>>python = [‘Python’]

to it, which produced a list with three references >>>snake_nest = [python] * size
to a single list object called ‘Python’. The next step
>>>snake_nest[0] == snake_nest[1] == snake_
was to replace a reference to one of those objects
nest[2] == snake_nest[3] == snake_nest[4]
with a reference to a new object called “Monty.”
True
During this very final phase, one of the three object

34 Writing Structured Programs

>>>snake_nest[0] is snake_nest[1] is snake_nest[2] It could seem weird to have two different sorts of
is snake_nest[3] is snake_nest[4] equality. Having said that, all that’s happening here
is that the type-token difference, which is common
True
in natural language, is being implemented in a

Let’s add another python to this nest now, shall we? computer language.

It is simple for us to demonstrate that not all of the

Conditionals
objects are the same:

A string or list that is not empty is evaluated as

>>>import random
true within the condition portion of an if statement,
>>>position = random.choice(range(size))
whereas a string or list that is empty is evaluated as
>>>snake_nest[position] = [‘Python’] false within that portion.

>>>snake_nest
>>>mixed = [‘cat’, ‘’, [‘dog’], []]
[[‘Python’], [‘Python’], [‘Python’], [‘Python’],
>>>for element in mixed:
[‘Python’]]
... if element:
>>>snake_nest[0] == snake_nest[1]
== snake_nest[2] == snake_ ... print(element)
nest[3] == snake_nest[4]
...
True
cat
>>>snake_nest[0] is snake_
[‘dog’]
nest[1] is snake_nest[2] is
snake_nest[3] is snake_nest[4] That is to say, in the condition,
False we do not require the statement
if len(element) > 0:
You can perform a number of pairwise
tests to determine which position is occupied How is the use of if...elif distinct from the
by the intruder; however, the id() function makes use of multiple if statements placed one after the
detection much simpler: other? Now, take into consideration the following
scenario:
>>>[id(snake) for snake in snake_nest]
>>>animals = [‘cat’, ‘dog’]
[4557855488, 4557854763, 4557855488,
4557855488, 4557855488] >>>if’cat’in animals:

... print(1)
This demonstrates that there is a unique identifier
... elif’dog’in animals:
associated with the second item on the list. You
should anticipate seeing different numbers in ... print(2)
the list that is generated as a consequence of ...
executing this code snippet on your own, and the
1
invader could also be in a different position.

35 Writing Structured Programs

Python never even attempts to evaluate the elif >>>t = ‘walk’, ‘fem’, 3
clause because the if clause of the statement
>>>t
is already satisfied. As a result, we never get the
(‘walk’, ‘fem’, 3)
chance to print out the value 2. If, on the other hand,
we were to use an if statement instead of the elif, >>>t[0]
then we would print out both 1 and 2. When the elif ‘walk’
clause evaluates to true, it not only tells us that the
>>>t[1:]
condition is satisfied, but also that the condition of
the main if clause was not satisfied. This means (‘fem’, 3)
that an elif clause has the potential to provide us >>>len(t)
with more information than a simple if clause does.
3

When applied to a list (or any other sequence), the

methods all() and any() determine whether or not
all items, any items, or none of the items satisfy a
certain condition:

>>>sent = [‘No’, ‘good’, ‘fish’, ‘goes’, ‘anywhere’,

‘without’, ‘a’, ‘porpoise’, ‘.’]

>>>all(len(w) > 4 for w in sent)

False
Caution!
>>>any(len(w) > 4 for w in sent)

True The comma operator is utilized in the construction

of tuples. The use of parentheses, a more general
Sequences feature of Python syntax, was developed specifically
for the purpose of grouping. To define a tuple as
Strings and lists are the two types of sequence
having the single element “snark,” a trailing comma
objects that we have examined up to this point.
must be added to the end of the expression, as
A tuple is another kind of sequence that may be
shown here: “’snark’,” The special case of an empty
found. Tuples are created with the comma operator
tuple is denoted by the use of empty parentheses in
[1], and they are often surrounded by parentheses
the definition.
when written. In fact, we have seen them in the
chapters that came before, and we have referred Let’s directly compare tuples, strings, and lists, and
to them as “pairs” on occasion since there were then carry out the following operations on each
always two individuals in each group. Tuples, on the type: indexing, slicing, and counting the length of
other hand, may include any number of members. the item:
Tuples, much like lists and strings, have a length,
may be indexed, and can be sliced [2, 3]. Tuples >>>raw = ‘I turned off the spectroroute’
also have a length.
>>>text = [‘I’, ‘turned’, ‘off’, ‘the’, ‘spectroroute’]

36 Writing Structured Programs

>>>pair = (6, ‘turned’)

>>>raw[2], text[3], pair[1]

(‘t’, ‘the’, ‘turned’)

>>>raw[-3:], text[-3:], pair[-3:]

(‘ute’, [‘off’, ‘the’, ‘spectroroute’], (6, ‘turned’))

>>>len(raw), len(text), len(pair)

(29, 5, 2)

Take note that in this snippet of code, we have calculated many numbers on a single line, separating each one
with commas. Python enables us to eliminate the parentheses surrounding tuples if there is no ambiguity,
which is why these comma-separated expressions are essentially simply tuples. When we print a tuple, the
parenthesis will consistently be shown on the page. When we use tuples in this manner, we are effectively
grouping elements together in a more general sense.

Operating on Sequence Types

As was just shown for you, there are several helpful methods in which we can iterate through the elements
that make up a sequence.

Various Ways to Iterate Over Sequences

Python Expression Comment

For item in s Iterate over the items of s

For item in sorted (s ) Iterate over the items of s in order

For item in set (s) Iterate over unique elements of s

For item in reversed (s) Iterate over elements of s in reverse

For item in set (s) .difference (t) Iterate over elements of s not t

37 Writing Structured Programs

The sequence functions that were just described >>>words = [‘I’, ‘turned’, ‘off’, ‘the’, ‘spectroroute’]
may be combined in a variety of different ways; for
>>>words[2], words[3], words[4] = words[3],
instance, to get the unique members of s sorted in
words[4], words[2]
the opposite direction, use reversed(sorted(set(s))).
>>>words
Before iterating through the items in a list, we have
the option to shuffle the contents of the list by using [‘I’, ‘turned’, ‘the’, ‘spectroroute’, ‘off’]
the random.shuffle() function (s).
This method of rearranging the elements in a list is
These two forms of sequences are both convertible not only idiomatic but also easy to understand. It is
by us. Any sort of sequence may be turned into a the same as the following conventional approach
tuple by using the tuple(s) function, and any form for completing similar jobs, which does not make
of sequence can be turned into a list by using the use of tuples (notice that this method needs a
list(s) function. By making use of the join() method, temporary variable tmp).
for example ‘:’.join, we are able to reduce a list of
strings to a single string (words). >>>tmp = words[2]

>>>words[2] = words[3]
Other objects, such as a FreqDist, are
>>>words[3] = words[4]
capable of being transformed into
a sequence (through the list() or >>>words[4] = tmp
sorted() methods) and iteration.
Python’s sequence functions,
>>>raw = ‘Red lorry, yellow such as sorted() and
lorry, red lorry, yellow lorry.’ reversed(), can reorder the items
in a sequence, as we have seen.
>>>text = word_tokenize(raw)
These functions are available in
>>>fdist = nltk.FreqDist(text) Python. There are also functions that
>>>sorted(fdist) can change the structure of a sequence,
which can be useful for language processing
[‘,’, ‘.’, ‘Red’, ‘lorry’, ‘red’, ‘yellow’]
because of the flexibility they provide. Therefore,
>>>for key in fdist:
the function zip() takes the elements of two or
... print(key + ‘:’, fdist[key], end=’; ‘) more sequences and “zips” them together into a
single list of tuples using those elements. If you
...
give enumerate(s) a sequence, it will return pairs
lorry: 4; red: 1; .: 1; ,: 3; Red: 1; yellow: 2
to you that contain an index and the item that is
located at that index.
In the next example, we will demonstrate how to
rearrange the items in our list by making use of
>>>words = [‘I’, ‘turned’, ‘off’, ‘the’, ‘spectroroute’]
tuples. (Since the comma has greater precedence
>>>tags = [‘noun’, ‘verb’, ‘prep’, ‘det’, ‘noun’]
than assignment, the parentheses aren’t necessary
in this sentence.) >>>zip(words, tags)

<zip object at ...>

38 Writing Structured Programs

>>>list(zip(words, tags)) 9.0

[(‘I’, ‘noun’), (‘turned’, ‘verb’), (‘off’, ‘prep’),

During the course of this procedure, we are able to
(‘the’, ‘det’), (‘spectroroute’, ‘noun’)]
confirm that none of the original data is lost, nor is
>>>list(enumerate(words)) it replicated [3]. Additionally, we are able to confirm
that the proportion of the sizes of the two pieces
[(0, ‘I’), (1, ‘turned’), (2, ‘off’), (3, ‘the’), (4,
corresponds to what we had envisioned [4].
‘spectroroute’)]

Combining Several Different Types of Sequence

Note

In order to perform the task of sorting the words

Python 3 and NLTK 3 both have the characteristic of
in a string by their length, let’s combine our
only carrying out computation when it is absolutely
knowledge of these three sequence types with our
necessary (a feature known as “lazy evaluation”).
understanding of list comprehensions.
If you ever encounter a result such as “zip object
at 0x10d005448” when you are expecting
>>>words = ‘I turned off the
to see a sequence, you may force the
spectroroute’.split()
object to be evaluated by simply
placing it in a context that >>>wordlens = [(len(word),
expects a sequence, such as word) for word in words]
list(x) or for item in x. This >>>wordlens.sort()
will cause the object to be
>>>’ ‘.join(w for (_, w) in
evaluated immediately.
wordlens)

It may be required, depending ‘I off the turned spectroroute’

on the NLP job at hand, to divide a
sequence into two or more sections. The preceding lines of code each
For illustration’s sake, we could wish to include an important component of the
“train” a system on 90% of the data before putting it system. A string that seems to be simple is, in reality,
through its paces on the remaining 10%. In order to an object that may have methods created on it like
do this, we must first determine the point in time at split() [1]. We make use of a list comprehension in
which we want to cut the data [1], and then we must order to construct a list of tuples [2], where each
cut the sequence at that point [2]. tuple is composed of a number (representing
the length of the word) and the word, such as (3,
>>>text = nltk.corpus.nps_chat.words() ‘the’). In order to sort the list without moving it, we
make use of the sort() function [3]. In the end, we
>>>cut = int(0.9 * len(text))
get rid of the length information and combine the
>>>training_data, test_data = text[:cut], text[cut:]
words into a single string by joining them together
>>>text == training_data + test_data [4]. (The underscore [4] is simply an ordinary
Python variable, but we may use underscore as
True
a convention to signal that we will not utilize the
>>>len(training_data) / len(test_data)
value of the variable.)

39 Writing Structured Programs

We started by discussing the similarities shared by A lexicon is shown here in the form of a list since
different sequence kinds; nevertheless, the code it is a collection of objects of a single kind called
shown above highlights significant distinctions in lexical entries, and the length of the collection is
the functions played by each of them. To begin, not specified. As it is a collection of objects with
strings are located at the beginning and the end of different interpretations, such as the orthographic
the code. This is a common occurrence in situations form, the part of speech, and the pronunciations
when our software is taking input in the form of text (represented in the SAMPA computer-readable
and creating output for us to read. To accomplish phonetic alphabet http://www.phon.ucl.ac.uk/
their respective goals, lists and tuples are used home/sampa/), an individual entry is represented
throughout the middle. Generally speaking, a list is as a tuple. This is because a tuple is a collection
a succession of items that all have the same type of objects with different interpretations. Take note
and may be of any length. Word sequences are often that a list is being used to preserve these many
stored in lists that we create and utilize. Tuples, on ways of pronouncing the word. (Why?)
the other hand, are often collections of items that
are of varying kinds and are of a predetermined As a lexicon is a collection of items of a single
length. When we need to store a record, which is kind, which are referred to as lexical entries, and
essentially a collection of distinct data pertaining the length of the collection is not specified, a
to some object, we often utilize something called lexicon is presented in this article in the form of
a tuple. This contrast between the application of a list. An individual entry is represented as a tuple
tuples and lists requires some getting accustomed due to the fact that it is a collection of objects
to, so here is another illustration of it: that can be interpreted in a variety of ways, such
as the orthographic form, the part of speech, and
>>>lexicon = [ the pronunciations (which are shown using the
SAMPA computer-readable phonetic alphabet,
... (‘the’, ‘det’, [‘Di:’, ‘D@’]),
which can be found at http://www.phon.ucl.ac.uk/
... (‘off’, ‘prep’, [‘Qf’, ‘O:f’])
home/sampa/). This is due to the fact that a tuple
... ] is a collection of objects, each of which can be
interpreted in a unique way. Please take note that
in order to record all of these different ways to
pronounce the word, a list is being used. (Why?)

>>>lexicon.sort()

>>>lexicon[1] = (‘turned’, ‘VBD’, [‘t3:nd’, ‘t3`nd’])

>>>del lexicon[0]

40 Writing Structured Programs

Note python.org/dev/peps/pep-0008/. Consistency, with
the goal of increasing the readability of code, is the
Your turn: First, convert lexicon to a tuple by using underlying value that is conveyed in the style guide.
the syntax lexicon = tuple(lexicon), and then try each In this section, we provide a concise assessment of
of the operations listed above to verify that none of some of its most important suggestions and direct
them are allowed to be performed on tuples. readers to the complete guide for a more in-depth
discussion with examples.
Questions of Style
The layout of code should make use of four spaces
The process of programming has elements of for each level of indentation. When you are writing
both an art and a science. “The Art of Computer Python code in a file, you should make sure that you
Programming” is a multi-volume book that totals avoid using tabs for indentation. This is because
2,500 pages and is authored by Donald Knuth. It is tabs may be misread by various text editors,
widely considered to be the “bible” of programming. which can cause the indentation to get screwed
Literate programming is the subject of a up. Lines should be no longer than 80
large number of publications since it characters; if required, you are allowed
is widely acknowledged that people, to break a line inside of parentheses,
and not only computers, need to brackets, or braces since Python
be able to read and comprehend is able to identify that the line
programs. In this section, extends over to the following
we will discuss various line. In the event that you
difficulties pertaining to need to break a line outside
programming style that have of parentheses, brackets, or
crucial implications for the braces, you may often add further
readability of your code. These parentheses, and you can always
topics include the structure of add a backslash at the end of the line
your code, the difference between that is broken:
procedural and declarative programming styles,
and the application of loop variables. >>>if (len(syllables) > 4 and len(syllables[2]) == 3
and
Python’s Programming Format
... syllables[2][2] in [aeiou] and syllables[2][3] ==
syllables[1][3]):
When developing computer programs, you have to
make a lot of nuanced decisions about things like ... process(syllables)
names, space, comments, and so on. When looking >>>if len(syllables) > 4 and len(syllables[2]) == 3
at code created by other people, you may notice and \
unnecessary variances in style, which may make
... syllables[2][2] in [aeiou] and syllables[2][3] ==
the code more difficult to read. As a result, the
syllables[1][3]:
developers of the Python programming language
have written and distributed a style guide for ... process(syllables)

Python code, which may be found at http://www.

41 Writing Structured Programs

Summary

● Both the assignment and the passing of parameters in Python make use of object references. For instance,
if a is a list and we assign b = a, then every action performed on a will also alter b, and vice versa.

● The is action determines whether or not two objects are the same internal object, while the == operation
determines whether or not two things are equal. This difference is analogous to the one between the type
and the token.

● There are many distinct types of sequence objects, including strings, lists, and tuples. These sequence
objects enable standard operations like indexing, slicing, len(), sorted(), and membership checking using
in.

● Programming in a declarative approach often results in code that is more concise and easier to comprehend;
manually incremented loop variables are typically unnecessary; and enumerate should be used whenever a
sequence has to be enumerated ().

● Functions are a fundamental programming abstraction; the notions of passing parameters, variable scope,
and docstrings are important to grasp while working with functions.

● Names that are defined inside a function are not accessible outside of that function unless those names
are explicitly declared to be global. A function acts as a namespace.

● Modules make it possible to localize content that is logically connected inside a single file. Names that are
declared inside a module, such as variables and functions, are not accessible to other modules until those
names are imported into those other modules. This is because a module acts as a namespace.

● In the field of natural language processing (NLP), dynamic programming is a method for the construction
of algorithms that maintains the results of earlier computations to prevent needless re-computation.

42 Writing Structured Programs

Unit 5

Categorizing and Tagging Words

Learning Objectives Introduction

By the end of this unit, you will be

You learned the difference between nouns, verbs,
able to understand:
adjectives, and adverbs when you were in elementary
● How to use a tagger?
school. These “word classes” are not just something
● Tagged corpora that grammarians made up for fun; they are useful for

● Mapping words to properties many tasks that have to do with processing language.

using python dictionaries As we’ll see, they come from a simple look at how words
are placed in a text. This chapter will try to answer the
following questions:

● How are lexical categories used in natural language

processing, and what are they?

● What is a good way to store words and their categories

in Python?

● How can we automatically assign a word class to each

word in a text?

We’ll talk about some basic NLP techniques along the

way, like sequence labeling, n-gram models, backoff,
and evaluation. These techniques can be used in many
different situations, and tagging gives us a simple way
to show them. We’ll also see that the second step in
the typical NLP pipeline is tagging, which comes after
tokenization.

43 Categorizing and Tagging Words

Part-of-speech tagging, POS-tagging, or just >>> text = word_tokenize(“They refuse to permit
“tagging,” is the process of putting words into us to obtain the refuse permit”)
groups based on their parts of speech and labeling
>>> nltk.pos_tag(text)
them. Word classes and lexical categories are
[(‘They’, ‘PRP’), (‘refuse’, ‘VBP’), (‘to’, ‘TO’),
other names for parts of speech. A tagset is a
(‘permit’, ‘VB’), (‘us’, ‘PRP’),
group of tags that are used for a certain task. In
this chapter, we focus on how to use tags and how (‘to’, ‘TO’), (‘obtain’, ‘VB’), (‘the’, ‘DT’), (‘refuse’,
to automatically tag text. ‘NN’), (‘permit’, ‘NN’)]

Using a Tagger Notice that the words refuse and permit are both
present tense verbs (VBPs) and nouns (NN). For
A part-of-speech tagger, or POS-tagger, takes a example, refUSE is a verb that means “refuse” and
string of words and adds a part-of-speech tag to REFUSE is a noun that means “trash” (i.e. they are
each one (don’t forget to import nltk): not homophones). So, to say the text correctly, we
need to know which word is being used. (This is
>>>text = word_tokenize(“And now for something why POS-tagging is usually done by text-to-speech
completely different”) systems.)

>>>nltk.pos_tag(text)
Note:
[(‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’,
‘NN’),
Your Turn: Many words, such as ski and race, can
(‘completely’, ‘RB’), (‘different’, ‘JJ’)] be used as nouns or verbs without changing how
they sound. Can you come up with more? Hint: Try
Here, we can see that and is a coordinating putting the word to in front of a common noun to
conjunction (CC), that now and completely are see if it can also be a verb, or try putting the in front
adverbs (RB), that something is a noun (NN), and of an action to see if it can also be a noun. Now,
that different is an adjective (JJ). make up a sentence that uses this word in both
ways and run it through the POS-tagger.
Note:
Noun categories and part-of-speech tags like NN
NLTK has documentation for each tag, which you seem to have their uses, but many readers won’t
can look up using either the tag name, like nltk. understand the details. You might wonder why it’s
help.upenn tagset(‘RB’), or a regular expression, important to add this extra piece of information.
like nltk.help.upenn tagset(‘NN.*’). Some corpora Many of these categories come from a quick look
have README files with information about tagsets. at how words are placed in a text. Think about how
To find these files, type nltk.corpus.???.readme(), the words woman (a noun), bought (a verb), over
where??? is the name of the corpus. (a preposition), and the (a determiner). The text.
similar() method takes a word w, finds all contexts
Let’s take a look at another example, this time with w1w w2, and then finds all words w’ that appear in
homonyms: the same context, i.e. w1w’w2.

44 Categorizing and Tagging Words

>>> text = nltk.Text(word.lower() for word in nltk. a tuple made up of the token and the tag. With the
corpus.brown.words()) function str2tuple(), we can turn the standard string
representation of a tagged token into one of these
>>> text.similar(‘woman’)
special tuples:
Building word-context index...

man day time year car moment world family house >>> tagged_token = nltk.tag.str2tuple(‘fly/NN’)
boy child country job state girl place war way case
>>> tagged_token
question
(‘fly’, ‘NN’)
>>> text.similar(‘bought’)
>>> tagged_token[0]
made done put said found had seen given left
‘fly’
heard been brought got set was called felt in that
told >>> tagged_token[1]

>>> text.similar(‘over’) ‘NN’

in on to of and for with from at by that into as up

From a string, we can make a list of tagged tokens
out down through about all is
right away. The first step is to tokenize the string to
>>> text.similar(‘the’) get to the individual word/tag strings, and then use
a his this their its her an that our any all one these str2tuple() to turn each of these strings into a tuple.
my in your no some other and
>>> sent = ‘’’
When you search for “woman,” you find nouns. ... The/AT grand/JJ jury/NN commented/VBD on/
When you search for “bought,” you usually find IN a/AT number/NN of/IN
verbs. When you search for “over,” you usually find
... other/AP topics/NNS ,/, AMONG/IN them/PPO
prepositions. When you search for “the,” you find
the/AT Atlanta/NP and/CC
several “determiners.” When these words are used
in a sentence, a tagger can correctly identify the ... Fulton/NP-tl County/NN-tl purchasing/VBG
tags on them, e.g. The clothes she bought cost departments/NNS which/WDT it/PPS
more than $150,000. ... said/VBD ``/`` ARE/BER well/QL operated/VBN
and/CC follow/VB generally/RB
A tagger can also predict what we know about
... accepted/VBN practices/NNS which/WDT inure/
unknown words. For example, we can guess that
VB to/IN the/AT best/JJT
“scrobbling” is probably a verb because it has the
root “scrobble” and is likely to be used in situations ... interest/NN of/IN both/ABX governments/NNS
like “he was scrobbling.” ‘’/’’ ./.

... ‘’’
Tagged Corpora
>>> [nltk.tag.str2tuple(t) for t in sent.split()]

Representing Tagged Tokens [(‘The’, ‘AT’), (‘grand’, ‘JJ’), (‘jury’, ‘NN’),

(‘commented’, ‘VBD’),
In NLTK, a tagged token is usually represented by
(‘on’, ‘IN’), (‘a’, ‘AT’), (‘number’, ‘NN’), ... (‘.’, ‘.’)]
45 Categorizing and Tagging Words
Some of the corpora that come with NLTK have are some more examples, again using the Brown
been marked to show what part of speech they Corpus to show the output format:
are. Here’s an example of what you might see if
you used a text editor to open a file from the Brown >>> print(nltk.corpus.nps_chat.tagged_words())
Corpus: [(‘now’, ‘RB’), (‘im’, ‘PRP’), (‘left’, ‘VBD’), ...]

>>> nltk.corpus.conll2000.tagged_words()
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/
nn-tl said/vbd Friday/nr an/at investigation/nn of/ [(‘Confidence’, ‘NN’), (‘in’, ‘IN’), (‘the’, ‘DT’), ...]
in Atlanta’s/np$ recent/jj primary/nn election/nn >>> nltk.corpus.treebank.tagged_words()
produced/vbd / no/at evidence/nn ‘’/’’ that/cs any/
[(‘Pierre’, ‘NNP’), (‘Vinken’, ‘NNP’), (‘,’, ‘,’), ...]
dti irregularities/nns took/vbd place/nn ./.

Not all corpora use the same set of tags. For more
Part-of-speech tags are saved in different ways
information, see the tagset help and readme()
by other corpora. The interface of NLTK’s corpus
methods mentioned above. At first, we want to stay
readers is the same, so you don’t have to worry
away from these tagsets’ complexity, so we use a
about the different file formats. In contrast to the
built-in mapping to the “Universal Tagset”:
file fragment shown above, this is how the data
looks in the Brown Corpus reader. Since the Brown
>>> nltk.corpus.brown.tagged_
Corpus was published, it has become standard for
words(tagset=’universal’)
part-of-speech tags to be changed to all capital
letters. [(‘The’, ‘DET’), (‘Fulton’, ‘NOUN’), ...]

>>> nltk.corpus.treebank.tagged_
>>> nltk.corpus.brown.tagged_words() words(tagset=’universal’)
[(‘The’, ‘AT’), (‘Fulton’, ‘NP-TL’), ...] [(‘Pierre’, ‘NOUN’), (‘Vinken’, ‘NOUN’), (‘,’, ‘.’), ...]
>>> nltk.corpus.brown.tagged_
words(tagset=’universal’) NLTK comes with tagged corpora for a number
of other languages, such as Chinese, Hindi,
[(‘The’, ‘DET’), (‘Fulton’, ‘NOUN’), ...]
Portuguese, Spanish, Dutch, and Catalan. These
usually have text that isn’t ASCII, and when Python
When tagged text is in a corpus, the tagged words()
prints a larger structure like a list, it always shows
method will be in the NLTK corpus interface. Here
this text in hexadecimal.

46 Categorizing and Tagging Words

Figure: Four Indian languages, Bangla, Hindi, >>> tag_fd = nltk.FreqDist(tag for (word, tag) in
Marathi, and Telugu, have data with OS tags. brown_news_tagged)

>>> tag_fd.most_common()
A Universal Part-of-Speech Tagset
[(‘NOUN’, 30640), (‘VERB’, 14399), (‘ADP’, 12355),

There are many ways to tag words in tagged (‘.’, 11928), (‘DET’, 11389),

corpora. We will look at a simplified tagset to help (‘ADJ’, 6706), (‘ADV’, 3349), (‘CONJ’, 2717),
us get started. (‘PRON’, 2535), (‘PRT’, 2264),

(‘NUM’, 2166), (‘X’, 106)]

Tag Meaning English Examples
new, good, high,
ADJ adjective
special, big, local
Using Python Dictionaries to Link Words to
on, of, at, with, by,
ADP adposition Properties
into, under
really, already, still,
ADV adverb
early, now
and, or, but, if, while, As we’ve seen, a (word, tag) tagged word is a link
CONJ conjuction
although
between a word and a part-of-speech tag. Once
the, a, some, most,
DET determiner, article we start part-of-speech tagging, we will make
every, no, which
year, home, costs, programs that give a word the tag that is most likely
NOUN noun
time, Africa
twenty-four, fourth, to be used in a certain situation. This process can
NUM numerical
1991, 14:24
be thought of as a map from words to tags. The
at, on, out, over per,
PRT particle
that, up, with so-called dictionary data type is the most natural
he, their, her, its, my,
PRON pronoun
I, us
way to store mappings in Python (also known
is, say, told, given, as an associative array or hash array in other
VERB verb
playing, would
programming languages). In this section, we look at
. punctuation marks .,;!
ersatz, esprit, dunno, dictionaries and how they can show different parts
x other
gr8, university
of speech and other kinds of language information.

Let’s see which of these tags is used the most in Lists for Indexing Vs. Dictionaries
the Brown corpus’s news category:

As we’ve seen, Python treats a text like a list of

>>> from nltk.corpus import brown words. One important thing about lists is that we
>>> brown_news_tagged = brown.tagged_ can “find” a specific item by giving its index, such
words(categories=’news’, tagset=’universal’) as text1[100]. Take note of how we give a number

47 Categorizing and Tagging Words

and get a word back. A list can be thought of as a simple kind of table.

0 Call

1 me

2 Ishmael

3 .

Figure: List Look-up: With the help of an integer index, we can get to the items in a Python list.

Compare this to frequency distributions (3), where we give a word and get back a number, such as fdist[‘mon-
strous,’] that tells us how many times that word has been used in a text. Anyone who has used a dictionary
knows how to use words to look something up. Figure shows some more examples.

Alex x154 aclweb.org 128.231.23.4 computational 25

Dana x642 amazon.com 12.118.92.43 language 196

Kim x911 google.com 28.31.23.124 linguistics 17

Les x120 python.org 18.21.3.144 natural 56

Sandy x124 sourceforge.net 51.98.23.53 prcoessing 57

Dictionary Look-up: To get to an entry in a dictionary, we use a key like a person’s name, a web domain, or an
English word. Dictionary is also called a map, hashmap, hash, and associative array.

In a phonebook, we use a name to look up an entry and get back a phone number. When we type a domain
name into a web browser, the computer looks up the name to find an IP address. A word frequency table
lets us look up a word and find out how often it was used in a group of texts. In all of these cases, we are
converting names to numbers instead of the other way around, as we would with a list. In general, we’d like
to be able to connect any two kinds of information. It gives a list of different language objects and what they
map to.

Summary

● There are different kinds of words, like nouns, verbs, adjectives, and adverbs, that can be put into groups.
Parts of speech or lexical categories are other names for these groups. Short labels, or tags, like NN, VB,
etc., are given to different parts of speech.

● Part-of-speech tagging, POS tagging, or just tagging is the process of automatically assigning parts of
speech to words in text.

● Automatic tagging is an important step in the NLP pipeline. It can be used to predict how new words will
behave, analyze how words are used in corpora, and make text-to-speech systems work better.

48 Categorizing and Tagging Words

● POS tags have been added to some linguistic corpora, like the Brown Corpus.

● A variety of tagging methods are possible, e.g. default tagger, regular expression tagger, unigram tagger
and n-gram taggers. These can be put together using a method called “backoff.”

● Tagged corpora can be used to train and test taggers.

● Backoff is a way to combine models: when a more specialized model, like a bigram tagger, can’t assign a
tag in a given context, we switch to a more general model (such as a unigram tagger).

● Part-of-speech tagging is an early and important example of a sequence classification task in NLP. At any
point in the sequence, a classification decision is made using the words and tags in the local context.

● A dictionary is used to convert between different kinds of data, like a string and a number: freq[‘cat’] = 12.
We use the brace notation to make dictionaries: pos =, pos = “furiously”: “adv,” “ideas”: “n,” “colorless”: “adj”.

● N-gram taggers can be made for large numbers of n, but once n is greater than 3, we usually run into the
sparse data problem. Even with a lot of training data, we only see a tiny fraction of the possible contexts.

● Transformation-based tagging involves learning a set of repair rules like “change tag s to tag t in context
c.” Each rule fixes mistakes and might add a (smaller) number of errors.

49 Categorizing and Tagging Words

Unit 6

Learning to Classify Text

Learning Objectives Introduction

By the end of this unit, you will be

The most important part of natural language processing
able to understand:
is finding patterns. Words ending in -ed tend to be past
● Correct class label for a given
tense verbs. The use of “will” a lot is a sign of news text.
input
Some parts of meaning, like tense and topic, match up
● Choosing the right features with these patterns, which can be seen in the way words

● Document classification are put together and how often they are used.

● Sequence classification
● But how did we know where to start looking?

● How did we know which parts of the form went with

which parts of the meaning?

● This chapter will try to answer the following questions:

● How can we figure out what aspects of language data

are most important for classifying it?

● How can we build models of language that can be

used to automatically do tasks related to language
processing?

● What can these models teach us about language?

Along the way, we will look at some important machine

learning techniques, such as decision trees, naive Bayes’
classifiers, and maximum entropy classifiers. We won’t
get into the math and statistics behind these techniques.
Instead, we’ll focus on how and when to use them (see

50 Learning to Classify Text

the further readings section for more technical background). Before we look at these methods, we need to
understand how big this subject is.

Supervised Classification

Classification is the process of putting something into the right group. In simple classification tasks, each
input is looked at on its own, and the set of labels is set up ahead of time. Some examples of tasks that need
to be categorized are:

● Choosing whether or not an email is spam

● Choosing the topic of a news story from a fixed list of topics like “sports,” “technology,” and “politics”

● Figuring out whether a given use of the word “bank” refers to a riverbank, a bank, tilting to the side, or
putting something in a bank

There are a lot of interesting ways to do the basic classification task. In multi-class classification, for example,
each instance can be given more than one label; in open-class classification, the set of labels is not known
ahead of time; and in sequence classification, a list of inputs is classified all at once.

A classifier is called “supervised” if it was built using training corpora in which each input was labelled with
the correct label. Figure 6.1 shows how supervised classification works as a whole.

Figure 6.1:Classification with Help

51 Learning to Classify Text

In the next section, we’ll talk about these feature to their values. ‘last letter’ is an example of a
sets, which contain the most important information feature name, which is a case-sensitive string that
about each input that should be used to classify it. gives a short description of the feature. Values with
A model is made by giving the machine learning simple types, like booleans, numbers, and strings,
algorithm pairs of feature sets and labels. b) are called feature values.
During prediction, the unseen inputs are turned into
feature sets using the same feature extractor. Then, Note:
these sets of features are fed into the model, which
makes predictions about the labels. Most methods of classifying things need features to
be encoded using simple value types, like booleans,
In the rest of this section, we’ll look at how classifiers numbers, and strings. But keep in mind that just
can be used to solve a wide range of problems. Our because a feature has a simple type doesn’t mean
discussion is not meant to be a complete list of all that its value is easy to explain or figure out. As a
the things that can be done with the help of text matter of fact, you can even use very complex and
classifiers. Instead, it is meant to give an example informative values as features, like the output of a
of a few of the things that can be done. second supervised classifier.

Gender Identification Now that we’ve made a feature extractor, we need

to make a list of examples and the class labels that
In 4, we saw that names for men and women are go with them.
different in some ways. Names that end in k, o, r, s,
or t are more likely to be for boys. Names that end >>> from nltk.corpus import names
in a, e, or I are more likely to be for girls. Let’s make >>> labeled_names = ([(name, ‘male’) for name in
a classifier to better show these differences. names.words(‘male.txt’)] +

... [(name, ‘female’) for name in names.

The first step in making a classifier is to figure out
words(‘female.txt’)])
which parts of the input are important and how
to represent them. For this example, we’ll start >>> import random
by just looking at the last letter of a given name. >>> random.shuffle(labeled_names)
The following feature extractor function makes
a dictionary with information about a given name Next, we put the names data through the feature
that is useful: extractor and split the list of feature sets that
comes out into a training set and a test set. A new
>>> def gender_features(word): “naive Bayes” classifier is trained with the training
... return {‘last_letter’: word[-1]} set.

>>> gender_features(‘Shrek’)
>>> featuresets = [(gender_features(n), gender)
{‘last_letter’: ‘k’} for (n, gender) in labeled_names]

>>> train_set, test_set = featuresets[500:],

The dictionary that was sent back is called a
featuresets[:500]
“feature set,” and it links the names of the features

52 Learning to Classify Text

>>> classifier = nltk.NaiveBayesClassifier. end in “a” are 33 times more likely to belong to a
train(train_set) woman than to a man, while names that end in “k”
are 32 times more likely to belong to a man. These
Later in the chapter, we will learn more about the ratios are called “likelihood ratios,” and they can be
naive Bayes classifier. For now, let’s just try it out on used to compare how different things affect the
some names that didn’t appear in its training data: outcome.

>>> classifier.classify(gender_features(‘Neo’)) Make a note: Change the gender features() function

‘male’ so that it gives the classifier features that include

the length of the name, its first letter, and any other
>>> classifier.classify(gender_features(‘Trinity’))
features that might be useful. Retrain the classifier
‘female’ with these new features, then check how well it
works.
Look at how these The Matrix character names
fit into the right categories. Even though this When working with large corpora, it can take a
science fiction movie takes place in lot of memory to make a single list of
the year 2199, names and genders all the features of each instance. In
are still what we would expect. these situations, use the function
We can test the classifier in a nltk.classify.apply features,
systematic way on a much which returns an object that
larger amount of new data: acts like a list but doesn’t store
all the feature sets in memory:
>>> print(nltk.classify.
accuracy(classifier, test_set)) >>> from nltk.classify import
0.77 apply_features

>>> train_set = apply_

Lastly, we can look at the classifier to see features(gender_features, labeled_
what it used to tell the difference between male names[500:])
and female names:
>>> test_set = apply_features(gender_features,
labeled_names[:500])
>>> classifier.show_most_informative_features(5)

Most Informative Features Choosing the Right Features

last_letter = ‘a’ female : male = 33.2 : 1.0
Choosing what features are important and how
last_letter = ‘k’ male : female = 32.6 : 1.0
to encode them for a learning method can have
last_letter = ‘p’ male : female = 19.7 : 1.0 a huge effect on how well the method can build a
last_letter = ‘v’ male : female = 18.6 : 1.0 good model. When making a classifier, a lot of the
interesting work is figuring out what features might
last_letter = ‘f’ male : female = 17.3 : 1.0
be important and how to represent them. Even
though it’s often possible to get good performance
This list shows that names in the training set that

53 Learning to Classify Text

with a simple set of obvious features, it’s usually with small training sets. For example, if we train a
much better to use carefully built features that are naive Bayes classifier using the feature extractor
based on a deep understanding of the task. shown in 1.2, it will overfit the relatively small
training set, resulting in a system that is about 1%
Usually, feature extractors are made by trying less accurate than a classifier that only looks at the
things out and seeing what works best. People last letter of each name:
often start with a “kitchen sink” approach, adding
all the features they can think of, and then check >>> featuresets = [(gender_features2(n), gender)
to see which ones actually help. This is how we for (n, gender) in labeled_names]
choose names based on gender.
>>> train_set, test_set = featuresets[500:],
featuresets[:500]
def gender_features2(name):
>>> classifier = nltk.NaiveBayesClassifier.
features = {}
train(train_set)
features[“first_letter”] = name[0].lower()
>>> print(nltk.classify.accuracy(classifier, test_
features[“last_letter”] = name[-1].lower() set))

for letter in 0.768

‘abcdefghijklmnopqrstuvwxyz’:
Once the first set of features has
features[“count({})”.
been chosen, error analysis
format(letter)] = name.lower().
is a very effective way to
count(letter)
improve them. First, we choose
features[“has({})”.
a development set, which is a
format(letter)] = (letter in name.
group of corpus data that will be
lower())
used to build the model. The training
return features set and the dev-test set are then made

>>> gender_features2(‘John’) from this development set.

{‘count(j)’: 1, ‘has(d)’: False, ‘count(b)’: 0, ...}

>>> train_names = labeled_names[1500:]

A feature extractor that fits gender features too >>> devtest_names = labeled_names[500:1500]

well. This feature extractor’s output feature sets >>> test_names = labeled_names[:500]
have a lot of specific features, which makes them
too good for the relatively small Names Corpus. The training set is used to teach the model what to
do, while the dev-test set is used to figure out what
But there are usually limits to how many features went wrong. The test set helps us figure out how
you should use with a given learning algorithm. If well the system works in the end. As a result of the
you give the algorithm too many features, it is more reasons we’ll talk about below, it’s important to use
likely to rely on quirks of your training data that a separate dev-test set for error analysis instead of
don’t apply well to new examples. This is called just the test set. This shows how the corpus data is
“overfitting,” and it can be a problem when working broken up into different subsets.

54 Learning to Classify Text

Figure 6.2: Organization of Corpus Data for Training Classifiers with Human Supervision

There are two sets of data from the corpus: the Using the dev-test set, we can make a list of the
development set and the test set. Most of the time, mistakes the classifier makes when guessing the
the development set is split into a training set and gender of a name:
a dev-test set.
>>> errors = [ ]
After dividing the corpus into appropriate datasets,
>>> for (name, tag) in devtest_names:
we train a model with the training set and then run
.......guess = classifier.classify(gender_
it on the dev-test set.
features(name))

>>> train_set = [(gender_features(n), gender) for .......if guess != tag:

(n, gender) in train_names]
.......errors.append( (tag, guess, name) )
>>> devtest_set = [(gender_features(n), gender)
for (n, gender) in devtest_names] Then, we can look at each mistake where the model
predicted the wrong label and try to figure out what
>>> test_set = [(gender_features(n), gender) for (n,
extra information would help it make the right
gender) in test_names]
choice (or which existing pieces of information are
>>> classifier = nltk.NaiveBayesClassifier.
tricking it into making the wrong decision). The set
train(train_set)
of features can then be changed to fit. On the dev-
>>> print(nltk.classify.accuracy(classifier, devtest_ test corpus, the names classifier we made makes
set)) about 100 mistakes:

0.75
>>> for (tag, guess, name) in sorted(errors):

55 Learning to Classify Text

... print(‘correct={:<8} guess={:<8s} name={:<30}’. devtest_set))
format(tag, guess, name))
0.782
correct=female guess=male name=Abigail
Then, this error analysis process can be done again
...
to look for patterns in the mistakes that the newly
correct=female guess=male name=Cindelyn
improved classifier makes. Every time we do error
... analysis again, we should choose a different dev-
test/training split so that the classifier doesn’t start
correct=female guess=male name=Katheryn
to pick up on quirks in the dev-test set.
correct=female guess=male name=Kathryn

... But once we’ve used the dev-test set to help us

build the model, we can’t trust that it will give us a
correct=male guess=female name=Aldrich
good idea of how well the model will work on new
...
data. So, it’s important to keep the test set separate
correct=male guess=female name=Mitch and not use it until we’re done making our model.

... Then we can use the test set to see how

well our model works with new input
correct=male guess=female
values.
name=Rich

... Document Classification

By looking at this list of We’ve already seen a few

mistakes, it’s easy to see that examples of corpora where
some suffixes with more than one documents were put into different
letter can show a name’s gender. For groups. With these corpora, we can
example, names that end in yn seem make classifiers that will automatically
to be mostly for girls, even though most put the right category labels on new
names that end in n are for boys. Names that end documents. First, we make a list of documents with
in ch are usually for boys, even though most names their categories written on them. For this example,
that end in h are for girls. So, we change our feature we chose the Movie Reviews Corpus, which sorts
extractor so that it includes features for two-letter each review into positive or negative categories.
suffixes:
>>> from nltk.corpus import movie_reviews
>>> train_set = [(gender_features(n), gender) for
>>> documents = [(list(movie_reviews.
(n, gender) in train_names]
words(fileid)), category)
>>> devtest_set = [(gender_features(n), gender)
... for category in movie_reviews.categories()
for (n, gender) in devtest_names]
... for fileid in movie_reviews.fileids(category)]
>>> classifier = nltk.NaiveBayesClassifier.
>>> random.shuffle(documents)
train(train_set)

>>> print(nltk.classify.accuracy(classifier,
56 Learning to Classify Text
Next, we define a feature extractor for documents reviews. We measure the classifier’s accuracy on
so that the classifier will know which parts of the test set [1] to see how well it works. Again, we
the data to focus on (1.4). For figuring out what can use show most informative features() to find
a document is about, we can define a feature for out which features the classifier thought were the
each word that tells us if the document has that most helpful [2].
word or not. So that the classifier doesn’t have to
deal with too many features, we start by making featuresets = [(document_features(d), c) for (d,c)
a list of the 2000 most-used words in the whole in documents]
corpus [1]. Then, we can make a feature extractor train_set, test_set = featuresets[100:],
[2] that just checks to see if each of these words is featuresets[:100]
in a document.
classifier = nltk.NaiveBayesClassifier.train(train_
set)
all_words = nltk.FreqDist(w.lower() for w in movie_
reviews.words()) >>> print(nltk.classify.accuracy(classifier, test_
set))
word_features = list(all_words)[:2000]
0.81
def document_features(document):
>>> classifier.show_most_
document_words =
informative_features(5)
set(document)
Most Informative Features
features = {}
contains(outstanding) = True
for word in word_features:
pos : neg = 11.1 : 1.0
features[‘contains({})’.
contains(seagal) = True
format(word)] = (word in
neg : pos = 7.7 : 1.0
document_words)
contains(wonderfully) = True
return features
pos : neg = 6.8 : 1.0

>>> print(document_features(movie_reviews. contains(damon) = True pos : neg = 5.9

words(‘pos/cv957_8737.txt’))) : 1.0

{‘contains(waste)’: False, ‘contains(lot)’: False, ...} contains(wasted) = True neg : pos = 5.8
: 1.0
Note
In this corpus, a review that mentions “Seagal”
In [3], we compute the set of all the words in a is almost 8 times more likely to be negative than
document instead of just checking if a word is in positive, while a review that mentions “Damon” is
the document, because it is much faster to check if about 6 times more likely to be positive.
a word is in a set than a list.
Sequence Classification
Now that we’ve made our feature extractor, we can
use it to teach a classifier how to label new movie We can use joint classifier models, which pick the

57 Learning to Classify Text

right label for a set of related inputs based on how features = {“suffix(1)”: sentence[i][-1:],
they relate to each other, to figure out how related
“suffix(2)”: sentence[i][-2:],
classification tasks depend on each other. For part-
“suffix(3)”: sentence[i][-3:]}
of-speech tagging, different sequence classifier
models can be used to choose part-of-speech tags if i == 0:
for all the words in a given sentence. features[“prev-word”] = “<START>”

features[“prev-tag”] = “<START>”
One strategy, called consecutive classification or
greedy sequence classification, is to find the most else:
likely class label for the first input and then use that features[“prev-word”] = sentence[i-1]
answer to help find the best label for the next input.
features[“prev-tag”] = history[i-1]
The process can then be done again and again until
all of the inputs are named. This is how the bigram return features
tagger from 5 worked. It started by putting a part- class ConsecutivePosTagger(nltk.TaggerI):
of-speech tag on the first word in the sentence, and
def __init__(self, train_sents):
then put a tag on each word after that
based on the word itself and what it train_set = [ ]
thought the tag for the word before for tagged_sent in train_sents:
it would be.
untagged_sent = nltk.tag.
untag(tagged_sent)
This plan is shown to work.
First, we need to add a history history = [ ]
argument to our feature extractor for i, (word, tag) in
function, which gives us a list of the enumerate(tagged_sent):
tags we’ve guessed for the sentence
featureset = pos_features(untagged_
so far [1]. Every tag in history matches a
sent, i, history)
word in a sentence. But keep in mind that
history will only show tags for words we’ve already train_set.append( (featureset, tag) )
tagged, that is, words to the left of the target word. history.append(tag)
So, you can look at some things about the words
self.classifier = nltk.NaiveBayesClassifier.
to the right of the target word, but you can’t look at
train(train_set)
their tags yet because we haven’t made them yet.
def tag(self, sentence):
After we’ve decided on a feature extractor, we can history = [ ]
build our sequence classifier [2]. During training, we
for i, word in enumerate(sentence):
use the annotated tags to give the feature extractor
the right history. When tagging new sentences, featureset = pos_features(sentence, i, history)

however, we use the output of the tagger to make tag = self.classifier.classify(featureset)

the history list.
history.append(tag)

def pos_features(sentence, i, history): return zip(sentence, history)

58 Learning to Classify Text

>>> tagged_sents = brown.tagged_sents(categories=’news’)

>>> size = int(len(tagged_sents) * 0.1)

>>> train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]

>>> tagger = ConsecutivePosTagger(train_sents)

>>> print(tagger.evaluate(test_sents))

0.79796012981

Summary

● Modelling the language data in corpora can help us understand language patterns and can be used to
make predictions about new language data.

● Supervised classifiers use labelled training corpora to build models that can predict the label of an input
based on certain features of that input.

● Supervised classifiers can do a wide range of NLP tasks, such as document classification, part-of-speech
tagging, sentence segmentation, dialogue act type identification, determining entailment relations, and
many more.

● When training a supervised classifier, you should divide your corpus into three datasets: a training set for
building the classifier model, a dev-test set for helping choose and tune the model’s features, and a test set
for judging how well the final model works.

● When testing a supervised classifier, it’s important to use new data that wasn’t part of the training or
development-test set. If not, the results of your evaluation might be too optimistic.

● Decision trees are tree-shaped flowcharts that are made automatically and are used to label input values
based on their properties. Even though they are easy to understand, they are not very good at figuring out
the right label when the values of the features interact.

● In naive Bayes classifiers, each feature works on its own to help decide which label to use. This lets the
values of features interact, but it can be a problem when two or more features have a strong relationship.

● Maximum Entropy classifiers use a basic model that is similar to the model used by naive Bayes. However,
they use iterative optimization to find the set of feature weights that maximizes the probability of the
training set.

● Most of the models that are automatically built from a corpus are descriptive. They tell us which features
are important for a certain pattern or construction, but they don’t tell us how those features and patterns
are related.

59 Learning to Classify Text

Unit 7

Extracting Information from Text

Learning Objectives Introduction

By the end of this unit, you will be

Someone has probably written down the answer to any
able to understand:
question somewhere. The amount of natural language
● Information extraction
text that can be read on a computer is amazing, and it
● Chunking keeps growing every day. But because natural language

● How to develop and evaluate is hard to understand, it can be hard to get to the

chunkers information in that text. NLP is still a long way from being
able to build general-purpose meaning representations
● Recursion in linguistic structure
from unrestricted text. We can make a lot of progress if
we instead focus on a small set of questions, or “entity
relations,” like “where are different facilities located?” or
“who works for what company?” This chapter will try to
answer the following questions:

How can we build a system that takes unstructured text

and pulls out structured data, like tables, from it?

What are some reliable ways to figure out who or what is

being talked about in a text?

Which corpora are best for this kind of work, and how can
we use them to train and test our models?

Along the way, we’ll use methods from the last two
chapters to solve problems with chunking and recognizing
named entities.

60 Extracting Information from Text

Information Extraction ... (‘Kaplan Thaler Group’, ‘IN’, ‘New York’),

There are many different ways to get information. ... (‘BBDO South’, ‘IN’, ‘Atlanta’),
Structured data is an important type. In structured
data, entities and relationships are set up in a ... (‘Georgia-Pacific’, ‘IN’, ‘Atlanta’)]
regular and predictable way. For example, we might
be interested in how companies and places are >>> query = [e1 for (e1, rel, e2) in locs if e2==’Atlanta’]

related. If we know the name of a company, we

want to be able to find out where it does business. >>> print(query)

If we know the name of a location, we want to know

[‘BBDO South’, ‘Georgia-Pacific’]
which companies do business there. If our data is
in the form of a table, like the one shown below, it’s
When we try to get similar information from text, it’s
easy to answer these questions.
harder. Take a look at the following example: (from
Org Name Location Name nltk.corpus.ieer, for fileid NYT19980315.0085).
Omnicorn New York

DDB Needham New York The packaged paper-products division of Georgia-

Kaplan Thaler Group New York Pacific Corp. is leaving Wells for another agency.
BBDO South Atlanta
This division just started working with Wells last
Georgia-Pacific Atlanta
fall. Like Hertz and the History Channel, it is leaving

If this location data was saved in Python as a list for a BBDO South unit of BBDO Worldwide, which

of tuples (entity, relation, entity), then the question is owned by Omnicom. Ken Haldin, a spokesman

“Which organizations work in Atlanta?” could be for Georgia-Pacific in Atlanta, said that BBDO South

answered as follows: will take on more responsibilities for brands like

Angel Soft toilet paper and Sparkle paper towels.
>>> locs = [(‘Omnicom’, ‘IN’, ‘New York’), BBDO South is in charge of corporate advertising
for Georgia-Pacific.
... (‘DDB Needham’, ‘IN’, ‘New York’),

61 Extracting Information from Text

If you read the whole thing, you’ll find out what you business intelligence, resume harvesting, media
need to know to answer the example question. But analysis, sentiment detection, patent search, and
how do we get a machine to know enough about the email scanning. One important area of research right
questions to give us the answers? This is clearly a now is trying to get structured data out of scientific
much tougher job. Unlike the table above, it doesn’t literature that is available online, especially in the
have a structure that links organization names to fields of biology and medicine.
location names.
Information Extraction Architecture
One way to solve this problem is to build a very
general model of what meaning is. In this chapter, In the figure below, you can see how a simple
we take a different approach. We decide ahead of information extraction system is put together. It
time that we will only look for very specific kinds starts by processing a document using some of
of information in text, such as the relationship the steps described in 3 and 5: first, the raw text
between organizations and locations. Instead of of the document is broken up into sentences using
trying to use text like this to answer the question a sentence segmenter, and then each sentence
directly, we first turn the natural language sentences’ is broken up into words using a tokenizer. Next,
unstructured data into structured data. Then we get part-of-speech tags are added to each sentence.
to use powerful query tools like SQL that help us These tags will help a lot in the next step, named
find what we need. Information Extraction is a way entity detection. In this step, we look for possible
to figure out what a text means. interesting things mentioned in each sentence.
Lastly, we use relation detection to look for likely
Information Extraction has many uses, such as connections between different entities in the text.

Figure 7.1:A Simple Pipeline Architecture for a System that Pulls Information

62 Extracting Information from Text

To do the first three tasks, we can define a simple who say “ni,” or proper names, like “Monty Python.”
function that connects NLTK’s default sentence Some tasks require you to think about indefinite
segmenter [1], word tokenizer [2], and part-of- nouns or noun chunks, like “every student” or “cats.”
speech tagger [3]: These don’t always mean the same thing as definite
nouns and proper names, but they can be helpful.
>>> def ie_preprocess(document):
Lastly, in relation extraction, we look for specific
... sentences = nltk.sent_tokenize(document)
patterns between entities that are close to each
... sentences = [nltk.word_tokenize(sent) for sent
other in the text and use those patterns to make
in sentences]
tuples that show how the entities are related.
... sentences = [nltk.pos_tag(sent) for sent in
sentences] Chunking

Note The main method we will use to find entities is

chunking, which breaks up and labels sequences
Remember that our program samples are based of more than one token, as shown. The small boxes
on the idea that you will start your program or show tokenization and part-of-speech tagging at
interactive session with: import nltk, re, pprint the word level, while the large boxes show chunking
at a higher level. We call each of these bigger
The next step in named entity detection is to divide boxes a chunk. Like tokenization, which leaves out
and label the entities that might have interesting whitespace, chunking usually chooses a subset of
relationships with each other. Most of the time, the tokens. Like tokenization, a chunker does not
these are specific noun phrases, like “the knights make pieces that overlap in the source text.

Figure 7.2:Both Token Level and Chunk Level Segmentation and Labelling

In this section, we’ll talk about chunking in more detail. We’ll start by looking at what chunks are and how
they’re shown. We will look at the regular expression and n-gram approaches to chunking and use the
CoNLL-2000 chunking corpus to build and test chunkers. Then, in steps 5 and 6, we’ll go back to the tasks of
recognizing named entities and pulling out their relationships.

Noun Phrase Chunking

First, we’ll talk about the task of noun phrase chunking, or NP-chunking, in which we look for chunks that
match each noun phrase. Here is some text from the Wall Street Journal that has NP chunks marked with

63 Extracting Information from Text

brackets: is a set of rules that tell us how to chunk sentences.
In this case, we will use a single regular-expression
[ The/DT market/NN ] for/IN [ system-management/ rule to define a simple grammar. This rule says that
NN software/NN ] for/IN [ Digital/NNP ] [ ‘s/POS an NP chunk should be made when the chunker
hardware/NN ] is/VBZ fragmented/JJ enough/RB finds an optional determiner (DT) followed by any
that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/ number of adjectives (JJ) and then a noun (NN).
NNP Associates/NNPS ] should/MD do/VB well/ We make a chunk parser with this grammar and
RB there/RB ./. test it on our example sentence. The result is a tree,
which can be printed or shown as a picture.
As we can see, NP-chunks are often smaller pieces
than full noun phrases. For example, the market >>> sentence = [(“the”, “DT”), (“little”, “JJ”),
for system-management software for Digital’s (“yellow”, “JJ”),
hardware is a single noun phrase (with two other
... (“dog”, “NN”), (“barked”, “VBD”), (“at”, “IN”),
noun phrases inside it), but it is captured in NP-
(“the”, “DT”), (“cat”, “NN”)]
chunks by the simpler chunk “the market.” One
>>> grammar = “NP: {<DT>?<JJ>*<NN>}”
reason for this difference is that NP-chunks are
defined in such a way that they can’t contain >>> cp = nltk.RegexpParser(grammar)
other NP-chunks. So, any prepositional phrases or
>>> result = cp.parse(sentence)
subordinate clauses that modify a noun will not be
>>> print(result)
included in the corresponding NP-chunk, since they
almost certainly contain more noun phrases. (S

(NP the/DT little/JJ yellow/JJ dog/NN)

Part-of-speech tags are one of the best ways to
barked/VBD
learn more about NP-chunking. One reason our
information extraction system does part-of-speech at/IN
tagging is because of this. We show how this (NP the/DT cat/NN))
method works by using a part-of-speech-tagged
>>> result.draw()
example sentence. Before we can make an NP-
chunker, we need to define a chunk grammar, which

Figure 7.3:Example of a Simple Np Chunker Based on a Regular Expression

64 Extracting Information from Text

grammar = r””” notation by using the corpus module. This corpus
has three types of chunks: NP, VP, and PP. As we’ve
NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/
seen, each sentence is made up of more than one
possessive, adjectives and noun
line, as shown in the image below:
{<NNP>+} # chunk sequences of proper
nouns
he PRP B-NP
“””
accepted VBD B-VP
cp = nltk.RegexpParser(grammar)
the DT B-NP
sentence = [(“Rapunzel”, “NNP”), (“let”, “VBD”),
position NN I-NP
(“down”, “RP”),
...
(“her”, “PP$”), (“long”, “JJ”), (“golden”, “JJ”),
(“hair”, “NN”)] A piece of the conversion function. conllstr2tree()
>>> print(cp.parse(sentence)) takes one of these multi-line strings and turns it
into a tree. Also, it lets us choose which of the three
(S
types of chunks to use, in this case just NP chunks:
(NP Rapunzel/NNP)

let/VBD >>> text = ‘’’

down/RP ... he PRP B-NP

(NP her/PP$ long/JJ golden/JJ hair/NN)) ... accepted VBD B-VP

... the DT B-NP

Developing and Evaluating Chunkers
... position NN I-NP

Now you know what chunking is and how it works, ... of IN B-PP
but we haven’t told you how to judge chunkers. As
... vice NN B-NP
usual, this needs a corpus with good annotations.
... chairman NN I-NP
We start by looking at how IOB format can be
changed into an NLTK tree, and then we look at how ... of IN B-PP
this can be done on a larger scale using a chunked ... Carlyle NNP B-NP
corpus. We’ll look at how to rate a chunker’s
... Group NNP I-NP
accuracy in relation to a corpus, and then look at
some more data-driven ways to find NP chunks. ... , , O
Our main goal will be to cover more ground with a ... a DT B-NP
chunker.
... merchant NN I-NP

IOB Format and the CoNLL 2000 Corpus: How to ... banking NN I-NP
Read Them ... concern NN I-NP

... . . O
We can load text from the Wall Street Journal
that has been tagged and chunked using the IOB ... ‘’’

65 Extracting Information from Text

>>> nltk.chunk.conllstr2tree(text, chunk_types=[‘NP’]).draw()

Figure 7.4

We can get to more chunked text with the help of >>> print(conll2000.chunked_sents(‘train.txt’,
the NLTK corpus module. The CoNLL 2000 corpus chunk_types=[‘NP’])[99])
has 270k words of text from the Wall Street Journal.
(S
This text is split into “train” and “test” sections and
Over/IN
annotated with tags for parts of speech and chunks
in the IOB format. Using nltk.corpus.conll2000, we (NP a/DT cup/NN)
can get to the data. Here’s how the 100th sentence of/IN
of the “train” section of the corpus looks like:
(NP coffee/NN)

>>> from nltk.corpus import conll2000 ,/,

>>> print(conll2000.chunked_sents(‘train.txt’)[99]) (NP Mr./NNP Stone/NNP)

(S told/VBD

(PP Over/IN) (NP his/PRP$ story/NN)

(NP a/DT cup/NN) ./.)

(PP of/IN)
Training Chunkers Based on Classifiers
(NP coffee/NN)

,/, Both the chunkers that use regular expressions and

the chunkers that use n-grams decide what chunks
(NP Mr./NNP Stone/NNP)
to make based on the part-of-speech tags. But
(VP told/VBD) sometimes the part-of-speech tags aren’t enough
(NP his/PRP$ story/NN) to figure out how to chunk a sentence. For instance,
think about the following two sentences:
./.)

Joey/NN sold/VBD the/DT farmer/NN rice/NN ./.

As you can see, the CoNLL 2000 corpus has three
types of chunks: NP chunks, which we’ve already Nick/NN broke/VBD my/DT computer/NN monitor/
seen, VP chunks like has already delivered, and PP NN ./.
chunks like because of. Since we are only interested
in the NP chunks right now, we can use the chunk The part-of-speech tags in these two sentences are
types argument to choose them: the same, but they are chunked differently. In the

66 Extracting Information from Text

first sentence, there are two chunks: the farmer and history = [ ]
the rice. In the second sentence, there is only one
for i, (word, tag) in enumerate(tagged_sent):
chunk: the computer monitor. Obviously, if we want
featureset = npchunk_features(untagged_sent, i,
to get the most out of chunking, we need to use
history)
more than just the words’ part-of-speech tags. We
also need to know what the words mean. train_set.append( (featureset, tag) )

history.append(tag)
We can use a classifier-based tagger to chunk the
self.classifier = nltk.MaxentClassifier.train(
sentence, which is one way to use information about
what the words mean. Like the n-gram chunker train_set, algorithm=’megam’, trace=0)
we talked about in the last section, this classifier- def tag(self, sentence):
based chunker will work by giving each word in a
history = [ ]
sentence an IOB tag and then turning those tags
into chunks. We will use the same method to build for i, word in enumerate(sentence):
a classifier-based tagger as we did to build a part- featureset = npchunk_features(sentence, i,
of-speech tagger in 1. history)

tag = self.classifier.classify(featureset)
The classifier-based NP chunker’s basic code
is shown. There are two classes in it. The first history.append(tag)
class is very similar to the ConsecutivePosTagger return zip(sentence, history)
class from version 1.5. The only two differences
class ConsecutiveNPChunker(nltk.ChunkParserI):
are that it uses a MaxentClassifier instead of a
NaiveBayesClassifier and calls a different feature def __init__(self, train_sents):
extractor. The second class is a wrapper around tagged_sents = [[((w,t),c) for (w,t,c) in
the tagger class that turns it into a chunker. During
nltk.chunk.tree2conlltags(sent)]
training, this second class turns the chunk trees
in the training corpus into tag sequences. In the for sent in train_sents]
parse() method, it turns the tag sequence given by self.tagger = ConsecutiveNPChunkTagger(tagged_
the tagger back into a chunk tree. sents)

def parse(self, sentence):

lass ConsecutiveNPChunkTagger(nltk.TaggerI):
tagged_sents = self.tagger.tag(sentence)
def __init__(self, train_sents):
conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
train_set = [ ]
return nltk.chunk.conlltags2tree(conlltags)
for tagged_sent in train_sents:

untagged_sent = nltk.tag.untag(tagged_sent)

67 Extracting Information from Text

Summary

Information extraction systems look through large amounts of unrestricted text for certain types of entities
and relationships and use them to fill well-organized databases. Then, questions can be asked of these
databases to find answers.

A typical information extraction system starts by segmenting, tokenizing, and tagging the text with its parts of
speech. The data that comes out of this is then searched for certain kinds of entities. Lastly, the information
extraction system looks at entities that are mentioned close to each other in the text and tries to figure out if
they have certain relationships.

Entity recognition is often done with chunkers, which break up sequences of more than one token and label
each one with the type of entity it is. People, places, dates, times, money, and GPE are all examples of common
entity types (geo-political entity).

Chunkers can be made with rule-based systems, like NLTK’s RegexpParser class, or with machine learning
techniques, like the ConsecutiveNPChunker shown in this chapter. When looking for chunks, part-of-speech
tags are often a very important feature.

Even though chunkers are designed to make relatively flat data structures where no two chunks can overlap,
they can be chained together to make structures with multiple levels.

Relation extraction can be done with either rule-based systems or machine-learning systems. Rule-based
systems look for specific patterns in the text that link entities and the words in between, while machine-
learning systems try to learn these patterns automatically from a training corpus.

68 Extracting Information from Text

Unit 8

Analyzing Sentence Structure

Learning Objectives Introduction

By the end of this unit, you will be

In earlier chapters, we learned how to find words, look
able to understand:
at their structure, put them into lexical categories, and
● Some grammatical dilemmas
find out what they mean. We have also learned how to
● Use of syntax find patterns in word sequences, also called “n-grams.”

● Context free grammar But these methods only scratch the surface of the many
rules that sentences have to follow. We need a way to
● Parsing with context free
handle the ambiguity that comes with natural language.
grammar
We also need to be able to deal with the fact that there are
an infinite number of possible sentences and that we can
only write finite programs to analyze their structures and
figure out what they mean.

This chapter will try to answer the following questions:

● How can we use formal grammar to talk about how any

number of sentences are put together?

● How do syntax trees show the way sentences are put

together?

● How do parsers look at a sentence and build a syntax

tree on their own?

Along the way, we’ll talk about the basics of English

syntax and see that meaning has patterns that are much
easier to understand once we know how sentences are

69 Analyzing Sentence Structure

put together. a. Usain Bolt broke the record for the 100

Some Grammatical Dilemmas b. The Jamaica Observer said that Usain Bolt set a
new record for the 100m.
Language Information and Endless Possibilities
c. Andre said The Jamaica Observer said that Usain
In previous chapters, we talked about how to Bolt set a new record in the 100m.
process and analyze text corpora and how hard it is
for NLP to deal with the huge amount of electronic d. Andre might have said that the Jamaica Observer
language data that keeps growing every day. Let’s said Usain Bolt broke the 100m record.
look at this information more closely and pretend
for a moment that we have a huge corpus of If we replaced whole sentences with the letter S, we
everything said or written in English over, say, the would see patterns like “Andre said S” and “I think
last 50 years. Could we call this group of words S.” These are examples of how to use a sentence
“the language of modern English”? There are to make a longer sentence. We can also use
several reasons why we might say “No.” templates like S but S and S when S.
Remember that in step 3, we asked With a little creativity, we can use
you to look for the pattern “the these templates to make some
of” on the web? Even though really long sentences. In a
it’s easy to find examples of Winnie the Pooh story by A.A.
this word sequence on the Milne, Piglet is completely
web, like “New man at the of surrounded by water, which is
IMG” (http://www.telegraph. a great example.
co.uk/sport/2387900/New-
man-at-the-of-IMG.html), English You can guess how happy Piglet
speakers will say that most of these was when he finally saw the ship. In
examples are mistakes and aren’t really later years, he liked to think that he had
English at all. been in Very Great Danger during the Terrible
Flood, but the only real danger he had been in was
So, we can say that “modern English” is not the the last half-hour of his imprisonment, when Owl,
same as the very long string of words in our made- who had just flown in, sat on a branch of his tree to
up corpus. People who speak English can judge comfort him and told him a very long story about
these sequences, and some of them will be wrong an aunt who laid a seagull’s egg by mistake. The
because they don’t follow the rules of grammar. story went on and on, like this sentence, until Piglet,
who was listening out his window When he finally
Also, it’s easy to write a new sentence that most saw the good ship Brain of Pooh (Captain C. Robin
English speakers will agree is good English. One and 1st Mate P. Bear) coming over the sea to save
interesting thing about sentences is that they can him, he was very happy.
be put inside of bigger sentences. Take a look at
the following This long sentence has a simple structure that
starts with the letters S, B, W, and S. From this

70 Analyzing Sentence Structure

example, we can see that language gives us ways Let’s look more closely at the meaning of “I shot
to put words together that make it seem like we can an elephant in my pajamas.” First, let’s talk about a
make sentences go on forever. It’s also interesting simple grammar rule:
that we can understand sentences of any length
that we’ve never heard before. It’s not hard to make >>> groucho_grammar = nltk.CFG.fromstring(“””
up a sentence that has probably never been used ... S -> NP VP
before in the history of the language, but all people
... PP -> P NP
who speak that language will understand it.
... NP -> Det N | Det N PP | ‘I’
The goal of a grammar is to describe a language in ... VP -> V NP | VP PP
a clear way. But what we think of as a language is
... Det -> ‘an’ | ‘my’
closely linked to what we think of as a grammar. Is
it a large but limited set of words and writings that ... N -> ‘elephant’ | ‘pajamas’
have been heard and read? Is it something more ... V -> ‘shot’
abstract, like the knowledge that good speakers
... P -> ‘in’
already have about how sentences should
be put together? Or is it something in ... “””)
between? We won’t take a side on
this issue, but we will talk about Depending on whether the
the main ways to look at it. prepositional phrase in my
pajamas talks about the
In this chapter, we will use the elephant or the shooting, the
formal framework of “generative sentence can be read in two
grammar.” In this framework, a different ways.
“language” is just a huge set of
all grammatical sentences, and a >>> sent = [‘I’, ‘shot’, ‘an’, ‘elephant’,
grammar is a set of rules that can be used ‘in’, ‘my’, ‘pajamas’]
to “generate” these sentences. As we will see in 3, >>> parser = nltk.ChartParser(groucho_grammar)
grammars use recursive productions of the form S
>>> for tree in parser.parse(sent):
S and S. In the tenth step, we’ll add to this so that
a sentence’s meaning can be built automatically ... print(tree)

from the meanings of its parts. ...

(S
Ubiquitous Ambiguity
(NP I)
Animal Crackers, a 1930 movie with Groucho Marx, (VP
is a well-known example of ambiguity:
(VP (V shot) (NP (Det an) (N elephant)))

I shot an elephant in my pajamas while hunting in (PP (P in) (NP (Det my) (N pajamas)))))
Africa. I don’t know how he got into my PJs. (S

71 Analyzing Sentence Structure

(NP I)

(VP

(V shot)

(NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))

Figure 8.1

72 Analyzing Sentence Structure

As shown, the program makes two structures with “The Adventures of Buster Brown” (http://www.
brackets that look like trees. gutenberg.org/files/22816/22816.txt):

Notice that none of the words are hard to a. He laughed with me as I watched the bucket slide
understand. For example, the word “shot” doesn’t down his back.
mean “to take a picture” in the first sentence and
“to take a picture” with a camera in the second b. The worst part was that it was hard to find out
sentence. who heard the light.

Make a note: Think about the following sentences You know that these sequences are “word salad,”
and see if you can come up with two very different but you might not be able to put your finger on
ways to understand them: It could be dangerous exactly what is wrong with them. One benefit of
for animals to fight. Seeing family members can be learning grammar is that it gives you a framework
tiring. Is it because some of the words aren’t clear? for thinking about these ideas and a vocabulary for
If not, why is there a lack of clarity? putting them into words. Let’s look more closely at
the order, which is the worst part and looks sloppy.
This chapter talks about grammars and parsing, This looks like a coordinate structure, where two
which are formal and computer-based ways to phrases are joined by a coordinating conjunction like
study and model the things we’ve been talking and, but, or or. Here’s a simple, informal explanation
about in language. As we’ll see, patterns of well- of how coordination works in a sentence:
formedness and poorly-formedness in a string of
words can be understood by looking at the structure Coordinate Structure:
and dependencies of the phrase. With grammars
and parsers, we can make formal models of these If both v1 and v2 are phrases that belong to category
structures. As before, one of the main reasons is to X, then both v1 and v2 are also phrases that belong
understand natural language. How much more of a to category X.
text’s meaning can we understand if we can reliably
recognize the structures of language it uses? Can Here are just a few examples. In the first, two NPs
a program “understand” what it has read in a have been put together to make an NP, and in the
text well enough to answer simple questions like second, two APs have been put together to make
“what happened” or “who did what to whom?” Like an AP.
before, we will make simple programs to process
annotated corpora and do useful things with them. a. For me, the ending of the book was both the
worst and the best part.
What’s the Use of Syntax?
b. When they are on land (AP slow and clumsy
In this unit, we showed how to use the frequency looking).
information in bigrams to make text that seems
fine for short strings of words but quickly turns into We can’t put an NP and an AP together, so the worst
gibberish. Here are two more examples we made part and clumsy looking is not a correct sentence.
by finding the bigrams in a children’s story called Before we can make these ideas official, we need to

73 Analyzing Sentence Structure

know what it means to have a “component structure.”

The idea behind constituent structure is that words combine with other words to make units. The fact that a
group of words can be replaced by a shorter group without making the sentence not make sense shows that
they are a unit. To understand this better, think about the next sentence:

The little bear saw the nice, fat trout in the brook.

The fact that we can replace He with The little bear shows that the second set of words is a unit. On the other
hand, little bear saw can’t be replaced in the same way.

a. He saw the beautiful, fat trout in the stream.

b. *The brook has a fine, fat trout.

In this chapter, we replace longer sequences with shorter ones in a way that doesn’t change the grammar.
Each group of words that makes up a unit can actually be replaced by a single word, leaving only two elements.

the little bear saw the fine fat trout in the brook

the bear saw the trout in it

He saw it there

He ran there

He ran

Det Adj N V Det Adj Adj N P Det N

the little bear saw the fine fat trout in the brook

Det Nom V Det Nom P NP

the bear saw the trout in it

NP V NP PP
He saw it there

NP VP PP
He ran there

NP VP
He ran

74 Analyzing Sentence Structure

Change the order of words and add grammatical A parser lets a grammar be tested against a set of
categories: This diagram shows how noun phrases test sentences, which helps linguists find errors in
(NP), verb phrases (VP), prepositional phrases their analysis of grammar. A parser can be used
(PP), and nominals (NP) fit into different types of as a model for how the brain processes language,
grammar (Nom). which can help explain why people have trouble
understanding certain syntactic structures. Parsing
Now, if we take out all the words except the ones in is an important part of many natural language
the top row, add an S node, and flip the figure over, applications. For example, we would expect that
we get a standard phrase structure tree, as shown. the natural language questions asked of a question-
A constituent is any part of this tree, including the answering system would be parsed as the first step.
words. NP and VP are the parts of S that are closest
to each other. In this part, we’ll look at two simple parsing
algorithms: recursive descent parsing, which
As we’ll see in the next section, a grammar works from the top down, and shift-reduce parsing,
shows how a sentence can be broken down into which works from the bottom up. We also look at
its immediate parts and how those parts can be some more complex algorithms, like left-corner
broken down even further until we get to the level parsing, which is a top-down method with bottom-
of individual words. up filtering, and chart parsing, which is a dynamic
programming method.
Note
Recursive Descent Parsing Means:
As we saw, the length of a sentence is not set in
stone. So, phrase structure trees can have any The simplest type of parser reads a grammar as a
number of levels. The cascaded chunk parsers we description of how to break a high-level goal into
saw in Chapter 4 can only make structures with several lower-level subgoals. The main objective is
a limited number of levels, so chunking methods to find a S. The S NP VP production lets the parser
can’t be used here. replace this goal with two smaller goals: find an
NP, then find a VP. Each of these subgoals can
Parsing With Context Free Grammar be replaced by a sub-subgoal, using productions
with NP and VP on the left side. At some point,
A parser takes in sentences and processes them this process of growth will lead to smaller goals,
according to the rules of a grammar. It then builds like “find the word telescope.” These subgoals can
one or more structures that follow the rules of the be directly compared to the input sequence, and if
grammar. A grammar is a statement about how the next word matches, the subgoal is met. If there
well-formed something is. In reality, it is just a isn’t a match, the parser has to go back and try
string and not a program. A parser is a step-by-step something else.
explanation of how the grammar works. It looks
through the space of trees licensed by a grammar During the above steps, the recursive descent
to find one with the required sentence along its parser builds a parse tree. The S root node is
edge. made with the initial goal (find a S) in mind. As the
above process uses the outputs of the grammar to

75 Analyzing Sentence Structure

keep adding new goals, the parse tree grows downward (hence the name recursive descent). The graphical
demonstration nltk.app.rdparser shows us how this works (). There are six steps to the way this parser works.

Figure 8.2: Six Stages of a Recursive Descent Parser

The parser starts with a tree containing the node S; at each stage, it looks at the grammar to find a production
that can be used to grow the tree; when a lexical production is found, its word is compared to the input; once
a complete parse has been found, the parser goes back to look for more parses.

During this process, the parser often has to choose between several possible outputs. For example, when
it moves from step 3 to step 4, it looks for productions with the letter N on the left side. N man is the first
of these. When this doesn’t work, it goes back and tries other N productions in order until it gets to N dog,
which matches the next word in the input sentence. Step 5 shows that it finds a full parse a long time after
that. This tree covers the whole sentence and doesn’t leave any loose ends. After finding a parse, we can tell
the parser to look for more parses. Again, it will go back and look at other production options to see if any of
them lead to a parse.

NLTK provides a recursive descent parser:

>>> rd_parser = nltk.RecursiveDescentParser(grammar1)

>>> sent = ‘Mary saw a dog’.split()

>>> for tree in rd_parser.parse(sent):

... print(tree)

(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))

There are three main problems with recursive descent parsing. First, left-recursive productions like NP ->
76 Analyzing Sentence Structure
NP PP put it in an endless loop. Second, the that match the right side of a grammar production
parser spends a lot of time thinking about words and replace them with the left side until the whole
and structures that don’t belong in the sentence sentence is reduced to a S.
it is trying to understand. Third, the backtracking
process may throw away parsed parts that will The shift-reduce parser keeps adding the next
have to be put back together later. For example, if word from the input stream to a stack (4.1). This
you backtrack over VP -> V -> NP, the subtree you is called the shift operation. If the top n items on
made for the NP will be thrown away. If the parser the stack match the top n items on the right side
then goes from VP to V to NP to PP, the NP subtree of a production, they are all taken off the stack and
must be made from scratch. the item on the left side of the production is put on
the stack. The reduce operation replaces the top n
Recursive descent parsing is a type of parsing that items with a single item. This operation can only be
goes from the top down. Before looking at the input, done on the top item in the stack. Lower items in
top-down parsers use a grammar to guess what the stack must be reduced before new items can
it will be. But since the input sentence is always be added. The parser is done when all of the input
available to the parser, it would make more sense is used and there is only one item left on the stack:
to start with the input sentence from the start. This a S node at the top of a parse tree. During the above
is called “bottom-up parsing,” and we’ll look at an steps, the shift-reduce parser builds a parse tree.
example in the next section. Each time it takes n items off the stack, it combines
them into a partial parse tree and puts that back
Shift-Reduce Parsing on the stack. With the graphical demonstration
nltk.app.srparser, we can see how the shift-reduce
Shift-reduce parsers are a simple type of bottom-up parsing algorithm works. There are six steps to the
parsers. Like all bottom-up parsers, a shift-reduce way this parser works.
parser tries to find sequences of words and phrases

Figure 8.3: Six Steps of a Shift-Reduce Parser

77 Analyzing Sentence Structure

The parser starts by moving the first input word onto its stack; once the top items on the stack match the
right hand side of a grammar production, they can be replaced with the left hand side of that production; the
parser is done when all input is used up and only one S item is left on the stack.

ShiftReduceParser() is a simple way for NLTK to implement a shift-reduce parser. This parser doesn’t do any
backtracking, so even if there is a parse for a text, it might not find it. Also, it will only find one parse, even if
there are more than one. We can give an optional trace parameter that tells the parser how much information
to give about the steps it takes to read a text:

>>> sr_parser = nltk.ShiftReduceParser(grammar1)

>>> sent = ‘Mary saw a dog’.split()

>>> for tree in sr_parser.parse(sent):

... print(tree)

(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))

Make a note: With sr parse = nltk, you can run the above parser in tracing mode to see the order of the shift
and reduce operations. ShiftReducerParser(grammar1, trace=2)

A shift-reduce parser can get stuck and fail to find a parse, even if the sentence it is given is correct from a
grammar point of view. When this happens, there are no more inputs, and the stack has things that can’t be
turned into a S. The problem is that there are choices that were made earlier that the parser can’t take back
(although users of the graphical demonstration can undo their choices). The parser has to make two kinds
of decisions: (a) which reduction to do when there is more than one option, and (b) whether to shift or reduce
when both options are available.

A shift-reduce parser can be made to work with policies that help solve these kinds of problems. For example,
it may handle shift-reduce conflicts by only shifting when no reductions are possible, and it may handle
reduce-reduce conflicts by favoring the reduction operation that removes the most items from the stack. A
“lookahead LR parser,” which is a more general version of a shift-reduce parser, is often used in programming
language compilers.

Shift-reduce parsers are better than recursive descent parsers because they only put together structure that
matches the words in the input. Also, they only build each substructure once. For example, NP(Det(the),
N(man)) is only built and put on the stack once, no matter if it will be used by the VP -> V NP PP reduction or
the NP -> NP PP reduction.

Summary

● Sentences have their own structure, which can be shown with a tree. Recursion, heads, complements, and
modifiers are all important parts of constituent structure.

78 Analyzing Sentence Structure

● A grammar is a short way to describe a set of sentences that could go on forever. If a tree is well-formed
according to a grammar, we say that a grammar “licenses” it.

● A grammar is a formal way to describe whether or not a given phrase can have a certain structure of parts
or dependencies.

● Given a set of syntactic categories, a context-free grammar uses a set of productions to show how a
phrase from category A can be broken down into a series of smaller parts 1...n.

● A dependency grammar uses “productions” to show what parts of a lexical head depend on other parts.

● When a sentence can be understood in more than one way, this is called syntactic ambiguity (e.g.
prepositional phrase attachment ambiguity).

● A parser is a method for finding one or more trees that match a sentence that is correct in terms of
grammar.

● The recursive descent parser is a simple top-down parser that tries to match the input sentence by
recursively expanding the start symbol (usually S) with the help of grammar productions. This parser can’t
deal with left-recursive productions, such as NP -> NP PP. It is inefficient because it expands categories
without checking to see if they work with the input string and because it expands the same non-terminals
over and over again and then throws away the results.

● The shift-reduce parser is a simple bottom-up parser. It puts input on a stack and tries to match the items
at the top of the stack with the grammar productions on the right. Even if there is a valid parse for the input,
this parser is not guaranteed to find it. It also builds substructure without checking if it is consistent with
the grammar as a whole.

79 Analyzing Sentence Structure

Unit 9

Building Feature Based

Grammars
Learning Objectives Introduction

By the end of this unit, you will be

There are a lot of different grammatical structures in
able to understand:
natural languages that are hard to deal with using simple
● Some grammatical dilemmas
methods. We change how we use grammatical groups
● Use of syntax like S, NP, and V so that we can be more flexible. Instead

● Context free grammar of atomic labels, we break them up into structures like
dictionaries, where each feature can have a number of
● Parsing with context free
different values.
grammar

This chapter will try to answer the following questions:

How can we add features to the framework of context-

free grammars to give us more control over categories
and productions of grammar?

What are the most important formal properties of feature

structures, and how do we use them in computing?

What kinds of language patterns and grammatical

structures can feature-based grammars now help us
understand?

Along the way, we’ll talk about more English grammar

topics, like agreement, subcategorization, and unbounded
dependency constructions.

80 Building Feature Based Grammars

Grammatical Features the “patient.” Let’s add this information, using “sbj”
and “obj” as placeholders that will be filled in when
In chapter-data-intensive, we talked about how to the verb combines with its grammatical arguments:
make classifiers that use text features to make
decisions. These features can be very simple, >>>chase[‘AGT’] = ‘sbj’
like taking out the last letter of a word, or more
>>>chase[‘PAT’] = ‘obj’
complicated, like a part-of-speech tag that the
classifier has already guessed. In this chapter, we’ll When we look at the sentence “Kim chased Lee,”
look at how features are used to build rule-based we want to “bind” the agent role of the verb to
grammars. Unlike feature extractors, which record the subject and the patient role of the verb to the
features that have been automatically found, we object. We do this by linking to the relevant NP’s
will now say what features words and phrases REF feature. In the next example, we take it as a
have. We start with a very simple example in which given that the NPs to the left and right of the verb
features and their values are stored in dictionaries. are the subject and the object, respectively. To
finish off the example, we also add a feature
>>>kim = {‘CAT’: ‘NP’, ‘ORTH’: ‘Kim’,
structure for Lee.
‘REF’: ‘k’}

>>>chase = {‘CAT’: ‘V’, ‘ORTH’: >>>sent = “Kim chased Lee”

‘chased’, ‘REL’: ‘chase’}
>>>tokens = sent.split()
CAT (grammatical category)
and ORTH are two things >>>lee = {‘CAT’: ‘NP’, ‘ORTH’:
that both kim and chase have ‘Lee’, ‘REF’: ‘l’}
in common (orthography, i.e.,
spelling). Each also has a more >>>deflex2fs(word):
semantic property: kim[‘REF’] is
meant to give the referent of kim, while ... for fs in [kim, lee, chase]:
chase[‘REL’] is meant to give the relation that chase
expresses. In the context of rule-based grammars, ... if fs[‘ORTH’] == word:

these combinations of features and values are

... return fs
called “feature structures,” and we will soon look at
other ways to write them.
>>>subj, verb, obj = lex2fs(tokens[0]),
lex2fs(tokens[1]), lex2fs(tokens[2])
There is different kinds of information about
grammatical entities in feature structures. We don’t
>>>verb[‘AGT’] = subj[‘REF’]
have to include everything, and we might want to
add more properties. In the case of a verb, for
>>>verb[‘PAT’] = obj[‘REF’]
example, it is often helpful to know what “semantic
role” the verb’s arguments play. In the sentence
>>>for k in [‘ORTH’, ‘REL’, ‘AGT’, ‘PAT’]:
“chase,” the subject is the “agent,” and the object is

81 Building Feature Based Grammars

... print(“%-5s => %s” % (k, verb[k])) feature structures with complex values.

ORTH => chased >>>fs2 = nltk.FeatStruct(POS=’N’, AGR=fs1)

>>>print(fs2)
REL => chase
[ [ CASE = ‘acc’ ] ]

AGT => k [ AGR = [ GND = ‘fem’ ] ]

[ [ NUM = ‘pl’ ] ]
PAT => l
[ [ PER = 3 ]]
Processing Feature Structures [ ]

[ POS = ‘N’ ]
In this section, we’ll show you how to build and
change feature structures in NLTK. We will also talk >>>print(fs2[‘AGR’])
about unification, which is the basic operation that [ CASE = ‘acc’ ]
lets us combine the information in two different
[ GND = ‘fem’ ]
feature structures.
[ NUM = ‘pl’ ]
In NLTK, the FeatStruct() constructor is used to [ PER = 3 ]
declare a structure for a feature. The values of
>>>print(fs2[‘AGR’][‘PER’])
atomic features can be either strings or numbers.
3
>>>fs1 = nltk.FeatStruct(TENSE=’past’, NUM=’sg’)
Seeing feature structures as graphs, more
>>>print(fs1)
specifically directed acyclic graphs, is often helpful.
[ NUM = ‘sg’ ] It is the same as the AVM above.
[ TENSE = ‘past’ ]

A feature structure is really just a type of dictionary,

so to get its values, we use indexing, just like we
do with any other dictionary. Using the syntax we
already know, we can give values to features:

>>>fs1 = nltk.FeatStruct(PER=3, NUM=’pl’,

GND=’fem’)

>>>print(fs1[‘GND’])

fem

>>>fs1[‘CASE’] = ‘acc’

As we’ve already talked about, we can also define

82 Building Feature Based Grammars

Figure 9.1

The names of the features are written as labels on the directed arcs, and the values of the features are written
as labels on the nodes that the arcs point to.

Like before, feature values can be hard to understand:

Figure 9.2

83 Building Feature Based Grammars

When we look at these kinds of graphs, it makes sense to think about paths through the graph. From the
root node, you can follow a set of arcs that make up a feature path. We will use tuples to show paths. So,
(‘ADDRESS’, ‘STREET’) is a feature path whose value is the node called “rue Pascal.”

Now, let’s say Lee is married to a woman named Kim, and Kim lives at the same address as Lee.

Figure 9.3

But instead of putting the address information twice in the feature structure, we can “share” the same sub-
graph across multiple arcs:

Figure 9.4

84 Building Feature Based Grammars

In other words, the path value (‘ADDRESS’) is the [ D -> (1) ]
same as the path value (‘SPOUSE, ‘ADDRESS’).
[ E -> (1) ]
Structure sharing or re-entrancy is said to happen
in DAGs like these. When the value of two paths is Extending a Feature-based Grammar
the same, we say that they are equivalent.
In this section, we’ll go back to feature-based
To show re-entrancy in our matrix-style grammar and look at a number of linguistic issues.
representations, we will put an integer in We’ll also show why it’s good to use features in
parentheses before the first occurrence of a shared grammar.
feature structure, such as (1). The notation ->(1) will
be used to talk about that structure in the future, as Subcategorization
shown below.
Previously, we added more labels to our categories
>>>print(nltk.FeatStruct(“””[NAME=’Lee’, to show different kinds of verbs. For example, IV
ADDRESS=(1)[NUMBER=74, STREET=’rue stands for an intransitive verb, while TV stands for
Pascal’], a transitive verb. This gave us the chance to write
... SPOUSE=[NAME=’Kim’, plays like the ones below:
ADDRESS->(1)]]”””))
VP -> IV
[ ADDRESS = (1) [ NUMBER = 74 ]]
VP -> TV NP
[ [ STREET = ‘rue Pascal’ ] ]

[ ] Even though we know that IV and TV are two

[ NAME = ‘Lee’ ] different kinds of V, they are still just two different
atomic nonterminal symbols from a CFG. We
[ ]
can’t say anything about verbs in general with this
[ SPOUSE = [ ADDRESS -> (1) ] ] notation. For example, we can’t say, “All lexical
[ [ NAME = ‘Kim’ ] ] items in category V can be marked for tense,” since
walk is in category IV, not V. So, can we replace
Sometimes, the bracketed number is called a tag category labels like TV and IV with the letter V and
or a coindex. It doesn’t matter what number you a feature that tells us whether the verb goes with an
choose. In a single feature structure, there can be NP object or can stand alone?
any number of tags.
A simple method, which was first made for a
>>>print(nltk.FeatStruct(“[A=’a’, B=(1)[C=’c’], grammar framework called Generalized Phrase
D->(1), E->(1)]”)) Structure Grammar (GPSG), tries to solve this
problem by letting lexical categories have a
[ A = ‘a’ ]
SUBCAT that tells us what subcategorization class
[ ]
the item belongs to. GPSG used whole numbers for
[ B = (1) [ C = ‘c’ ] ] SUBCAT, but the example below uses intrans, trans,
and clause, which are easier to remember:
[ ]

85 Building Feature Based Grammars

VP[TENSE=?t, NUM=?n] -> V[SUBCAT=intrans, | ‘claimed’
TENSE=?t, NUM=?n]
When we see a lexical category like V[SUBCAT=trans],
VP[TENSE=?t, NUM=?n] -> V[SUBCAT=trans,
we can take the SUBCAT specification as a pointer
TENSE=?t, NUM=?n] NP
to a production in which V[SUBCAT=trans] is
VP[TENSE=?t, NUM=?n] -> V[SUBCAT=clause,
introduced as the head child in a VP production.
TENSE=?t, NUM=?n] SBar
By convention, the values of SUBCAT and the
productions that introduce lexical heads have the
V[SUBCAT=intrans, TENSE=pres, NUM=sg] ->
same values. On this approach, SUBCAT can only
‘disappears’ | ‘walks’
be used on lexical categories. For example, putting
V[SUBCAT=trans, TENSE=pres, NUM=sg] -> ‘sees’ a SUBCAT value on VP doesn’t make sense. walk
| ‘likes’ and like are both in category V, which is what was

V[SUBCAT=clause, TENSE=pres, NUM=sg] -> asked for. But walk will only happen in VPs that

‘says’ | ‘claims’ are expanded by a production with the feature

SUBCAT=intrans on the right side. Like, on the other
V[SUBCAT=intrans, TENSE=pres, NUM=pl] -> hand, needs SUBCAT=trans.
‘disappear’ | ‘walk’
In the third group of verbs we talked about above,
V[SUBCAT=trans, TENSE=pres, NUM=pl] -> ‘see’ |
we named a group SBar. This is the name for
‘like’
subordinate clauses like the complement of claim
V[SUBCAT=clause, TENSE=pres, NUM=pl] -> ‘say’ in the example. You say that you enjoy being around
| ‘claim’ kids. To analyze these kinds of sentences, we need
two more works:
V[SUBCAT=intrans, TENSE=past, NUM=?n] ->
‘disappeared’ | ‘walked’ SBar -> Comp S
V[SUBCAT=trans, TENSE=past, NUM=?n] -> ‘saw’ Comp -> ‘that’
| ‘liked’

V[SUBCAT=clause, TENSE=past, NUM=?n] -> ‘said’ Resulting in:

Figure 9.5

86 Building Feature Based Grammars

In feature-based frameworks like PATR and Head-driven Phrase Structure Grammar, subcategorization is
handled in a different way. This is because categorial grammar did it in the first place. Instead of using
SUBCAT values to index productions, the valency of a head is directly encoded in the SUBCAT value (the list
of arguments that it can combine with). For example, a verb like “put” that takes NP and PP complements
(like “put the book on the table”) could be shown as:

V[SUBCAT=<NP, NP, PP>]

This means that the verb can go with three different arguments. The NP on the far left of the list is the subject
noun. Everything else, in this case an NP followed by a PP, is a subcategory-for complement. When a verb like
“put” is used with the right complements, the SUBCAT requirements are met, and only a subject noun phrase
(NP) is needed. This category, which is the same as what people usually think of as VP, could be shown as
follows.

V[SUBCAT=<NP>]

Lastly, a sentence is a type of verbal category that doesn’t need any more arguments. It has a SUBCAT whose
value is the empty list because it doesn’t need any more arguments. Kim put the book on the table.

Figure 9.6

Heads: A Second Look

In the last section, we talked about how we could say more general things about verb properties by separating
subcategorization information from the main category label. The following is another trait of this kind: Heads
of phrases in category VP are expressions from category V. In the same way, NPs have Ns as their heads, APs
have as their heads, and PPs have Ps as their heads. Not all phrases have heads. For example, it’s common
to say that coordinate phrases (like “the book and the bell”) don’t have heads. However, we’d like our grammar
formalism to show the parent-child relationship where it makes sense. V and VP are just atomic symbols
right now, and we need to find a way to connect them using their features (as we did earlier to relate IV and
TV).

87 Building Feature Based Grammars

X-bar -The idea of phrasal level is taken out of syntax to solve this problem. Most of the time, you can see
three of these levels. If N stands for the lexical level, N’ stands for the next level up, which is the traditional
category Nom, and N” stands for the phrasal level, which is the category NP. The below illustration‘a’ is an
example of a representative structure, and‘b’ is a more typical one.

Figure 9.7

N is the head of the structure (a), and N’ and N” are projections of N. N” is the maximum projection, and N is
sometimes called the zero projection. One of the main ideas behind X-bar syntax is that all of its parts have
the same structure. Using X as a variable over N, V, A, and P, we say that directly subcategorized complements
of a lexical head X are always placed as siblings of the head, while adjuncts are placed as siblings of the
intermediate category, X’. So, the arrangement of the two P” add-ons in the next figure is different from that of
the P” complement in (a). The productions in next figure show how feature structures can be used to encode

88 Building Feature Based Grammars

Figure 9.8

bar levels. The nested structure in the last figure is made by expanding N[BAR=1] twice with the recursive
rule.

S -> N[BAR=2] V[BAR=2]

N[BAR=2] -> Det N[BAR=1]

N[BAR=1] -> N[BAR=1] P[BAR=2]

N[BAR=1] -> N[BAR=0] P[BAR=2]

N[BAR=1] -> N[BAR=0]XS

Summary

● Symbols constitute the conventional groups of context-free grammar. Feature structures are used to
manage subtle differences efficiently, avoiding the need for a vast number of atomic categories.

● Using variables in place of feature values allows us to impose constraints in grammar productions, enabling
interdependent realization of different feature specifications.

● Typically, we assign fixed values to features at the lexical level and ensure that the values of features in
phrases match those of their children.

● Feature values can be either simple or complex. Boolean values, denoted as [+/- f], are one type of atomic
value.

● When two entities share the same value (either atomic or complex), we refer to the structure as re-entrant.

89 Building Feature Based Grammars

Numerical indexes (tags) in AVMs represent shared values.

● In feature structures, a path is a pair of features that corresponds to a series of arcs from the graph’s root.

● Two paths are considered the same if they lead to the same value.

● The order of feature structures is partially determined by their inclusion. FS0 replaces FS1 when all the
information in FS0 is also present in FS1.

● If two structures, FS0 and FS1, are merged, the result is a feature structure, FS2, containing all the
information from FS0 and FS1.

● If unification adds information to a path π in FS, it also adds information to every equivalent path π’.

● Feature structures allow us to provide concise analyses of various linguistic phenomena, such as verb
subcategorization, inversion constructions, unbounded dependency constructions, and case government.

90 Building Feature Based Grammars

Unit 10

Analyzing the Meaning of

Sentences
Learning Objectives Introduction

By the end of this unit, you will be

We have seen how helpful it is to use a computer’s power
able to understand:
to process a lot of text. But now that we have parsers and
● Natural language understanding
feature-based grammars, can we do anything useful by
● Propositional logic looking at how sentences are put together? This chapter

● First-order logic will try to answer the following questions:

● How can we show the meaning of natural language in

a way that a computer can understand?

● How can we connect representations of meaning to an

infinite number of sentences?

● How can we use programs that link the way sentences

are written to what we already know?

Along the way, we’ll learn some formal techniques from

the field of logical semantics and see how they can be
used to look up facts in databases.

91 Analyzing the Meaning of Sentences

Natural Language Understanding

Figure 10.1: Example of Human Natural Understanding

Querying a Database

Imagine that we have a program that lets us type in a question in natural language and gives us the right
answer back:

a. Athens is in which country?

b. Greece.

How hard would it be to write a program like this? And can we just use the same methods we’ve seen in this
book so far, or do we need to learn something new? In this section, we’ll show that it’s pretty easy to solve
the task in a limited domain. But we will also see that if we want to solve the problem in a more general way,
we need to open up a whole new set of ideas and methods that have to do with how meaning is represented.

So, let’s start by assuming we have structured data about cities and countries. To be more specific, we’ll use
a database table whose first few rows are shown.

Note

The information shown on the chart comes from the Chat-80 system (Warren & Pereira, 1982). The population
numbers are given in thousands, but keep in mind that the data used in these examples goes back to at least

92 Analyzing the Meaning of Sentences

the 1980s and was already a bit out of date when “sql0.fcfg” illustrates how to assemble a meaning
Warren and Pereira (1982) came out. representation for a sentence in tandem with
parsing the sentence. Each rule for how phrases
are put together has a recipe for making a value for
City Country Population
the feature “sem.” You can see that these recipes
athens greece 1368 are very simple. In each case, we use the string
concatenation operation ‘+’ to combine the values
bangkok thailand 1178 of the child constituents into a value for the parent

barcelona spain 1280 constituent.

east_
berlin 3481 > > > n l t k . d a t a . s h o w _ c f g ( ‘g r a m m a r s / b o o k _
germany
grammars/sql0.fcfg’)
united_
birmingham 1112
kingdom % start S

S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np]

Writing queries in a database query language like
VP[SEM=?vp]
SQL is the most obvious way to get answers from
this tabular data. VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp]

VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap]

Note
NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n]

SQL, which stands for “Structured Query Language,” PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
is a language used to get information from and AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
manage relational databases. If you want to learn
NP[SEM=’Country=”greece”’] -> ‘Greece’
more about SQL, you can use the website http://
www.w3schools.com/sql/. NP[SEM=’Country=”china”’] -> ‘China’

Det[SEM=’SELECT’] -> ‘Which’ | ‘What’

For example, if the query is run, the value “Greece”
N[SEM=’City FROM city_table’] -> ‘cities’
will be pulled out:
IV[SEM=’’] -> ‘are’
“Choose the country from the city list. WHERE City A[SEM=’’] -> ‘located’
is ‘Athens’”
P[SEM=’’] -> ‘in’

This tells the computer to make a set of results

This lets us take a query and turn it into SQL.
that includes all values for the column “Country” in
data rows where the column “City” has the value
>>>from nltk import load_parser
“Athens.”
>>>cp = load_parser(‘grammars/book_grammars/
How can we get the same result if we use English sql0.fcfg’)
as our query language? The feature-based >>>query = ‘What cities are located in China’
grammar formalism described in section 9 makes
>>>trees = list(cp.parse(query.split()))
it easy to go from English to SQL. The grammar

93 Analyzing the Meaning of Sentences

>>>answer = trees[0].label()[‘SEM’] English. Assume you’re an English speaker learning
Dutch. The teacher asks if you understand:
>>>answer = [s for s in answer if s]

>>>q = ‘ ‘.join(answer)
“Margriet likes Brunoke.”
>>>print(q)
If you know the meanings of the words and how
SELECT City FROM city_table WHERE
they’re combined, you could say that means
Country=”china”
“Margrietje loves Brunoke.”

Note
An observer, Olga, may conclude that you
understand. But Olga must know English. Your
Your turn: Run the parser with the most tracing
Dutch-to-English translation won’t persuade her if
turned on (cp = load_parser(“grammars/book
she doesn’t. We’ll revisit this soon.
grammars/sql0.fcfg”, “trace=3”)) and watch how
the values of “sem” change as complete edges are
Together, “sql0.fcfg” and NLTK Earley parser
added to the chart.
translate English to SQL. Grammar?
The SQL translation for the
Finally, we run the query against
whole sentence was built from
the “city.db” database and get
component translations.
some results.
Component meaning

>>>from nltk.sem import representations seem

chat80 unjustified. Analyzing the

noun phrase “Which cities”
>>>rows = chat80.sql_
and “City FROM city table”
query(‘corpora/city_database/
are SQL fragments. In isolation,
city.db’, q)
neither has a clear meaning.
>>>for r in rows: print(r[0], end=” “)

canton chungking dairen harbin kowloon mukden We “hard-wired” a lot of database details into

peking shanghai sian tientsin the grammar. We need the table and field names
(e.g., “city table”). But if our database had the same
Since each row r is a one-element tuple, we print rows of data but different table names and field
out its member. names, the SQL queries wouldn’t run. Equally, we
could have stored our data in a different format,
We defined a task where the computer returns such as XML, and then translated our English
useful data in response to a natural language queries into an XML query language instead of
query and implemented it by translating a small SQL. We should translate English into something
subset of English into SQL. Python can execute more abstract and broad than SQL.
SQL queries against a database, so our NLTK
code “understands” queries like “What cities are Consider another English question and its
in China.” As an example of natural language translation to illustrate.
understanding, this is like translating from Dutch to
94 Analyzing the Meaning of Sentences
a “Which Chinese cities have above 1,000,000 and her doll Brunoke. The two entities also have
people?” a love relationship. (4) is true in situation s if you
understand (3). “Margrietje” refers to Margrietje,
b. choose city table where country=’China’ and “Brunoke” to Brunoke, and “houdt van” to the love
population>1000” relationship.

Turnabout: Extend “sql0.fcfg” to translate (a) into We’ve introduced two semantic concepts. First,
(b), then check the query’s results. declarative sentences can be true or false. Definite
noun phrases and proper nouns refer to worldly
Before tackling conjunction, you may find it easiest things. “Margrietje loves Brunoke,” as shown.
to use grammar to answer questions like “What
cities have populations over 1,000,000?” Compare
your answer to “grammars/book grammars/sql1.
fcfg” in the NLTK data distribution.

The AND in (a) becomes AND in SQL (b). The latter

instructs us to choose rows where both “Country”
and “Population” are ‘China’ and larger than 1000.
This interpretation of “and” introduces a new idea:
Cond1 AND Cond2 is true in situation s if both
conditions are true in s. This doesn’t cover all
definitions of “and” in English, but it’s independent Once we accept the actuality of a situation, we may

of any query language. We interpreted it using reason effectively. We may inquire whether a group

classical reasoning. In the following sections, of phrases might be true jointly.

we’ll study a strategy in which natural language

statements are transformed into logic instead The Semantics of English Sentences

of SQL. Logical formalisms are more broad and

Grammar’s Compositional Semantics
abstract. We could translate our logic translation
into additional special-purpose languages if we
We briefly presented a strategy for creating
desired. Most natural language database queries
semantic representations from a syntactic parse
employ this strategy.
using the 9. We’ll develop a logical form instead
of an SQL query. The Principle of Compositionality
Natural Language, Semantics, and Logic
guides our language design (also known as Frege’s

We started by translating (1a) into an SQL query Principle; see Gleitman & Liberman, 1995).

the computer could interpret and execute. The

Compositionality: The whole meaning is a result of
translation’s accuracy remained unclear. The
component meanings and syntactic combination.
meaning of “and” seems to depend on being able to
specify when statements are true or false. Instead
We’ll assume syntactic analysis gives the
of translating a sentence S, we relate it to a real-
semantically significant elements of a complicated
world situation. Continue this. Imagine Margrietje

95 Analyzing the Meaning of Sentences

phrase. This chapter assumes expressions are parsed against context-free grammar. Compositionality
doesn’t need this.

Now, we want to integrate semantic representation construction with parsing. (29) illustrates the type of
analyses we want to build.

Figure 10.2

The root sem value represents the whole sentence, demonstrates compositionality in semantics.
whereas smaller sem values indicate sentence
parts. Since sem values are special, they are To finish the grammar, use the rules below.
contained in angle brackets.
VP[SEM=?v] -> IV[SEM=?v]
How do we design grammatical rules that get this NP[SEM=<cyril>] -> ‘Cyril’
result? Our technique will be similar to that used
IV[SEM=<\x.bark(x)>] -> ‘barks’
for the grammar “sql0.fcfg” at the beginning of this
chapter. We will assign semantic representations
The VP rule says that the semantics of the parent
to lexical nodes and then assemble each phrase’s
is the same as the semantics of the head child. The
semantic representation from its child nodes. In
meanings of “Cyril” and “barks” are given by non-
this situation, function application will replace
logical constants that come from the two lexical
string concatenation.
rules. In the entry for “barks,” there is an extra note
that we will explain in a moment.
NP and VP components have acceptable sem node
values. The rule handles S’s sem value. When sem
Before going into more detail about compositional
is a variable, angle brackets are omitted.
semantic rules, we need to add the λ calculus to
our toolbox. This gives us a very useful way to
S[SEM=<?vp(?np)>] -> NP[SEM=?np] VP[SEM=?vp]
combine first-order logic expressions as we put
together a representation of the meaning of an
The sem value of the S parent is created by
English sentence.
applying ?vp as a function expression to ?np. ?vp
must signify a function whose domain is ?np. This
96 Analyzing the Meaning of Sentences
Summary

● First-order logic is a suitable language for representing natural language meaning in a computer setting.
It is flexible enough to represent many useful parts of natural language meaning, and efficient theorem
provers exist for reasoning with first-order logic. (In natural language semantics, there are also a number
of things that are thought to require more powerful logical mechanisms.)

● Besides translating natural language sentences into first-order logic, we can also examine models of first-
order formulas to determine when these sentences are true.

● To build meaning representations compositionally, we supplement first-order logic with the λ calculus.

● In the λ-calculus, β-reduction is the same thing as applying a function to an argument. In terms of syntax,
it means replacing a variable bound by λ in the function expression with the argument expression in the
function application.

● A key part of building a model is making a valuation that gives meanings to constants that don’t make
sense. These can be seen as either n-ary predicates or standalone constants.

● An expression that has one or more free variables is called an “open expression.” Only when their free
variables receive values from a variable assignment does an open expression obtain a meaning.

● Quantifiers are interpreted by constructing, for a formula φ[x] open in variable x, the set of individuals
that make φ[x] true when an assignment g assigns them as the value of x. The quantifier then places
constraints on that set.

● A closed expression is one that doesn’t have any free variables. This means that all the variables are
bound. A closed sentence is either true or false, regardless of the variable used.

● If the only difference between two formulas is the name of the variable bound by the binding operator
(either λ or a quantifier), then they are equivalent. α-conversion is what happens when you change the
name of a variable that is bound in a formula.

● When there are two quantifiers Q1 and Q2 nested within each other in a formula, the outermost quantifier
Q1 is said to have wide scope (or scope over Q2). Often, it’s not clear what the meaning of the quantifiers
in an English sentence is.

● English sentences can be associated with a semantic representation by treating sem as a feature in a
feature-based grammar. The sem value of a complex expression typically involves functional application
of the sem values of the component expressions.

97 Analyzing the Meaning of Sentences

Unit 11

Managing Linguistic Data

Learning Objectives Introduction

By the end of this unit, you will be

However, despite the fact that structured collections of
able to understand:
annotated linguistic data are vital in most areas of NLP,
● Structure of TIMIT
we continue to experience a great deal of resistance when
● Life cycle of a corpus attempting to use them. The purpose of this chapter is to

● Acquiring data provide answers to the following questions:

● How can we create a new language resource in such a

way that its coverage, balance, and documentation are
capable of supporting a diverse variety of applications?

● How can we transform data that is already in existence

but is in a format that is incompatible with the analytic
tool that we are using?

● What is an efficient approach to record the presence of

a resource that we have developed so that others may
locate it without much difficulty?

Along the process, we will investigate the structure of

previously compiled corpora, the standard procedure for
building a corpus, as well as the lifespan of a corpus. There
will be numerous examples taken from real experience
handling linguistic data, such as data that has been
obtained in the course of linguistic fieldwork, laboratory
work, and web crawling. These examples will be similar to
the ones that have been presented in previous chapters.

98 Managing Linguistic Data

Figure 11.1

Structure of the Corpus: A Case Study

The TIMIT corpus of read speech was the first database of annotated speech that was widely shared, and
it is very well put together. TIMIT was made by a group of companies, including Texas Instruments and MIT,
which is where its name comes from. It was made so that data could be used to learn about acoustics and
phonetics and to help develop and test automatic speech recognition systems.

How TIMIT is Put Together

Like the Brown Corpus, which has a good mix of different types of texts and sources, TIMIT has a good mix of
different dialects, speakers, and materials. For each of the eight dialect regions, 50 male and female speakers
of a range of ages and levels of education each read ten carefully chosen sentences. Two sentences, read by
all the speakers, were meant to show how different their dialects were:

a. She washed your dark suit in dirty water for a whole year.

b. Don’t ask me to carry that greasy rag.

The rest of the sentences were chosen because they had a lot of phonemes (sounds) and a wide range of
diphones (phone bigrams). Also, the design strikes a balance between having multiple speakers say the
same sentence so that differences between speakers can be seen and having a wide range of sentences
in the corpus so that diphones are covered as much as possible. Each speaker reads five sentences that
are also read by six other speakers (for comparability). The last three sentences each speaker read were
different from the others (for coverage).

99 Managing Linguistic Data

A sample from the TIMIT corpus is part of NLTK. Help is the standard way to get to its documentation (nltk.
corpus.timit). Print nltk.corpus.timit.fileids() to see a list of the 160 recorded sentences in the corpus sample.
The inside of each file name looks like this:

Figure 11.2: Structure of the Published TIMIT Corpus

The phones() method can be used to get to each item’s phonetic transcription. We can use the usual method
to get to the word tokens that match. Both access methods have an optional argument called “offset” that
gives the start and end points in the audio file where the corresponding span begins and ends.

>>>phonetic = nltk.corpus.timit.phones(‘dr1-fvmh0/sa1’)

>>>phonetic

[‘h#’, ‘sh’, ‘iy’, ‘hv’, ‘ae’, ‘dcl’, ‘y’, ‘ix’, ‘dcl’, ‘d’, ‘aa’, ‘kcl’,

‘s’, ‘ux’, ‘tcl’, ‘en’, ‘gcl’, ‘g’, ‘r’, ‘iy’, ‘s’, ‘iy’, ‘w’, ‘aa’,

‘sh’, ‘epi’, ‘w’, ‘aa’, ‘dx’, ‘ax’, ‘q’, ‘ao’, ‘l’, ‘y’, ‘ih’, ‘ax’, ‘h#’]

>>>nltk.corpus.timit.word_times(‘dr1-fvmh0/sa1’)

[(‘she’, 7812, 10610), (‘had’, 10610, 14496), (‘your’, 14496, 15791),

(‘dark’, 15791, 20720), (‘suit’, 20720, 25647), (‘in’, 25647, 26906),

(‘greasy’, 26906, 32668), (‘wash’, 32668, 37890), (‘water’, 38531, 42417),

(‘all’, 43091, 46052), (‘year’, 46052, 50522)]

In addition to this text data, TIMIT has a lexicon with the correct way to say each word, which can be compared
to a particular sentence:

100 Managing Linguistic Data

>>>timitdict = nltk.corpus.timit.transcription_ A second thing about TIMIT is that it is well-balanced
dict() across many different dimensions of variation,
so that it can cover different dialect regions and
>>>timitdict[‘greasy’] + timitdict[‘wash’] +
different sounds. When speaker demographics are
timitdict[‘water’]
added, there are many more independent variables.
[‘g’, ‘r’, ‘iy1’, ‘s’, ‘iy’, ‘w’, ‘ao1’, ‘sh’, ‘w’, ‘ao1’, ‘t’, ‘axr’]
These variables may help explain why the data
>>>phonetic[17:30] aren’t all the same, and they make it easier to use

[‘g’, ‘r’, ‘iy’, ‘s’, ‘iy’, ‘w’, ‘aa’, ‘sh’, ‘epi’, ‘w’, ‘aa’, ‘dx’, the corpus for things that weren’t thought of when

‘ax’] it was made, like sociolinguistics.

This gives us an idea of what a speech processing The original linguistic event, which is captured as

system would have to do to produce or recognize an audio recording, and the annotations of that

speech in this dialect. Lastly, TIMIT includes event are very different. Text corpora are the same

demographic information about the speakers, way, in that the original text usually comes from

which makes it possible to study voice, somewhere else and is thought to be a fixed

social, and gender traits in great artifact. Even a simple change like

detail. tokenization that requires human

judgment can be changed in
>>>nltk.corpus.timit. the future. As a result of this,
spkrinfo(‘dr1-fvmh0’) it is important to keep the
source material in a form that
SpeakerInfo(id=’VMH0’,
is as close to the original as
sex=’F’, dr=’1’, use=’TRN’,
possible.
recdate=’03/11/86’,

birthdate=’01/08/60’, The CD-ROM has doc, train, and

ht=’5\’05”’, race=’WHT’, edu=’BS’, test directories at the top. The train
comments=’BEST NEW ENGLAND and test directories each have 8 sub-
ACCENT SO FAR’) directories, one for each dialect region. Each
of these has more sub-directories, one for each
Design Features speaker. The contents of the directory for female
speaker aks0 are listed, showing 10 wav files with a
TIMIT shows a number of important aspects of text transcription, a word-aligned transcription, and
corpus design. First, the corpus has two levels of a phonetic transcription.
annotation: one for phonetics and one for spelling.
In general, a text or speech corpus can be annotated The fourth thing about TIMIT is that the corpus
at many different linguistic levels, such as the is set up in a hierarchy. With 10 sentences per
morphological, syntactic, and discourse levels. speaker and 4 files per sentence, there are 20,000
Also, even at the same level, different labelling files. These are set up in a tree structure, which is
schemes or disagreements between annotators shown in a diagram. At the top level, there is a split
may mean that we need to show more than one between training and testing sets, which shows
version. that its main purpose is to build and test statistical
101 Managing Linguistic Data
models. corpus’s life.

Lastly, keep in mind that even though TIMIT is a Three Ways to Make a Corpus
speech corpus, its transcriptions and other data are
just text that can be processed by programs just In one kind of corpus, the design changes over time
like any other text corpus. So, many of the ways to as the creator explores different aspects. This is
compute that are explained in this book are useful. how traditional “field linguistics” works. Information
Also, keep in mind that all of the types of data in from elicitation sessions is analyzed as it is
the TIMIT corpus fall into the two main groups of gathered, and questions that arise while analyzing
lexicon and text, which we will talk about in more today’s elicitation sessions are often used to plan
detail below. Even the demographic information tomorrow’s elicitation sessions. The resulting
about the speakers is just another example of the corpus is then used for research in the following
lexicon data type. years and may be kept as an archive for the rest
of time. The popular program Shoebox, which has
This last fact isn’t as surprising when you think been around for over 20 years and has now been re-
about how text retrieval and databases, released as Toolbox, is a good example
which are two areas of computer of how computerization assists this
science that deal with data kind of work (see [4]). Other software
management, are based on text tools, like simple word processors
and record structures. and spreadsheets, are often
used to gather the data. In the
Life Cycle of a Corpus next section, we’ll look at how
to use these sources to acquire
Corpora don’t come into the world data.
fully formed. Instead, they are made
through careful planning and input Another corpus creation scenario is
from many people over a long period typical of experimental research. In this
of time. Raw data needs to be gathered, cleaned, kind of research, a body of carefully designed
written down, and stored in a way that makes material is collected from various human subjects
sense. There are different ways to annotate, and and then analyzed to test a hypothesis or develop
some of them require expert knowledge of the a technology. It has become common for these
morphology or syntax of the language. At this stage, databases to be shared and reused within a lab
success depends on creating a good workflow or company, and they are often also made public.
with the right tools and format converters. Quality The “common task” method of managing research,
control procedures can be put in place to identify which has become the norm in government-funded
annotations that don’t match up and to ensure language technology research programs over
that annotators agree on as much as possible. As the past 20 years, is based on corpora like these.
the task is significant and complicated, putting We’ve encountered many of these kinds of corpora
together a large corpus can take years and tens or in earlier chapters. In this chapter, we’ll learn how
hundreds of person-years of work. In this section, to write Python programs to perform the curation
we’ll quickly go over the different stages of a work required before these kinds of corpora can be
102 Managing Linguistic Data
published. When quality is of utmost importance, the entire
corpus can be annotated twice, and any differences
Lastly, there are efforts to put together a “reference can be resolved by an expert.
corpus” for a particular language, like the
American National Corpus (ANC) and the British Best practice is to report the level of agreement
National Corpus (BNC). Here, the goal is to make between annotators for a corpus (e.g., by annotating
a comprehensive record of all the different ways a 10% of the corpus twice). This score gives a good
language can be used, written, and spelled. Aside idea of how well any automatic system trained on
from the sheer size of the task, there is also much this corpus should perform.
reliance on automatic annotation tools and post-
editing to correct any mistakes. However, we can Caution!
write programs to identify and correct mistakes
and check for balance in the corpus. The interpretation of an inter-annotator agreement
score should be done carefully, as the difficulty of
Quality Control annotation tasks varies significantly. For example,
90% agreement would be a terrible score for
Creating a high-quality corpus involves having tagging parts of speech but a great score for
good tools for both automatic and manual data labeling semantic roles.
preparation. However, making a high-quality
corpus also depends on everyday things like The Kappa coefficient K measures how much two
documentation, training, and workflow. Annotation people agree about a category, taking into account
guidelines explain the task and how it should be how much they would agree by chance. For
done. They might be updated frequently to handle example, let’s say an item needs to be annotated,
tricky situations and add new rules to improve and there are four equally likely ways to code it.
annotation consistency. Annotators need to be Then, if two people were to code at random, they
taught how to perform their jobs, including how should agree 25% of the time. So, K = 0 will be given
to handle situations not covered in the rules. A to a level of agreement of 25%, and the scale will go
workflow must be established, possibly with the up as the level of agreement gets better. For a 50%
help of software, to track which files have been agreement, K = 0.333, since 50 is a third of the way
initialized, annotated, validated, manually checked, between 25 and 100. There are many other ways
etc. There could be multiple layers of annotation, to measure agreement. See help(nltk.metrics.
each done by a different expert. When there is agreement) for more information.
doubt or disagreement, it may be necessary to
make a decision.

Large annotation tasks require more than one

person, making it harder to maintain consistency. To
check consistency, we can have two different people
annotate the same part of the source material.
This can reveal problems with the instructions or
different levels of skill with the annotation task.
103 Managing Linguistic Data
Linguistic data management is notable because it usually uses both types of data and can use results and
techniques from both fields.

Figure 11.3: Three Segments of a Sequence

The small rectangles represent characters, words, >>>s1 = “00000010000000001000000”

sentences, or any other linguistic unit that can be
>>>s2 = “00000001000000010000000”
used to break up a sequence. S1 and S2 are very
>>>s3 = “00010000000000000001000”
similar, but S3 is very different from both of them.
>>>nltk.windowdiff(s1, s1, 3)
We can also measure how close two different parts
0.0
of the language input are to each other, such as for
>>>nltk.windowdiff(s1, s2, 3)
tokenization, sentence segmentation, and named-
entity detection. In the Figure above, we see three 0.190...
possible ways that annotators could have broken >>>nltk.windowdiff(s2, s3, 3)
up a set of items (or programs). Even though they
0.571...
don’t agree exactly, S1 and S2 are close, and we
need a good way to measure that. Windowdiff is
In the example above, the size of the window was
a simple algorithm for figuring out how well two
3. This window is moved across two strings by the
segmentations match up. It does this by running
windowdiff operation. At each position, it adds up
a sliding window over the data and giving partial
the number of boundaries found in this window for
credit for close calls. If we turn our tokens into a
both strings, then figures out the difference. Then,
string of zeros and ones before we use them, we
all of these differences are added up. To change
can keep track of when a token is followed by a
how sensitive the measure is, we can change the
boundary, show the segmentations as strings, and
size of the window.
use the windowdiff scorer.

104 Managing Linguistic Data

Acquiring Data link structures, dealing with network latencies,
avoiding overloading the site or getting banned
How to Get Information from the Web from accessing the site, and so on.

The Web has a lot of information that can be used Finding Information in Word Processing Files
to study language. We’ve already talked about how
to access single files, RSS feeds, and search engine Word processing software is often used to prepare
results. But there are times when we want to get a texts and dictionaries by hand in projects that don’t
large amount of text from the web. have a lot of computers. These kinds of projects
often provide templates for entering data, but the
The easiest way to do this is to obtain a published word processing software doesn’t check to ensure
collection of web text. A list of resources is the data is set up correctly. For instance, each text
maintained by the ACL Special Interest Group on might have to have a title and a date. Similarly, each
Web as Corpus (SIGWAC) at http://www.sigwac.org. lexical entry may have some required fields. As the
uk/. Using a well-defined web corpus has the data gets bigger and more complicated, it
advantage that it is well-documented, may take more time to ensure it stays
stable, and allows you to conduct consistent.
experiments repeatedly.
How can we extract the data
If the content you want is from these files so we can
available only on a certain work with it in other programs?
website, there are many tools, Additionally, how can we
like GNU Wget (http://www. check the content of these
gnu.org/software/wget/), that files to help authors create
can help you obtain it. For the well-structured data, maximizing
most freedom and control, you can the quality of the data in the context
use a web crawler like Heritrix http:// of the original authoring process?
crawler.archive.org/. Crawlers give you fine-
grained control over where to look, which links Consider a dictionary where each entry has a part-of-
to follow, and how to organize the results (Croft, speech field with 20 options. This field is displayed
Metzler, & Strohman, 2009). For example, if we want after the pronunciation field and is in 11-point
to create a bilingual text collection with pairs of bold. No regular word processor has search or
documents that are the same in each language, the macro functions that can check whether all parts
crawler needs to figure out how the site is set up to of speech have been entered and shown correctly.
determine how the documents relate to each other. This task requires a lot of careful checking by hand.
It also needs to organize the downloaded pages in If the word processor allows the document to be
a way that makes it easy to find the connections. saved in a non-proprietary format, such as text,
While it might be tempting to write your own web HTML, or XML, we can sometimes write programs
crawler, there are many potential challenges, such to perform this checking automatically.
as detecting MIME types, converting relative URLs
to absolute URLs, avoiding getting stuck in cyclic Consider this part of a dictionary entry: “sleep (v.i.)

105 Managing Linguistic Data

is a condition of the body and mind...” We can type This simple program is just the tip of a much bigger
this into MSWord, click “Save as Web Page,” and problem. We can develop sophisticated tools to
then look at the HTML file that was created: check the consistency of word processor files and
report errors so that the person in charge of the
sleep dictionary can fix the original file using the original

word processor.

[sli:p]
Once we know that the data is in the right format,
 we can write other programs to change the format

v.i. of the data. Using the BeautifulSoup library, the 3.1
program strips out the HTML markup, pulls out the

words and how to say them, and puts the results in
a condition of body and mind ...<o:p></o:p> “comma-separated value” (CSV) format.

from bs4 import BeautifulSoup
The entry is shown as an HTML paragraph with the deflexical_data(html_file, encoding=”utf-8”):
p> element, and the part of speech is shown inside
SEP = ‘_ENTRY’
a span> element.
html = open(html_file, encoding=encoding).read()
style=’font-size:11.0pt’> element. The set of legal html = re.sub(r’<p’, SEP + ‘<p’, html)
parts of speech, called legal pos, is set up by the
text = BeautifulSoup(html, ‘html.parser’).get_text()
next program. Then it takes all of the 11-point
content from the dict.htm file and puts it in the set text = ‘ ‘.join(text.split())

used pos. Notice that the search pattern has a sub- for entry in text.split(SEP):
expression in parentheses. re.findall will only return
if entry.count(‘ ‘) > 2:
information that matches this sub-expression.
yield entry.split(‘ ‘, 3)
Lastly, the program makes a list of words that can’t
be used as used pos - legal pos: >>>import csv

>>>writer = csv.writer(open(“dict1.csv”, “w”,

>>>legal_pos = set([‘n’, ‘v.t.’, ‘v.i.’, ‘adj’, ‘det’])
encoding=”utf-8”))
>>>pattern = re.compile(r”’font-
>>>writer.writerows(lexical_data(“dict.htm”,
size:11.0pt’>([a-z.]+)<”)
encoding=”windows-1252”))
>>>document = open(“dict.htm”,
encoding=”windows-1252”).read() Standards and Tools

>>>used_pos = set(re.findall(pattern, document))

For a corpus to be useful to many people, it needs
>>>illegal_pos = used_pos.difference(legal_pos)
to be available in a format that can be widely used.
>>>print(list(illegal_pos)) However, the cutting edge of NLP research often
involves new kinds of annotations that are not
[‘v.i’, ‘intrans’]
widely used by definition. In general, there are not

106 Managing Linguistic Data

many good tools for creating, publishing, and utilizing linguistic data. Most projects have to develop their own
tools for internal use, which doesn’t help those who lack the necessary resources. Moreover, there is a lack
of widely accepted standards for describing the structure and content of corpora. Without these standards,
it becomes challenging to develop tools that can be universally employed. On the other hand, without tools,
it’s unlikely that good standards will be created, adopted, and embraced.

One approach to address this problem has been to work on developing a generic format that can accommodate
a wide range of annotation types (see [8] for examples). The challenge for NLP is to create programs that
can handle the diversity of these formats. For instance, if the programming task involves tree data and the
file format allows any directed graph, then the input data must be validated to ensure tree properties such
as rootedness, connectedness, and acyclicity. If the input files contain additional layers of annotation, the
program should know how to ignore them when loading the data without invalidating or deleting them when
saving the tree data back to the file.

Another solution has been to write one-off scripts to convert the formats of corpora. Many NLP researchers
have numerous such scripts in their files. The idea behind NLTK’s corpus readers is that the work of parsing
a corpus format should only need to be done once (per programming language).

Figure 11.4: A Common Interface Vs. A Common Format

Instead of trying to establish a common format, we believe it would be more beneficial to create a common
interface (cf. nltk.corpus). Consider the case of treebanks, which are crucial for NLP work. A phrase structure
tree can be saved in a file in various ways, such as nested parentheses, nested XML elements, a dependency
notation with a (child-id, parent-id) pair on each line, or an XML version of the dependency notation. However,
the way the ideas are represented is almost the same in each case. It is much easier to develop a common
interface that allows application programmers to write code using methods like children(), leaves(), depth(),
and so on to access tree data. Note that this approach aligns with standard computer science practices, such
as abstract data types, object-oriented design, and the three-layer architecture. The last one, from the world
of relational databases, enables end-user applications to use a common model (the “relational model”) and
a common language (SQL) to abstract away the specifics of file storage and accommodate new filesystem
technologies without impacting end-user applications. Similarly, a common corpus interface ensures that
application programs do not need to be aware of specific data formats.

When creating a new corpus for distribution in this context, it is best to use an already popular format whenever
107 Managing Linguistic Data
possible. When this is not possible, the corpus could come with software, like an nltk.corpus module, that
supports existing interface methods.

Summary

● Annotated texts and lexicons are essential types of data found in most corpora. Texts have a structure
based on time, while lexicons have a structure based on records.

● In a corpus’s lifecycle, data collection, annotation, quality control, and publication are all crucial steps. After
a book is published, the lifecycle continues because the corpus is changed and added to during research.

● Corpus development requires finding a balance between obtaining a good sample of how language is used
and gathering enough information from any one source or genre to be useful. Due to limited resources, it
is usually not possible to multiply the dimensions of variation.

● XML is a suitable format for storing and exchanging linguistic data, but it doesn’t offer easy solutions to
common data modeling problems.

● Language documentation projects often use the Toolbox format. We can write programs to help manage
Toolbox files and convert them to XML.

● The Open Language Archives Community (OLAC) is a place where language resources can be documented
and found.

108 Managing Linguistic Data

Unit 12

Afterword: The Language

Challenge
Learning Objectives Introduction

By the end of this unit, you will be

There are some interesting problems to solve when
able to understand:
it comes to computers and natural language. We’ve
● Language processing vs symbol
talked about a lot of these in the chapters before, such
processing
as tokenization, tagging, classification, information
● Contemporary philosophical extraction, and building syntactic and semantic
devices representations. You should now be able to work

● NLTK roadmap with large datasets, make solid models of linguistic

phenomena, and use these models as building blocks
for real-world language technologies. We hope that the
Natural Language Toolkit (NLTK) has made it possible for
more people than ever to get involved in the exciting field
of natural language processing.

Despite everything that has come before, language is

much more than just a temporary problem for computers.
Think about the following sentences, which show how
rich language is:

a. The sky is grey and flat, and the sun is hidden by a flight
of grey spears. (From As I Lay Dying, by William Faulkner,
1935)

b. Make sure the exhaust fan is on when you use the

toaster. (Sign in the kitchen of the dorm)

c. The effects of amiodarone on CYP2C9, CYP2D6, and

109 Afterword: Facing the Language Challenge

CYP3A4 were weak, with Ki values of 45.1-271.6 M. Symbol Processing vs. Language
(Medline, PMID: 10718780) Processing
d. Iraqi Head Seeks Arms (fake news story headline)
The idea that natural language could be treated
e. A good man’s sincere prayer has a lot of power
in a computational way came from a research
and brings about great things. (James 5:16b)
program that started in the early 1900s and tried
f. It was beautiful, and the slimy toads wiggled and to reconstruct mathematical reasoning using logic.
wriggled in the water (Lewis Carroll, Jabberwocky, Frege, Russell, Wittgenstein, Tarski, Lambek, and
1872) Carnap’s work showed this most clearly. As a result
g. As far as I know, there are two ways to do of this work, the idea that language is a formal
this:smile: (archive of internet discussions) system that can be processed by a computer came
about. Natural language processing was made
Another sign that language is rich is that it is the focus possible by three things that happened later. The
of so many different fields of study. Translation, first was the study of formal language. This gave the
literary criticism, philosophy, anthropology, basis for computational syntax by defining
and psychology are some obvious a language as a set of strings that a
fields. Law, hermeneutics, class of automata, like context-
forensics, telephony, pedagogy, free languages and pushdown
archaeology, cryptanalysis, automata, can understand.
and speech pathology are all
fields that study how people The second change was the
use language. Each uses use of symbols in logic. This
different ways to collect data, provided a formal way to
come up with theories, and test capture some natural language
hypotheses. All of these things features that are important
help us learn more about language for expressing logical proofs. In
and how it shows our intelligence. symbolic logic, a formal calculus gives
the syntax of a language, as well as rules for
Given how complicated language is and how many inference and, possibly, rules for interpretation in
people are interested in studying it from different a set-theoretic model. Propositional logic and First
points of view, it’s clear that we’ve only just begun to Order Logic are two examples. Given such a formal
scratch the surface. Also, there are many important calculus with well-defined syntax and semantics, it
methods and uses of NLP that we haven’t talked becomes possible to connect meanings to natural
about yet. language expressions by translating them into
formal calculus expressions. For example, if we
In our final comments, we’ll take a wider look at turn “John saw Mary” into the formula saw(j,m),
NLP, including its roots and other areas you might we (either explicitly or implicitly) think of “saw”
want to look into. Some of the topics aren’t well- as a binary relation and “John” and “Mary” as two
supported by NLTK, and you might want to fix that different people. “All birds fly” is a more general
by adding new software and data to the toolkit. statement that needs a quantifier, which in this

110 Afterword: Facing the Language Challenge

case is “for all”: Vx (bird(x) V fly(x)). This use of to solve the hard problem of recognizing real
logic gave us the tools to make inferences, which speech in something close to real-time. Systems
are an important part of understanding language. that learned patterns from large amounts of speech
The principle of compositionality, which says that data, on the other hand, were much more accurate,
the meaning of a complex expression is made efficient, and reliable. Also, the speech community
up of the meanings of its parts and how they are found that building shared resources for measuring
put together, grew out of this. This principle gave performance against common test data helped a lot
a useful link between syntax and meaning, which with making progress on building better systems. In
was that the meaning of a complicated expression the end, most people in the NLP community came
could be figured out in a recursive way. Think about to agree that language processing should be based
the sentence “It is not the case that p,” where p is a on a lot of data and use more machine-learning
statement. We can show what this sentence means techniques and evaluation-led methods.
by saying “not (p).” In the same way, we can write
“John saw Mary” as “saw(j, m).” Now we can figure Philosophical Differences Today
out what it means. It’s not true that John saw Mary,
as shown by the fact that not(saw(j,m)) The different ways of doing NLP that

can be made from the information were talked about in the last section

above. go back to the early metaphysical

debates about rationalism vs.
All of the above methods are empiricism and realism vs.
based on the idea that rules idealism that happened during
for manipulating symbolic the Enlightenment in Western
representations are essential to philosophy. The orthodox way
computing with natural language. of thinking at the time was that
During a certain time in the history of all knowledge came from God.
NLP, especially the 1980s, this premise Philosophers of the 17th and 18th
was used by both linguists and NLP centuries argued that human reason
practitioners as a starting point. This led to the or sensory experience is more important than
development of a family of grammar formalisms revelation. Descartes and Leibniz, among others,
called unification-based (or feature-based) grammar were rationalists. They said that all truth comes
(see Chapter 9) and NLP applications written in the from human thought and from “innate ideas” that
Prolog programming language. Grammar-based are already in our minds when we are born. For
NLP is still an important area of research, but it example, they said that the principles of Euclidean
has lost some of its importance in the last 15–20 geometry were not based on supernatural revelation
years for a number of reasons. Automatic speech or sensory experience but rather on human reason.
recognition was a big part of this change. Early
work on speech processing used a model that was Locke and others, on the other hand, took the

similar to the rule-based phonological processing empiricist view, which says that our primary source

shown by the Sound Pattern of English (Chomsky of knowledge is the experience of our senses,

and Halle, 1968). However, this model was unable and that human reason plays a secondary role in
reflecting on that experience. Galileo’s discovery,

111 Afterword: Facing the Language Challenge

based on careful observation of how the planets The differences between symbolic and statistical
move, that the solar system is heliocentric and not methods, deep and shallow processing, binary
geocentric, was often used as proof of this point of and gradient classifications, and scientific and
view. In the field of linguistics, this debate leads to engineering goals show that these problems still
the following question: how much of our knowledge exist. But these differences are not as stark as they
of language comes from our experience with used to be, and the debate is not as divided as it
language versus our “language faculty”? In NLP, once was. In fact, a “balancing act” is at the heart
this question comes up when people argue about of most of the talks and even most of the progress.
whether corpus data or linguistic introspection is For example, one middle-ground position is to
more important when building computer models. believe that humans are born with analogical and
memory-based ways of learning (weak rationalism)
In the debate between realism and idealism, there and that they use these ways to find meaningful
was also the question of the metaphysical status patterns in their sensory and linguistic experiences
of a theory’s ideas. Kant argued that there should (empiricism).
be a difference between phenomena, which are
the things we can see, and “things in themselves,” In this unit, we’ve seen many examples of how
which we can never know directly. A linguistic this method works. Whenever corpus statistics
realist would think that a theoretical construct guide the choice of productions in a context-free
like a noun phrase is a real thing in the world that grammar, i.e. “grammar engineering,” statistical
exists apart from human perception and reason methods are used to build symbolic models. When
and actually causes the things that can be seen in a corpus made with rule-based methods is used as
language. On the other hand, a linguistic idealist a source of features to train a statistical language
would say that noun phrases, along with more model, this is called “grammatical inference,” and it
abstract things like semantic representations, are is based on symbolic methods. The ring is full.
not observable in and of themselves and are just
useful fictions. Linguists often show that they are NLTK Roadmap
realists by the way they write about theories, while
NLP practitioners are either neutral or lean toward The Natural Language Toolkit is a work in progress,
the idealist position. So, in NLP, it is often enough if and people keep adding code to make it better.
a theoretical simplification leads to a useful result; Some areas of NLP and linguistics are not (yet)
it doesn’t matter if this result tells us anything about well-supported in NLTK, so contributions in these
how humans process language. areas are especially welcome.

Language’s sounds and shapes: A finite state toolkit

is often used to study sound patterns and how
words are put together using a computer. With the
string processing methods we’ve been looking at,
it’s hard to deal with things like suppletion and non-
concatenative morphology. The technical challenge
is not only to connect NLTK to a high-performance
finite state toolkit but also to avoid duplicating

112 Afterword: Facing the Language Challenge

lexical data and connect the morphosyntactic formats, could be supported in NLTK. This would
features that morph analyzers and syntactic make it easier for linguists to collect, organize, and
parsers need. analyze this data while giving them more time to
collect it.
Components that work well: Some NLP tasks
require too much computing power for pure Python Other Languages: To improve NLP support in
to handle them. But sometimes the cost only comes languages other than English, work could be done
up when training models and not when using them in two areas: getting permission to distribute more
to label inputs. The NLTK package system makes corpora with NLTK’s data collection and writing
it easy to share trained models, even if they were language-specific HOWTOs for http://nlp.org/
trained with corpora that can’t be shared freely. howto that show how to use NLTK and talk about
You could also make Python interfaces for high- language-specific NLP problems like character
performance machine learning tools or use parallel encodings, word segmentation, and morphology.
programming techniques like MapReduce to make Researchers in NLP who know a certain language
Python more useful. well could translate this book and put it on
the NLTK website. This would involve
Lexical Semantics: This is a very more than just translating the
active area of research right discussions, as it would also
now. It includes things like involve giving equivalent
inheritance models of the worked examples using data
lexicon, ontologies, multiword in the target language.
expressions, and other things
that are mostly outside of NLTK-Contrib: Many of NLTK’s
what NLTK can do right now. core parts were made by people
A conservative goal would be to in the NLP community, and they
get lexical information from large were first kept in the “Contrib”
external stores to help with tasks like package, nltk contrib. Software can be
figuring out the meaning of a word, parsing it, and added to this package as long as it is written in
figuring out what it means. Python, is related to natural language processing
(NLP), and has the same open source license as
Generation of natural language: An important part of the rest of NLTK. Software with bugs is fine, and
NLP is making coherent text from representations other people in the NLP community will probably fix
of meaning. In NLTK, a unification-based approach them over time.
to NLG has been developed, and there is room for
more contributions in this area. Materials for teaching: Since the beginning of NLTK
development, there have been teaching materials
Fieldwork in linguistics: Linguists have to document to go along with it. These materials have grown
thousands of languages that are in danger of dying over time to fill this book, and there are also a lot of
out, which creates a lot of data that is different and online materials. We hope that instructors who add
changes quickly. More fieldwork data formats, such presentation slides, problem sets with answers,
as interlinear text formats and lexicon interchange or more in-depth discussions of the topics we’ve

113 Afterword: Facing the Language Challenge

covered will make them available and let the authors know so we can link to them from http://nltk.org/.
Materials that help NLP become a standard course in computer science and linguistics departments for
undergraduates or that make NLP available at the secondary level, where there is a lot of room to add
computational content to the language, literature, computer science, and information technology curricula,
are especially useful.

Only a Toolkit: As stated in the introduction, NLTK is not a system, but rather a set of tools. NLTK, Python, other
Python libraries, and interfaces to NLP tools and formats from outside will be used to solve many problems.

Summary

● Linguists are often asked about the number of languages they can communicate in. In response, they must
clarify that the primary focus of linguistics is the study of abstract structures shared by languages. This
profound and elusive study goes beyond merely learning as many languages as possible.

● Similarly, computer scientists are frequently questioned about the number of programming languages
they are fluent in. They, too, must explain that the primary focus of computer science is the study of data
structures and algorithms that can be implemented in any programming language. This study is profound
and intricate, surpassing the pursuit of fluency in multiple programming languages.

● Throughout this book, various topics related to Natural Language Processing (NLP) have been discussed,
with most examples presented in English and Python. However, it would be a mistake for readers to
conclude that NLP is solely about developing Python programs for editing English text or manipulating text
in any natural language using any programming language.

● The choice of Python and English was primarily for convenience, and the focus on programming was merely
a means to an end. That end was to understand data structures and algorithms for handling linguistically
annotated text collections, building new language technologies to serve the information society’s needs,
and ultimately gaining a deeper understanding of the vast richness of human language.

114 Afterword: Facing the Language Challenge

Implementation of A Chatbot For Helping Users Find The Nearest and Cheapest Gas Station
No ratings yet
Implementation of A Chatbot For Helping Users Find The Nearest and Cheapest Gas Station
59 pages
Approaching Almost Any NLP
No ratings yet
Approaching Almost Any NLP
118 pages
(IJIT-V6I5P7) :ravishankar Belkunde
No ratings yet
(IJIT-V6I5P7) :ravishankar Belkunde
9 pages
BTP Presentation On Text To Image Synthesis
100% (1)
BTP Presentation On Text To Image Synthesis
38 pages
Neural Networks1
No ratings yet
Neural Networks1
164 pages
Indian National Symbols and Monuments
No ratings yet
Indian National Symbols and Monuments
7 pages
Handling and Processing Strings in R PDF
No ratings yet
Handling and Processing Strings in R PDF
113 pages
Urn - Isbn - 978 952 61 4462 7
No ratings yet
Urn - Isbn - 978 952 61 4462 7
278 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
45 pages
Week 1
No ratings yet
Week 1
184 pages
(Series On Language Processing Pattern Recognition and Intelligent Systems Vol. 4) Neamat El Gayar, Ching Y. Suen - Computational Linguistics, Speech and Image Processing For Arabic Language-World Sci
No ratings yet
(Series On Language Processing Pattern Recognition and Intelligent Systems Vol. 4) Neamat El Gayar, Ching Y. Suen - Computational Linguistics, Speech and Image Processing For Arabic Language-World Sci
286 pages
Bedrock Doc 1
No ratings yet
Bedrock Doc 1
4 pages
Lab7 LLM Chains
No ratings yet
Lab7 LLM Chains
7 pages
Rakesh Kumar - Data Scientist
No ratings yet
Rakesh Kumar - Data Scientist
3 pages
Natural Language Processing
100% (1)
Natural Language Processing
3 pages
Voice Communication Between Humans and Machines
No ratings yet
Voice Communication Between Humans and Machines
559 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Natural Language Processing Inside Pages 2
No ratings yet
Natural Language Processing Inside Pages 2
159 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
Text Processing, Tokenization & Characteristics
No ratings yet
Text Processing, Tokenization & Characteristics
89 pages
U1 NLP App Solved
No ratings yet
U1 NLP App Solved
26 pages
NLP: Background and Overview: Introduction To Natural Language Processing (CSE5321)
No ratings yet
NLP: Background and Overview: Introduction To Natural Language Processing (CSE5321)
30 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
Machine Learning in Translation (Peng Wang, David B. Sawyer) (Z-Library)
No ratings yet
Machine Learning in Translation (Peng Wang, David B. Sawyer) (Z-Library)
219 pages
024 KnowledgeRepresentationMethods (Puhr)
No ratings yet
024 KnowledgeRepresentationMethods (Puhr)
18 pages
NLP Python Intro 1-3
100% (1)
NLP Python Intro 1-3
79 pages
Parsing
No ratings yet
Parsing
38 pages
NLP Presentation
No ratings yet
NLP Presentation
20 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Engineering Design Knowledge Representation Based On Logic and Objects
No ratings yet
Engineering Design Knowledge Representation Based On Logic and Objects
19 pages
A Survey On Language Models For Code
No ratings yet
A Survey On Language Models For Code
125 pages
Rick Brigs - Knowledge Representation and Inference in Sanskrit PDF
No ratings yet
Rick Brigs - Knowledge Representation and Inference in Sanskrit PDF
1 page
Top-Down and Bottom-Up Parsing
No ratings yet
Top-Down and Bottom-Up Parsing
23 pages
Universe Retrieveuserguide v1123 PDF
No ratings yet
Universe Retrieveuserguide v1123 PDF
257 pages
Natural Language Processing
No ratings yet
Natural Language Processing
49 pages
Gen Ai Solutions
No ratings yet
Gen Ai Solutions
14 pages
IBM Universe Pick
No ratings yet
IBM Universe Pick
250 pages
Information Retrieval, Question Answering Systems, and Chatgpt: Technology, Capability, and Intelligence
No ratings yet
Information Retrieval, Question Answering Systems, and Chatgpt: Technology, Capability, and Intelligence
15 pages
10 Natural Language Processing
No ratings yet
10 Natural Language Processing
27 pages
Agents in Artificial Intelligence Book
No ratings yet
Agents in Artificial Intelligence Book
29 pages
Nepali Natural Language Processing
No ratings yet
Nepali Natural Language Processing
30 pages
Linux Programming and Data Mining Lab Manual
No ratings yet
Linux Programming and Data Mining Lab Manual
97 pages
IS 7118 Unit-9 Semantics
No ratings yet
IS 7118 Unit-9 Semantics
82 pages
MLOPs Artem Koval
No ratings yet
MLOPs Artem Koval
38 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
103 pages
Spark NLP Training-Public-Oct 2020
No ratings yet
Spark NLP Training-Public-Oct 2020
50 pages
Full download Implementing MLOps (Early Release) Noah Gift pdf docx
100% (5)
Full download Implementing MLOps (Early Release) Noah Gift pdf docx
50 pages
NLP Lab Expdoc New
No ratings yet
NLP Lab Expdoc New
103 pages
Smart Traffic Management System Using IOT and Machine Learning Approach
No ratings yet
Smart Traffic Management System Using IOT and Machine Learning Approach
6 pages
A Comparison of Document Similarity Algorithms
No ratings yet
A Comparison of Document Similarity Algorithms
10 pages
How LLMs and Quantum Science Can Empower Each Other
No ratings yet
How LLMs and Quantum Science Can Empower Each Other
94 pages
Short Report On Expert Systems
100% (1)
Short Report On Expert Systems
12 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
Performance Analysis of LoRA Finetuning Llama-2
No ratings yet
Performance Analysis of LoRA Finetuning Llama-2
4 pages
Python Programming-Grade 9
No ratings yet
Python Programming-Grade 9
53 pages
Qurana: Corpus of The Quran Annotated With Pronominal Anaphora
No ratings yet
Qurana: Corpus of The Quran Annotated With Pronominal Anaphora
8 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
The Concise Guide to the Internet of Things for Executives
From Everand
The Concise Guide to the Internet of Things for Executives
alasdair gilchrist
4/5 (7)
Python For Linguists
No ratings yet
Python For Linguists
45 pages
UBC Summer School in NLP - VSP 2019 Lecture 8
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 8
27 pages

Natural Language Processing

Uploaded by

Natural Language Processing

Uploaded by

U NIVERSIT Y OF TOMORROW

Unit 1 Language Processing and Python ..................................................................................1

Unit 6 Learning to Classify Text ............................................................................................ 50

Unit 11 Managing Linguistic Data ...........................................................................................98

Language Processing and

By the end of this unit, you will be

● How can we programmatically identify the most

● What kinds of resources and strategies are made

● The processing of natural language presents a

This chapter is broken up into parts that switch between

1 Language Processing and Python

Computational Linguistics: A “credits” or “license” for more

Study of Texts and Words information >>>

Text is something that every one of us Note

is quite acquainted with since we read and write

Language from __future__ import division

2 Language Processing and Python

>>>1 + >>>import nltk

File “<stdin>”, line 1 >>>nltk.download()

3 Language Processing and Python

*** Introductory Examples for the NLTK Book ***

Loading text1, ..., text9 and sent1, ..., sent9

Type the name of the text or sentence to view it.

Type: ‘texts()’ or ‘sents()’ to list the materials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen 1811

text3: The Book of Genesis

4 Language Processing and Python

monstrous pictures of whales, I am be used and how rich it can be. In

strongly the next chapter, you will acquire

mean part maddens doleful gamesome subtly

5 Language Processing and Python

very heartily so exceedingly remarkably as vast a great amazingly

extremely good sweet

6 Language Processing and Python

7 Language Processing and Python

8 Language Processing and Python

‘A’, ‘Abel’, ‘Abelmizraim’, ‘Abidah’, ‘Abide’, ‘Abimael’,

‘Abr’, ‘Abrah’, ‘Abraham’, ‘Abram’, ‘Accad’, ‘Achbor’,

9 Language Processing and Python

... return 100 * count / total

Note a parameter called “text.” This parameter is a

10 Language Processing and Python

11 Language Processing and Python

Accessing Text Corpora and

By the end of this unit, you will be

● Which aspects of Python’s syntax are most useful for

● When creating code in Python, how can we keep from

This chapter continues to demonstrate programming

12 Accessing Text Corpora and Lexical Resources

Figure No. 2.1: Accessing Text Copora

13 Accessing Text Corpora and Lexical Resources

Note display, we will round each number to the nearest

text1.concordance. An example of this would ... num_chars = len(gutenberg.raw(fileid))

14 Accessing Text Corpora and Lexical Resources

[‘Double’, ‘,’, ‘double’, ‘,’, ‘toile’, ‘and’, ‘trouble’, ‘;’,

The example that came before Most NLTK corpus readers

15 Accessing Text Corpora and Lexical Resources

>>>for fileid in webtext.fileids(): >>>chatroom = nps_chat.posts(‘10-19-

grail.txt SCENE 1: [wind] [clop clop clop] KING

overheard.txt White guy: So, do you have any plans

pirates.txt PIRATES OF THE CARRIBEAN: DEAD

singles.txt 25 SEXY MALE, seeks attrac older

wine.txt Lovely delicate, fragrant Rhone wine.

example, the file “10-19-20s 706posts.xml” includes

[‘adventure’, ‘belles_lettres’, ‘editorial’, ‘fiction’,

16 Accessing Text Corpora and Lexical Resources

>>>brown.words(categories=’news’) In order to have the print function’s output appear

[‘Does’, ‘our’, ‘society’, ‘have’, ‘a’, ‘runaway’, ‘,’, ...] Note

The field of linguistics research The next step is to compile counts

>>>from corpus import brown >>>cfd = nltk.ConditionalFreqDist(

17 Accessing Text Corpora and Lexical Resources

science_fiction 16 49 4 12 8 16 Unlike the Brown Corpus, categories in the Reuters

document that was drawn from the test ‘EXPORT’, ...]

set. The documents have been categorized into >>>reuters.words(categories=’barley’)

[‘FRENCH’, ‘FREE’, ‘MARKET’, ‘CEREAL’, ‘EXPORT’, >>>inaugural.fileids()

Language from future import division

* Introductory Examples for the NLTK Book *

>>>from future import division # Python 2 >>>len(raw)