Natural Language Processing
Natural Language Processing
NLP
Contents
● Searching text
● What do you think we will be able to do if we combine
● Counting vocabulary
basic programming skills with a significant amount of
text?
Getting Started with the Python Programming the expected behavior of division you need to type:
You are able to input directly into the interactive The >>> prompt indicates that the Python interpreter
interpreter, which is the software that will be is now waiting for input. When copying examples
responsible for executing your Python programs, from this book, don’t type the “>>>” yourself. Now,
^
>>>1 + 5 * 2 - 3
SyntaxError: invalid syntax
8
>>>
>>>
The question will appear once again when the Over this, a syntax error occurred. In Python, it is
interpreter has completed computing the answer not appropriate to close an instruction with the
and showing it to the user. This indicates that the plus sign since it does not make sense. The Python
Python interpreter is awaiting more instruction at interpreter points out the line number on which
this time. the error occurred. In this case, it is line 1 of stdin,
which is short for “standard input.”
Note
Since we are now able to access the Python
Your Turn: Enter a few more expressions of your interpreter, we are prepared to begin dealing with
own. You can use asterisk (*) for multiplication data relating to languages.
and slash (/) for division, and parentheses for
bracketing expressions. To get the version that is necessary for your
platform, just follow the instructions provided there.
The accompanying examples explain how you can
work interactively with the Python interpreter by After you have finished installing, restart the Python
playing with different expressions in the language interpreter in the same manner as before, and then
to see what they accomplish. You may do this by install the data that is necessary for the book by
following the instructions in the interpreter’s help typing the following two commands at the Python
file. Now, let’s examine how the interpreter deals prompt, and then selecting the book collection in
with a meaningless phrase by giving it a shot: the manner that is outlined in the following:
>>>frombookimport *
>>>
>>>text1.concordance(“monstrous”)
Any time we want to find out about these texts, we Displaying 11 of 11 matches:
that has survived the flood; most monstrous and words have been used differently over time. We’ve
most mountainous! That Himmal also included text5, the NPS Chat Corpus: search
this for unconventional words like im, ur, lol. (Note
they might scout at Moby Dick as a monstrous
that this corpus is uncensored!)
fable, or still worse and more de
th of Radney.’” CHAPTER 55 Of the monstrous We hope that once you’ve spent some time reading
Pictures of Whales. I shall ere l these works, you’ll have a fresh perspective
ing Scenes. In connexion with the on the variety of ways language may
Note >>>text1.similar(“monstrous”)
>>>text2.similar(“monstrous”)
>>>
Observe that we get different results for different texts. Austen uses this word quite differently from Melville;
for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very.
The term common contexts allows us to examine just the contexts that are shared by two or more words,
such as monstrous and very. We must enclose these words in square brackets as well as parentheses and
separate them with a comma:
Note
Your Turn: Pick another pair of words and compare their usage in two different texts, using the similar() and
common_contexts() functions.
One thing is to show certain words that exist in the same context as others that have been automatically
detected as being in the same text as a specific term that has been searched for in a text. On the other hand,
we can also identify the placement of a word in the text by counting the number of words that have passed
since we started reading and finding it. A dispersion plot may be used to present this location information for
readers’ convenience. Each column illustrates every individual occurrence of a word, and each row illustrates
the whole body of text. We see some remarkable trends in the way words have been used during the last
220 years (in an artificial text constructed by joining the texts of the Inaugural Address Corpus end-to-end).
You may create this plot in the manner shown below. You might try out some more terms (such as “liberty”
or “constitution”) as well as some alternative passages. Are you able to guess how a word will be distributed
before you ever see it? As was the case previously, ensure that the quotation marks, commas, brackets, and
parentheses are entered correctly.
>>>
Note
Important: You need to have Python’s NumPy and Matplotlib packages installed in order to produce the
graphical plots used in this book.
Note
You can also plot the frequency of word usage through time using https://books.google.com/ngrams
Let’s generate some random text now, just for fun, and see how it looks in the different styles that we’ve been
discussing. In order to do this, we first input the name of the text, then the word “create.” (It is necessary for
us to add the parenthesis, but there should be nothing included in that space.)
>>>text3.generate()
In the beginning of his brother is a hairy man, whose top may reach
unto heaven; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month, upon the earth. So, shall thy
wages be? And they made their father; and Isaac was old, and kissed
him: and Laban with his cattle in the midst of the hands of Esau thy
first born, and Phichol the chief butler unto his son Isaac , she
>>>
The most striking difference between the two sets of instances is the terminology that is used in each of
the two sets of texts, which is undoubtedly the case. In the following discussion, we will look at a number
of practical applications for counting the number of words in a piece of writing using a computer. You will
proceed as you have in the past and immediately begin experimenting with the Python interpreter, despite the
fact that you may not have yet studied Python in a systematic manner. Modifying the examples and working
through the activities at the conclusion of the chapter will help you evaluate your level of comprehension.
Let’s begin by counting the number of words and punctuation marks that occur in a piece of writing from
beginning to end so that we may get a sense of its overall length. Len is an abbreviation that stands for
“length of anything,” and in this context, we’ll use it to refer to the book of Genesis:
44764
We were able to generate a sorted list of vocabulary
>>>
items by surrounding the Python statement
set(text3) with the sorted() function. The list begins
Therefore, the book of Genesis has 44,764 “tokens,”
with a variety of punctuation marks and continues
which are words and punctuation symbols. The
with words that begin with the letter A. Every word
string of letters (such as “hairy,” “his,” or “:”) that we
with a capital letter comes before a word with a
wish to be treated as a single unit is referred to by
lowercase letter. Indirectly, we get the size of the
its proper technical term, which is a token. We are
vocabulary by inquiring about the number of items
counting instances of these sequences when we
in the set; once again, we may use len to retrieve
count the number of tokens in a text, for example
this amount.
when we count the number of times the phrase
“to be or not to be” appears in the text. Therefore,
This book includes just 2,789 unique words, which
our example sentence has two instances of the
are referred to as “word kinds,” despite the fact that
word “to,” two instances of the word “be,” and one
it contains 44,764 tokens. A word type is the form
instance each of the words “or” and “not.” However,
or spelling of the term independent of its particular
there are only four unique words or phrases included
occurrences in a text; in other words, it refers to
inside this sentence. How many unique words
the word itself in its capacity as an individual piece
can be found throughout the whole of the book
of vocabulary. Our tally of 2,789 items will contain
of Genesis? In order to solve this problem using
various punctuation symbols; hence, we will refer
Python, we will need to restate the issue in a slightly
to these singular things in general as types rather
different way. A text’s vocabulary is just the set of
than word types.
tokens that it employs, because in a set, any tokens
that are used more than once are compressed into
a single entry. With the set command in Python, we
are able to extract the vocabulary items that text3
contains (text3). When you do this, a lot of screens
filled with text will go by very quickly. Now put the
following into practice:
>>>sorted(set(text3))
[‘!’, “’”, ‘(‘, ‘)’, ‘,’, ‘,)’, ‘.’, ‘.)’, ‘:’, ‘;’, ‘;)’, ‘?’, ‘?)’,
>>>len(set(text3))
2789
>>>
In the lexical_diversity() [1] definition, we set
0.06230453042623537
>>>lexical_diversity(text5)
0.13477005109975562
>>>percentage(4, 5)
80.0
>>>percentage(text4.count(‘a’), len(text4))
1.4643016433938312
>>>
Summary
● To sum up, we use or call a function like lexical_diversity() by typing its name, followed by an open
parenthesis, the name of the text, and then a close parenthesis.
● You’ll see these parentheses a lot. Their job is to separate the name of a task, like lexical_diversity(), from
the data on which the task is to be done, like text3. When we call a function, the data value we put in the
parentheses is called an argument to the function.
● You have already seen functions like len(), set(), and sorted in this chapter. By convention, we always put
an empty pair of parentheses after a function name, like len(), to show that we are talking about a function
and not some other kind of Python expression.
● Functions are an important part of programming, and we only talk about them at the beginning to show
newcomers how powerful and flexible programming can be.
● Don’t worry if it’s hard to understand right now. We’ll learn how to use functions when tabulating data later.
We’ll use a function to do the same work over and over for each row of the table, even though each row has
different data.
● Brown corpus
● Which text corpora and lexical resources are most
● Reuters corpus
helpful, and how can we utilize Python to access them?
A big body of text is referred to as a text corpus, the Python interpreter to load the package, and
as was just explained. A great number of corpora then we make a request to view the file IDs included
are constructed with the intention of including inside this corpus using the corpus.gutenberg.
an appropriate mix of information from one or fileids() function:
more categories. In step one, we looked at a few
tiny collections of text, such as the speeches that >>>import nltk
are collectively referred to as the US Presidential >>>corpus.gutenberg.fielids()
Inaugural Addresses. This specific corpus, in fact,
[‘austen-emma.txt’, ‘austen-persuasion.txt’,
consists of dozens of separate texts, one for each
‘austen-sense.txt’, ‘bible-kjv.txt’,
address; but, for the sake of convenience, we have
joined all of these texts together and are treating ‘blake-poems.txt’, ‘bryant-stories.txt’, ‘burgess-
them as a single text. We also used a variety of busterbrown.txt’,
pre-defined texts, any of which could be accessed ‘carroll-alice.txt’, ‘chesterton-ball.txt’, ‘chesterton-
by entering “from book import *”. This part, on the brown.txt’,
other hand, looks at a number of different text
‘chesterton-thursday.txt’, ‘edgeworth-parents.txt’,
corpora since we want to be able to deal with a
‘melville-moby_dick.txt’,
range of different texts. We are going to go through
the process of selecting certain texts and working ‘milton-paradise.txt’, ‘shakespeare-caesar.txt’,
with those texts. ‘shakespeare-hamlet.txt’,
‘shakespeare-macbeth.txt’, ‘whitman-leaves.txt’]
Gutenberg Corpus
Let’s take the first of these works, “Emma” by Jane
A tiny subset of the texts that may be found in the Austen, give it the abbreviated name “emma”, and
Project Gutenberg electronic text collection, which then count the number of words it contains:
can be found at http://www.gutenberg.org/ and
houses around 25,000 free electronic books, are >>>emma = corpus.gutenberg.words(‘austen-
included in the NLTK toolkit. We begin by instructing emma.txt’)
...
>>>emma = Text(corpus.
gutenberg.words(‘austen- 5 25 26 austen-emma.txt
emma.txt’))
5 26 17 austen-persuasion.
> > > e m m a . txt
concordance(“surprize”)
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
In the process of defining “emma”,
we used the words() method of 5 19 5 blake-poems.txt
the “gutenberg” object included inside
4 19 14 bryant-stories.txt
the “corpus” package. Python, however, offers a
4 18 12 burgess-busterbrown.txt
different variant of the import statement, which
may be summarized as follows: 4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
>>>from.corpus import gutenberg
5 23 11 chesterton-brown.txt
>>>gutenberg.fileids()
5 18 11 chesterton-thursday.txt
[‘austen-emma.txt’, ‘austen-persuasion.txt’,
4 21 25 edgeworth-parents.txt
‘austen-sense.txt’, ...]
5 26 15 melville-moby_dick.txt
>>>emma = gutenberg.words(‘austen-emma.txt’)
5 52 11 milton-paradise.txt
Let’s write a short program to display other
4 12 9 shakespeare-caesar.txt
information about each text, by looping over all the
4 12 8 shakespeare-hamlet.txt
5 36 12 whitman-leaves.txt >>>macbeth_sentences[1116]
In addition, there is a corpus of instant messaging In 1961, researchers at Brown University established
conversation sessions that were first compiled by the Brown Corpus, which is credited as being the
the Naval Postgraduate School for the purpose of first computerized English corpus to include one
studying the automated identification of Internet million words. Text from five hundred different
predators. The corpus includes approximately sources has been compiled into this corpus, and
10,000 posts, all of which have been anonymized those sources have been arranged in this corpus
by having their usernames replaced with generic according to genre, such as editorial, news, and so
names of the type “UserNNN,” and by having any on. [1.1] provides an example of each kind of music
other identifying information removed by human (for a comprehensive list, please refer to http://
editing. The messages for an age-restricted icame.uib.no/brown/bcm-los.html).
chatroom were compiled into the corpus, and it is
arranged into 15 files, where each file comprises We have the option of gaining access to the corpus
several hundred entries gathered on a certain in the form of a list of words or a list of phrases
day (teens, 20s, 30s, 40s, plus a generic adults (where each sentence is itself just a list of words).
chatroom). The date, chatroom, and total number We have the option of specifying specified file types
of posts are all included in the name of the file. For or categories for you to read:
>>>brown.sents(categories=[‘news’, ‘editorial’,
Now it’s your turn: choose a new area of the Brown
‘reviews’])
Corpus, and modify the preceding example so that
[[‘The’, ‘Fulton’, ‘County’...], [‘The’, ‘jury’, ‘further’...], it counts a selection of wh-words, such as “what,”
...] “when,” “where,” “who,” and “why.”
>>>from corpus import reuters In a similar manner, we are able to specify the
files or categories that contain the words or
>>>reuters.fileids()
sentences of our choosing. The titles, which are
[‘test/14826’, ‘test/14828’, ‘test/14829’,
always presented in uppercase because that is the
‘test/14832’, ...]
accepted format for storing them, make up the first
>>>reuters.categories() few words of each of these texts.
18 Accessing Text Corpora and Lexical Resources
>>>reuters.words(‘training/9865’)[:14] >>>from corpus import inaugural
>>>cfd = ConditionalFreqDist(
In [1], we examined the Inaugural Address Corpus,
but we handled the whole collection as if it were a ... (target, fileid[:4])
single text. The “word offset” was used as one of the ... for fileid in inaugural.fileids()
axes in the graph that was shown in fig-inaugural.
... for w in inaugural.words(fileid)
This is the numerical index of the word in the corpus,
and it is calculated by starting with the very first ... for target in [‘america’, ‘citizen’]
word of the very first address. On the other hand, the ... if w.lower().startswith(target))
corpus is a collection of 55 separate writings, one
>>>cfd.plot()
of which corresponds to each presidential speech.
One aspect of this collection that is particularly
intriguing is its temporal dimension:
Plot of a Conditional Frequency Distribution: All words in the Inaugural Address Corpus that begin with
“America” or “citizen” are counted; separate counts are kept for each address; these are plotted so that trends
in usage over time can be observed; counts are not normalized for document length. Plot of a Conditional
Frequency Distribution: all words in the Inaugural Address Corpus that begin with “America” or “citizen” are
counted. Separate counts are kept for each address.
Annotations of linguistic content are found in a great number of text corpora. These annotations might
represent POS tags, named entities, grammatical structures, semantic roles, and so on. The NLTK gives easy
access to a number of these corpora and includes data packages that can be freely downloaded for use in
education and research that comprise corpora as well as samples of corpora. The following are examples of
some of the corpora:
Summary
● A text corpus is a huge collection of texts that has been organized in a certain way. The Natural Language
Toolkit (NLTK) includes a variety of corpora, such as the Brown Corpus.
● Some text corpora are organized into categories, such as those based on genre or subject matter; in other
cases, the categories in a corpus overlap with one another.
● A collection of frequency distributions, each one for a distinct circumstance, is what we mean when we talk
about a conditional frequency distribution. They are useful for calculating the number of times a certain
word or phrase appears inside a specific context or kind of writing.
● Python code that is more than a few lines long has to be written in a text editor, then saved to a file with the
.py suffix, and finally retrieved using an import statement.
● The name of an object is followed by a period and then the function that is connected with the object, as
seen here: x.funct(y). For example, the function “word.isalpha” is associated with the word object and is
referred to as a “method” ().
● In the Python interactive interpreter, you may read the help entry for objects of this kind by typing help(v)
and then looking up the variable you want information on.
● WordNet is an English dictionary that takes a semantic approach. It is comprised of synonym sets, often
known as synsets, and is structured in the form of a network.
● Certain functions cannot be used without first using Python’s import statement because they are not made
accessible by default.
● Search engine results most likely have your own text sources in mind, and you
will need to discover how to access them.
● Local files
>>>url = “http://www.gutenberg.org/
Note
files/2554/2554-0.txt”
>>>response = request.urlopen(url)
Important: From this chapter onwards, our program
samples will assume you begin your interactive >>>raw = response.read().decode(‘utf8’)
session or your program with the following import
>>>type(raw)
statements:
<class ‘str’>
>>>tokens = word_tokenize(raw)
>>>len(llog.entries)
>>>f = open(‘document.txt’)
15
>>>raw = f.read()
>>>post = llog.entries[2]
>>>post.title Note
“He’s My BF”
Now it’s your turn: using a text editor, create a new
>>>content = post.content[0].value
file on your computer and name it document.txt.
>>>content[:70] Type in a few lines of text and save the file as plain
text. If you are working with IDLE, select the New
‘<p>Today I was chatting with three of our visiting
Window command from the File menu. Type the
graduate students f’
necessary text into this window, and then save the
>>>raw = BeautifulSoup(content, ‘html.parser’).
file as document.txt within the directory that IDLE
get_text()
suggests you use in the pop-up dialogue box. If you
>>>word_tokenize(raw) are working with IDLE, you can download it here.
[‘Today’, ‘I’, ‘was’, ‘chatting’, ‘with’, ‘three’, ‘of’, ‘our’, The following step is to open the file in the Python
interpreter by typing f = open(‘document.txt’), and
27 Processing Raw Text
then to examine its contents by typing print(f. There are a few different approaches you might take
read()). in order to read the file if you are able to access it.
The read() function generates a string that contains
When you attempted to do this, a number of different the whole of the file’s contents, including:
things may have gone wrong. In the event that the
interpreter was unable to locate your file, you would >>>f.read()
have received an error similar to the following:
‘Time flies like an arrow.\nFruit flies like a
banana.\n’
>>>f = open(‘document.txt’)
Traceback (most recent call last): It is important to keep in mind that the letters ‘n’
represent newlines; this is the same as beginning a
File “<pyshell#7>”, line 1, in -toplevel-
new line by hitting the Enter key on a keyboard.
f = open(‘document.txt’)
IOError: [Errno 2] No such file or directory: We also have the option of reading a file line by line
‘document.txt’ by using a for loop:
Use the Access command found in the File menu >>>f = open(‘document.txt’, ‘rU’)
of IDLE; this will show a list of all the files located
>>>for line in f:
in the directory where IDLE is now operating. Using
... print(line.strip())
this command will allow you to verify that the file
you are attempting to open is indeed located in the Time flies like an arrow.
correct directory. An additional option is to do the Fruit flies like a banana.
search from inside Python, using the directory that
is now active: In this case, we make use of the strip() function to
get rid of the newline character that appears at the
>>>import os end of each line of input.
>>>os.listdir(‘.’)
>>>path = nltk.data.find(‘corpora/gutenberg/melville-moby_dick.txt’)
Text Extraction from Binary Formats Such As PDF, MS Word, and Other Formats
Text formats that are understandable by humans include ASCII text and HTML text. Binary file formats, such
as PDF and MSWord, are used for text documents, and these formats need specific software to be accessed.
Access to these formats may be gained via the use of third-party libraries such as pypdf and pywin32. It is a
particularly difficult task to extract text from documents that have several columns. It is easier to do a one-
time conversion of a few documents if you first open the document in an appropriate program, then save it as
text to your local disk, and then access it in the manner that is outlined in the following paragraphs. You may
search for the document by entering its URL into Google’s search box if it is already available on the internet.
A link to an HTML version of the document, which you are able to save as text, is often included in the results
of the search.
When a user is engaging with our software, there are occasions when we would want to record the text that
the user types in. Calling the Python function input will cause the user to be prompted to enter a line of input
Summary
● In this unit, a “text” is understood to be nothing more than a collection of words. A “raw text” is a potentially
lengthy string that contains words and whitespace formatting, and it is the traditional way that we store
and perceive a text. This string may also be called “plain text.”
● Python allows you to specify strings by enclosing them in single or double quotes, such as “Monty Python”
or “Monty Python.”
● To retrieve the characters included in a string, indices are used, with counting beginning with zero: The
value M is determined by ‘Monty Python’[0]. The len function is what’s used to determine the length of a
string ().
● Accessing a substring requires employing a technique known as slice notation. For example, “Monty
Python”[1:5] returns the result onty. If the start index is not specified, the substring will start at the beginning
of the string. On the other hand, if the end index is not specified, the slice will continue until it reaches the
end of the text.
● ‘Monty Python’ is an example of a string that may be parsed into many lists.
● The result of using split() is [‘Monty,’ ‘Python’]. It is possible to combine several lists into one string by using
the ‘/’ character. The expression “join(‘Monty, ‘Python’]) produces the string “Monty/Python.”
● Through the use of the formula text = open(‘input.txt’), we are able to read text from the file input.txt.
● read(). The formula text = request.urlopen allows us to read text from URLs (url). read(). decode(‘utf8’).
Using the for line function in open, we are able to loop through the lines of a text file (f).
● Text may be written to a file by first opening the file for writing using the output file = open(‘output.txt’, ‘w’)
command, and then appending the text using the print(“Monty Python”, file=output file) command.
● Before we can do any kind of linguistic processing on texts that were obtained on the internet, we need to
remove any undesired data that they could have included, such as headers, footers, or markup.
● The process of breaking up a text into its component parts, also known as tokens, such as words and
punctuation is known as tokenization. As it groups punctuation in with the words it recognizes, tokenization
that is based on whitespace is not suitable for many applications. NLTK has a pre-built tokenizer that you
may use called nltk.word tokenize().
● The process of lemmatization translates the many forms of a word (such as appeared and appears) to
the canonical or citation form of the term, which is also referred to as the lexeme or lemma (e.g., appear).
● About sequences is possible that you are still struggling with Python and
do not yet feel as if you have complete control. Within
● Sequence types
the scope of this chapter, we will discuss the following
issues:
>>>bar = foo
>>>foo[1] = ‘Bodkin’
>>>bar
[‘Monty’, ‘Bodkin’]
The Figure Showing the Assignment of Lists and the Computer Memory: Because the two list objects foo
and bar reference the same address in the memory of the computer, any changes you make to foo will
automatically be reflected in bar, and vice versa.
The contents of the variable are not copied; just its “object reference” is copied when the line bar = foo [1] is
used. It is necessary for us to have an understanding of how lists are kept in the memory of the computer
to comprehend what is taking on here. We can see from the contents of the figure that a list called foo is a
reference to an object that is kept at the position 3133 (which is itself a series of pointers to other locations
holding strings). When we do an assignment like bar = foo, all that is transferred is the object reference,
which is 3133. This behavior is applicable to other facets of the language as well, such as the passing of
parameters.
Let’s conduct some further tests by first establishing a variable called empty that will store the empty list, and
then making use of that variable three times on the line that follows:
>>>empty = []
>>>nested
>>>nested[1].append(‘Python’)
Take note that by altering one of the items in our nested list of lists in two different locations. It is
nested list of lists, we were able to modify the essential to have a thorough understanding of
other items as well. This is the case due to the fact the distinction that exists between modifying an
that each of the three elements is, in reality, just a object through the use of an object reference and
reference to the same list that is stored in memory. overwriting an object reference.
>>>nested then we will show that these copies are not only
identical according to the operator ==, but also that
[[‘Python’], [‘Monty’], [‘Python’]]
they are the same object in their entirety:
to it, which produced a list with three references >>>snake_nest = [python] * size
to a single list object called ‘Python’. The next step
>>>snake_nest[0] == snake_nest[1] == snake_
was to replace a reference to one of those objects
nest[2] == snake_nest[3] == snake_nest[4]
with a reference to a new object called “Monty.”
True
During this very final phase, one of the three object
Let’s add another python to this nest now, shall we? computer language.
>>>snake_nest
>>>mixed = [‘cat’, ‘’, [‘dog’], []]
[[‘Python’], [‘Python’], [‘Python’], [‘Python’],
>>>for element in mixed:
[‘Python’]]
... if element:
>>>snake_nest[0] == snake_nest[1]
== snake_nest[2] == snake_ ... print(element)
nest[3] == snake_nest[4]
...
True
cat
>>>snake_nest[0] is snake_
[‘dog’]
nest[1] is snake_nest[2] is
snake_nest[3] is snake_nest[4] That is to say, in the condition,
False we do not require the statement
if len(element) > 0:
You can perform a number of pairwise
tests to determine which position is occupied How is the use of if...elif distinct from the
by the intruder; however, the id() function makes use of multiple if statements placed one after the
detection much simpler: other? Now, take into consideration the following
scenario:
>>>[id(snake) for snake in snake_nest]
>>>animals = [‘cat’, ‘dog’]
[4557855488, 4557854763, 4557855488,
4557855488, 4557855488] >>>if’cat’in animals:
... print(1)
This demonstrates that there is a unique identifier
... elif’dog’in animals:
associated with the second item on the list. You
should anticipate seeing different numbers in ... print(2)
the list that is generated as a consequence of ...
executing this code snippet on your own, and the
1
invader could also be in a different position.
False
Caution!
>>>any(len(w) > 4 for w in sent)
(29, 5, 2)
Take note that in this snippet of code, we have calculated many numbers on a single line, separating each one
with commas. Python enables us to eliminate the parentheses surrounding tuples if there is no ambiguity,
which is why these comma-separated expressions are essentially simply tuples. When we print a tuple, the
parenthesis will consistently be shown on the page. When we use tuples in this manner, we are effectively
grouping elements together in a more general sense.
As was just shown for you, there are several helpful methods in which we can iterate through the elements
that make up a sequence.
For item in set (s) .difference (t) Iterate over elements of s not t
>>>words[2] = words[3]
Other objects, such as a FreqDist, are
>>>words[3] = words[4]
capable of being transformed into
a sequence (through the list() or >>>words[4] = tmp
sorted() methods) and iteration.
Python’s sequence functions,
>>>raw = ‘Red lorry, yellow such as sorted() and
lorry, red lorry, yellow lorry.’ reversed(), can reorder the items
in a sequence, as we have seen.
>>>text = word_tokenize(raw)
These functions are available in
>>>fdist = nltk.FreqDist(text) Python. There are also functions that
>>>sorted(fdist) can change the structure of a sequence,
which can be useful for language processing
[‘,’, ‘.’, ‘Red’, ‘lorry’, ‘red’, ‘yellow’]
because of the flexibility they provide. Therefore,
>>>for key in fdist:
the function zip() takes the elements of two or
... print(key + ‘:’, fdist[key], end=’; ‘) more sequences and “zips” them together into a
single list of tuples using those elements. If you
...
give enumerate(s) a sequence, it will return pairs
lorry: 4; red: 1; .: 1; ,: 3; Red: 1; yellow: 2
to you that contain an index and the item that is
located at that index.
In the next example, we will demonstrate how to
rearrange the items in our list by making use of
>>>words = [‘I’, ‘turned’, ‘off’, ‘the’, ‘spectroroute’]
tuples. (Since the comma has greater precedence
>>>tags = [‘noun’, ‘verb’, ‘prep’, ‘det’, ‘noun’]
than assignment, the parentheses aren’t necessary
in this sentence.) >>>zip(words, tags)
>>>lexicon.sort()
>>>del lexicon[0]
● Both the assignment and the passing of parameters in Python make use of object references. For instance,
if a is a list and we assign b = a, then every action performed on a will also alter b, and vice versa.
● The is action determines whether or not two objects are the same internal object, while the == operation
determines whether or not two things are equal. This difference is analogous to the one between the type
and the token.
● There are many distinct types of sequence objects, including strings, lists, and tuples. These sequence
objects enable standard operations like indexing, slicing, len(), sorted(), and membership checking using
in.
● Programming in a declarative approach often results in code that is more concise and easier to comprehend;
manually incremented loop variables are typically unnecessary; and enumerate should be used whenever a
sequence has to be enumerated ().
● Functions are a fundamental programming abstraction; the notions of passing parameters, variable scope,
and docstrings are important to grasp while working with functions.
● Names that are defined inside a function are not accessible outside of that function unless those names
are explicitly declared to be global. A function acts as a namespace.
● Modules make it possible to localize content that is logically connected inside a single file. Names that are
declared inside a module, such as variables and functions, are not accessible to other modules until those
names are imported into those other modules. This is because a module acts as a namespace.
● In the field of natural language processing (NLP), dynamic programming is a method for the construction
of algorithms that maintains the results of earlier computations to prevent needless re-computation.
● Mapping words to properties many tasks that have to do with processing language.
using python dictionaries As we’ll see, they come from a simple look at how words
are placed in a text. This chapter will try to answer the
following questions:
Using a Tagger Notice that the words refuse and permit are both
present tense verbs (VBPs) and nouns (NN). For
A part-of-speech tagger, or POS-tagger, takes a example, refUSE is a verb that means “refuse” and
string of words and adds a part-of-speech tag to REFUSE is a noun that means “trash” (i.e. they are
each one (don’t forget to import nltk): not homophones). So, to say the text correctly, we
need to know which word is being used. (This is
>>>text = word_tokenize(“And now for something why POS-tagging is usually done by text-to-speech
completely different”) systems.)
>>>nltk.pos_tag(text)
Note:
[(‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’,
‘NN’),
Your Turn: Many words, such as ski and race, can
(‘completely’, ‘RB’), (‘different’, ‘JJ’)] be used as nouns or verbs without changing how
they sound. Can you come up with more? Hint: Try
Here, we can see that and is a coordinating putting the word to in front of a common noun to
conjunction (CC), that now and completely are see if it can also be a verb, or try putting the in front
adverbs (RB), that something is a noun (NN), and of an action to see if it can also be a noun. Now,
that different is an adjective (JJ). make up a sentence that uses this word in both
ways and run it through the POS-tagger.
Note:
Noun categories and part-of-speech tags like NN
NLTK has documentation for each tag, which you seem to have their uses, but many readers won’t
can look up using either the tag name, like nltk. understand the details. You might wonder why it’s
help.upenn tagset(‘RB’), or a regular expression, important to add this extra piece of information.
like nltk.help.upenn tagset(‘NN.*’). Some corpora Many of these categories come from a quick look
have README files with information about tagsets. at how words are placed in a text. Think about how
To find these files, type nltk.corpus.???.readme(), the words woman (a noun), bought (a verb), over
where??? is the name of the corpus. (a preposition), and the (a determiner). The text.
similar() method takes a word w, finds all contexts
Let’s take a look at another example, this time with w1w w2, and then finds all words w’ that appear in
homonyms: the same context, i.e. w1w’w2.
man day time year car moment world family house >>> tagged_token = nltk.tag.str2tuple(‘fly/NN’)
boy child country job state girl place war way case
>>> tagged_token
question
(‘fly’, ‘NN’)
>>> text.similar(‘bought’)
>>> tagged_token[0]
made done put said found had seen given left
‘fly’
heard been brought got set was called felt in that
told >>> tagged_token[1]
... ‘’’
Tagged Corpora
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
>>> nltk.corpus.conll2000.tagged_words()
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/
nn-tl said/vbd Friday/nr an/at investigation/nn of/ [(‘Confidence’, ‘NN’), (‘in’, ‘IN’), (‘the’, ‘DT’), ...]
in Atlanta’s/np$ recent/jj primary/nn election/nn >>> nltk.corpus.treebank.tagged_words()
produced/vbd / no/at evidence/nn ‘’/’’ that/cs any/
[(‘Pierre’, ‘NNP’), (‘Vinken’, ‘NNP’), (‘,’, ‘,’), ...]
dti irregularities/nns took/vbd place/nn ./.
Not all corpora use the same set of tags. For more
Part-of-speech tags are saved in different ways
information, see the tagset help and readme()
by other corpora. The interface of NLTK’s corpus
methods mentioned above. At first, we want to stay
readers is the same, so you don’t have to worry
away from these tagsets’ complexity, so we use a
about the different file formats. In contrast to the
built-in mapping to the “Universal Tagset”:
file fragment shown above, this is how the data
looks in the Brown Corpus reader. Since the Brown
>>> nltk.corpus.brown.tagged_
Corpus was published, it has become standard for
words(tagset=’universal’)
part-of-speech tags to be changed to all capital
letters. [(‘The’, ‘DET’), (‘Fulton’, ‘NOUN’), ...]
>>> nltk.corpus.treebank.tagged_
>>> nltk.corpus.brown.tagged_words() words(tagset=’universal’)
[(‘The’, ‘AT’), (‘Fulton’, ‘NP-TL’), ...] [(‘Pierre’, ‘NOUN’), (‘Vinken’, ‘NOUN’), (‘,’, ‘.’), ...]
>>> nltk.corpus.brown.tagged_
words(tagset=’universal’) NLTK comes with tagged corpora for a number
of other languages, such as Chinese, Hindi,
[(‘The’, ‘DET’), (‘Fulton’, ‘NOUN’), ...]
Portuguese, Spanish, Dutch, and Catalan. These
usually have text that isn’t ASCII, and when Python
When tagged text is in a corpus, the tagged words()
prints a larger structure like a list, it always shows
method will be in the NLTK corpus interface. Here
this text in hexadecimal.
>>> tag_fd.most_common()
A Universal Part-of-Speech Tagset
[(‘NOUN’, 30640), (‘VERB’, 14399), (‘ADP’, 12355),
There are many ways to tag words in tagged (‘.’, 11928), (‘DET’, 11389),
corpora. We will look at a simplified tagset to help (‘ADJ’, 6706), (‘ADV’, 3349), (‘CONJ’, 2717),
us get started. (‘PRON’, 2535), (‘PRT’, 2264),
Let’s see which of these tags is used the most in Lists for Indexing Vs. Dictionaries
the Brown corpus’s news category:
0 Call
1 me
2 Ishmael
3 .
Figure: List Look-up: With the help of an integer index, we can get to the items in a Python list.
Compare this to frequency distributions (3), where we give a word and get back a number, such as fdist[‘mon-
strous,’] that tells us how many times that word has been used in a text. Anyone who has used a dictionary
knows how to use words to look something up. Figure shows some more examples.
Dictionary Look-up: To get to an entry in a dictionary, we use a key like a person’s name, a web domain, or an
English word. Dictionary is also called a map, hashmap, hash, and associative array.
In a phonebook, we use a name to look up an entry and get back a phone number. When we type a domain
name into a web browser, the computer looks up the name to find an IP address. A word frequency table
lets us look up a word and find out how often it was used in a group of texts. In all of these cases, we are
converting names to numbers instead of the other way around, as we would with a list. In general, we’d like
to be able to connect any two kinds of information. It gives a list of different language objects and what they
map to.
Summary
● There are different kinds of words, like nouns, verbs, adjectives, and adverbs, that can be put into groups.
Parts of speech or lexical categories are other names for these groups. Short labels, or tags, like NN, VB,
etc., are given to different parts of speech.
● Part-of-speech tagging, POS tagging, or just tagging is the process of automatically assigning parts of
speech to words in text.
● Automatic tagging is an important step in the NLP pipeline. It can be used to predict how new words will
behave, analyze how words are used in corpora, and make text-to-speech systems work better.
● A variety of tagging methods are possible, e.g. default tagger, regular expression tagger, unigram tagger
and n-gram taggers. These can be put together using a method called “backoff.”
● Backoff is a way to combine models: when a more specialized model, like a bigram tagger, can’t assign a
tag in a given context, we switch to a more general model (such as a unigram tagger).
● Part-of-speech tagging is an early and important example of a sequence classification task in NLP. At any
point in the sequence, a classification decision is made using the words and tags in the local context.
● A dictionary is used to convert between different kinds of data, like a string and a number: freq[‘cat’] = 12.
We use the brace notation to make dictionaries: pos =, pos = “furiously”: “adv,” “ideas”: “n,” “colorless”: “adj”.
● N-gram taggers can be made for large numbers of n, but once n is greater than 3, we usually run into the
sparse data problem. Even with a lot of training data, we only see a tiny fraction of the possible contexts.
● Transformation-based tagging involves learning a set of repair rules like “change tag s to tag t in context
c.” Each rule fixes mistakes and might add a (smaller) number of errors.
● Document classification are put together and how often they are used.
● Sequence classification
● But how did we know where to start looking?
Supervised Classification
Classification is the process of putting something into the right group. In simple classification tasks, each
input is looked at on its own, and the set of labels is set up ahead of time. Some examples of tasks that need
to be categorized are:
● Choosing the topic of a news story from a fixed list of topics like “sports,” “technology,” and “politics”
● Figuring out whether a given use of the word “bank” refers to a riverbank, a bank, tilting to the side, or
putting something in a bank
There are a lot of interesting ways to do the basic classification task. In multi-class classification, for example,
each instance can be given more than one label; in open-class classification, the set of labels is not known
ahead of time; and in sequence classification, a list of inputs is classified all at once.
A classifier is called “supervised” if it was built using training corpora in which each input was labelled with
the correct label. Figure 6.1 shows how supervised classification works as a whole.
>>> gender_features(‘Shrek’)
>>> featuresets = [(gender_features(n), gender)
{‘last_letter’: ‘k’} for (n, gender) in labeled_names]
A feature extractor that fits gender features too >>> devtest_names = labeled_names[500:1500]
well. This feature extractor’s output feature sets >>> test_names = labeled_names[:500]
have a lot of specific features, which makes them
too good for the relatively small Names Corpus. The training set is used to teach the model what to
do, while the dev-test set is used to figure out what
But there are usually limits to how many features went wrong. The test set helps us figure out how
you should use with a given learning algorithm. If well the system works in the end. As a result of the
you give the algorithm too many features, it is more reasons we’ll talk about below, it’s important to use
likely to rely on quirks of your training data that a separate dev-test set for error analysis instead of
don’t apply well to new examples. This is called just the test set. This shows how the corpus data is
“overfitting,” and it can be a problem when working broken up into different subsets.
There are two sets of data from the corpus: the Using the dev-test set, we can make a list of the
development set and the test set. Most of the time, mistakes the classifier makes when guessing the
the development set is split into a training set and gender of a name:
a dev-test set.
>>> errors = [ ]
After dividing the corpus into appropriate datasets,
>>> for (name, tag) in devtest_names:
we train a model with the training set and then run
.......guess = classifier.classify(gender_
it on the dev-test set.
features(name))
0.75
>>> for (tag, guess, name) in sorted(errors):
>>> print(nltk.classify.accuracy(classifier,
56 Learning to Classify Text
Next, we define a feature extractor for documents reviews. We measure the classifier’s accuracy on
so that the classifier will know which parts of the test set [1] to see how well it works. Again, we
the data to focus on (1.4). For figuring out what can use show most informative features() to find
a document is about, we can define a feature for out which features the classifier thought were the
each word that tells us if the document has that most helpful [2].
word or not. So that the classifier doesn’t have to
deal with too many features, we start by making featuresets = [(document_features(d), c) for (d,c)
a list of the 2000 most-used words in the whole in documents]
corpus [1]. Then, we can make a feature extractor train_set, test_set = featuresets[100:],
[2] that just checks to see if each of these words is featuresets[:100]
in a document.
classifier = nltk.NaiveBayesClassifier.train(train_
set)
all_words = nltk.FreqDist(w.lower() for w in movie_
reviews.words()) >>> print(nltk.classify.accuracy(classifier, test_
set))
word_features = list(all_words)[:2000]
0.81
def document_features(document):
>>> classifier.show_most_
document_words =
informative_features(5)
set(document)
Most Informative Features
features = {}
contains(outstanding) = True
for word in word_features:
pos : neg = 11.1 : 1.0
features[‘contains({})’.
contains(seagal) = True
format(word)] = (word in
neg : pos = 7.7 : 1.0
document_words)
contains(wonderfully) = True
return features
pos : neg = 6.8 : 1.0
{‘contains(waste)’: False, ‘contains(lot)’: False, ...} contains(wasted) = True neg : pos = 5.8
: 1.0
Note
In this corpus, a review that mentions “Seagal”
In [3], we compute the set of all the words in a is almost 8 times more likely to be negative than
document instead of just checking if a word is in positive, while a review that mentions “Damon” is
the document, because it is much faster to check if about 6 times more likely to be positive.
a word is in a set than a list.
Sequence Classification
Now that we’ve made our feature extractor, we can
use it to teach a classifier how to label new movie We can use joint classifier models, which pick the
features[“prev-tag”] = “<START>”
One strategy, called consecutive classification or
greedy sequence classification, is to find the most else:
likely class label for the first input and then use that features[“prev-word”] = sentence[i-1]
answer to help find the best label for the next input.
features[“prev-tag”] = history[i-1]
The process can then be done again and again until
all of the inputs are named. This is how the bigram return features
tagger from 5 worked. It started by putting a part- class ConsecutivePosTagger(nltk.TaggerI):
of-speech tag on the first word in the sentence, and
def __init__(self, train_sents):
then put a tag on each word after that
based on the word itself and what it train_set = [ ]
thought the tag for the word before for tagged_sent in train_sents:
it would be.
untagged_sent = nltk.tag.
untag(tagged_sent)
This plan is shown to work.
First, we need to add a history history = [ ]
argument to our feature extractor for i, (word, tag) in
function, which gives us a list of the enumerate(tagged_sent):
tags we’ve guessed for the sentence
featureset = pos_features(untagged_
so far [1]. Every tag in history matches a
sent, i, history)
word in a sentence. But keep in mind that
history will only show tags for words we’ve already train_set.append( (featureset, tag) )
tagged, that is, words to the left of the target word. history.append(tag)
So, you can look at some things about the words
self.classifier = nltk.NaiveBayesClassifier.
to the right of the target word, but you can’t look at
train(train_set)
their tags yet because we haven’t made them yet.
def tag(self, sentence):
After we’ve decided on a feature extractor, we can history = [ ]
build our sequence classifier [2]. During training, we
for i, word in enumerate(sentence):
use the annotated tags to give the feature extractor
the right history. When tagging new sentences, featureset = pos_features(sentence, i, history)
>>> print(tagger.evaluate(test_sents))
0.79796012981
Summary
● Modelling the language data in corpora can help us understand language patterns and can be used to
make predictions about new language data.
● Supervised classifiers use labelled training corpora to build models that can predict the label of an input
based on certain features of that input.
● Supervised classifiers can do a wide range of NLP tasks, such as document classification, part-of-speech
tagging, sentence segmentation, dialogue act type identification, determining entailment relations, and
many more.
● When training a supervised classifier, you should divide your corpus into three datasets: a training set for
building the classifier model, a dev-test set for helping choose and tune the model’s features, and a test set
for judging how well the final model works.
● When testing a supervised classifier, it’s important to use new data that wasn’t part of the training or
development-test set. If not, the results of your evaluation might be too optimistic.
● Decision trees are tree-shaped flowcharts that are made automatically and are used to label input values
based on their properties. Even though they are easy to understand, they are not very good at figuring out
the right label when the values of the features interact.
● In naive Bayes classifiers, each feature works on its own to help decide which label to use. This lets the
values of features interact, but it can be a problem when two or more features have a strong relationship.
● Maximum Entropy classifiers use a basic model that is similar to the model used by naive Bayes. However,
they use iterative optimization to find the set of feature weights that maximizes the probability of the
training set.
● Most of the models that are automatically built from a corpus are descriptive. They tell us which features
are important for a certain pattern or construction, but they don’t tell us how those features and patterns
are related.
● How to develop and evaluate is hard to understand, it can be hard to get to the
chunkers information in that text. NLP is still a long way from being
able to build general-purpose meaning representations
● Recursion in linguistic structure
from unrestricted text. We can make a lot of progress if
we instead focus on a small set of questions, or “entity
relations,” like “where are different facilities located?” or
“who works for what company?” This chapter will try to
answer the following questions:
Which corpora are best for this kind of work, and how can
we use them to train and test our models?
Along the way, we’ll use methods from the last two
chapters to solve problems with chunking and recognizing
named entities.
There are many different ways to get information. ... (‘BBDO South’, ‘IN’, ‘Atlanta’),
Structured data is an important type. In structured
data, entities and relationships are set up in a ... (‘Georgia-Pacific’, ‘IN’, ‘Atlanta’)]
regular and predictable way. For example, we might
be interested in how companies and places are >>> query = [e1 for (e1, rel, e2) in locs if e2==’Atlanta’]
If this location data was saved in Python as a list for a BBDO South unit of BBDO Worldwide, which
of tuples (entity, relation, entity), then the question is owned by Omnicom. Ken Haldin, a spokesman
“Which organizations work in Atlanta?” could be for Georgia-Pacific in Atlanta, said that BBDO South
Figure 7.1:A Simple Pipeline Architecture for a System that Pulls Information
Figure 7.2:Both Token Level and Chunk Level Segmentation and Labelling
In this section, we’ll talk about chunking in more detail. We’ll start by looking at what chunks are and how
they’re shown. We will look at the regular expression and n-gram approaches to chunking and use the
CoNLL-2000 chunking corpus to build and test chunkers. Then, in steps 5 and 6, we’ll go back to the tasks of
recognizing named entities and pulling out their relationships.
First, we’ll talk about the task of noun phrase chunking, or NP-chunking, in which we look for chunks that
match each noun phrase. Here is some text from the Wall Street Journal that has NP chunks marked with
Now you know what chunking is and how it works, ... of IN B-PP
but we haven’t told you how to judge chunkers. As
... vice NN B-NP
usual, this needs a corpus with good annotations.
... chairman NN I-NP
We start by looking at how IOB format can be
changed into an NLTK tree, and then we look at how ... of IN B-PP
this can be done on a larger scale using a chunked ... Carlyle NNP B-NP
corpus. We’ll look at how to rate a chunker’s
... Group NNP I-NP
accuracy in relation to a corpus, and then look at
some more data-driven ways to find NP chunks. ... , , O
Our main goal will be to cover more ground with a ... a DT B-NP
chunker.
... merchant NN I-NP
IOB Format and the CoNLL 2000 Corpus: How to ... banking NN I-NP
Read Them ... concern NN I-NP
... . . O
We can load text from the Wall Street Journal
that has been tagged and chunked using the IOB ... ‘’’
Figure 7.4
We can get to more chunked text with the help of >>> print(conll2000.chunked_sents(‘train.txt’,
the NLTK corpus module. The CoNLL 2000 corpus chunk_types=[‘NP’])[99])
has 270k words of text from the Wall Street Journal.
(S
This text is split into “train” and “test” sections and
Over/IN
annotated with tags for parts of speech and chunks
in the IOB format. Using nltk.corpus.conll2000, we (NP a/DT cup/NN)
can get to the data. Here’s how the 100th sentence of/IN
of the “train” section of the corpus looks like:
(NP coffee/NN)
(S told/VBD
(PP of/IN)
Training Chunkers Based on Classifiers
(NP coffee/NN)
history.append(tag)
We can use a classifier-based tagger to chunk the
self.classifier = nltk.MaxentClassifier.train(
sentence, which is one way to use information about
what the words mean. Like the n-gram chunker train_set, algorithm=’megam’, trace=0)
we talked about in the last section, this classifier- def tag(self, sentence):
based chunker will work by giving each word in a
history = [ ]
sentence an IOB tag and then turning those tags
into chunks. We will use the same method to build for i, word in enumerate(sentence):
a classifier-based tagger as we did to build a part- featureset = npchunk_features(sentence, i,
of-speech tagger in 1. history)
tag = self.classifier.classify(featureset)
The classifier-based NP chunker’s basic code
is shown. There are two classes in it. The first history.append(tag)
class is very similar to the ConsecutivePosTagger return zip(sentence, history)
class from version 1.5. The only two differences
class ConsecutiveNPChunker(nltk.ChunkParserI):
are that it uses a MaxentClassifier instead of a
NaiveBayesClassifier and calls a different feature def __init__(self, train_sents):
extractor. The second class is a wrapper around tagged_sents = [[((w,t),c) for (w,t,c) in
the tagger class that turns it into a chunker. During
nltk.chunk.tree2conlltags(sent)]
training, this second class turns the chunk trees
in the training corpus into tag sequences. In the for sent in train_sents]
parse() method, it turns the tag sequence given by self.tagger = ConsecutiveNPChunkTagger(tagged_
the tagger back into a chunk tree. sents)
untagged_sent = nltk.tag.untag(tagged_sent)
Information extraction systems look through large amounts of unrestricted text for certain types of entities
and relationships and use them to fill well-organized databases. Then, questions can be asked of these
databases to find answers.
A typical information extraction system starts by segmenting, tokenizing, and tagging the text with its parts of
speech. The data that comes out of this is then searched for certain kinds of entities. Lastly, the information
extraction system looks at entities that are mentioned close to each other in the text and tries to figure out if
they have certain relationships.
Entity recognition is often done with chunkers, which break up sequences of more than one token and label
each one with the type of entity it is. People, places, dates, times, money, and GPE are all examples of common
entity types (geo-political entity).
Chunkers can be made with rule-based systems, like NLTK’s RegexpParser class, or with machine learning
techniques, like the ConsecutiveNPChunker shown in this chapter. When looking for chunks, part-of-speech
tags are often a very important feature.
Even though chunkers are designed to make relatively flat data structures where no two chunks can overlap,
they can be chained together to make structures with multiple levels.
Relation extraction can be done with either rule-based systems or machine-learning systems. Rule-based
systems look for specific patterns in the text that link entities and the words in between, while machine-
learning systems try to learn these patterns automatically from a training corpus.
● Context free grammar But these methods only scratch the surface of the many
rules that sentences have to follow. We need a way to
● Parsing with context free
handle the ambiguity that comes with natural language.
grammar
We also need to be able to deal with the fact that there are
an infinite number of possible sentences and that we can
only write finite programs to analyze their structures and
figure out what they mean.
Some Grammatical Dilemmas b. The Jamaica Observer said that Usain Bolt set a
new record for the 100m.
Language Information and Endless Possibilities
c. Andre said The Jamaica Observer said that Usain
In previous chapters, we talked about how to Bolt set a new record in the 100m.
process and analyze text corpora and how hard it is
for NLP to deal with the huge amount of electronic d. Andre might have said that the Jamaica Observer
language data that keeps growing every day. Let’s said Usain Bolt broke the 100m record.
look at this information more closely and pretend
for a moment that we have a huge corpus of If we replaced whole sentences with the letter S, we
everything said or written in English over, say, the would see patterns like “Andre said S” and “I think
last 50 years. Could we call this group of words S.” These are examples of how to use a sentence
“the language of modern English”? There are to make a longer sentence. We can also use
several reasons why we might say “No.” templates like S but S and S when S.
Remember that in step 3, we asked With a little creativity, we can use
you to look for the pattern “the these templates to make some
of” on the web? Even though really long sentences. In a
it’s easy to find examples of Winnie the Pooh story by A.A.
this word sequence on the Milne, Piglet is completely
web, like “New man at the of surrounded by water, which is
IMG” (http://www.telegraph. a great example.
co.uk/sport/2387900/New-
man-at-the-of-IMG.html), English You can guess how happy Piglet
speakers will say that most of these was when he finally saw the ship. In
examples are mistakes and aren’t really later years, he liked to think that he had
English at all. been in Very Great Danger during the Terrible
Flood, but the only real danger he had been in was
So, we can say that “modern English” is not the the last half-hour of his imprisonment, when Owl,
same as the very long string of words in our made- who had just flown in, sat on a branch of his tree to
up corpus. People who speak English can judge comfort him and told him a very long story about
these sequences, and some of them will be wrong an aunt who laid a seagull’s egg by mistake. The
because they don’t follow the rules of grammar. story went on and on, like this sentence, until Piglet,
who was listening out his window When he finally
Also, it’s easy to write a new sentence that most saw the good ship Brain of Pooh (Captain C. Robin
English speakers will agree is good English. One and 1st Mate P. Bear) coming over the sea to save
interesting thing about sentences is that they can him, he was very happy.
be put inside of bigger sentences. Take a look at
the following This long sentence has a simple structure that
starts with the letters S, B, W, and S. From this
(S
Ubiquitous Ambiguity
(NP I)
Animal Crackers, a 1930 movie with Groucho Marx, (VP
is a well-known example of ambiguity:
(VP (V shot) (NP (Det an) (N elephant)))
I shot an elephant in my pajamas while hunting in (PP (P in) (NP (Det my) (N pajamas)))))
Africa. I don’t know how he got into my PJs. (S
(VP
(V shot)
(NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))
Figure 8.1
Notice that none of the words are hard to a. He laughed with me as I watched the bucket slide
understand. For example, the word “shot” doesn’t down his back.
mean “to take a picture” in the first sentence and
“to take a picture” with a camera in the second b. The worst part was that it was hard to find out
sentence. who heard the light.
Make a note: Think about the following sentences You know that these sequences are “word salad,”
and see if you can come up with two very different but you might not be able to put your finger on
ways to understand them: It could be dangerous exactly what is wrong with them. One benefit of
for animals to fight. Seeing family members can be learning grammar is that it gives you a framework
tiring. Is it because some of the words aren’t clear? for thinking about these ideas and a vocabulary for
If not, why is there a lack of clarity? putting them into words. Let’s look more closely at
the order, which is the worst part and looks sloppy.
This chapter talks about grammars and parsing, This looks like a coordinate structure, where two
which are formal and computer-based ways to phrases are joined by a coordinating conjunction like
study and model the things we’ve been talking and, but, or or. Here’s a simple, informal explanation
about in language. As we’ll see, patterns of well- of how coordination works in a sentence:
formedness and poorly-formedness in a string of
words can be understood by looking at the structure Coordinate Structure:
and dependencies of the phrase. With grammars
and parsers, we can make formal models of these If both v1 and v2 are phrases that belong to category
structures. As before, one of the main reasons is to X, then both v1 and v2 are also phrases that belong
understand natural language. How much more of a to category X.
text’s meaning can we understand if we can reliably
recognize the structures of language it uses? Can Here are just a few examples. In the first, two NPs
a program “understand” what it has read in a have been put together to make an NP, and in the
text well enough to answer simple questions like second, two APs have been put together to make
“what happened” or “who did what to whom?” Like an AP.
before, we will make simple programs to process
annotated corpora and do useful things with them. a. For me, the ending of the book was both the
worst and the best part.
What’s the Use of Syntax?
b. When they are on land (AP slow and clumsy
In this unit, we showed how to use the frequency looking).
information in bigrams to make text that seems
fine for short strings of words but quickly turns into We can’t put an NP and an AP together, so the worst
gibberish. Here are two more examples we made part and clumsy looking is not a correct sentence.
by finding the bigrams in a children’s story called Before we can make these ideas official, we need to
The idea behind constituent structure is that words combine with other words to make units. The fact that a
group of words can be replaced by a shorter group without making the sentence not make sense shows that
they are a unit. To understand this better, think about the next sentence:
The little bear saw the nice, fat trout in the brook.
The fact that we can replace He with The little bear shows that the second set of words is a unit. On the other
hand, little bear saw can’t be replaced in the same way.
In this chapter, we replace longer sequences with shorter ones in a way that doesn’t change the grammar.
Each group of words that makes up a unit can actually be replaced by a single word, leaving only two elements.
the little bear saw the fine fat trout in the brook
He saw it there
He ran there
He ran
NP V NP PP
He saw it there
NP VP PP
He ran there
NP VP
He ran
The parser starts with a tree containing the node S; at each stage, it looks at the grammar to find a production
that can be used to grow the tree; when a lexical production is found, its word is compared to the input; once
a complete parse has been found, the parser goes back to look for more parses.
During this process, the parser often has to choose between several possible outputs. For example, when
it moves from step 3 to step 4, it looks for productions with the letter N on the left side. N man is the first
of these. When this doesn’t work, it goes back and tries other N productions in order until it gets to N dog,
which matches the next word in the input sentence. Step 5 shows that it finds a full parse a long time after
that. This tree covers the whole sentence and doesn’t leave any loose ends. After finding a parse, we can tell
the parser to look for more parses. Again, it will go back and look at other production options to see if any of
them lead to a parse.
... print(tree)
There are three main problems with recursive descent parsing. First, left-recursive productions like NP ->
76 Analyzing Sentence Structure
NP PP put it in an endless loop. Second, the that match the right side of a grammar production
parser spends a lot of time thinking about words and replace them with the left side until the whole
and structures that don’t belong in the sentence sentence is reduced to a S.
it is trying to understand. Third, the backtracking
process may throw away parsed parts that will The shift-reduce parser keeps adding the next
have to be put back together later. For example, if word from the input stream to a stack (4.1). This
you backtrack over VP -> V -> NP, the subtree you is called the shift operation. If the top n items on
made for the NP will be thrown away. If the parser the stack match the top n items on the right side
then goes from VP to V to NP to PP, the NP subtree of a production, they are all taken off the stack and
must be made from scratch. the item on the left side of the production is put on
the stack. The reduce operation replaces the top n
Recursive descent parsing is a type of parsing that items with a single item. This operation can only be
goes from the top down. Before looking at the input, done on the top item in the stack. Lower items in
top-down parsers use a grammar to guess what the stack must be reduced before new items can
it will be. But since the input sentence is always be added. The parser is done when all of the input
available to the parser, it would make more sense is used and there is only one item left on the stack:
to start with the input sentence from the start. This a S node at the top of a parse tree. During the above
is called “bottom-up parsing,” and we’ll look at an steps, the shift-reduce parser builds a parse tree.
example in the next section. Each time it takes n items off the stack, it combines
them into a partial parse tree and puts that back
Shift-Reduce Parsing on the stack. With the graphical demonstration
nltk.app.srparser, we can see how the shift-reduce
Shift-reduce parsers are a simple type of bottom-up parsing algorithm works. There are six steps to the
parsers. Like all bottom-up parsers, a shift-reduce way this parser works.
parser tries to find sequences of words and phrases
ShiftReduceParser() is a simple way for NLTK to implement a shift-reduce parser. This parser doesn’t do any
backtracking, so even if there is a parse for a text, it might not find it. Also, it will only find one parse, even if
there are more than one. We can give an optional trace parameter that tells the parser how much information
to give about the steps it takes to read a text:
... print(tree)
Make a note: With sr parse = nltk, you can run the above parser in tracing mode to see the order of the shift
and reduce operations. ShiftReducerParser(grammar1, trace=2)
A shift-reduce parser can get stuck and fail to find a parse, even if the sentence it is given is correct from a
grammar point of view. When this happens, there are no more inputs, and the stack has things that can’t be
turned into a S. The problem is that there are choices that were made earlier that the parser can’t take back
(although users of the graphical demonstration can undo their choices). The parser has to make two kinds
of decisions: (a) which reduction to do when there is more than one option, and (b) whether to shift or reduce
when both options are available.
A shift-reduce parser can be made to work with policies that help solve these kinds of problems. For example,
it may handle shift-reduce conflicts by only shifting when no reductions are possible, and it may handle
reduce-reduce conflicts by favoring the reduction operation that removes the most items from the stack. A
“lookahead LR parser,” which is a more general version of a shift-reduce parser, is often used in programming
language compilers.
Shift-reduce parsers are better than recursive descent parsers because they only put together structure that
matches the words in the input. Also, they only build each substructure once. For example, NP(Det(the),
N(man)) is only built and put on the stack once, no matter if it will be used by the VP -> V NP PP reduction or
the NP -> NP PP reduction.
Summary
● Sentences have their own structure, which can be shown with a tree. Recursion, heads, complements, and
modifiers are all important parts of constituent structure.
● A grammar is a formal way to describe whether or not a given phrase can have a certain structure of parts
or dependencies.
● Given a set of syntactic categories, a context-free grammar uses a set of productions to show how a
phrase from category A can be broken down into a series of smaller parts 1...n.
● A dependency grammar uses “productions” to show what parts of a lexical head depend on other parts.
● When a sentence can be understood in more than one way, this is called syntactic ambiguity (e.g.
prepositional phrase attachment ambiguity).
● A parser is a method for finding one or more trees that match a sentence that is correct in terms of
grammar.
● The recursive descent parser is a simple top-down parser that tries to match the input sentence by
recursively expanding the start symbol (usually S) with the help of grammar productions. This parser can’t
deal with left-recursive productions, such as NP -> NP PP. It is inefficient because it expands categories
without checking to see if they work with the input string and because it expands the same non-terminals
over and over again and then throws away the results.
● The shift-reduce parser is a simple bottom-up parser. It puts input on a stack and tries to match the items
at the top of the stack with the grammar productions on the right. Even if there is a valid parse for the input,
this parser is not guaranteed to find it. It also builds substructure without checking if it is consistent with
the grammar as a whole.
● Context free grammar of atomic labels, we break them up into structures like
dictionaries, where each feature can have a number of
● Parsing with context free
different values.
grammar
>>>print(fs2)
REL => chase
[ [ CASE = ‘acc’ ] ]
[ [ NUM = ‘pl’ ] ]
PAT => l
[ [ PER = 3 ]]
Processing Feature Structures [ ]
[ POS = ‘N’ ]
In this section, we’ll show you how to build and
change feature structures in NLTK. We will also talk >>>print(fs2[‘AGR’])
about unification, which is the basic operation that [ CASE = ‘acc’ ]
lets us combine the information in two different
[ GND = ‘fem’ ]
feature structures.
[ NUM = ‘pl’ ]
In NLTK, the FeatStruct() constructor is used to [ PER = 3 ]
declare a structure for a feature. The values of
>>>print(fs2[‘AGR’][‘PER’])
atomic features can be either strings or numbers.
3
>>>fs1 = nltk.FeatStruct(TENSE=’past’, NUM=’sg’)
Seeing feature structures as graphs, more
>>>print(fs1)
specifically directed acyclic graphs, is often helpful.
[ NUM = ‘sg’ ] It is the same as the AVM above.
[ TENSE = ‘past’ ]
>>>print(fs1[‘GND’])
fem
>>>fs1[‘CASE’] = ‘acc’
The names of the features are written as labels on the directed arcs, and the values of the features are written
as labels on the nodes that the arcs point to.
Figure 9.2
Now, let’s say Lee is married to a woman named Kim, and Kim lives at the same address as Lee.
Figure 9.3
But instead of putting the address information twice in the feature structure, we can “share” the same sub-
graph across multiple arcs:
Figure 9.4
V[SUBCAT=clause, TENSE=pres, NUM=sg] -> asked for. But walk will only happen in VPs that
Figure 9.5
This means that the verb can go with three different arguments. The NP on the far left of the list is the subject
noun. Everything else, in this case an NP followed by a PP, is a subcategory-for complement. When a verb like
“put” is used with the right complements, the SUBCAT requirements are met, and only a subject noun phrase
(NP) is needed. This category, which is the same as what people usually think of as VP, could be shown as
follows.
V[SUBCAT=<NP>]
Lastly, a sentence is a type of verbal category that doesn’t need any more arguments. It has a SUBCAT whose
value is the empty list because it doesn’t need any more arguments. Kim put the book on the table.
Figure 9.6
In the last section, we talked about how we could say more general things about verb properties by separating
subcategorization information from the main category label. The following is another trait of this kind: Heads
of phrases in category VP are expressions from category V. In the same way, NPs have Ns as their heads, APs
have as their heads, and PPs have Ps as their heads. Not all phrases have heads. For example, it’s common
to say that coordinate phrases (like “the book and the bell”) don’t have heads. However, we’d like our grammar
formalism to show the parent-child relationship where it makes sense. V and VP are just atomic symbols
right now, and we need to find a way to connect them using their features (as we did earlier to relate IV and
TV).
Figure 9.7
N is the head of the structure (a), and N’ and N” are projections of N. N” is the maximum projection, and N is
sometimes called the zero projection. One of the main ideas behind X-bar syntax is that all of its parts have
the same structure. Using X as a variable over N, V, A, and P, we say that directly subcategorized complements
of a lexical head X are always placed as siblings of the head, while adjuncts are placed as siblings of the
intermediate category, X’. So, the arrangement of the two P” add-ons in the next figure is different from that of
the P” complement in (a). The productions in next figure show how feature structures can be used to encode
bar levels. The nested structure in the last figure is made by expanding N[BAR=1] twice with the recursive
rule.
Summary
● Symbols constitute the conventional groups of context-free grammar. Feature structures are used to
manage subtle differences efficiently, avoiding the need for a vast number of atomic categories.
● Using variables in place of feature values allows us to impose constraints in grammar productions, enabling
interdependent realization of different feature specifications.
● Typically, we assign fixed values to features at the lexical level and ensure that the values of features in
phrases match those of their children.
● Feature values can be either simple or complex. Boolean values, denoted as [+/- f], are one type of atomic
value.
● When two entities share the same value (either atomic or complex), we refer to the structure as re-entrant.
● In feature structures, a path is a pair of features that corresponds to a series of arcs from the graph’s root.
● Two paths are considered the same if they lead to the same value.
● The order of feature structures is partially determined by their inclusion. FS0 replaces FS1 when all the
information in FS0 is also present in FS1.
● If two structures, FS0 and FS1, are merged, the result is a feature structure, FS2, containing all the
information from FS0 and FS1.
● If unification adds information to a path π in FS, it also adds information to every equivalent path π’.
● Feature structures allow us to provide concise analyses of various linguistic phenomena, such as verb
subcategorization, inversion constructions, unbounded dependency constructions, and case government.
Querying a Database
Imagine that we have a program that lets us type in a question in natural language and gives us the right
answer back:
b. Greece.
How hard would it be to write a program like this? And can we just use the same methods we’ve seen in this
book so far, or do we need to learn something new? In this section, we’ll show that it’s pretty easy to solve
the task in a limited domain. But we will also see that if we want to solve the problem in a more general way,
we need to open up a whole new set of ideas and methods that have to do with how meaning is represented.
So, let’s start by assuming we have structured data about cities and countries. To be more specific, we’ll use
a database table whose first few rows are shown.
Note
The information shown on the chart comes from the Chat-80 system (Warren & Pereira, 1982). The population
numbers are given in thousands, but keep in mind that the data used in these examples goes back to at least
east_
berlin 3481 > > > n l t k . d a t a . s h o w _ c f g ( ‘g r a m m a r s / b o o k _
germany
grammars/sql0.fcfg’)
united_
birmingham 1112
kingdom % start S
SQL, which stands for “Structured Query Language,” PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
is a language used to get information from and AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
manage relational databases. If you want to learn
NP[SEM=’Country=”greece”’] -> ‘Greece’
more about SQL, you can use the website http://
www.w3schools.com/sql/. NP[SEM=’Country=”china”’] -> ‘China’
>>>q = ‘ ‘.join(answer)
“Margriet likes Brunoke.”
>>>print(q)
If you know the meanings of the words and how
SELECT City FROM city_table WHERE
they’re combined, you could say that means
Country=”china”
“Margrietje loves Brunoke.”
Note
An observer, Olga, may conclude that you
understand. But Olga must know English. Your
Your turn: Run the parser with the most tracing
Dutch-to-English translation won’t persuade her if
turned on (cp = load_parser(“grammars/book
she doesn’t. We’ll revisit this soon.
grammars/sql0.fcfg”, “trace=3”)) and watch how
the values of “sem” change as complete edges are
Together, “sql0.fcfg” and NLTK Earley parser
added to the chart.
translate English to SQL. Grammar?
The SQL translation for the
Finally, we run the query against
whole sentence was built from
the “city.db” database and get
component translations.
some results.
Component meaning
canton chungking dairen harbin kowloon mukden We “hard-wired” a lot of database details into
peking shanghai sian tientsin the grammar. We need the table and field names
(e.g., “city table”). But if our database had the same
Since each row r is a one-element tuple, we print rows of data but different table names and field
out its member. names, the SQL queries wouldn’t run. Equally, we
could have stored our data in a different format,
We defined a task where the computer returns such as XML, and then translated our English
useful data in response to a natural language queries into an XML query language instead of
query and implemented it by translating a small SQL. We should translate English into something
subset of English into SQL. Python can execute more abstract and broad than SQL.
SQL queries against a database, so our NLTK
code “understands” queries like “What cities are Consider another English question and its
in China.” As an example of natural language translation to illustrate.
understanding, this is like translating from Dutch to
94 Analyzing the Meaning of Sentences
a “Which Chinese cities have above 1,000,000 and her doll Brunoke. The two entities also have
people?” a love relationship. (4) is true in situation s if you
understand (3). “Margrietje” refers to Margrietje,
b. choose city table where country=’China’ and “Brunoke” to Brunoke, and “houdt van” to the love
population>1000” relationship.
Turnabout: Extend “sql0.fcfg” to translate (a) into We’ve introduced two semantic concepts. First,
(b), then check the query’s results. declarative sentences can be true or false. Definite
noun phrases and proper nouns refer to worldly
Before tackling conjunction, you may find it easiest things. “Margrietje loves Brunoke,” as shown.
to use grammar to answer questions like “What
cities have populations over 1,000,000?” Compare
your answer to “grammars/book grammars/sql1.
fcfg” in the NLTK data distribution.
of any query language. We interpreted it using reason effectively. We may inquire whether a group
We started by translating (1a) into an SQL query Principle; see Gleitman & Liberman, 1995).
Now, we want to integrate semantic representation construction with parsing. (29) illustrates the type of
analyses we want to build.
Figure 10.2
The root sem value represents the whole sentence, demonstrates compositionality in semantics.
whereas smaller sem values indicate sentence
parts. Since sem values are special, they are To finish the grammar, use the rules below.
contained in angle brackets.
VP[SEM=?v] -> IV[SEM=?v]
How do we design grammatical rules that get this NP[SEM=<cyril>] -> ‘Cyril’
result? Our technique will be similar to that used
IV[SEM=<\x.bark(x)>] -> ‘barks’
for the grammar “sql0.fcfg” at the beginning of this
chapter. We will assign semantic representations
The VP rule says that the semantics of the parent
to lexical nodes and then assemble each phrase’s
is the same as the semantics of the head child. The
semantic representation from its child nodes. In
meanings of “Cyril” and “barks” are given by non-
this situation, function application will replace
logical constants that come from the two lexical
string concatenation.
rules. In the entry for “barks,” there is an extra note
that we will explain in a moment.
NP and VP components have acceptable sem node
values. The rule handles S’s sem value. When sem
Before going into more detail about compositional
is a variable, angle brackets are omitted.
semantic rules, we need to add the λ calculus to
our toolbox. This gives us a very useful way to
S[SEM=<?vp(?np)>] -> NP[SEM=?np] VP[SEM=?vp]
combine first-order logic expressions as we put
together a representation of the meaning of an
The sem value of the S parent is created by
English sentence.
applying ?vp as a function expression to ?np. ?vp
must signify a function whose domain is ?np. This
96 Analyzing the Meaning of Sentences
Summary
● First-order logic is a suitable language for representing natural language meaning in a computer setting.
It is flexible enough to represent many useful parts of natural language meaning, and efficient theorem
provers exist for reasoning with first-order logic. (In natural language semantics, there are also a number
of things that are thought to require more powerful logical mechanisms.)
● Besides translating natural language sentences into first-order logic, we can also examine models of first-
order formulas to determine when these sentences are true.
● To build meaning representations compositionally, we supplement first-order logic with the λ calculus.
● In the λ-calculus, β-reduction is the same thing as applying a function to an argument. In terms of syntax,
it means replacing a variable bound by λ in the function expression with the argument expression in the
function application.
● A key part of building a model is making a valuation that gives meanings to constants that don’t make
sense. These can be seen as either n-ary predicates or standalone constants.
● An expression that has one or more free variables is called an “open expression.” Only when their free
variables receive values from a variable assignment does an open expression obtain a meaning.
● Quantifiers are interpreted by constructing, for a formula φ[x] open in variable x, the set of individuals
that make φ[x] true when an assignment g assigns them as the value of x. The quantifier then places
constraints on that set.
● A closed expression is one that doesn’t have any free variables. This means that all the variables are
bound. A closed sentence is either true or false, regardless of the variable used.
● If the only difference between two formulas is the name of the variable bound by the binding operator
(either λ or a quantifier), then they are equivalent. α-conversion is what happens when you change the
name of a variable that is bound in a formula.
● When there are two quantifiers Q1 and Q2 nested within each other in a formula, the outermost quantifier
Q1 is said to have wide scope (or scope over Q2). Often, it’s not clear what the meaning of the quantifiers
in an English sentence is.
● English sentences can be associated with a semantic representation by treating sem as a feature in a
feature-based grammar. The sem value of a complex expression typically involves functional application
of the sem values of the component expressions.
The TIMIT corpus of read speech was the first database of annotated speech that was widely shared, and
it is very well put together. TIMIT was made by a group of companies, including Texas Instruments and MIT,
which is where its name comes from. It was made so that data could be used to learn about acoustics and
phonetics and to help develop and test automatic speech recognition systems.
Like the Brown Corpus, which has a good mix of different types of texts and sources, TIMIT has a good mix of
different dialects, speakers, and materials. For each of the eight dialect regions, 50 male and female speakers
of a range of ages and levels of education each read ten carefully chosen sentences. Two sentences, read by
all the speakers, were meant to show how different their dialects were:
a. She washed your dark suit in dirty water for a whole year.
The rest of the sentences were chosen because they had a lot of phonemes (sounds) and a wide range of
diphones (phone bigrams). Also, the design strikes a balance between having multiple speakers say the
same sentence so that differences between speakers can be seen and having a wide range of sentences
in the corpus so that diphones are covered as much as possible. Each speaker reads five sentences that
are also read by six other speakers (for comparability). The last three sentences each speaker read were
different from the others (for coverage).
The phones() method can be used to get to each item’s phonetic transcription. We can use the usual method
to get to the word tokens that match. Both access methods have an optional argument called “offset” that
gives the start and end points in the audio file where the corresponding span begins and ends.
>>>phonetic = nltk.corpus.timit.phones(‘dr1-fvmh0/sa1’)
>>>phonetic
[‘h#’, ‘sh’, ‘iy’, ‘hv’, ‘ae’, ‘dcl’, ‘y’, ‘ix’, ‘dcl’, ‘d’, ‘aa’, ‘kcl’,
‘s’, ‘ux’, ‘tcl’, ‘en’, ‘gcl’, ‘g’, ‘r’, ‘iy’, ‘s’, ‘iy’, ‘w’, ‘aa’,
‘sh’, ‘epi’, ‘w’, ‘aa’, ‘dx’, ‘ax’, ‘q’, ‘ao’, ‘l’, ‘y’, ‘ih’, ‘ax’, ‘h#’]
>>>nltk.corpus.timit.word_times(‘dr1-fvmh0/sa1’)
In addition to this text data, TIMIT has a lexicon with the correct way to say each word, which can be compared
to a particular sentence:
[‘g’, ‘r’, ‘iy’, ‘s’, ‘iy’, ‘w’, ‘aa’, ‘sh’, ‘epi’, ‘w’, ‘aa’, ‘dx’, the corpus for things that weren’t thought of when
This gives us an idea of what a speech processing The original linguistic event, which is captured as
system would have to do to produce or recognize an audio recording, and the annotations of that
speech in this dialect. Lastly, TIMIT includes event are very different. Text corpora are the same
demographic information about the speakers, way, in that the original text usually comes from
which makes it possible to study voice, somewhere else and is thought to be a fixed
social, and gender traits in great artifact. Even a simple change like
Lastly, keep in mind that even though TIMIT is a Three Ways to Make a Corpus
speech corpus, its transcriptions and other data are
just text that can be processed by programs just In one kind of corpus, the design changes over time
like any other text corpus. So, many of the ways to as the creator explores different aspects. This is
compute that are explained in this book are useful. how traditional “field linguistics” works. Information
Also, keep in mind that all of the types of data in from elicitation sessions is analyzed as it is
the TIMIT corpus fall into the two main groups of gathered, and questions that arise while analyzing
lexicon and text, which we will talk about in more today’s elicitation sessions are often used to plan
detail below. Even the demographic information tomorrow’s elicitation sessions. The resulting
about the speakers is just another example of the corpus is then used for research in the following
lexicon data type. years and may be kept as an archive for the rest
of time. The popular program Shoebox, which has
This last fact isn’t as surprising when you think been around for over 20 years and has now been re-
about how text retrieval and databases, released as Toolbox, is a good example
which are two areas of computer of how computerization assists this
science that deal with data kind of work (see [4]). Other software
management, are based on text tools, like simple word processors
and record structures. and spreadsheets, are often
used to gather the data. In the
Life Cycle of a Corpus next section, we’ll look at how
to use these sources to acquire
Corpora don’t come into the world data.
fully formed. Instead, they are made
through careful planning and input Another corpus creation scenario is
from many people over a long period typical of experimental research. In this
of time. Raw data needs to be gathered, cleaned, kind of research, a body of carefully designed
written down, and stored in a way that makes material is collected from various human subjects
sense. There are different ways to annotate, and and then analyzed to test a hypothesis or develop
some of them require expert knowledge of the a technology. It has become common for these
morphology or syntax of the language. At this stage, databases to be shared and reused within a lab
success depends on creating a good workflow or company, and they are often also made public.
with the right tools and format converters. Quality The “common task” method of managing research,
control procedures can be put in place to identify which has become the norm in government-funded
annotations that don’t match up and to ensure language technology research programs over
that annotators agree on as much as possible. As the past 20 years, is based on corpora like these.
the task is significant and complicated, putting We’ve encountered many of these kinds of corpora
together a large corpus can take years and tens or in earlier chapters. In this chapter, we’ll learn how
hundreds of person-years of work. In this section, to write Python programs to perform the curation
we’ll quickly go over the different stages of a work required before these kinds of corpora can be
102 Managing Linguistic Data
published. When quality is of utmost importance, the entire
corpus can be annotated twice, and any differences
Lastly, there are efforts to put together a “reference can be resolved by an expert.
corpus” for a particular language, like the
American National Corpus (ANC) and the British Best practice is to report the level of agreement
National Corpus (BNC). Here, the goal is to make between annotators for a corpus (e.g., by annotating
a comprehensive record of all the different ways a 10% of the corpus twice). This score gives a good
language can be used, written, and spelled. Aside idea of how well any automatic system trained on
from the sheer size of the task, there is also much this corpus should perform.
reliance on automatic annotation tools and post-
editing to correct any mistakes. However, we can Caution!
write programs to identify and correct mistakes
and check for balance in the corpus. The interpretation of an inter-annotator agreement
score should be done carefully, as the difficulty of
Quality Control annotation tasks varies significantly. For example,
90% agreement would be a terrible score for
Creating a high-quality corpus involves having tagging parts of speech but a great score for
good tools for both automatic and manual data labeling semantic roles.
preparation. However, making a high-quality
corpus also depends on everyday things like The Kappa coefficient K measures how much two
documentation, training, and workflow. Annotation people agree about a category, taking into account
guidelines explain the task and how it should be how much they would agree by chance. For
done. They might be updated frequently to handle example, let’s say an item needs to be annotated,
tricky situations and add new rules to improve and there are four equally likely ways to code it.
annotation consistency. Annotators need to be Then, if two people were to code at random, they
taught how to perform their jobs, including how should agree 25% of the time. So, K = 0 will be given
to handle situations not covered in the rules. A to a level of agreement of 25%, and the scale will go
workflow must be established, possibly with the up as the level of agreement gets better. For a 50%
help of software, to track which files have been agreement, K = 0.333, since 50 is a third of the way
initialized, annotated, validated, manually checked, between 25 and 100. There are many other ways
etc. There could be multiple layers of annotation, to measure agreement. See help(nltk.metrics.
each done by a different expert. When there is agreement) for more information.
doubt or disagreement, it may be necessary to
make a decision.
The Web has a lot of information that can be used Finding Information in Word Processing Files
to study language. We’ve already talked about how
to access single files, RSS feeds, and search engine Word processing software is often used to prepare
results. But there are times when we want to get a texts and dictionaries by hand in projects that don’t
large amount of text from the web. have a lot of computers. These kinds of projects
often provide templates for entering data, but the
The easiest way to do this is to obtain a published word processing software doesn’t check to ensure
collection of web text. A list of resources is the data is set up correctly. For instance, each text
maintained by the ACL Special Interest Group on might have to have a title and a date. Similarly, each
Web as Corpus (SIGWAC) at http://www.sigwac.org. lexical entry may have some required fields. As the
uk/. Using a well-defined web corpus has the data gets bigger and more complicated, it
advantage that it is well-documented, may take more time to ensure it stays
stable, and allows you to conduct consistent.
experiments repeatedly.
How can we extract the data
If the content you want is from these files so we can
available only on a certain work with it in other programs?
website, there are many tools, Additionally, how can we
like GNU Wget (http://www. check the content of these
gnu.org/software/wget/), that files to help authors create
can help you obtain it. For the well-structured data, maximizing
most freedom and control, you can the quality of the data in the context
use a web crawler like Heritrix http:// of the original authoring process?
crawler.archive.org/. Crawlers give you fine-
grained control over where to look, which links Consider a dictionary where each entry has a part-of-
to follow, and how to organize the results (Croft, speech field with 20 options. This field is displayed
Metzler, & Strohman, 2009). For example, if we want after the pronunciation field and is in 11-point
to create a bilingual text collection with pairs of bold. No regular word processor has search or
documents that are the same in each language, the macro functions that can check whether all parts
crawler needs to figure out how the site is set up to of speech have been entered and shown correctly.
determine how the documents relate to each other. This task requires a lot of careful checking by hand.
It also needs to organize the downloaded pages in If the word processor allows the document to be
a way that makes it easy to find the connections. saved in a non-proprietary format, such as text,
While it might be tempting to write your own web HTML, or XML, we can sometimes write programs
crawler, there are many potential challenges, such to perform this checking automatically.
as detecting MIME types, converting relative URLs
to absolute URLs, avoiding getting stuck in cyclic Consider this part of a dictionary entry: “sleep (v.i.)
[<span class=SpellE>sli:p</span>]
Once we know that the data is in the right format,
<span style=’mso-spacerun:yes’></span> we can write other programs to change the format
<b><span style=’font-size:11.0pt’>v.i.</span></b> of the data. Using the BeautifulSoup library, the 3.1
program strips out the HTML markup, pulls out the
<span style=’mso-spacerun:yes’></span>
words and how to say them, and puts the results in
<i>a condition of body and mind ...<o:p></o:p></i> “comma-separated value” (CSV) format.
</p>
from bs4 import BeautifulSoup
The entry is shown as an HTML paragraph with the deflexical_data(html_file, encoding=”utf-8”):
p> element, and the part of speech is shown inside
SEP = ‘_ENTRY’
a span> element.
html = open(html_file, encoding=encoding).read()
style=’font-size:11.0pt’> element. The set of legal html = re.sub(r’<p’, SEP + ‘<p’, html)
parts of speech, called legal pos, is set up by the
text = BeautifulSoup(html, ‘html.parser’).get_text()
next program. Then it takes all of the 11-point
content from the dict.htm file and puts it in the set text = ‘ ‘.join(text.split())
used pos. Notice that the search pattern has a sub- for entry in text.split(SEP):
expression in parentheses. re.findall will only return
if entry.count(‘ ‘) > 2:
information that matches this sub-expression.
yield entry.split(‘ ‘, 3)
Lastly, the program makes a list of words that can’t
be used as used pos - legal pos: >>>import csv
One approach to address this problem has been to work on developing a generic format that can accommodate
a wide range of annotation types (see [8] for examples). The challenge for NLP is to create programs that
can handle the diversity of these formats. For instance, if the programming task involves tree data and the
file format allows any directed graph, then the input data must be validated to ensure tree properties such
as rootedness, connectedness, and acyclicity. If the input files contain additional layers of annotation, the
program should know how to ignore them when loading the data without invalidating or deleting them when
saving the tree data back to the file.
Another solution has been to write one-off scripts to convert the formats of corpora. Many NLP researchers
have numerous such scripts in their files. The idea behind NLTK’s corpus readers is that the work of parsing
a corpus format should only need to be done once (per programming language).
Instead of trying to establish a common format, we believe it would be more beneficial to create a common
interface (cf. nltk.corpus). Consider the case of treebanks, which are crucial for NLP work. A phrase structure
tree can be saved in a file in various ways, such as nested parentheses, nested XML elements, a dependency
notation with a (child-id, parent-id) pair on each line, or an XML version of the dependency notation. However,
the way the ideas are represented is almost the same in each case. It is much easier to develop a common
interface that allows application programmers to write code using methods like children(), leaves(), depth(),
and so on to access tree data. Note that this approach aligns with standard computer science practices, such
as abstract data types, object-oriented design, and the three-layer architecture. The last one, from the world
of relational databases, enables end-user applications to use a common model (the “relational model”) and
a common language (SQL) to abstract away the specifics of file storage and accommodate new filesystem
technologies without impacting end-user applications. Similarly, a common corpus interface ensures that
application programs do not need to be aware of specific data formats.
When creating a new corpus for distribution in this context, it is best to use an already popular format whenever
107 Managing Linguistic Data
possible. When this is not possible, the corpus could come with software, like an nltk.corpus module, that
supports existing interface methods.
Summary
● Annotated texts and lexicons are essential types of data found in most corpora. Texts have a structure
based on time, while lexicons have a structure based on records.
● In a corpus’s lifecycle, data collection, annotation, quality control, and publication are all crucial steps. After
a book is published, the lifecycle continues because the corpus is changed and added to during research.
● Corpus development requires finding a balance between obtaining a good sample of how language is used
and gathering enough information from any one source or genre to be useful. Due to limited resources, it
is usually not possible to multiply the dimensions of variation.
● XML is a suitable format for storing and exchanging linguistic data, but it doesn’t offer easy solutions to
common data modeling problems.
● Language documentation projects often use the Toolbox format. We can write programs to help manage
Toolbox files and convert them to XML.
● The Open Language Archives Community (OLAC) is a place where language resources can be documented
and found.
a. The sky is grey and flat, and the sun is hidden by a flight
of grey spears. (From As I Lay Dying, by William Faulkner,
1935)
can be made from the information were talked about in the last section
similar to the rule-based phonological processing empiricist view, which says that our primary source
shown by the Sound Pattern of English (Chomsky of knowledge is the experience of our senses,
and Halle, 1968). However, this model was unable and that human reason plays a secondary role in
reflecting on that experience. Galileo’s discovery,
Only a Toolkit: As stated in the introduction, NLTK is not a system, but rather a set of tools. NLTK, Python, other
Python libraries, and interfaces to NLP tools and formats from outside will be used to solve many problems.
Summary
● Linguists are often asked about the number of languages they can communicate in. In response, they must
clarify that the primary focus of linguistics is the study of abstract structures shared by languages. This
profound and elusive study goes beyond merely learning as many languages as possible.
● Similarly, computer scientists are frequently questioned about the number of programming languages
they are fluent in. They, too, must explain that the primary focus of computer science is the study of data
structures and algorithms that can be implemented in any programming language. This study is profound
and intricate, surpassing the pursuit of fluency in multiple programming languages.
● Throughout this book, various topics related to Natural Language Processing (NLP) have been discussed,
with most examples presented in English and Python. However, it would be a mistake for readers to
conclude that NLP is solely about developing Python programs for editing English text or manipulating text
in any natural language using any programming language.
● The choice of Python and English was primarily for convenience, and the focus on programming was merely
a means to an end. That end was to understand data structures and algorithms for handling linguistically
annotated text collections, building new language technologies to serve the information society’s needs,
and ultimately gaining a deeper understanding of the vast richness of human language.