Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Daniel Jurafsky pdf download
Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Daniel Jurafsky pdf download
http://ebookstep.com/product/speech-and-language-processing-an-
introduction-to-natural-language-processing-computational-
linguistics-and-speech-recognition-3rd-edition-daniel-jurafsky/
http://ebookstep.com/product/exploring-engineering-an-
introduction-to-engineering-and-design-5th-edition-philip-kosky/
https://ebookstep.com/download/ebook-33653502/
http://ebookstep.com/product/digital-signal-processing-a-nagoor-
kani/
http://ebookstep.com/product/language-ability-and-educational-
achievement-routledge-library-editions-philosophy-of-education-
winch/
Linguaggio e verità La filosofia e il discorso
religioso Language and Truth Philosophy and Religious
Discourse First Edition Aa. Vv.
http://ebookstep.com/product/linguaggio-e-verita-la-filosofia-e-
il-discorso-religioso-language-and-truth-philosophy-and-
religious-discourse-first-edition-aa-vv/
https://ebookstep.com/download/ebook-24142950/
http://ebookstep.com/product/cyber-security-issues-and-current-
trends-studies-in-computational-intelligence-995-dutta/
http://ebookstep.com/product/travel-to-past-and-back-to-the-
future-first-edition-rawiar-a-abdallah/
Speech and Language Processing
An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition
Daniel Jurafsky
Stanford University
James H. Martin
University of Colorado at Boulder
Copyright c 2017
2
Contents
1 Introduction 9
7 Logistic Regression 92
7.1 Features in Multinomial Logistic Regression . . . . . . . . . . . . 93
7.2 Classification in Multinomial Logistic Regression . . . . . . . . . 95
3
4 C ONTENTS
Bibliography 461
Author Index 485
Subject Index 493
CHAPTER
1 Introduction
Placeholder
9
10 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
CHAPTER
ELIZA The dialogue above is from ELIZA, an early natural language processing sys-
tem that could carry on a limited conversation with a user by imitating the responses
of a Rogerian psychotherapist (Weizenbaum, 1966). ELIZA is a surprisingly simple
program that uses pattern matching to recognize phrases like “You are X” and trans-
late them into suitable outputs like “What makes you think I am X?”. This simple
technique succeeds in this domain because ELIZA doesn’t actually need to know
anything to mimic a Rogerian psychotherapist. As Weizenbaum notes, this is one
of the few dialogue genres where listeners can act as if they know nothing of the
world. Eliza’s mimicry of human conversation was remarkably successful: many
people who interacted with ELIZA came to believe that it really understood them
and their problems, many continued to believe in ELIZA’s abilities even after the
program’s operation was explained to them (Weizenbaum, 1976), and even today
chatbots such chatbots are a fun diversion.
Of course modern conversational agents are much more than a diversion; they
can answer questions, book flights, or find restaurants, functions for which they rely
on a much more sophisticated understanding of the user’s intent, as we will see in
Chapter 29. Nonetheless, the simple pattern-based methods that powered ELIZA
and other chatbots play a crucial role in natural language processing.
We’ll begin with the most important tool for describing text patterns: the regular
expression. Regular expressions can be used to specify strings we might want to
extract from a document, from transforming “You are X” in Eliza above, to defining
strings like $199 or $24.99 for extracting tables of prices from a document.
text
normalization We’ll then turn to a set of tasks collectively called text normalization, in which
regular expressions play an important part. Normalizing text means converting it
to a more convenient, standard form. For example, most of what we are going to
do with language relies on first separating out or tokenizing words from running
tokenization text, the task of tokenization. English words are often separated from each other
by whitespace, but whitespace is not always sufficient. New York and rock ’n’ roll
are sometimes treated as large words despite the fact that they contain spaces, while
sometimes we’ll need to separate I’m into the two words I and am. For processing
tweets or texts we’ll need to tokenize emoticons like :) or hashtags like #nlproc.
Some languages, like Chinese, don’t have spaces between words, so word tokeniza-
tion becomes more difficult.
2.1 • R EGULAR E XPRESSIONS 11
Regular expressions are case sensitive; lower case /s/ is distinct from upper
case /S/ (/s/ matches a lower case s but not an upper case S). This means that
the pattern /woodchucks/ will not match the string Woodchucks. We can solve this
problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.
The regular expression /[1234567890]/ specified any single digit. While such
classes of characters as digits or letters are important building blocks in expressions,
they can get awkward (e.g., it’s inconvenient to specify
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
to mean “any capital letter”). In cases where there is a well-defined sequence asso-
ciated with a set of characters, the brackets can be used with the dash (-) to specify
range any one character in a range. The pattern /[2-5]/ specifies any one of the charac-
ters 2, 3, 4, or 5. The pattern /[b-g]/ specifies one of the characters b, c, d, e, f, or
g. Some other examples are shown in Fig. 2.3.
The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ. If the caret ˆ is the first symbol after the open square brace [,
the resulting pattern is negated. For example, the pattern /[ˆa]/ matches any single
character (including special characters) except a. This is only true when the caret
is the first symbol after the open square brace. If it occurs anywhere else, it usually
stands for a caret; Fig. 2.4 shows some examples.
How can we talk about optional elements, like an optional s in woodchuck and
woodchucks? We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing”. For this we use the question mark
/?/, which means “the preceding character or nothing”, as shown in Fig. 2.5.
We can think of the question mark as meaning “zero or one instances of the
previous character”. That is, it’s a way of specifying how many of something that
2.1 • R EGULAR E XPRESSIONS 13
The wildcard is often used together with the Kleene star to mean “any string of
characters”. For example, suppose we want to find any line in which a particular
word, for example, aardvark, appears twice. We can specify this with the regular
expression /aardvark.*aardvark/.
Anchors Anchors are special characters that anchor regular expressions to particular places
in a string. The most common anchors are the caret ˆ and the dollar sign $. The caret
ˆ matches the start of a line. The pattern /ˆThe/ matches the word The only at the
start of a line. Thus, the caret ˆ has three uses: to match the start of a line, to in-
dicate a negation inside of square brackets, and just to mean a caret. (What are the
contexts that allow grep or Python to know which function a given caret is supposed
to have?) The dollar sign $ matches the end of a line. So the pattern $ is a useful
14 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
pattern for matching a space at the end of a line, and /ˆThe dog\.$/ matches a
line that contains only the phrase The dog. (We have to use the backslash here since
we want the . to mean “period” and not the wildcard.)
There are also two other anchors: \b matches a word boundary, and \B matches
a non-boundary. Thus, /\bthe\b/ matches the word the but not the word other.
More technically, a “word” for the purposes of a regular expression is defined as any
sequence of digits, underscores, or letters; this is based on the definition of “words”
in programming languages. For example, /\b99\b/ will match the string 99 in
There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in
There are 299 bottles of beer on the wall (since 99 follows a number). But it will
match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore,
or letter).
/the*/ matches theeeee but not thethe. Because sequences have a higher prece-
dence than disjunction, /the|any/ matches the or any but not theny.
Patterns can be ambiguous in another way. Consider the expression /[a-z]*/
when matching against the text once upon a time. Since /[a-z]*/ matches zero or
more letters, this expression could match nothing, or just the first letter o, on, onc,
or once. In these cases regular expressions always match the largest string they can;
greedy we say that patterns are greedy, expanding to cover as much of a string as they can.
non-greedy There are, however, ways to enforce non-greedy matching, using another mean-
*? ing of the ? qualifier. The operator *? is a Kleene star that matches as little text as
+? possible. The operator +? is a Kleene plus that matches as little text as possible.
RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrence of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
Figure 2.8 Regular expression operators for counting.
Finally, certain special characters are referred to by special notation based on the
Newline backslash (\) (see Fig. 2.9). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like
., *, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
Here the \1 will be replaced by whatever string matched the first item in paren-
theses. So this will match The bigger they were, the bigger they will be but not The
bigger they were, the faster they will be.
capture group This use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the re-
register sulting match is stored in a numbered register. If you match two different sets of
parentheses, \2 means whatever matched the second capture group. Thus
/the (.*)er they (.*), the \1er we \2/
will match The faster they ran, the faster we ran but not The faster they ran, the
faster we ate. Similarly, the third capture group is stored in \3, the fourth is \4, and
so on.
Parentheses thus have a double function in regular expressions; they are used to
group terms for specifying the order in which operators should apply, and they are
used to capture something in a register. Occasionally we might want to use parenthe-
ses for grouping, but don’t want to capture the resulting pattern in a register. In that
non-capturing
group case we use a non-capturing group, which is specified by putting the commands
?: after the open paren, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some \1/
will match some cats like some people but not some people like some a few.
Substitutions and capture groups are very useful in implementing simple chat-
bots like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates a Rogerian
psychologist by carrying on conversations like the following:
User1 : Men are all alike.
ELIZA1 : IN WHAT WAY
User2 : They’re always bugging us about something or other.
ELIZA2 : CAN YOU THINK OF A SPECIFIC EXAMPLE
User3 : Well, my boyfriend made me come here.
ELIZA3 : YOUR BOYFRIEND MADE YOU COME HERE
User4 : He says I’m depressed much of the time.
ELIZA4 : I AM SORRY TO HEAR YOU ARE DEPRESSED
ELIZA works by having a series or cascade of regular expression substitutions
each of which matches and changes some part of the input lines. The first substitu-
tions change all instances of my to YOUR, and I’m to YOU ARE, and so on. The next
set of substitutions matches and replaces other patterns in the input. Here are some
examples:
s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Since multiple substitutions can apply to a given input, substitutions are assigned
a rank and applied in order. Creating patterns is the topic of Exercise 2.3, and we
return to the details of the ELIZA architecture in Chapter 29.
lookahead These lookahead assertions make use of the (? syntax that we saw in the previ-
ous section for non-capture groups. The operator (?= pattern) is true if pattern
zero-width occurs, but is zero-width, i.e. the match pointer doesn’t advance. The operator
(?! pattern) only returns true if a pattern does not match, but again is zero-width
and doesn’t advance the cursor. Negative lookahead is commonly used when we
are parsing some complex pattern but want to rule out a special case. For example
suppose we want to match, at the beginning of a line, any single word that doesn’t
start with ”Volcano”. We can use negative lookahead to do this:
/(ˆ?!Volcano)[A-Za-z]+/
How about inflected forms like cats versus cat? These two words have the same
lemma lemma cat but are different wordforms. A lemma is a set of lexical forms having
the same stem, the same major part-of-speech, and the same word sense. The word-
wordform form is the full inflected or derived form of the word. For morphologically complex
languages like Arabic, we often need to deal with lemmatization. For many tasks in
English, however, wordforms are sufficient.
How many words are there in English? To answer this question we need to
word type distinguish two ways of talking about words. Types are the number of distinct words
in a corpus; if the set of words in the vocabulary is V , the number of types is the
word token vocabulary size |V |. Tokens are the total number N of running words. If we ignore
punctuation, the following Brown sentence has 16 tokens and 14 types:
They picnicked by the pool, then lay back on the grass and looked at the stars.
When we speak about the number of words in the language, we are generally
referring to word types.
Fig. 2.10 shows the rough numbers of types and tokens computed from some
popular English corpora. The larger the corpora we look at, the more word types
we find, and in fact this relationship between the number of types |V | and number
Herdan’s Law of tokens N is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978)
Heaps’ Law after its discoverers (in linguistics and information retrieval respectively). It is shown
in Eq. 2.1, where k and β are positive constants, and 0 < β < 1.
|V | = kN β (2.1)
The value of β depends on the corpus size and the genre, but at least for the
large corpora in Fig. 2.10, β ranges from .67 to .75. Roughly then we can say that
the vocabulary size for a text goes up significantly faster than the square root of its
length in words.
Another measure of the number of words in the language is the number of lem-
mas instead of wordform types. Dictionaries can help in giving lemma counts; dic-
tionary entries or boldface forms are a very rough upper bound on the number of
lemmas (since some lemmas have multiple boldface forms). The 1989 edition of the
Oxford English Dictionary had 615,000 entries.
14725 a
97 aaron
1 abaissiez
10 abandon
2 abandoned
2 abase
1 abash
14 abate
3 abated
3 abatement
...
Now we can sort again to find the frequent words. The -n option to sort means
to sort numerically rather than alphabetically, and the -r option means to sort in
reverse order (highest-to-lowest):
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
The results show that the most frequent words in Shakespeare, as in any other
corpus, are the short function words like articles, pronouns, prepositions:
27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
...
Unix tools of this sort can be very handy in building quick word count statistics
for any corpus.
on this; many continental European languages like Spanish, French, and German, by
contrast, use a comma to mark the decimal point, and spaces (or sometimes periods)
where English puts commas, for example, 555 500,50.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example, converting what’re to the two tokens what are, and
we’re to we are. A clitic is a part of a word that can’t stand on its own, and can only
occur when it is attached to another word. Some such contractions occur in other
alphabetic languages, including articles and pronouns in French (j’ai, l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity detection, the task of detecting names, dates, and
organizations (Chapter 20).
One commonly used tokenization standard is known as the Penn Treebank to-
Penn Treebank
tokenization kenization standard, used for the parsed corpora (treebanks) released by the Lin-
guistic Data Consortium (LDC), the source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation:
Input: “The San Francisco-based restaurant,” they said, “doesn’t charge $10”.
Output: “ The San Francisco-based restaurant , ” they
said , “ does n’t charge $ 10 ” .
Tokens can also be normalized, in which a single normalized form is chosen for
words with multiple forms like USA and US or uh-huh and uhhuh. This standard-
ization may be valuable, despite the spelling information that is lost in the normal-
ization process. For information retrieval, we might want a query for US to match a
document that has USA; for information extraction we might want to extract coherent
information that is consistent across differently-spelled instances.
case folding Case folding is another kind of normalization. For tasks like speech recognition
and information retrieval, everything is mapped to lower case. For sentiment anal-
ysis and other text classification tasks, information extraction, and machine transla-
tion, by contrast, case is quite helpful and case folding is generally not done (losing
the difference, for example, between US the country and us the pronoun can out-
weigh the advantage in generality that case folding provides).
In practice, since tokenization needs to be run before any other language pro-
cessing, it is important for it to be very fast. The standard method for tokeniza-
tion/normalization is therefore to use deterministic algorithms based on regular ex-
pressions compiled into very efficient finite state automata. Carefully designed de-
terministic algorithms can deal with the ambiguities that arise, such as the fact that
the apostrophe needs to be tokenized differently when used as a genitive marker (as
in the book’s cover), a quotative as in ‘The other class’, she said, or in clitics like
they’re. We’ll discuss this use of automata in Chapter 3.
markably well for segmenting Chinese, and often used as a baseline comparison for
more advanced methods, is a version of greedy search called maximum match-
maximum
matching ing or sometimes MaxMatch. The algorithm requires a dictionary (wordlist) of the
language.
The maximum matching algorithm starts by pointing at the beginning of a string.
It chooses the longest word in the dictionary that matches the input at the current
position. The pointer is then advanced to the end of that word in the string. If
no word matches, the pointer is instead advanced one character (creating a one-
character word). The algorithm is then iteratively applied again starting from the
new pointer position. Fig. 2.11 shows a version of the algorithm.
if sentence is empty
return empty list
for i ← length(sentence) downto 1
firstword = first i chars of sentence
remainder = rest of sentence
if InDictionary(firstword, D)
return list(firstword, MaxMatch(remainder,dictionary) )
MaxMatch works very well on Chinese; the following example shows an appli-
cation to a simple Chinese sentence using a simple Chinese lexicon available from
the Linguistic Data Consortium:
Input: 他特别喜欢北京烤鸭 “He especially likes Peking duck”
Output: 他 特别 喜欢 北京烤鸭
He especially likes Peking duck
MaxMatch doesn’t work as well on English. To make the intuition clear, we’ll
create an example by removing the spaces from the beginning of Turing’s famous
quote “We can only see a short distance ahead”, producing “wecanonlyseeashortdis-
tanceahead”. The MaxMatch results are shown below.
Input: wecanonlyseeashortdistanceahead
Output: we canon l y see ash ort distance ahead
On English the algorithm incorrectly chose canon instead of stopping at can,
which left the algorithm confused and having to create single-character words l and
y and use the very rare word ort.
The algorithm works better in Chinese than English, because Chinese has much
shorter words than English. We can quantify how well a segmenter works using a
word error rate metric called word error rate. We compare our output segmentation with a perfect
hand-segmented (‘gold’) sentence, seeing how many words differ. The word error
rate is then the normalized minimum edit distance in words between our output and
the gold: the number of word insertions, deletions, and substitutions divided by the
length of the gold sentence in words; we’ll see in Section 2.4 how to compute edit
distance. Even in Chinese, however, MaxMatch has problems, for example dealing
2.3 • T EXT N ORMALIZATION 25
with unknown words (words not in the dictionary) or genres that differ a lot from
the assumptions made by the dictionary builder.
The most accurate Chinese segmentation algorithms generally use statistical se-
quence models trained via supervised machine learning on hand-segmented training
sets; we’ll introduce sequence models in Chapter 10.
This was not the map we found in Billy Bones’s chest, but
an accurate copy, complete in all things-names and heights
and soundings-with the single exception of the red crosses
and the written notes.
cascade The algorithm is based on series of rewrite rules run in series, as a cascade, in
which the output of each pass is fed as input to the next pass; here is a sampling of
26 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
the rules:
ATIONAL → ATE (e.g., relational → relate)
ING → if stem contains vowel (e.g., motoring → motor)
SSES → SS (e.g., grasses → grass)
Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.)
can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980).
Simple stemmers can be useful in cases where we need to collapse across differ-
ent variants of the same lemma. Nonetheless, they do tend to commit errors of both
over- and under-generalizing, as shown in the table below (Krovetz, 1993):
Again, the fact that these two strings are very similar (differing by only one word)
seems like useful evidence for deciding that they might be coreferent.
Edit distance gives us a way to quantify both of these intuitions about string sim-
minimum edit
distance ilarity. More formally, the minimum edit distance between two strings is defined
as the minimum number of editing operations (operations like insertion, deletion,
substitution) needed to transform one string into another.
The gap between intention and execution, for example, is 5 (delete an i, substi-
tute e for n, substitute x for t, insert c, substitute u for n). It’s much easier to see
alignment this by looking at the most important visualization for string distances, an alignment
between the two strings, shown in Fig. 2.12. Given two sequences, an alignment is
a correspondence between substrings of the two sequences. Thus, we say I aligns
with the empty string, N with E, and so on. Beneath the aligned strings is another
representation; a series of symbols expressing an operation list for converting the
top string into the bottom string: d for deletion, s for substitution, i for insertion.
INTE*NTION
| | | | | | | | | |
*EXECUTION
d s s i s
Figure 2.12 Representing the minimum edit distance between two strings as an alignment.
The final row gives the operation list for converting the top string into the bottom string: d for
deletion, s for substitution, i for insertion.
We can also assign a particular cost or weight to each of these operations. The
Levenshtein distance between two sequences is the simplest weighting factor in
which each of the three operations has a cost of 1 (Levenshtein, 1966)—we assume
that the substitution of a letter for itself, for example, t for t, has zero cost. The Lev-
enshtein distance between intention and execution is 5. Levenshtein also proposed
an alternative version of his metric in which each insertion or deletion has a cost of
1 and substitutions are not allowed. (This is equivalent to allowing substitution, but
giving each substitution a cost of 2 since any substitution can be represented by one
insertion and one deletion). Using this version, the Levenshtein distance between
intention and execution is 8.
i n t e n t i o n
n t e n t i o n i n t e c n t i o n i n x e n t i o n
Figure 2.13 Finding the edit distance viewed as a search problem
The space of all possible edits is enormous, so we can’t search naively. However,
lots of distinct edit paths will end up in the same state (string), so rather than recom-
puting all those paths, we could just remember the shortest path to a state each time
28 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
dynamic
programming we saw it. We can do this by using dynamic programming. Dynamic programming
is the name for a class of algorithms, first introduced by Bellman (1957), that apply
a table-driven method to solve problems by combining solutions to sub-problems.
Some of the most commonly used algorithms in natural language processing make
use of dynamic programming, such as the Viterbi and forward algorithms (Chap-
ter 9) and the CKY algorithm for parsing (Chapter 12).
The intuition of a dynamic programming problem is that a large problem can
be solved by properly combining the solutions to various sub-problems. Consider
the shortest path of transformed words that represents the minimum edit distance
between the strings intention and execution shown in Fig. 2.14.
i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.14 Path from intention to execution.
Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from
intention to exention. Why? If there were a shorter path from intention to exention,
then we could use it instead, resulting in a shorter overall path, and the optimal
sequence wouldn’t be optimal, thus leading to a contradiction.
minimum edit
distance The minimum edit distance algorithm was named by Wagner and Fischer (1974)
but independently discovered by many people (summarized later, in the Historical
Notes section of Chapter 9).
Let’s first define the minimum edit distance between two strings. Given two
strings, the source string X of length n, and target string Y of length m, we’ll define
D(i, j) as the edit distance between X[1..i] and Y [1.. j], i.e., the first i characters of X
and the first j characters of Y . The edit distance between X and Y is thus D(n, m).
We’ll use dynamic programming to compute D(n, m) bottom up, combining so-
lutions to subproblems. In the base case, with a source substring of length i but an
empty target string, going from i characters to 0 requires i deletes. With a target
substring of length j but an empty source going from 0 characters to j characters
requires j inserts. Having computed D(i, j) for small i, j we then compute larger
D(i, j) based on previously computed smaller values. The value of D(i, j) is com-
puted by taking the minimum of the three possible paths through the matrix which
arrive there:
D[i − 1, j] + del-cost(source[i])
a cost of 2 (except substitution of identical letters have zero cost), the computation
for D(i, j) becomes:
D[i − 1, j] + 1
D[i, j − 1] + 1
D[i, j] = min (2.2)
2; if source[i] 6= target[ j]
D[i − 1, j − 1] +
0; if source[i] = target[ j]
The algorithm is summarized in Fig. 2.15; Fig. 2.16 shows the results of applying
the algorithm to the distance between intention and execution with the version of
Levenshtein in Eq. 2.2.
n ← L ENGTH(source)
m ← L ENGTH(target)
Create a distance matrix distance[n+1,m+1]
# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] ← D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] ← D[0, j-1] + ins-cost(target[j])
# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] ← M IN( D[i−1, j] + del-cost(source[i]),
D[i−1, j−1] + sub-cost(source[i], target[j]),
D[i, j−1] + ins-cost(target[j]))
# Termination
return D[n,m]
Figure 2.15 The minimum edit distance algorithm, an example of the class of dynamic
programming algorithms. The various costs can either be fixed (e.g., ∀x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).
Knowing the minimum edit distance is useful for algorithms like finding poten-
tial spelling error corrections. But the edit distance algorithm is important in another
way; with a small change, it can also provide the minimum cost alignment between
two strings. Aligning two strings is useful throughout speech and language process-
ing. In speech recognition, minimum edit distance alignment is used to compute
the word error rate (Chapter 31). Alignment plays a role in machine translation, in
which sentences in a parallel corpus (a corpus with a text in two languages) need to
be matched to each other.
To extend the edit distance algorithm to produce an alignment, we can start by
visualizing an alignment as a path through the edit distance matrix. Figure 2.17
shows this path with the boldfaced cell. Each boldfaced cell represents an alignment
of a pair of letters in the two strings. If two boldfaced cells occur in the same row,
30 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.16 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.15, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.
there will be an insertion in going from the source to the target; two boldfaced cells
in the same column indicate a deletion.
Figure 2.17 also shows the intuition of how to compute this alignment path. The
computation proceeds in two steps. In the first step, we augment the minimum edit
distance algorithm to store backpointers in each cell. The backpointer from a cell
points to the previous cell (or cells) that we came from in entering the current cell.
We’ve shown a schematic of these backpointers in Fig. 2.17, after a similar diagram
in Gusfield (1997). Some cells have multiple backpointers because the minimum
extension could have come from multiple previous cells. In the second step, we
backtrace perform a backtrace. In a backtrace, we start from the last cell (at the final row and
column), and follow the pointers back through the dynamic programming matrix.
Each complete path between the final cell and the initial cell is a minimum distance
alignment. Exercise 2.7 asks you to modify the minimum edit distance algorithm to
store the pointers and compute the backtrace to output an alignment.
# e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 -←↑ 2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -6 ←7 ←8
n 2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 ↑7 -←↑ 8 -7
t 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -7 ←↑ 8 -←↑ 9 ↑8
e 4 -3 ←4 -← 5 ←6 ←7 ←↑ 8 -←↑ 9 -←↑ 10 ↑9
n 5 ↑4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 -↑ 10
t 6 ↑5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -8 ←9 ← 10 ←↑ 11
i 7 ↑6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 ↑9 -8 ←9 ← 10
o 8 ↑7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 ↑ 10 ↑9 -8 ←9
n 9 ↑8 -←↑ 9 -←↑ 10 -←↑ 11 -←↑ 12 ↑ 11 ↑ 10 ↑9 -8
Figure 2.17 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum cost
alignment between the two strings.
While we worked our example with simple Levenshtein distance, the algorithm
in Fig. 2.15 allows arbitrary weights on the operations. For spelling correction, for
example, substitutions are more likely to happen between letters that are next to
each other on the keyboard. We’ll discuss how these weights can be estimated in
2.5 • S UMMARY 31
Ch. 5. The Viterbi algorithm, for example, is an extension of minimum edit distance
that uses probabilistic definitions of the operations. Instead of computing the “mini-
mum edit distance” between two strings, Viterbi computes the “maximum probabil-
ity alignment” of one string with another. We’ll discuss this more in Chapter 9.
2.5 Summary
This chapter introduced a fundamental tool in language processing, the regular ex-
pression, and showed how to perform basic text normalization tasks including
word segmentation and normalization, sentence segmentation, and stemming.
We also introduce the important minimum edit distance algorithm for comparing
strings. Here’s a summary of the main points we covered about these ideas:
• The regular expression language is a powerful tool for pattern-matching.
• Basic operations in regular expressions include concatenation of symbols,
disjunction of symbols ([], |, and .), counters (*, +, and {n,m}), anchors
(ˆ, $) and precedence operators ((,)).
• Word tokenization and normalization are generally done by cascades of
simple regular expressions substitutions or finite automata.
• The Porter algorithm is a simple and efficient way to do stemming, stripping
off affixes. It does not have high accuracy but may be useful for some tasks.
• The minimum edit distance between two strings is the minimum number of
operations it takes to edit one into the other. Minimum edit distance can be
computed by dynamic programming, which also results in an alignment of
the two strings.
For more on Herdan’s law and Heaps’ Law, see Herdan (1960, p. 28), Heaps
(1978), Egghe (2007) and Baayen (2001); Yasseri et al. (2012) discuss the relation-
ship with other measures of linguistic complexity. For more on edit distance, see the
excellent Gusfield (1997). Our example measuring the edit distance from ‘intention’
to ‘execution’ was adapted from Kruskal (1983). There are various publicly avail-
able packages to compute edit distance, including Unix diff and the NIST sclite
program (NIST, 2005).
In his autobiography Bellman (1984) explains how he originally came up with
the term dynamic programming:
“...The 1950s were not good years for mathematical research. [the]
Secretary of Defense ...had a pathological fear and hatred of the word,
research... I decided therefore to use the word, “programming”. I
wanted to get across the idea that this was dynamic, this was multi-
stage... I thought, let’s ... take a word that has an absolutely precise
meaning, namely dynamic... it’s impossible to use the word, dynamic,
in a pejorative sense. Try thinking of some combination that will pos-
sibly give it a pejorative meaning. It’s impossible. Thus, I thought
dynamic programming was a good name. It was something not even a
Congressman could object to.”
Exercises
2.1 Write regular expressions for the following languages.
1. the set of all alphabetic strings;
2. the set of all lower case alphabetic strings ending in a b;
3. the set of all strings from the alphabet a, b such that each a is immedi-
ately preceded by and immediately followed by a b;
2.2 Write regular expressions for the following languages. By “word”, we mean
an alphabetic string separated from other words by whitespace, any relevant
punctuation, line breaks, and so forth.
1. the set of all strings with two consecutive repeated words (e.g., “Hum-
bert Humbert” and “the the” but not “the bug” or “the big bug”);
2. all strings that start at the beginning of the line with an integer and that
end at the end of the line with a word;
3. all strings that have both the word grotto and the word raven in them
(but not, e.g., words like grottos that merely contain the word grotto);
4. write a pattern that places the first word of an English sentence in a
register. Deal with punctuation.
2.3 Implement an ELIZA-like program, using substitutions such as those described
on page 18. You may choose a different domain than a Rogerian psychologist,
if you wish, although keep in mind that you would need a domain in which
your program can legitimately engage in a lot of simple repetition.
2.4 Compute the edit distance (using insertion cost 1, deletion cost 1, substitution
cost 1) of “leda” to “deal”. Show your work (using the edit distance grid).
2.5 Figure out whether drive is closer to brief or to divers and what the edit dis-
tance is to each. You may use any version of distance that you like.
E XERCISES 33
2.6 Now implement a minimum edit distance algorithm and use your hand-computed
results to check your code.
2.7 Augment the minimum edit distance algorithm to output an alignment; you
will need to store pointers and add a stage to compute the backtrace.
2.8 Implement the MaxMatch algorithm.
2.9 To test how well your MaxMatch algorithm works, create a test set by remov-
ing spaces from a set of sentences. Implement the Word Error Rate metric (the
number of word insertions + deletions + substitutions, divided by the length
in words of the correct string) and compute the WER for your test set.
CHAPTER
34
CHAPTER
Being able to predict the future is not always a good thing. Cassandra of Troy had
the gift of foreseeing but was cursed by Apollo that her predictions would never be
believed. Her warnings of the destruction of Troy were ignored and to simplify, let’s
just say that things just didn’t go well for her later.
In this chapter we take up the somewhat less fraught topic of predicting words.
What word, for example, is likely to follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over,
but probably not refrigerator or the. In the following sections we will formalize
this intuition by introducing models that assign a probability to each possible next
word. The same models will also serve to assign a probability to an entire sentence.
Such a model, for example, could predict that the following sequence has a much
higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
Why would you want to predict upcoming words, or assign probabilities to sen-
tences? Probabilities are essential in any task in which we have to identify words
in noisy, ambiguous input, like speech recognition or handwriting recognition. In
the movie Take the Money and Run, Woody Allen tries to rob a bank with a sloppily
written hold-up note that the teller incorrectly reads as “I have a gub”. As Rus-
sell and Norvig (2002) point out, a language processing system could avoid making
this mistake by using the knowledge that the sequence “I have a gun” is far more
probable than the non-word “I have a gub” or even “I have a gull”.
In spelling correction, we need to find and correct spelling errors like Their
are two midterms in this class, in which There was mistyped as Their. A sentence
starting with the phrase There are will be much more probable than one starting with
Their are, allowing a spellchecker to both detect and correct these errors.
Assigning probabilities to sequences of words is also essential in machine trans-
lation. Suppose we are translating a Chinese source sentence:
他 向 记者 介绍了 主要 内容
He to reporters introduced main content
Other documents randomly have
different content
“To bestow your attention in company, upon trifling singularities in
the dress, person, or manners of others, is spending your time to
little purpose. From such a practice you can derive neither pleasure
nor profit; but must unavoidably subject yourselves to the
imputation of incivility and malice.”
Thursday, P. M.
AMUSEMENTS.
“Amusement is impatiently desired, and eagerly sought by young
ladies in general. Forgetful that the noblest entertainment arises
from a placid and well cultivated mind, too many fly from
themselves, from thought and reflection, to fashionable dissipation,
or what they call pleasure, as a mean of beguiling the hours which
solitude and retirement render insupportably tedious.
“An extravagant fondness for company and public resorts is
incompatible with those domestic duties, the faithful discharge of
which ought to be the prevailing object of the sex. In the indulgence
of this disposition, the mind is enervated, and the manners
corrupted, till all relish for those enjoyments, which being simple and
natural, are best calculated to promote health, innocence, and social
delight, is totally lost.
“It is by no means amiss for youth to seek relaxation from severer
cares and labors, in a participation of diversions, suited to their age,
sex, and station in life. But there is great danger of their lively
imaginations’ hurrying them into excess, and detaching their
affections from the ennobling acquisitions of moral improvement,
and refined delicacy. Guard, then against those amusements which
have the least tendency to sully the purity of your minds.
“Loose and immoral books; company, whose manners are licentious,
however gay and fashionable; conversation which is even tinctured
with profaneness or obscenity; plays in which the representation is
immodest, and offensive to the ear of chastity; indeed, pastimes of
every description, from which no advantage can be derived, should
not be countenanced; much less applauded. Why should those
things afford apparent satisfaction in a crowd which would call forth
the blush of indignation in more private circles? This question is
worthy the serious attention of those ladies, who at the theatre, can
hardly restrain their approbation of expressions and actions, which
at their houses, would be intolerably rude and indecent, in their
most familiar friends!
“Cards are so much the taste of the present day, that to caution my
pupils against the too frequent use of them may be thought old
fashioned in the extreme. I believe it, however, to be a fascinating
game, which occupies the time, without yielding any kind of pleasure
or profit. As the satirist humorously observes,
“The love of gaming is the worst of ills;
With ceaseless storms the blacken’d soul it fills;
Inveighs at Heaven, neglects the ties of blood;
Destroys the power and will of doing good;
Kills health, pawns honor, plunges in disgrace;
And, what is still more dreadful—spoils your face.”
Friday, A. M.
FILIAL AND FRATERNAL AFFECTION.
“The filial and fraternal are the first duties of a single state. The
obligations you are under to your parents cannot be discharged, but
by a uniform and cheerful obedience; an unreserved and ready
compliance with their wishes, added to the most diligent attention to
their ease and happiness. The virtuous and affectionate behaviour of
children is the best compensation, in their power, for that unwearied
care and solicitude which parents, only, know. Upon daughters,
whose situation and employments lead them more frequently into
scenes of domestic tenderness; who are often called to smooth the
pillow of sick and aged parents, and to administer with a skilful and
delicate hand the cordial, restorative to decaying nature, and
endearing sensibility, and a dutiful acquiescence in the dispositions,
and even peculiarities of those from whom they have derived
existence, are indispensably incumbent.
“Such a conduct will yield a satisfaction of mind more than
equivalent to any little sacrifices of inclination or humour which may
be required at your hands.
“Pope, among all his admired poetry, has not six lines more
beautifully expressive than the following:
“Me, let the pious office long engage,
To rock the cradle of declining age;
With lenient arts extend a mother’s breath,
Make languor smile, and smooth the bed of death;
Explore the thought, explain the asking eye,
And keep awhile one parent from the sky!”
Friday, P. M.
FRIENDSHIP.
“Friendship is a term much insisted on by young people; but, like
many others more frequently used than understood. A friend, with
girls in general, is an intimate acquaintance, whose taste and
pleasures are similar to their own; who will encourage, or at least
connive at their foibles and faults, and communicate with them
every secret; in particular those of love and gallantry, in which those
of the other sex are concerned. By such friends their errors and
stratagems are flattered and concealed, while the prudent advice of
real friendship is neglected, till they find too late, how fictitious a
character, and how vain a dependence they have chosen.
“Augusta and Serena were educated at the same school, resided in
the same neighborhood, and were equally volatile in their tempers,
and dissipated in their manners. Hence every plan of amusement
was concerted and enjoyed together. At the play, the ball, the card-
table and every other party of pleasure, they were companions.
“Their parents saw that this intimacy strengthened the follies of
each; and strove to disengage their affections, that they might turn
their attention to more rational entertainments, and more judicious
advisers. But they gloried in their friendship, and thought it a
substitute for every other virtue. They were the dupes of adulation,
and the votaries of coquetry.
“The attentions of a libertine, instead of putting them on their guard
against encroachments, induced them to triumph in their fancied
conquests, and to boast of resolution sufficient to shield them from
delusion.
“Love, however, which with such dispositions, is the pretty play-thing
of imagination, assailed the tender heart of Serena. A gay youth,
with more wit than sense, more show than substance, more art than
honesty, took advantage of her weakness to ingratiate himself into
her favour, and persuade her they could not live without each other.
Augusta was the confident of Serena. She fanned the flame, and
encouraged her resolution of promoting her own felicity, though at
the expense of every other duty. Her parents suspected her amour,
remonstrated against the man, and forbade her forming any
connexion with him, on pain of their displeasure. She apparently
acquiesced; but flew to Augusta for counsel and relief. Augusta
soothed her anxiety, and promised to assist her in the
accomplishment of all her wishes. She accordingly contrived means
for a clandestine intercourse, both personal and epistolary.
“Aristus was a foreigner, and avowed his purpose of returning to his
native country, urging her to accompany him. Serena had a fortune,
independent of her parents, left her by a deceased relation. This,
with her hand, she consented to give to her lover, and to quit a
country, in which she acknowledged but one friend. Augusta praised
her fortitude, and favored her design. She accordingly eloped, and
embarked. Her parents were almost distracted by her imprudent and
undutiful conduct, and their resentment fell on Augusta, who had
acted contrary to all the dictates of integrity and friendship, in
contributing to her ruin; for ruin it proved. Her ungrateful paramour,
having rioted on the property which she bestowed, abandoned her
to want and despair. She wrote to her parents, but received no
answer. She represented her case to Augusta, and implored relief
from her friendship; but Augusta alleged that she had already
incurred the displeasure of her family on her account and chose not
again to subject herself to censure by the same means.
“Serena at length returned to her native shore, and applied in
person to Augusta, who coolly told her that she wished no
intercourse with a vagabond, and then retired. Her parents refused
to receive her into their house; but from motives of compassion and
charity, granted her a small annuity, barely sufficient to keep her and
her infant from want.
“Too late she discovered her mistaken notions of friendship; and
learned by sad experience, that virtue must be its foundation, or
sincerity and constancy can never be its reward.
“Sincerity and constancy are essential ingredients in virtuous
friendship. It invariably seeks the permanent good of its object; and
in so doing, will advise, caution and reprove, with all the frankness
of undissembled affection. In the interchange of genuine friendship,
flattery is utterly excluded. Yet, even in the most intimate
connexions of this kind, a proper degree of respect, attention and
politeness must be observed. You are not so far to presume on the
partiality of friendship, as to hazard giving offence, and wounding
the feelings of persons, merely because you think their regard for
you will plead your excuse, and procure your pardon. Equally
cautious should you be, of taking umbrage at circumstances which
are undesignedly offensive.
“Hear the excellent advice of the wise son of Sirach, upon this
subject:
“Admonish thy friend; it may be he hath not done it; and if he have
done it, that he do it no more. Admonish thy friend; it may be he
hath not said it; and if he have, that he speak it not again. Admonish
thy friend; for many times it is a slander; and believe not every tale.
There is one that slippeth in his speech, but not from his heart; and
who is he that offendeth not with his tongue?”
“Be not hasty in forming friendships; but deliberately examine the
principles, disposition, temper and manners, of the person you wish
to sustain this important character. Be well assured that they are
agreeable to your own, and such as merit your entire esteem and
confidence, before you denominate her your friend. You may have
many general acquaintances, with whom you are pleased and
entertained; but in the chain of friendship there is a still closer link.
“Reserve will wound it, and distrust destroy,
Deliberate on all things with thy friend:
But since friends grow not thick on every bough
Nor ev’ry friend unrotten at the core,
First on thy friend, deliberate with thyself:
Pause, ponder, first: not eager in the choice,
Nor jealous of the chosen: fixen, fix:
Judge before friendship: then confide till death.”
“But if you would have friends, you must show yourselves friendly;
that is, you must be careful to act the part you wish from another. If
your friend have faults, mildly and tenderly represent them to her;
but conceal them as much as possible from the observation of the
world. Endeavor to convince her of her errors, to rectify her
mistakes, and to confirm and increase every virtuous sentiment.
“Should she so far deviate, as to endanger her reputation and
happiness; and should your admonitions fail to reclaim her, become
not, like Augusta, an abettor of her crimes. It is not the part of
friendship to hide transactions which will end in the ruin of your
friend. Rather acquaint those who ought to have the rule over her of
her intended missteps, and you will have discharged your duty; you
will merit, and very probably may afterwards receive her thanks.
“Narcissa and Florinda were united in the bonds of true and
generous friendship. Narcissa was called to spend a few months with
a relation in the metropolis, where she became acquainted with, and
attached to a man who was much her inferior; but whose specious
manners and appearance deceived her youthful heart, though her
reason and judgment informed her, that her parents would
disapprove the connexion. She returned home, the consciousness of
her fault, the frankness which she owed to her friend, and her
partiality to her lover, wrought powerfully upon her mind, and
rendered her melancholy. Florinda soon explored the cause, and
warmly remonstrated against her imprudence in holding a moment’s
intercourse with a man, whom she knew, would be displeasing to
her parents. She searched out his character, and found it far
inadequate to Narcissa’s merit. This she represented to her in its
true colours, and conjured her not to sacrifice her reputation, her
duty and her happiness, by encouraging his addresses; but to no
purpose were her expostulations. Narcissa avowed the design of
permitting him to solicit the consent of her parents, and the
determination of marrying him without it, if they refused.
“Florinda was alarmed at this resolution; and, with painful anxiety,
saw the danger of her friend. She told her plainly, that the regard
she had for her demanded a counteraction of her design; and that if
she found no other way of preventing its execution, she should
discharge her duty by informing her parents of her proceedings. This
Narcissa resented, and immediately withdrew her confidence and
familiarity; but the faithful Florinda neglected not the watchful
solicitude of friendship; and when she perceived that Narcissa’s
family were resolutely opposed to her projected match and that
Narcissa was preparing to put her rash purpose into execution, she
made known the plan which she had concerted and by that mean
prevented her destruction. Narcissa thought herself greatly injured,
and declared that she would never forgive so flagrant a breach of
fidelity. Florinda endeavoured to convince her of her good intentions,
and the real kindness of her motives; but she refused to hear the
voice of wisdom, till a separation from her lover, and a full proof of
his unworthiness opened her eyes to a sight of her own folly and
indiscretion, and to a lively sense of Florinda’s friendship, in saving
her from ruin without her consent. Her heart overflowed with
gratitude to her generous preserver. She acknowledged herself
indebted to Florinda’s benevolence, for deliverance from the baneful
impetuosity of her own passions. She sought and obtained
forgiveness; and ever after lived in the strictest amity with her
faithful benefactress.”
Saturday, A. M.
LOVE.
“The highest state of friendship which this life admits, is in the
conjugal relation. On this refined affection, love, which is but a more
interesting and tender kind of friendship, ought to be founded. The
same virtues, the same dispositions and qualities which are
necessary in a friend, are still more requisite in a companion for life.
And when these enlivening principles are united, they form the basis
of durable happiness. But let not the mask of friendship, or of love,
deceive you. You are now entering upon a new stage of action
where you will probably admire, and be admired. You may attract
the notice of many, who will select you as objects of adulation, to
discover their taste and gallantry; and perhaps of some whose
affections you have really and seriously engaged. The first class your
penetration will enable you to detect; and your good sense and
virtue will lead you to treat them with the neglect they deserve. It is
disreputable for a young lady to receive and encourage the officious
attentions of those mere pleasure-hunters, who rove from fair to fair,
with no other design than the exercise of their art, addresses, and
intrigue. Nothing can render their company pleasing, but a vanity of
being caressed, and a false pride in being thought an object of
general admiration, with a fondness for flattery which bespeaks a
vitiated mind. But when you are addressed by a person of real merit,
who is worthy your esteem and may justly demand your respect, let
him be treated with honor, frankness and sincerity. It is the part of a
prude, to affect a shyness, reserve, and indifference, foreign to the
heart. Innocence and virtue will rise superior to such little arts, and
indulge no wish which needs disguise.
“Still more unworthy are the insidious and deluding wiles of the
coquette. How disgusting must this character appear to persons of
sentiment and integrity! how unbecoming the delicacy and dignity of
an uncorrupted female!
“As you are young and inexperienced, your affections may possibly
be involuntarily engaged, where prudence and duty forbid a
connexion. Beware, then how you admit the passion of love. In
young minds, it is of all others the most uncontrollable. When fancy
takes the reins, it compels its blinded votary to sacrifice reason,
discretion and conscience to its impetuous dictates. But a passion of
this origin tends not to substantial and durable happiness. To secure
this, it must be quite of another kind, enkindled by esteem, founded
on merit, strengthened by congenial dispositions and corresponding
virtues, and terminating in the most pure and refined affection.
“Never suffer your eyes to be charmed by the mere exterior; nor
delude yourselves with the notion of unconquerable love. The eye, in
this respect, is often deceptious, and fills the imagination with
charms which have no reality. Nip, in the bud, every particular liking,
much more all ideas of love, till called forth by unequivocal tokens as
well as professions of sincere regard. Even then, harbor them not
without a thorough knowledge of the temper, disposition and
circumstances of your lover, the advice of your friends; and, above
all the approbation of your parents. Maturely weigh every
consideration for and against, and deliberately determine with
yourselves, what will be most conducive to your welfare and fidelity
in life. Let a rational and discreet plan of thinking and acting,
regulate your deportment, and render you deserving of the affection
you wish to insure. This you will find far more conducive to your
interest, than the indulgence of that romantic passion, which a blind
and misguided fancy paints in such alluring colors to the thoughtless
and inexperienced.
“Recollect the favourite air you so often sing:
“Ye fair, who would be blessed in love,
Take your pride a little lower:
Let the swain that you approve,
Rather like you than adore.
Saturday, P. M.
RELIGION.
“Having given you my sentiments on a variety of subjects which
demand your particular attention, I come now to the closing and
most important theme; and that is religion. The virtuous education
you have received, and the good principles which have been instilled
into your minds from infancy, will render the enforcement of
Christian precepts and duties a pleasing lesson.
“Religion is to be considered as an essential and durable object; not
as the embellishment of a day; but an acquisition which shall endure
and increase through the endless ages of eternity.
“Lay the foundation of it in youth, and it will not forsake you in
advanced age; but furnish you with an adequate substitute for the
transient pleasures which will then desert you, and prove a source of
rational and refined delight: a refuge from the disappointments and
corroding cares of life, and from the depressions of adverse events.
“Remember now your creator, in the days of your youth, while the
evil days come not, nor the years draw nigh, when you shall say we
have no pleasure in them.” If you wish for permanent happiness,
cultivate the divine favour as your highest enjoyment in life, and
your safest retreat when death shall approach you.
“That even the young are not exempt from the arrest of this
universal conqueror, the tombstone of Amelia will tell you. Youth,
beauty, health and fortune, strewed the path of life with flowers, and
left her no wish ungratified. Love, with its gentlest and purest flame,
animated her heart, and was equally returned by Julius. Their
passion was approved by their parents and friends; the day was
fixed, and preparations were making for the celebration of their
nuptials. At this period Amelia was attacked by a violent cold, which
seating on her lungs, baffled the skill of the most eminent
physicians, and terminated in a confirmed hectic. She perceived her
disorder to be incurable, and with inexpressible regret and concern
anticipated her approaching dissolution. She had enjoyed life too
highly to think much of death; yet die she must! “Oh,” said she,
“that I had prepared, while in health and at ease, for this awful
event! Then should I not be subjected to the keenest distress of
mind, in addition to the most painful infirmities of body! Then should
I be able to look forward with hope, and to find relief in the
consoling expectation of being united beyond the grave, with those
dear and beloved connexions, which I must soon leave behind! Let
my companions and acquaintance learn from me the important
lesson of improving their time to the best of purposes; of acting at
once as becomes mortal and immortal creatures!”
“Hear, my dear pupils, the solemn admonition, and be ye also ready!
“Too many, especially of the young and gay, seem more anxious to
live in pleasure, than to answer the end of their being, by the
cultivation of that piety and virtue which will render them good
members of society, useful to their friends and associates, and
partakers of that heart-felt satisfaction which results from a
conscience void of offence both towards God and man.
“This, however, is an egregious mistake; for in many situations, piety
and virtue are our only source of consolation; and in all, they are
peculiarly friendly to our happiness.
“Do you exult in beauty, and the pride of external charms? Turn your
eyes for a moment, on the miserable Flirtilla.[1] Like her, your
features and complexion may be impaired by disease; and where
then will you find a refuge from mortification and discontent, if
destitute of those ennobling endowments which can raise you
superior to the transient graces of a fair form, if unadorned by that
substantial beauty of mind which can not only ensure respect from
those around you, but inspire you with resignation to the divine will,
and a patient acquiescence in the painful allotments of a holy
Providence. Does wealth await your command, and grandeur with its
fascinating appendages beguile your fleeting moments? Recollect,
that riches often make themselves wings and fly away. A single
instance of mismanagement; a consuming fire, with various other