100% found this document useful (3 votes)
5 views

Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Daniel Jurafsky pdf download

The document is a draft of the third edition of 'Speech and Language Processing' by Daniel Jurafsky and James H. Martin, covering various topics in natural language processing, computational linguistics, and speech recognition. It includes a comprehensive table of contents outlining chapters on language modeling, parsing, semantics, and machine translation among others. The document also provides links to additional educational resources and ebooks.

Uploaded by

clydczg127
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
5 views

Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Daniel Jurafsky pdf download

The document is a draft of the third edition of 'Speech and Language Processing' by Daniel Jurafsky and James H. Martin, covering various topics in natural language processing, computational linguistics, and speech recognition. It includes a comprehensive table of contents outlining chapters on language modeling, parsing, semantics, and machine translation among others. The document also provides links to additional educational resources and ebooks.

Uploaded by

clydczg127
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Speech and Language Processing An Introduction

to Natural Language Processing Computational


Linguistics and Speech Recognition 3rd Edition
Daniel Jurafsky install download

http://ebookstep.com/product/speech-and-language-processing-an-
introduction-to-natural-language-processing-computational-
linguistics-and-speech-recognition-3rd-edition-daniel-jurafsky/

Download more ebook from https://ebookstep.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookstep.com
to discover even more!

Exploring Engineering: An Introduction to Engineering


and Design, 5th Edition Philip Kosky

http://ebookstep.com/product/exploring-engineering-an-
introduction-to-engineering-and-design-5th-edition-philip-kosky/

GPS 量与 据处理 GPS surveying and data processing 李征航.

https://ebookstep.com/download/ebook-33653502/

Digital Signal Processing A Nagoor Kani

http://ebookstep.com/product/digital-signal-processing-a-nagoor-
kani/

Language Ability and Educational Achievement Routledge


Library Editions Philosophy of Education Winch

http://ebookstep.com/product/language-ability-and-educational-
achievement-routledge-library-editions-philosophy-of-education-
winch/
Linguaggio e verità La filosofia e il discorso
religioso Language and Truth Philosophy and Religious
Discourse First Edition Aa. Vv.

http://ebookstep.com/product/linguaggio-e-verita-la-filosofia-e-
il-discorso-religioso-language-and-truth-philosophy-and-
religious-discourse-first-edition-aa-vv/

JavaScript programming language and the essence of


Practice 3rd Edition Bowen viewpoint produced Chinese
Edition Zhou Ai Min 周 民

https://ebookstep.com/download/ebook-24142950/

Religious Education and Religious Understanding An


Introduction to the Philosophy of Religious Education
Routledge Library Editions Philosophy of Education
Holley
http://ebookstep.com/product/religious-education-and-religious-
understanding-an-introduction-to-the-philosophy-of-religious-
education-routledge-library-editions-philosophy-of-education-
holley/

Cyber Security Issues and Current Trends Studies in


Computational Intelligence 995 Dutta

http://ebookstep.com/product/cyber-security-issues-and-current-
trends-studies-in-computational-intelligence-995-dutta/

Travel to Past and Back to the Future First Edition


Rawiar A.Abdallah

http://ebookstep.com/product/travel-to-past-and-back-to-the-
future-first-edition-rawiar-a-abdallah/
Speech and Language Processing
An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition

Third Edition draft

Daniel Jurafsky
Stanford University

James H. Martin
University of Colorado at Boulder

Copyright c 2017

Draft of August 28, 2017. Comments and typos welcome!


Summary of Contents
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Regular Expressions, Text Normalization, Edit Distance . . . . . . . . . 10
3 Finite State Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Language Modeling with N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Spelling Correction and the Noisy Channel . . . . . . . . . . . . . . . . . . . . . 61
6 Naive Bayes and Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . 74
7 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8 Neural Networks and Neural Language Models . . . . . . . . . . . . . . . . . 102
9 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
10 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
11 Formal Grammars of English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
12 Syntactic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
13 Statistical Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
14 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
15 Vector Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
16 Semantics with Dense Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
17 Computing with Word Senses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
18 Lexicons for Sentiment and Affect Extraction . . . . . . . . . . . . . . . . . . . 326
19 The Representation of Sentence Meaning . . . . . . . . . . . . . . . . . . . . . . . 346
20 Computational Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
21 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
22 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
23 Coreference Resolution and Entity Linking . . . . . . . . . . . . . . . . . . . . . 396
24 Discourse Coherence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
25 Machine Translation and Seq2Seq Models . . . . . . . . . . . . . . . . . . . . . . 398
26 Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
27 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
28 Dialog Systems and Chatbots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
29 Advanced Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
30 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
31 Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

2
Contents
1 Introduction 9

2 Regular Expressions, Text Normalization, Edit Distance 10


2.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Words and Corpora . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Minimum Edit Distance . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 31
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Finite State Transducers 34

4 Language Modeling with N-grams 35


4.1 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Evaluating Language Models . . . . . . . . . . . . . . . . . . . . 41
4.3 Generalization and Zeros . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Kneser-Ney Smoothing . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 The Web and Stupid Backoff . . . . . . . . . . . . . . . . . . . . 53
4.7 Advanced: Perplexity’s Relation to Entropy . . . . . . . . . . . . 54
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 58
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5 Spelling Correction and the Noisy Channel 61


5.1 The Noisy Channel Model . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Real-word spelling errors . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Noisy Channel Model: The State of the Art . . . . . . . . . . . . . 69
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 72
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Naive Bayes and Sentiment Classification 74


6.1 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . 76
6.2 Training the Naive Bayes Classifier . . . . . . . . . . . . . . . . . 78
6.3 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Optimizing for Sentiment Analysis . . . . . . . . . . . . . . . . . 81
6.5 Naive Bayes as a Language Model . . . . . . . . . . . . . . . . . 82
6.6 Evaluation: Precision, Recall, F-measure . . . . . . . . . . . . . . 83
6.7 More than two classes . . . . . . . . . . . . . . . . . . . . . . . . 85
6.8 Test sets and Cross-validation . . . . . . . . . . . . . . . . . . . . 86
6.9 Statistical Significance Testing . . . . . . . . . . . . . . . . . . . 87
6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 89
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

7 Logistic Regression 92
7.1 Features in Multinomial Logistic Regression . . . . . . . . . . . . 93
7.2 Classification in Multinomial Logistic Regression . . . . . . . . . 95

3
4 C ONTENTS

7.3 Learning Logistic Regression . . . . . . . . . . . . . . . . . . . . 96


7.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.6 Choosing a classifier and features . . . . . . . . . . . . . . . . . . 100
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 101
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8 Neural Networks and Neural Language Models 102


8.1 Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.2 The XOR problem . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3 Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . . . 108
8.4 Training Neural Nets . . . . . . . . . . . . . . . . . . . . . . . . 111
8.5 Neural Language Models . . . . . . . . . . . . . . . . . . . . . . 115
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 120

9 Hidden Markov Models 122


9.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.2 The Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . 124
9.3 Likelihood Computation: The Forward Algorithm . . . . . . . . . 127
9.4 Decoding: The Viterbi Algorithm . . . . . . . . . . . . . . . . . . 131
9.5 HMM Training: The Forward-Backward Algorithm . . . . . . . . 134
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 140
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

10 Part-of-Speech Tagging 142


10.1 (Mostly) English Word Classes . . . . . . . . . . . . . . . . . . . 143
10.2 The Penn Treebank Part-of-Speech Tagset . . . . . . . . . . . . . 145
10.3 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . 147
10.4 HMM Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . 149
10.5 Maximum Entropy Markov Models . . . . . . . . . . . . . . . . . 157
10.6 Bidirectionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
10.7 Part-of-Speech Tagging for Other Languages . . . . . . . . . . . . 162
10.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 164
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

11 Formal Grammars of English 168


11.1 Constituency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
11.2 Context-Free Grammars . . . . . . . . . . . . . . . . . . . . . . . 169
11.3 Some Grammar Rules for English . . . . . . . . . . . . . . . . . . 174
11.4 Treebanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
11.5 Grammar Equivalence and Normal Form . . . . . . . . . . . . . . 187
11.6 Lexicalized Grammars . . . . . . . . . . . . . . . . . . . . . . . . 188
11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 194
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

12 Syntactic Parsing 197


12.1 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
C ONTENTS 5

12.2 CKY Parsing: A Dynamic Programming Approach . . . . . . . . 199


12.3 Partial Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 210
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

13 Statistical Parsing 212


13.1 Probabilistic Context-Free Grammars . . . . . . . . . . . . . . . . 213
13.2 Probabilistic CKY Parsing of PCFGs . . . . . . . . . . . . . . . . 217
13.3 Ways to Learn PCFG Rule Probabilities . . . . . . . . . . . . . . 218
13.4 Problems with PCFGs . . . . . . . . . . . . . . . . . . . . . . . . 220
13.5 Improving PCFGs by Splitting Non-Terminals . . . . . . . . . . . 223
13.6 Probabilistic Lexicalized CFGs . . . . . . . . . . . . . . . . . . . 225
13.7 Probabilistic CCG Parsing . . . . . . . . . . . . . . . . . . . . . . 230
13.8 Evaluating Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . 238
13.9 Human Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
13.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 242
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

14 Dependency Parsing 245


14.1 Dependency Relations . . . . . . . . . . . . . . . . . . . . . . . . 246
14.2 Dependency Formalisms . . . . . . . . . . . . . . . . . . . . . . . 248
14.3 Dependency Treebanks . . . . . . . . . . . . . . . . . . . . . . . 249
14.4 Transition-Based Dependency Parsing . . . . . . . . . . . . . . . 250
14.5 Graph-Based Dependency Parsing . . . . . . . . . . . . . . . . . 261
14.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 268
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

15 Vector Semantics 270


15.1 Words and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 271
15.2 Weighing terms: Pointwise Mutual Information (PMI) . . . . . . . 275
15.3 Measuring similarity: the cosine . . . . . . . . . . . . . . . . . . 279
15.4 Using syntax to define a word’s context . . . . . . . . . . . . . . . 282
15.5 Evaluating Vector Models . . . . . . . . . . . . . . . . . . . . . . 283
15.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 284
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

16 Semantics with Dense Vectors 286


16.1 Dense Vectors via SVD . . . . . . . . . . . . . . . . . . . . . . . 286
16.2 Embeddings from prediction: Skip-gram and CBOW . . . . . . . 290
16.3 Properties of embeddings . . . . . . . . . . . . . . . . . . . . . . 295
16.4 Brown Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 295
16.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 298
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

17 Computing with Word Senses 300


17.1 Word Senses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
6 C ONTENTS

17.2 Relations Between Senses . . . . . . . . . . . . . . . . . . . . . . 303


17.3 WordNet: A Database of Lexical Relations . . . . . . . . . . . . . 305
17.4 Word Sense Disambiguation: Overview . . . . . . . . . . . . . . . 306
17.5 Supervised Word Sense Disambiguation . . . . . . . . . . . . . . 308
17.6 WSD: Dictionary and Thesaurus Methods . . . . . . . . . . . . . 311
17.7 Semi-Supervised WSD: Bootstrapping . . . . . . . . . . . . . . . 314
17.8 Unsupervised Word Sense Induction . . . . . . . . . . . . . . . . 316
17.9 Word Similarity: Thesaurus Methods . . . . . . . . . . . . . . . . 317
17.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 323
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

18 Lexicons for Sentiment and Affect Extraction 326


18.1 Available Sentiment Lexicons . . . . . . . . . . . . . . . . . . . . 327
18.2 Semi-supervised induction of sentiment lexicons . . . . . . . . . . 328
18.3 Supervised learning of word sentiment . . . . . . . . . . . . . . . 333
18.4 Using Lexicons for Sentiment Recognition . . . . . . . . . . . . . 337
18.5 Emotion and other classes . . . . . . . . . . . . . . . . . . . . . . 338
18.6 Other tasks: Personality . . . . . . . . . . . . . . . . . . . . . . . 341
18.7 Affect Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 342
18.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 344

19 The Representation of Sentence Meaning 346

20 Computational Semantics 347

21 Information Extraction 348


21.1 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . 349
21.2 Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 354
21.3 Extracting Times . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
21.4 Extracting Events and their Times . . . . . . . . . . . . . . . . . . 368
21.5 Template Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
21.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 374
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375

22 Semantic Role Labeling 377


22.1 Semantic Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
22.2 Diathesis Alternations . . . . . . . . . . . . . . . . . . . . . . . . 379
22.3 Semantic Roles: Problems with Thematic Roles . . . . . . . . . . 380
22.4 The Proposition Bank . . . . . . . . . . . . . . . . . . . . . . . . 381
22.5 FrameNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
22.6 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . 384
22.7 Selectional Restrictions . . . . . . . . . . . . . . . . . . . . . . . 387
22.8 Primitive Decomposition of Predicates . . . . . . . . . . . . . . . 392
22.9 AMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
22.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 394
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

23 Coreference Resolution and Entity Linking 396


C ONTENTS 7

24 Discourse Coherence 397

25 Machine Translation and Seq2Seq Models 398


26 Summarization 399
27 Question Answering 400
27.1 IR-based Factoid Question Answering . . . . . . . . . . . . . . . 401
27.2 Knowledge-based Question Answering . . . . . . . . . . . . . . . 408
27.3 Using multiple information sources: IBM’s Watson . . . . . . . . 412
27.4 Evaluation of Factoid Answers . . . . . . . . . . . . . . . . . . . 415
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 416
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
28 Dialog Systems and Chatbots 418
28.1 Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
28.2 Frame Based Dialog Agents . . . . . . . . . . . . . . . . . . . . . 427
28.3 VoiceXML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
28.4 Evaluating Dialogue Systems . . . . . . . . . . . . . . . . . . . . 437
28.5 Dialogue System Design . . . . . . . . . . . . . . . . . . . . . . . 438
28.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 439
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
29 Advanced Dialog Systems 441
29.1 Dialog Acts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
29.2 Dialog State: Interpreting Dialogue Acts . . . . . . . . . . . . . . 447
29.3 Dialogue Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
29.4 A simple policy based on local context . . . . . . . . . . . . . . . 451
29.5 Natural language generation in the dialog-state model . . . . . . . 451
29.6 Advanced: Markov Decision Processes . . . . . . . . . . . . . . 454
29.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 457
30 Speech Recognition 459
31 Speech Synthesis 460

Bibliography 461
Author Index 485
Subject Index 493
CHAPTER

1 Introduction

Placeholder

9
10 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

CHAPTER

2 Regular Expressions, Text


Normalization, Edit Distance
User: You are like my father in some ways.
ELIZA: WHAT RESEMBLANCE DO YOU SEE
User: You are not very aggressive but I think you don’t want me to notice that.
ELIZA: WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
User: You don’t argue with me.
ELIZA: WHY DO YOU THINK I DON’T ARGUE WITH YOU
User: You are afraid of me.
ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
Weizenbaum (1966)

ELIZA The dialogue above is from ELIZA, an early natural language processing sys-
tem that could carry on a limited conversation with a user by imitating the responses
of a Rogerian psychotherapist (Weizenbaum, 1966). ELIZA is a surprisingly simple
program that uses pattern matching to recognize phrases like “You are X” and trans-
late them into suitable outputs like “What makes you think I am X?”. This simple
technique succeeds in this domain because ELIZA doesn’t actually need to know
anything to mimic a Rogerian psychotherapist. As Weizenbaum notes, this is one
of the few dialogue genres where listeners can act as if they know nothing of the
world. Eliza’s mimicry of human conversation was remarkably successful: many
people who interacted with ELIZA came to believe that it really understood them
and their problems, many continued to believe in ELIZA’s abilities even after the
program’s operation was explained to them (Weizenbaum, 1976), and even today
chatbots such chatbots are a fun diversion.
Of course modern conversational agents are much more than a diversion; they
can answer questions, book flights, or find restaurants, functions for which they rely
on a much more sophisticated understanding of the user’s intent, as we will see in
Chapter 29. Nonetheless, the simple pattern-based methods that powered ELIZA
and other chatbots play a crucial role in natural language processing.
We’ll begin with the most important tool for describing text patterns: the regular
expression. Regular expressions can be used to specify strings we might want to
extract from a document, from transforming “You are X” in Eliza above, to defining
strings like $199 or $24.99 for extracting tables of prices from a document.
text
normalization We’ll then turn to a set of tasks collectively called text normalization, in which
regular expressions play an important part. Normalizing text means converting it
to a more convenient, standard form. For example, most of what we are going to
do with language relies on first separating out or tokenizing words from running
tokenization text, the task of tokenization. English words are often separated from each other
by whitespace, but whitespace is not always sufficient. New York and rock ’n’ roll
are sometimes treated as large words despite the fact that they contain spaces, while
sometimes we’ll need to separate I’m into the two words I and am. For processing
tweets or texts we’ll need to tokenize emoticons like :) or hashtags like #nlproc.
Some languages, like Chinese, don’t have spaces between words, so word tokeniza-
tion becomes more difficult.
2.1 • R EGULAR E XPRESSIONS 11

lemmatization Another part of text normalization is lemmatization, the task of determining


that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
Lemmatization is essential for processing morphologically complex languages like
stemming Arabic. Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word. Text normalization also includes sen-
sentence
segmentation tence segmentation: breaking up a text into individual sentences, using cues like
periods or exclamation points.
Finally, we’ll need to compare words and other strings. We’ll introduce a metric
called edit distance that measures how similar two strings are based on the number
of edits (insertions, deletions, substitutions) it takes to change one string into the
other. Edit distance is an algorithm with applications throughout language process-
ing, from spelling correction to speech recognition to coreference resolution.

2.1 Regular Expressions


SIR ANDREW: Her C’s, her U’s and her T’s: why that?
Shakespeare, Twelfth Night
One of the unsung successes in standardization in computer science has been the
regular
expression regular expression (RE), a language for specifying text search strings. This prac-
tical language is used in every computer language, word processor, and text pro-
cessing tools like the Unix tools grep or Emacs. Formally, a regular expression is
an algebraic notation for characterizing a set of strings. They are particularly use-
corpus ful for searching in texts, when we have a pattern to search for and a corpus of
texts to search through. A regular expression search function will search through the
corpus, returning all texts that match the pattern. The corpus can be a single docu-
ment or a collection. For example, the Unix command-line tool grep takes a regular
expression and returns every line of the input document that matches the expression.
A search can be designed to return every match on a line, if there are more than
one, or just the first match. In the following examples we underline the exact part of
the pattern that matches the regular expression and show only the first match. We’ll
show regular expressions delimited by slashes but note that slashes are not part of
the regular expressions.

2.1.1 Basic Regular Expression Patterns


The simplest kind of regular expression is a sequence of simple characters. To search
for woodchuck, we type /woodchuck/. The expression /Buttercup/ matches any
string containing the substring Buttercup; grep with that expression would return the
line I’m called little Buttercup. The search string can consist of a single character
(like /!/) or a sequence of characters (like /urgl/).

RE Example Patterns Matched


/woodchucks/ “interesting links to woodchucks and lemurs”
/a/ “Mary Ann stopped by Mona’s”
/!/ “You’ve left the burglar behind again!” said Nori
Figure 2.1 Some simple regex searches.
12 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

Regular expressions are case sensitive; lower case /s/ is distinct from upper
case /S/ (/s/ matches a lower case s but not an upper case S). This means that
the pattern /woodchucks/ will not match the string Woodchucks. We can solve this
problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.

RE Match Example Patterns


/[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck”
/[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati”
/[1234567890]/ any digit “plenty of 7 to 5”
Figure 2.2 The use of the brackets [] to specify a disjunction of characters.

The regular expression /[1234567890]/ specified any single digit. While such
classes of characters as digits or letters are important building blocks in expressions,
they can get awkward (e.g., it’s inconvenient to specify
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
to mean “any capital letter”). In cases where there is a well-defined sequence asso-
ciated with a set of characters, the brackets can be used with the dash (-) to specify
range any one character in a range. The pattern /[2-5]/ specifies any one of the charac-
ters 2, 3, 4, or 5. The pattern /[b-g]/ specifies one of the characters b, c, d, e, f, or
g. Some other examples are shown in Fig. 2.3.

RE Match Example Patterns Matched


/[A-Z]/ an upper case letter “we should call it ‘Drenched Blossoms’ ”
/[a-z]/ a lower case letter “my beans were impatient to be hoed!”
/[0-9]/ a single digit “Chapter 1: Down the Rabbit Hole”
Figure 2.3 The use of the brackets [] plus the dash - to specify a range.

The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ. If the caret ˆ is the first symbol after the open square brace [,
the resulting pattern is negated. For example, the pattern /[ˆa]/ matches any single
character (including special characters) except a. This is only true when the caret
is the first symbol after the open square brace. If it occurs anywhere else, it usually
stands for a caret; Fig. 2.4 shows some examples.

RE Match (single characters) Example Patterns Matched


/[ˆA-Z]/ not an upper case letter “Oyfn pripetchik”
/[ˆSs]/ neither ‘S’ nor ‘s’ “I have no exquisite reason for’t”
/[ˆ\.]/ not a period “our resident Djinn”
/[eˆ]/ either ‘e’ or ‘ˆ’ “look up ˆ now”
/aˆb/ the pattern ‘aˆb’ “look up aˆ b now”
Figure 2.4 Uses of the caret ˆ for negation or just to mean ˆ. We discuss below the need to escape the period
by a backslash.

How can we talk about optional elements, like an optional s in woodchuck and
woodchucks? We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing”. For this we use the question mark
/?/, which means “the preceding character or nothing”, as shown in Fig. 2.5.
We can think of the question mark as meaning “zero or one instances of the
previous character”. That is, it’s a way of specifying how many of something that
2.1 • R EGULAR E XPRESSIONS 13

RE Match Example Patterns Matched


/woodchucks?/ woodchuck or woodchucks “woodchuck”
/colou?r/ color or colour “colour”
Figure 2.5 The question mark ? marks optionality of the previous expression.

we want, something that is very important in regular expressions. For example,


consider the language of certain sheep, which consists of strings that look like the
following:
baa!
baaa!
baaaa!
baaaaa!
...
This language consists of strings with a b, followed by at least two a’s, followed
by an exclamation point. The set of operators that allows us to say things like “some
Kleene * number of as” are based on the asterisk or *, commonly called the Kleene * (gen-
erally pronounced “cleany star”). The Kleene star means “zero or more occurrences
of the immediately previous character or regular expression”. So /a*/ means “any
string of zero or more as”. This will match a or aaaaaa, but it will also match Off
Minor since the string Off Minor has zero a’s. So the regular expression for matching
one or more a is /aa*/, meaning one a followed by zero or more as. More complex
patterns can also be repeated. So /[ab]*/ means “zero or more a’s or b’s” (not
“zero or more right square braces”). This will match strings like aaaa or ababab or
bbbb.
For specifying multiple digits (useful for finding prices) we can extend /[0-9]/,
the regular expression for a single digit. An integer (a string of digits) is thus
/[0-9][0-9]*/. (Why isn’t it just /[0-9]*/?)
Sometimes it’s annoying to have to write the regular expression for digits twice,
so there is a shorter way to specify “at least one” of some character. This is the
Kleene + Kleene +, which means “one or more of the previous character”. Thus, the expres-
sion /[0-9]+/ is the normal way to specify “a sequence of digits”. There are thus
two ways to specify the sheep language: /baaa*!/ or /baa+!/.
One very important special character is the period (/./), a wildcard expression
that matches any single character (except a carriage return), as shown in Fig. 2.6.

RE Match Example Matches


/beg.n/ any character between beg and n begin, beg’n, begun
Figure 2.6 The use of the period . to specify any character.

The wildcard is often used together with the Kleene star to mean “any string of
characters”. For example, suppose we want to find any line in which a particular
word, for example, aardvark, appears twice. We can specify this with the regular
expression /aardvark.*aardvark/.
Anchors Anchors are special characters that anchor regular expressions to particular places
in a string. The most common anchors are the caret ˆ and the dollar sign $. The caret
ˆ matches the start of a line. The pattern /ˆThe/ matches the word The only at the
start of a line. Thus, the caret ˆ has three uses: to match the start of a line, to in-
dicate a negation inside of square brackets, and just to mean a caret. (What are the
contexts that allow grep or Python to know which function a given caret is supposed
to have?) The dollar sign $ matches the end of a line. So the pattern $ is a useful
14 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

pattern for matching a space at the end of a line, and /ˆThe dog\.$/ matches a
line that contains only the phrase The dog. (We have to use the backslash here since
we want the . to mean “period” and not the wildcard.)
There are also two other anchors: \b matches a word boundary, and \B matches
a non-boundary. Thus, /\bthe\b/ matches the word the but not the word other.
More technically, a “word” for the purposes of a regular expression is defined as any
sequence of digits, underscores, or letters; this is based on the definition of “words”
in programming languages. For example, /\b99\b/ will match the string 99 in
There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in
There are 299 bottles of beer on the wall (since 99 follows a number). But it will
match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore,
or letter).

2.1.2 Disjunction, Grouping, and Precedence


Suppose we need to search for texts about pets; perhaps we are particularly interested
in cats and dogs. In such a case, we might want to search for either the string cat or
the string dog. Since we can’t use the square brackets to search for “cat or dog” (why
disjunction can’t we say /[catdog]/?), we need a new operator, the disjunction operator, also
called the pipe symbol |. The pattern /cat|dog/ matches either the string cat or
the string dog.
Sometimes we need to use this disjunction operator in the midst of a larger se-
quence. For example, suppose I want to search for information about pet fish for
my cousin David. How can I specify both guppy and guppies? We cannot simply
say /guppy|ies/, because that would match only the strings guppy and ies. This
Precedence is because sequences like guppy take precedence over the disjunction operator |.
To make the disjunction operator apply only to a specific pattern, we need to use the
parenthesis operators ( and ). Enclosing a pattern in parentheses makes it act like
a single character for the purposes of neighboring operators like the pipe | and the
Kleene*. So the pattern /gupp(y|ies)/ would specify that we meant the disjunc-
tion only to apply to the suffixes y and ies.
The parenthesis operator ( is also useful when we are using counters like the
Kleene*. Unlike the | operator, the Kleene* operator applies by default only to
a single character, not to a whole sequence. Suppose we want to match repeated
instances of a string. Perhaps we have a line that has column labels of the form
Column 1 Column 2 Column 3. The expression /Column [0-9]+ */ will not
match any number of columns; instead, it will match a single column followed by
any number of spaces! The star here applies only to the space that precedes it,
not to the whole sequence. With the parentheses, we could write the expression
/(Column [0-9]+ *)*/ to match the word Column, followed by a number and
optional spaces, the whole pattern repeated any number of times.
This idea that one operator may take precedence over another, requiring us to
sometimes use parentheses to specify what we mean, is formalized by the operator
operator
precedence precedence hierarchy for regular expressions. The following table gives the order
of RE operator precedence, from highest precedence to lowest precedence.
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |
Thus, because counters have a higher precedence than sequences,
2.1 • R EGULAR E XPRESSIONS 15

/the*/ matches theeeee but not thethe. Because sequences have a higher prece-
dence than disjunction, /the|any/ matches the or any but not theny.
Patterns can be ambiguous in another way. Consider the expression /[a-z]*/
when matching against the text once upon a time. Since /[a-z]*/ matches zero or
more letters, this expression could match nothing, or just the first letter o, on, onc,
or once. In these cases regular expressions always match the largest string they can;
greedy we say that patterns are greedy, expanding to cover as much of a string as they can.
non-greedy There are, however, ways to enforce non-greedy matching, using another mean-
*? ing of the ? qualifier. The operator *? is a Kleene star that matches as little text as
+? possible. The operator +? is a Kleene plus that matches as little text as possible.

2.1.3 A Simple Example


Suppose we wanted to write a RE to find cases of the English article the. A simple
(but incorrect) pattern might be:
/the/
One problem is that this pattern will miss the word when it begins a sentence
and hence is capitalized (i.e., The). This might lead us to the following pattern:
/[tT]he/
But we will still incorrectly return texts with the embedded in other words (e.g.,
other or theology). So we need to specify that we want instances with a word bound-
ary on both sides:
/\b[tT]he\b/
Suppose we wanted to do this without the use of /\b/. We might want this since
/\b/ won’t treat underscores and numbers as word boundaries; but we might want
to find the in some context where it might also have underlines or numbers nearby
(the or the25). We need to specify that we want instances in which there are no
alphabetic letters on either side of the the:
/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
But there is still one more problem with this pattern: it won’t find the word the
when it begins a line. This is because the regular expression [ˆa-zA-Z], which
we used to avoid embedded instances of the, implies that there must be some single
(although non-alphabetic) character before the the. We can avoid this by specify-
ing that before the the we require either the beginning-of-line or a non-alphabetic
character, and the same at the end of the line:
/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/
The process we just went through was based on fixing two kinds of errors: false
false positives positives, strings that we incorrectly matched like other or there, and false nega-
false negatives tives, strings that we incorrectly missed, like The. Addressing these two kinds of
errors comes up again and again in implementing speech and language processing
systems. Reducing the overall error rate for an application thus involves two antag-
onistic efforts:
• Increasing precision (minimizing false positives)
• Increasing recall (minimizing false negatives)
16 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

2.1.4 A More Complex Example


Let’s try out a more significant example of the power of REs. Suppose we want to
build an application to help a user buy a computer on the Web. The user might want
“any machine with more than 6 GHz and 500 GB of disk space for less than $1000”.
To do this kind of retrieval, we first need to be able to look for expressions like 6
GHz or 500 GB or Mac or $999.99. In the rest of this section we’ll work out some
simple regular expressions for this task.
First, let’s complete our regular expression for prices. Here’s a regular expres-
sion for a dollar sign followed by a string of digits:
/$[0-9]+/
Note that the $ character has a different function here than the end-of-line function
we discussed earlier. Regular expression parsers are in fact smart enough to realize
that $ here doesn’t mean end-of-line. (As a thought experiment, think about how
regex parsers might figure out the function of $ from the context.)
Now we just need to deal with fractions of dollars. We’ll add a decimal point
and two digits afterwards:
/$[0-9]+\.[0-9][0-9]/
This pattern only allows $199.99 but not $199. We need to make the cents
optional and to make sure we’re at a word boundary:
/\b$[0-9]+(\.[0-9][0-9])?\b/
How about specifications for processor speed? Here’s a pattern for that:
/\b[0-9]+ *(GHz|[Gg]igahertz)\b/
Note that we use / */ to mean “zero or more spaces” since there might always
be extra spaces lying around. We also need to allow for optional fractions again (5.5
GB); note the use of ? for making the final s optional:
/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/

2.1.5 More Operators


Figure 2.7 shows some aliases for common ranges, which can be used mainly to
save typing. Besides the Kleene * and Kleene + we can also use explicit numbers as
counters, by enclosing them in curly brackets. The regular expression /{3}/ means
“exactly 3 occurrences of the previous character or expression”. So /a\.{24}z/
will match a followed by 24 dots followed by z (but not a followed by 23 or 25 dots
followed by a z).

RE Expansion Match First Matches


\d [0-9] any digit Party of 5
\D [ˆ0-9] any non-digit Blue moon
\w [a-zA-Z0-9_] any alphanumeric/underscore Daiyu
\W [ˆ\w] a non-alphanumeric !!!!
\s [ \r\t\n\f] whitespace (space, tab)
\S [ˆ\s] Non-whitespace in Concord
Figure 2.7 Aliases for common sets of characters.
2.1 • R EGULAR E XPRESSIONS 17

A range of numbers can also be specified. So /{n,m}/ specifies from n to m


occurrences of the previous char or expression, and /{n,}/ means at least n occur-
rences of the previous expression. REs for counting are summarized in Fig. 2.8.

RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrence of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
Figure 2.8 Regular expression operators for counting.

Finally, certain special characters are referred to by special notation based on the
Newline backslash (\) (see Fig. 2.9). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like
., *, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).

RE Match First Patterns Matched


\* an asterisk “*” “K*A*P*L*A*N”
\. a period “.” “Dr. Livingston, I presume”
\? a question mark “Why don’t they come and lend a hand?”
\n a newline
\t a tab
Figure 2.9 Some characters that need to be backslashed.

2.1.6 Regular Expression Substitution, Capture Groups, and ELIZA


substitution An important use of regular expressions is in substitutions. For example, the substi-
tution operator s/regexp1/pattern/ used in Python and in Unix commands like
vim or sed allows a string characterized by a regular expression to be replaced by
another string:
s/colour/color/
It is often useful to be able to refer to a particular subpart of the string matching
the first pattern. For example, suppose we wanted to put angle brackets around all
integers in a text, for example, changing the 35 boxes to the <35> boxes. We’d
like a way to refer to the integer we’ve found so that we can easily add the brackets.
To do this, we put parentheses ( and ) around the first pattern and use the number
operator \1 in the second pattern to refer back. Here’s how it looks:
s/([0-9]+)/<\1>/
The parenthesis and number operators can also specify that a certain string or
expression must occur twice in the text. For example, suppose we are looking for
the pattern “the Xer they were, the Xer they will be”, where we want to constrain
the two X’s to be the same string. We do this by surrounding the first X with the
parenthesis operator, and replacing the second X with the number operator \1, as
follows:
/the (.*)er they were, the \1er they will be/
18 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

Here the \1 will be replaced by whatever string matched the first item in paren-
theses. So this will match The bigger they were, the bigger they will be but not The
bigger they were, the faster they will be.
capture group This use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the re-
register sulting match is stored in a numbered register. If you match two different sets of
parentheses, \2 means whatever matched the second capture group. Thus
/the (.*)er they (.*), the \1er we \2/
will match The faster they ran, the faster we ran but not The faster they ran, the
faster we ate. Similarly, the third capture group is stored in \3, the fourth is \4, and
so on.
Parentheses thus have a double function in regular expressions; they are used to
group terms for specifying the order in which operators should apply, and they are
used to capture something in a register. Occasionally we might want to use parenthe-
ses for grouping, but don’t want to capture the resulting pattern in a register. In that
non-capturing
group case we use a non-capturing group, which is specified by putting the commands
?: after the open paren, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some \1/
will match some cats like some people but not some people like some a few.
Substitutions and capture groups are very useful in implementing simple chat-
bots like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates a Rogerian
psychologist by carrying on conversations like the following:
User1 : Men are all alike.
ELIZA1 : IN WHAT WAY
User2 : They’re always bugging us about something or other.
ELIZA2 : CAN YOU THINK OF A SPECIFIC EXAMPLE
User3 : Well, my boyfriend made me come here.
ELIZA3 : YOUR BOYFRIEND MADE YOU COME HERE
User4 : He says I’m depressed much of the time.
ELIZA4 : I AM SORRY TO HEAR YOU ARE DEPRESSED
ELIZA works by having a series or cascade of regular expression substitutions
each of which matches and changes some part of the input lines. The first substitu-
tions change all instances of my to YOUR, and I’m to YOU ARE, and so on. The next
set of substitutions matches and replaces other patterns in the input. Here are some
examples:
s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Since multiple substitutions can apply to a given input, substitutions are assigned
a rank and applied in order. Creating patterns is the topic of Exercise 2.3, and we
return to the details of the ELIZA architecture in Chapter 29.

2.1.7 Lookahead assertions


Finally, there will be times when we need to predict the future: look ahead in the
text to see if some pattern matches, but not advance the match cursor, so that we can
then deal with the pattern if it occurs.
2.2 • W ORDS AND C ORPORA 19

lookahead These lookahead assertions make use of the (? syntax that we saw in the previ-
ous section for non-capture groups. The operator (?= pattern) is true if pattern
zero-width occurs, but is zero-width, i.e. the match pointer doesn’t advance. The operator
(?! pattern) only returns true if a pattern does not match, but again is zero-width
and doesn’t advance the cursor. Negative lookahead is commonly used when we
are parsing some complex pattern but want to rule out a special case. For example
suppose we want to match, at the beginning of a line, any single word that doesn’t
start with ”Volcano”. We can use negative lookahead to do this:
/(ˆ?!Volcano)[A-Za-z]+/

2.2 Words and Corpora


Before we talk about processing words, we need to decide what counts as a word.
corpus Let’s start by looking at a corpus (plural corpora), a computer-readable collection
corpora of text or speech. For example the Brown corpus is a million-word collection of sam-
ples from 500 written texts from different genres (newspaper, fiction, non-fiction,
academic, etc.), assembled at Brown University in 1963–64 (Kučera and Francis,
1967). How many words are in the following Brown sentence?
He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15
if we count punctuation. Whether we treat period (“.”), comma (“,”), and so on as
words depends on the task. Punctuation is critical for finding boundaries of things
(commas, periods, colons) and for identifying some aspects of meaning (question
marks, exclamation marks, quotation marks). For some tasks, like part-of-speech
tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if
they were separate words.
The Switchboard corpus of telephone conversations between strangers was col-
lected in the early 1990s; it contains 2430 conversations averaging 6 minutes each,
totaling 240 hours of speech and about 3 million words (Godfrey et al., 1992). Such
corpora of spoken language don’t have punctuation but do introduce other compli-
cations with regard to defining words. Let’s look at one utterance from Switchboard;
utterance an utterance is the spoken correlate of a sentence:
I do uh main- mainly business data processing
disfluency This utterance has two kinds of disfluencies. The broken-off word main- is
fragment called a fragment. Words like uh and um are called fillers or filled pauses. Should
filled pause we consider these to be words? Again, it depends on the application. If we are
building a speech transcription system, we might want to eventually strip out the
disfluencies.
But we also sometimes keep disfluencies around. Disfluencies like uh or um
are actually helpful in speech recognition in predicting the upcoming word, because
they may signal that the speaker is restarting the clause or idea, and so for speech
recognition they are treated as regular words. Because people use different disflu-
encies they can also be a cue to speaker identification. In fact Clark and Fox Tree
(2002) showed that uh and um have different meanings. What do you think they are?
Are capitalized tokens like They and uncapitalized tokens like they the same
word? These are lumped together in some tasks (speech recognition), while for part-
of-speech or named-entity tagging, capitalization is a useful feature and is retained.
20 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

How about inflected forms like cats versus cat? These two words have the same
lemma lemma cat but are different wordforms. A lemma is a set of lexical forms having
the same stem, the same major part-of-speech, and the same word sense. The word-
wordform form is the full inflected or derived form of the word. For morphologically complex
languages like Arabic, we often need to deal with lemmatization. For many tasks in
English, however, wordforms are sufficient.
How many words are there in English? To answer this question we need to
word type distinguish two ways of talking about words. Types are the number of distinct words
in a corpus; if the set of words in the vocabulary is V , the number of types is the
word token vocabulary size |V |. Tokens are the total number N of running words. If we ignore
punctuation, the following Brown sentence has 16 tokens and 14 types:
They picnicked by the pool, then lay back on the grass and looked at the stars.
When we speak about the number of words in the language, we are generally
referring to word types.

Corpus Tokens = N Types = |V |


Shakespeare 884 thousand 31 thousand
Brown corpus 1 million 38 thousand
Switchboard telephone conversations 2.4 million 20 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13 million
Figure 2.10 Rough numbers of types and tokens for some corpora. The largest, the Google
N-grams corpus, contains 13 million types, but this count only includes types appearing 40 or
more times, so the true number would be much larger.

Fig. 2.10 shows the rough numbers of types and tokens computed from some
popular English corpora. The larger the corpora we look at, the more word types
we find, and in fact this relationship between the number of types |V | and number
Herdan’s Law of tokens N is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978)
Heaps’ Law after its discoverers (in linguistics and information retrieval respectively). It is shown
in Eq. 2.1, where k and β are positive constants, and 0 < β < 1.

|V | = kN β (2.1)
The value of β depends on the corpus size and the genre, but at least for the
large corpora in Fig. 2.10, β ranges from .67 to .75. Roughly then we can say that
the vocabulary size for a text goes up significantly faster than the square root of its
length in words.
Another measure of the number of words in the language is the number of lem-
mas instead of wordform types. Dictionaries can help in giving lemma counts; dic-
tionary entries or boldface forms are a very rough upper bound on the number of
lemmas (since some lemmas have multiple boldface forms). The 1989 edition of the
Oxford English Dictionary had 615,000 entries.

2.3 Text Normalization


Before almost any natural language processing of a text, the text has to be normal-
ized. At least three tasks are commonly applied as part of any normalization process:
1. Segmenting/tokenizing words from running text
2.3 • T EXT N ORMALIZATION 21

2. Normalizing word formats


3. Segmenting sentences in running text.
In the next sections we walk through each of these tasks.

2.3.1 Unix tools for crude tokenization and normalization


Let’s begin with an easy, if somewhat naive version of word tokenization and nor-
malization (and frequency computation) that can be accomplished solely in a single
UNIX command-line, inspired by Church (1994). We’ll make use of some Unix
commands: tr, used to systematically change particular characters in the input;
sort, which sorts input lines in alphabetical order; and uniq, which collapses and
counts adjacent identical lines.
For example let’s begin with the complete words of Shakespeare in one textfile,
sh.txt. We can use tr to tokenize the words by changing every sequence of non-
alphabetic characters to a newline (’A-Za-z’ means alphabetic, the -c option com-
plements to non-alphabet, and the -s option squeezes all sequences into a single
character):
tr -sc ’A-Za-z’ ’\n’ < sh.txt
The output of this command will be:
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
Now that there is one word per line, we can sort the lines, and pass them to uniq
-c which will collapse and count them:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | sort | uniq -c
with the following output:
1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot
...
Alternatively, we can collapse all the upper case to lower case:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c
whose output is
22 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

14725 a
97 aaron
1 abaissiez
10 abandon
2 abandoned
2 abase
1 abash
14 abate
3 abated
3 abatement
...
Now we can sort again to find the frequent words. The -n option to sort means
to sort numerically rather than alphabetically, and the -r option means to sort in
reverse order (highest-to-lowest):
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
The results show that the most frequent words in Shakespeare, as in any other
corpus, are the short function words like articles, pronouns, prepositions:
27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
...
Unix tools of this sort can be very handy in building quick word count statistics
for any corpus.

2.3.2 Word Tokenization and Normalization


The simple UNIX tools above were fine for getting rough word statistics but more
tokenization sophisticated algorithms are generally necessary for tokenization, the task of seg-
normalization menting running text into words, and normalization, the task of putting words/tokens
in a standard format.
While the Unix command sequence just removed all the numbers and punctu-
ation, for most NLP applications we’ll need to keep these in our tokenization. We
often want to break off punctuation as a separate token; commas are a useful piece of
information for parsers, periods help indicate sentence boundaries. But we’ll often
want to keep the punctuation that occurs word internally, in examples like m.p.h,,
Ph.D.. AT&T, cap’n. Special characters and numbers will need to be kept in prices
($45.55) and dates (01/02/06); we don’t want to segment that price into separate to-
kens of “45” and “55”. And there are URLs (http://www.stanford.edu), Twitter
hashtags (#nlproc), or email addresses ([email protected]).
Number expressions introduce other complications as well; while commas nor-
mally appear at word boundaries, commas are used inside numbers in English, every
three digits: 555,500.50. Languages, and hence tokenization requirements, differ
2.3 • T EXT N ORMALIZATION 23

on this; many continental European languages like Spanish, French, and German, by
contrast, use a comma to mark the decimal point, and spaces (or sometimes periods)
where English puts commas, for example, 555 500,50.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example, converting what’re to the two tokens what are, and
we’re to we are. A clitic is a part of a word that can’t stand on its own, and can only
occur when it is attached to another word. Some such contractions occur in other
alphabetic languages, including articles and pronouns in French (j’ai, l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity detection, the task of detecting names, dates, and
organizations (Chapter 20).
One commonly used tokenization standard is known as the Penn Treebank to-
Penn Treebank
tokenization kenization standard, used for the parsed corpora (treebanks) released by the Lin-
guistic Data Consortium (LDC), the source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation:
Input: “The San Francisco-based restaurant,” they said, “doesn’t charge $10”.
Output: “ The San Francisco-based restaurant , ” they
said , “ does n’t charge $ 10 ” .

Tokens can also be normalized, in which a single normalized form is chosen for
words with multiple forms like USA and US or uh-huh and uhhuh. This standard-
ization may be valuable, despite the spelling information that is lost in the normal-
ization process. For information retrieval, we might want a query for US to match a
document that has USA; for information extraction we might want to extract coherent
information that is consistent across differently-spelled instances.
case folding Case folding is another kind of normalization. For tasks like speech recognition
and information retrieval, everything is mapped to lower case. For sentiment anal-
ysis and other text classification tasks, information extraction, and machine transla-
tion, by contrast, case is quite helpful and case folding is generally not done (losing
the difference, for example, between US the country and us the pronoun can out-
weigh the advantage in generality that case folding provides).
In practice, since tokenization needs to be run before any other language pro-
cessing, it is important for it to be very fast. The standard method for tokeniza-
tion/normalization is therefore to use deterministic algorithms based on regular ex-
pressions compiled into very efficient finite state automata. Carefully designed de-
terministic algorithms can deal with the ambiguities that arise, such as the fact that
the apostrophe needs to be tokenized differently when used as a genitive marker (as
in the book’s cover), a quotative as in ‘The other class’, she said, or in clitics like
they’re. We’ll discuss this use of automata in Chapter 3.

2.3.3 Word Segmentation in Chinese: the MaxMatch algorithm


Some languages, including Chinese, Japanese, and Thai, do not use spaces to mark
potential word-boundaries, and so require alternative segmentation methods. In Chi-
hanzi nese, for example, words are composed of characters known as hanzi. Each charac-
ter generally represents a single morpheme and is pronounceable as a single syllable.
Words are about 2.4 characters long on average. A simple algorithm that does re-
24 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

markably well for segmenting Chinese, and often used as a baseline comparison for
more advanced methods, is a version of greedy search called maximum match-
maximum
matching ing or sometimes MaxMatch. The algorithm requires a dictionary (wordlist) of the
language.
The maximum matching algorithm starts by pointing at the beginning of a string.
It chooses the longest word in the dictionary that matches the input at the current
position. The pointer is then advanced to the end of that word in the string. If
no word matches, the pointer is instead advanced one character (creating a one-
character word). The algorithm is then iteratively applied again starting from the
new pointer position. Fig. 2.11 shows a version of the algorithm.

function M AX M ATCH(sentence, dictionary D) returns word sequence W

if sentence is empty
return empty list
for i ← length(sentence) downto 1
firstword = first i chars of sentence
remainder = rest of sentence
if InDictionary(firstword, D)
return list(firstword, MaxMatch(remainder,dictionary) )

# no word was found, so make a one-character word


firstword = first char of sentence
remainder = rest of sentence
return list(firstword, MaxMatch(remainder,dictionary D) )
Figure 2.11 The MaxMatch algorithm for word segmentation.

MaxMatch works very well on Chinese; the following example shows an appli-
cation to a simple Chinese sentence using a simple Chinese lexicon available from
the Linguistic Data Consortium:
Input: 他特别喜欢北京烤鸭 “He especially likes Peking duck”
Output: 他 特别 喜欢 北京烤鸭
He especially likes Peking duck
MaxMatch doesn’t work as well on English. To make the intuition clear, we’ll
create an example by removing the spaces from the beginning of Turing’s famous
quote “We can only see a short distance ahead”, producing “wecanonlyseeashortdis-
tanceahead”. The MaxMatch results are shown below.
Input: wecanonlyseeashortdistanceahead
Output: we canon l y see ash ort distance ahead
On English the algorithm incorrectly chose canon instead of stopping at can,
which left the algorithm confused and having to create single-character words l and
y and use the very rare word ort.
The algorithm works better in Chinese than English, because Chinese has much
shorter words than English. We can quantify how well a segmenter works using a
word error rate metric called word error rate. We compare our output segmentation with a perfect
hand-segmented (‘gold’) sentence, seeing how many words differ. The word error
rate is then the normalized minimum edit distance in words between our output and
the gold: the number of word insertions, deletions, and substitutions divided by the
length of the gold sentence in words; we’ll see in Section 2.4 how to compute edit
distance. Even in Chinese, however, MaxMatch has problems, for example dealing
2.3 • T EXT N ORMALIZATION 25

with unknown words (words not in the dictionary) or genres that differ a lot from
the assumptions made by the dictionary builder.
The most accurate Chinese segmentation algorithms generally use statistical se-
quence models trained via supervised machine learning on hand-segmented training
sets; we’ll introduce sequence models in Chapter 10.

2.3.4 Lemmatization and Stemming


Lemmatization is the task of determining that two words have the same root, despite
their surface differences. The words am, are, and is have the shared lemma be; the
words dinner and dinners both have the lemma dinner. Representing a word by its
lemma is important for web search, since we want to find pages mentioning wood-
chucks if we search for woodchuck. This is especially important in morphologically
complex languages like Russian, where for example the word Moscow has different
endings in the phrases Moscow, of Moscow, from Moscow, and so on. Lemmatizing
each of these forms to the same lemma will let us find all mentions of Moscow. The
lemmatized form of a sentence like He is reading detective stories would thus be He
be read detective story.
How is lemmatization done? The most sophisticated methods for lemmatization
involve complete morphological parsing of the word. Morphology is the study of
morpheme the way words are built up from smaller meaning-bearing units called morphemes.
stem Two broad classes of morphemes can be distinguished: stems—the central mor-
affix pheme of the word, supplying the main meaning— and affixes—adding “additional”
meanings of various kinds. So, for example, the word fox consists of one morpheme
(the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. A morphological parser takes a word like cats and parses it into the
two morphemes cat and s, or a Spanish word like amaren (‘if in the future they
would love’) into the morphemes amar ‘to love’, 3PL, and future subjunctive. We’ll
introduce morphological parsing in Chapter 3.

The Porter Stemmer


While using finite-state transducers to build a full morphological parser is the most
general way to deal with morphological variation in word forms, we sometimes
make use of simpler but cruder chopping off of affixes. This naive version of mor-
stemming phological analysis is called stemming, and one of the most widely used stemming
Porter stemmer algorithms is the simple and efficient Porter (1980) algorithm. The Porter stemmer
applied to the following paragraph:

This was not the map we found in Billy Bones’s chest, but
an accurate copy, complete in all things-names and heights
and soundings-with the single exception of the red crosses
and the written notes.

produces the following stemmed output:

Thi wa not the map we found in Billi Bone s chest but an


accur copi complet in all thing name and height and sound
with the singl except of the red cross and the written note

cascade The algorithm is based on series of rewrite rules run in series, as a cascade, in
which the output of each pass is fed as input to the next pass; here is a sampling of
26 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

the rules:
ATIONAL → ATE (e.g., relational → relate)
ING →  if stem contains vowel (e.g., motoring → motor)
SSES → SS (e.g., grasses → grass)
Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.)
can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980).
Simple stemmers can be useful in cases where we need to collapse across differ-
ent variants of the same lemma. Nonetheless, they do tend to commit errors of both
over- and under-generalizing, as shown in the table below (Krovetz, 1993):

Errors of Commission Errors of Omission


organization organ European Europe
doing doe analysis analyzes
numerical numerous noise noisy
policy police sparse sparsity

2.3.5 Sentence Segmentation


Sentence
segmentation Sentence segmentation is another important step in text processing. The most use-
ful cues for segmenting a text into sentences are punctuation, like periods, question
marks, exclamation points. Question marks and exclamation points are relatively
unambiguous markers of sentence boundaries. Periods, on the other hand, are more
ambiguous. The period character “.” is ambiguous between a sentence boundary
marker and a marker of abbreviations like Mr. or Inc. The previous sentence that
you just read showed an even more complex case of this ambiguity, in which the final
period of Inc. marked both an abbreviation and the sentence boundary marker. For
this reason, sentence tokenization and word tokenization may be addressed jointly.
In general, sentence tokenization methods work by building a binary classifier
(based on a sequence of rules or on machine learning) that decides if a period is part
of the word or is a sentence-boundary marker. In making this decision, it helps to
know if the period is attached to a commonly used abbreviation; thus, an abbrevia-
tion dictionary is useful.
State-of-the-art methods for sentence tokenization are based on machine learning
and are introduced in later chapters.

2.4 Minimum Edit Distance


Much of natural language processing is concerned with measuring how similar two
strings are. For example in spelling correction, the user typed some erroneous
string—let’s say graffe–and we want to know what the user meant. The user prob-
ably intended a word that is similar to graffe. Among candidate similar words,
the word giraffe, which differs by only one letter from graffe, seems intuitively
to be more similar than, say grail or graf, which differ in more letters. Another
example comes from coreference, the task of deciding whether two strings such as
the following refer to the same entity:
Stanford President John Hennessy
Stanford University President John Hennessy
2.4 • M INIMUM E DIT D ISTANCE 27

Again, the fact that these two strings are very similar (differing by only one word)
seems like useful evidence for deciding that they might be coreferent.
Edit distance gives us a way to quantify both of these intuitions about string sim-
minimum edit
distance ilarity. More formally, the minimum edit distance between two strings is defined
as the minimum number of editing operations (operations like insertion, deletion,
substitution) needed to transform one string into another.
The gap between intention and execution, for example, is 5 (delete an i, substi-
tute e for n, substitute x for t, insert c, substitute u for n). It’s much easier to see
alignment this by looking at the most important visualization for string distances, an alignment
between the two strings, shown in Fig. 2.12. Given two sequences, an alignment is
a correspondence between substrings of the two sequences. Thus, we say I aligns
with the empty string, N with E, and so on. Beneath the aligned strings is another
representation; a series of symbols expressing an operation list for converting the
top string into the bottom string: d for deletion, s for substitution, i for insertion.

INTE*NTION
| | | | | | | | | |
*EXECUTION
d s s i s

Figure 2.12 Representing the minimum edit distance between two strings as an alignment.
The final row gives the operation list for converting the top string into the bottom string: d for
deletion, s for substitution, i for insertion.

We can also assign a particular cost or weight to each of these operations. The
Levenshtein distance between two sequences is the simplest weighting factor in
which each of the three operations has a cost of 1 (Levenshtein, 1966)—we assume
that the substitution of a letter for itself, for example, t for t, has zero cost. The Lev-
enshtein distance between intention and execution is 5. Levenshtein also proposed
an alternative version of his metric in which each insertion or deletion has a cost of
1 and substitutions are not allowed. (This is equivalent to allowing substitution, but
giving each substitution a cost of 2 since any substitution can be represented by one
insertion and one deletion). Using this version, the Levenshtein distance between
intention and execution is 8.

2.4.1 The Minimum Edit Distance Algorithm


How do we find the minimum edit distance? We can think of this as a search task, in
which we are searching for the shortest path—a sequence of edits—from one string
to another.

i n t e n t i o n

del ins subst

n t e n t i o n i n t e c n t i o n i n x e n t i o n
Figure 2.13 Finding the edit distance viewed as a search problem

The space of all possible edits is enormous, so we can’t search naively. However,
lots of distinct edit paths will end up in the same state (string), so rather than recom-
puting all those paths, we could just remember the shortest path to a state each time
28 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

dynamic
programming we saw it. We can do this by using dynamic programming. Dynamic programming
is the name for a class of algorithms, first introduced by Bellman (1957), that apply
a table-driven method to solve problems by combining solutions to sub-problems.
Some of the most commonly used algorithms in natural language processing make
use of dynamic programming, such as the Viterbi and forward algorithms (Chap-
ter 9) and the CKY algorithm for parsing (Chapter 12).
The intuition of a dynamic programming problem is that a large problem can
be solved by properly combining the solutions to various sub-problems. Consider
the shortest path of transformed words that represents the minimum edit distance
between the strings intention and execution shown in Fig. 2.14.

i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.14 Path from intention to execution.

Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from
intention to exention. Why? If there were a shorter path from intention to exention,
then we could use it instead, resulting in a shorter overall path, and the optimal
sequence wouldn’t be optimal, thus leading to a contradiction.
minimum edit
distance The minimum edit distance algorithm was named by Wagner and Fischer (1974)
but independently discovered by many people (summarized later, in the Historical
Notes section of Chapter 9).
Let’s first define the minimum edit distance between two strings. Given two
strings, the source string X of length n, and target string Y of length m, we’ll define
D(i, j) as the edit distance between X[1..i] and Y [1.. j], i.e., the first i characters of X
and the first j characters of Y . The edit distance between X and Y is thus D(n, m).
We’ll use dynamic programming to compute D(n, m) bottom up, combining so-
lutions to subproblems. In the base case, with a source substring of length i but an
empty target string, going from i characters to 0 requires i deletes. With a target
substring of length j but an empty source going from 0 characters to j characters
requires j inserts. Having computed D(i, j) for small i, j we then compute larger
D(i, j) based on previously computed smaller values. The value of D(i, j) is com-
puted by taking the minimum of the three possible paths through the matrix which
arrive there:

 D[i − 1, j] + del-cost(source[i])

D[i, j] = min D[i, j − 1] + ins-cost(target[ j]))


D[i − 1, j − 1] + sub-cost(source[i], target[ j])

If we assume the version of Levenshtein distance in which the insertions and


deletions each have a cost of 1 (ins-cost(·) = del-cost(·) = 1), and substitutions have
2.4 • M INIMUM E DIT D ISTANCE 29

a cost of 2 (except substitution of identical letters have zero cost), the computation
for D(i, j) becomes:

D[i − 1, j] + 1


D[i, j − 1] + 1 


D[i, j] = min (2.2)
2; if source[i] 6= target[ j]
 D[i − 1, j − 1] +

0; if source[i] = target[ j]

The algorithm is summarized in Fig. 2.15; Fig. 2.16 shows the results of applying
the algorithm to the distance between intention and execution with the version of
Levenshtein in Eq. 2.2.

function M IN -E DIT-D ISTANCE(source, target) returns min-distance

n ← L ENGTH(source)
m ← L ENGTH(target)
Create a distance matrix distance[n+1,m+1]

# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] ← D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] ← D[0, j-1] + ins-cost(target[j])

# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] ← M IN( D[i−1, j] + del-cost(source[i]),
D[i−1, j−1] + sub-cost(source[i], target[j]),
D[i, j−1] + ins-cost(target[j]))
# Termination
return D[n,m]

Figure 2.15 The minimum edit distance algorithm, an example of the class of dynamic
programming algorithms. The various costs can either be fixed (e.g., ∀x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).

Knowing the minimum edit distance is useful for algorithms like finding poten-
tial spelling error corrections. But the edit distance algorithm is important in another
way; with a small change, it can also provide the minimum cost alignment between
two strings. Aligning two strings is useful throughout speech and language process-
ing. In speech recognition, minimum edit distance alignment is used to compute
the word error rate (Chapter 31). Alignment plays a role in machine translation, in
which sentences in a parallel corpus (a corpus with a text in two languages) need to
be matched to each other.
To extend the edit distance algorithm to produce an alignment, we can start by
visualizing an alignment as a path through the edit distance matrix. Figure 2.17
shows this path with the boldfaced cell. Each boldfaced cell represents an alignment
of a pair of letters in the two strings. If two boldfaced cells occur in the same row,
30 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.16 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.15, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.

there will be an insertion in going from the source to the target; two boldfaced cells
in the same column indicate a deletion.
Figure 2.17 also shows the intuition of how to compute this alignment path. The
computation proceeds in two steps. In the first step, we augment the minimum edit
distance algorithm to store backpointers in each cell. The backpointer from a cell
points to the previous cell (or cells) that we came from in entering the current cell.
We’ve shown a schematic of these backpointers in Fig. 2.17, after a similar diagram
in Gusfield (1997). Some cells have multiple backpointers because the minimum
extension could have come from multiple previous cells. In the second step, we
backtrace perform a backtrace. In a backtrace, we start from the last cell (at the final row and
column), and follow the pointers back through the dynamic programming matrix.
Each complete path between the final cell and the initial cell is a minimum distance
alignment. Exercise 2.7 asks you to modify the minimum edit distance algorithm to
store the pointers and compute the backtrace to output an alignment.

# e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 -←↑ 2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -6 ←7 ←8
n 2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 ↑7 -←↑ 8 -7
t 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -7 ←↑ 8 -←↑ 9 ↑8
e 4 -3 ←4 -← 5 ←6 ←7 ←↑ 8 -←↑ 9 -←↑ 10 ↑9
n 5 ↑4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 -↑ 10
t 6 ↑5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -8 ←9 ← 10 ←↑ 11
i 7 ↑6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 ↑9 -8 ←9 ← 10
o 8 ↑7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 ↑ 10 ↑9 -8 ←9
n 9 ↑8 -←↑ 9 -←↑ 10 -←↑ 11 -←↑ 12 ↑ 11 ↑ 10 ↑9 -8
Figure 2.17 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum cost
alignment between the two strings.

While we worked our example with simple Levenshtein distance, the algorithm
in Fig. 2.15 allows arbitrary weights on the operations. For spelling correction, for
example, substitutions are more likely to happen between letters that are next to
each other on the keyboard. We’ll discuss how these weights can be estimated in
2.5 • S UMMARY 31

Ch. 5. The Viterbi algorithm, for example, is an extension of minimum edit distance
that uses probabilistic definitions of the operations. Instead of computing the “mini-
mum edit distance” between two strings, Viterbi computes the “maximum probabil-
ity alignment” of one string with another. We’ll discuss this more in Chapter 9.

2.5 Summary
This chapter introduced a fundamental tool in language processing, the regular ex-
pression, and showed how to perform basic text normalization tasks including
word segmentation and normalization, sentence segmentation, and stemming.
We also introduce the important minimum edit distance algorithm for comparing
strings. Here’s a summary of the main points we covered about these ideas:
• The regular expression language is a powerful tool for pattern-matching.
• Basic operations in regular expressions include concatenation of symbols,
disjunction of symbols ([], |, and .), counters (*, +, and {n,m}), anchors
(ˆ, $) and precedence operators ((,)).
• Word tokenization and normalization are generally done by cascades of
simple regular expressions substitutions or finite automata.
• The Porter algorithm is a simple and efficient way to do stemming, stripping
off affixes. It does not have high accuracy but may be useful for some tasks.
• The minimum edit distance between two strings is the minimum number of
operations it takes to edit one into the other. Minimum edit distance can be
computed by dynamic programming, which also results in an alignment of
the two strings.

Bibliographical and Historical Notes


Kleene (1951) and (1956) first defined regular expressions and the finite automaton,
based on the McCulloch-Pitts neuron. Ken Thompson was one of the first to build
regular expressions compilers into editors for text searching (Thompson, 1968). His
editor ed included a command “g/regular expression/p”, or Global Regular Expres-
sion Print, which later became the Unix grep utility.
Text normalization algorithms has been applied since the beginning of the field.
One of the earliest widely-used stemmers was Lovins (1968). Stemming was also
applied early to the digital humanities, by Packard (1973), who built an affix-stripping
morphological parser for Ancient Greek. Currently a wide variety of code for tok-
enization and normalization is available, such as the Stanford Tokenizer (http://
nlp.stanford.edu/software/tokenizer.shtml) or specialized tokenizers for
Twitter (O’Connor et al., 2010), or for sentiment (http://sentiment.christopherpotts.
net/tokenizing.html). See Palmer (2012) for a survey of text preprocessing.
While the max-match algorithm we describe is commonly used as a segmentation
baseline in languages like Chinese, higher accuracy algorithms like the Stanford
CRF segmenter, are based on sequence models; see Tseng et al. (2005a) and Chang
et al. (2008). NLTK is an essential tool that offers both useful Python libraries
(http://www.nltk.org) and textbook descriptions (Bird et al., 2009). of many
algorithms including text normalization and corpus interfaces.
32 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

For more on Herdan’s law and Heaps’ Law, see Herdan (1960, p. 28), Heaps
(1978), Egghe (2007) and Baayen (2001); Yasseri et al. (2012) discuss the relation-
ship with other measures of linguistic complexity. For more on edit distance, see the
excellent Gusfield (1997). Our example measuring the edit distance from ‘intention’
to ‘execution’ was adapted from Kruskal (1983). There are various publicly avail-
able packages to compute edit distance, including Unix diff and the NIST sclite
program (NIST, 2005).
In his autobiography Bellman (1984) explains how he originally came up with
the term dynamic programming:
“...The 1950s were not good years for mathematical research. [the]
Secretary of Defense ...had a pathological fear and hatred of the word,
research... I decided therefore to use the word, “programming”. I
wanted to get across the idea that this was dynamic, this was multi-
stage... I thought, let’s ... take a word that has an absolutely precise
meaning, namely dynamic... it’s impossible to use the word, dynamic,
in a pejorative sense. Try thinking of some combination that will pos-
sibly give it a pejorative meaning. It’s impossible. Thus, I thought
dynamic programming was a good name. It was something not even a
Congressman could object to.”

Exercises
2.1 Write regular expressions for the following languages.
1. the set of all alphabetic strings;
2. the set of all lower case alphabetic strings ending in a b;
3. the set of all strings from the alphabet a, b such that each a is immedi-
ately preceded by and immediately followed by a b;
2.2 Write regular expressions for the following languages. By “word”, we mean
an alphabetic string separated from other words by whitespace, any relevant
punctuation, line breaks, and so forth.
1. the set of all strings with two consecutive repeated words (e.g., “Hum-
bert Humbert” and “the the” but not “the bug” or “the big bug”);
2. all strings that start at the beginning of the line with an integer and that
end at the end of the line with a word;
3. all strings that have both the word grotto and the word raven in them
(but not, e.g., words like grottos that merely contain the word grotto);
4. write a pattern that places the first word of an English sentence in a
register. Deal with punctuation.
2.3 Implement an ELIZA-like program, using substitutions such as those described
on page 18. You may choose a different domain than a Rogerian psychologist,
if you wish, although keep in mind that you would need a domain in which
your program can legitimately engage in a lot of simple repetition.
2.4 Compute the edit distance (using insertion cost 1, deletion cost 1, substitution
cost 1) of “leda” to “deal”. Show your work (using the edit distance grid).
2.5 Figure out whether drive is closer to brief or to divers and what the edit dis-
tance is to each. You may use any version of distance that you like.
E XERCISES 33

2.6 Now implement a minimum edit distance algorithm and use your hand-computed
results to check your code.
2.7 Augment the minimum edit distance algorithm to output an alignment; you
will need to store pointers and add a stage to compute the backtrace.
2.8 Implement the MaxMatch algorithm.
2.9 To test how well your MaxMatch algorithm works, create a test set by remov-
ing spaces from a set of sentences. Implement the Word Error Rate metric (the
number of word insertions + deletions + substitutions, divided by the length
in words of the correct string) and compute the WER for your test set.
CHAPTER

3 Finite State Transducers


Placeholder

34
CHAPTER

4 Language Modeling with N-


grams
“You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.
Random sentence generated from a Jane Austen trigram model

Being able to predict the future is not always a good thing. Cassandra of Troy had
the gift of foreseeing but was cursed by Apollo that her predictions would never be
believed. Her warnings of the destruction of Troy were ignored and to simplify, let’s
just say that things just didn’t go well for her later.
In this chapter we take up the somewhat less fraught topic of predicting words.
What word, for example, is likely to follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over,
but probably not refrigerator or the. In the following sections we will formalize
this intuition by introducing models that assign a probability to each possible next
word. The same models will also serve to assign a probability to an entire sentence.
Such a model, for example, could predict that the following sequence has a much
higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk

than does this same set of words in a different order:

on guys all I of notice sidewalk three a sudden standing the

Why would you want to predict upcoming words, or assign probabilities to sen-
tences? Probabilities are essential in any task in which we have to identify words
in noisy, ambiguous input, like speech recognition or handwriting recognition. In
the movie Take the Money and Run, Woody Allen tries to rob a bank with a sloppily
written hold-up note that the teller incorrectly reads as “I have a gub”. As Rus-
sell and Norvig (2002) point out, a language processing system could avoid making
this mistake by using the knowledge that the sequence “I have a gun” is far more
probable than the non-word “I have a gub” or even “I have a gull”.
In spelling correction, we need to find and correct spelling errors like Their
are two midterms in this class, in which There was mistyped as Their. A sentence
starting with the phrase There are will be much more probable than one starting with
Their are, allowing a spellchecker to both detect and correct these errors.
Assigning probabilities to sequences of words is also essential in machine trans-
lation. Suppose we are translating a Chinese source sentence:
他 向 记者 介绍了 主要 内容
He to reporters introduced main content
Other documents randomly have
different content
“To bestow your attention in company, upon trifling singularities in
the dress, person, or manners of others, is spending your time to
little purpose. From such a practice you can derive neither pleasure
nor profit; but must unavoidably subject yourselves to the
imputation of incivility and malice.”

Thursday, P. M.
AMUSEMENTS.
“Amusement is impatiently desired, and eagerly sought by young
ladies in general. Forgetful that the noblest entertainment arises
from a placid and well cultivated mind, too many fly from
themselves, from thought and reflection, to fashionable dissipation,
or what they call pleasure, as a mean of beguiling the hours which
solitude and retirement render insupportably tedious.
“An extravagant fondness for company and public resorts is
incompatible with those domestic duties, the faithful discharge of
which ought to be the prevailing object of the sex. In the indulgence
of this disposition, the mind is enervated, and the manners
corrupted, till all relish for those enjoyments, which being simple and
natural, are best calculated to promote health, innocence, and social
delight, is totally lost.
“It is by no means amiss for youth to seek relaxation from severer
cares and labors, in a participation of diversions, suited to their age,
sex, and station in life. But there is great danger of their lively
imaginations’ hurrying them into excess, and detaching their
affections from the ennobling acquisitions of moral improvement,
and refined delicacy. Guard, then against those amusements which
have the least tendency to sully the purity of your minds.
“Loose and immoral books; company, whose manners are licentious,
however gay and fashionable; conversation which is even tinctured
with profaneness or obscenity; plays in which the representation is
immodest, and offensive to the ear of chastity; indeed, pastimes of
every description, from which no advantage can be derived, should
not be countenanced; much less applauded. Why should those
things afford apparent satisfaction in a crowd which would call forth
the blush of indignation in more private circles? This question is
worthy the serious attention of those ladies, who at the theatre, can
hardly restrain their approbation of expressions and actions, which
at their houses, would be intolerably rude and indecent, in their
most familiar friends!
“Cards are so much the taste of the present day, that to caution my
pupils against the too frequent use of them may be thought old
fashioned in the extreme. I believe it, however, to be a fascinating
game, which occupies the time, without yielding any kind of pleasure
or profit. As the satirist humorously observes,
“The love of gaming is the worst of ills;
With ceaseless storms the blacken’d soul it fills;
Inveighs at Heaven, neglects the ties of blood;
Destroys the power and will of doing good;
Kills health, pawns honor, plunges in disgrace;
And, what is still more dreadful—spoils your face.”

“One thing at least is certain; it entirely excludes all rational


conversation. That delightful interchange of sentiment, which the
social meeting of friends is calculated to afford and from which many
advantages might be derived, is utterly excluded.
“Reading, writing, drawing, needle-work, dancing, music, walking,
riding, and conversation are amusements well adapted to yield
pleasure and utility. From either of these, within proper bounds,
there is no danger of injury to the person or mind; though to render
even our diversions agreeable, they must be enjoyed with
moderation, and variously and prudently conducted. Such as are
peculiarly exhilarating to the spirits, however innocent in themselves,
should be more cautiously and sparingly indulged.
“When once the mind becomes too much relaxed by dissipating
pastimes, it is proportionably vitiated, and negligent of those nice
attentions to the rules of reserve and decorum which ought never to
be suspended. Intoxicating is the full draught of pleasure to the
youthful mind; and fatal are the effects of unrestrained passions.
“Flavia was the daughter of a gentleman, whose political principles
obliged him to leave his country at the commencement of the
American revolution. At that time she was at nurse in a neighboring
village; between which and the metropolis all communication being
cut off, he was reduced to the necessity of leaving her to the mercy
of those to whom she was entrusted. Having received her from
pecuniary motives only, they no sooner found themselves deprived
of the profits of their labor and care, than they sought relief by an
application to the town for her support. A wealthy farmer in the
vicinity, who had often seen and been pleased with the dawning
charms of Flavia, pitied her condition, and having no children of his
own, resolved to shelter her from the impending storm, till she could
be better provided for. At his house she was brought up in a homely,
though comfortable manner. The good man and his wife were
excessively fond of her, and gave her every instruction and
advantage in their power. Plain truths were liberally inculcated, and
every exertion made to give her a habit of industry and good nature.
Flavia requited their kindness by an obliging and cheerful, a docile
and submissive deportment. As she advanced in years, she
increased in beauty. Her amiable disposition rendered her beloved,
and her personal accomplishments made her admired by all the
village swains. The approbating smile of Flavia was the reward of
their toils, and the favor of her hand in the rustic dance was
emulously sought.
“In this state, Flavia was happy. Health and innocence were now her
portion; nor had ambition as yet taught her to sigh for pleasure
beyond the reach of her attainment.
“But the arrival of her father, who had been permitted to return, and
re-possess the estate which he had abandoned, put a period to the
simplicity and peace of Flavia’s mind. He sought and found her; and
though sensible of his obligations to her foster-parents for snatching
her from want and distress, still he could not prevail on himself to
make so great a sacrifice to gratitude as they wished, by permitting
his daughter to spend her days in obscurity. The lively fancy of Flavia
was allured by the splendid promises and descriptions of her father;
and she readily consented to leave the friends of her childhood and
youth, and explore the walks of fashionable life.
“When she arrived in town, what new scenes opened upon the
dazzled eyes of the admiring, and admired Flavia!
“Wealth, with its attendant train of splendid forms and ceremonies,
courted her attention, and every species of dissipating amusement,
sanctioned by the name of pleasure, beguiled the hours and
charmed the imagination of the noviciate. Each enchanting scene
she painted to herself in the brightest colours; and her experienced
heart promised her happiness without allay. Flattery gave her a
thousand charms which she was hitherto unconscious of possessing,
and the obsequiousness of the gaudy train around raised her vanity
to the highest pitch of arrogance and pride. Behold Flavia, now,
launched into the whirlpool of fashionable folly. Balls, plays, cards,
and parties engross every portion of her time.
“Her father saw, too late, the imprudence of his unbounded
indulgence; and his egregious mistake, in so immediately reversing
her mode of life, without first furnishing her mind with sufficient
knowledge and strength to repel temptation. He endeavored to
regulate and restrain her conduct; but in vain. She complained of
this, as an abridgment of her liberty, and took advantage of his
doating fondness to practise every excess. Involved in expenses (of
which losses at play composed a considerable part) beyond her
power to defray, in this embarrassing dilemma, she was reduced to
the necessity of accepting the treacherous offer of Marius to advance
money for the support of her extravagance. Obligated by his
apparent kindness, she could not refuse the continuance of his
acquaintance, till his delusive arts had obtained the reward he
proposed to himself, in the sacrifice of her honor. At length she
awoke to a trembling sense of her guilt, and found it fatal to her
peace, reputation, and happiness.
“Wretched Flavia! no art could conceal thy shame! The grief of her
mind, her retirement from company, and the alteration in her
appearance, betrayed her to her father’s observation. Highly
incensed at the ingratitude and baseness of her conduct, he refused
to forgive her; but sent her from the ensnaring pleasures of the
town, to languish out the remainder of life in solitude and obscurity.”

Friday, A. M.
FILIAL AND FRATERNAL AFFECTION.
“The filial and fraternal are the first duties of a single state. The
obligations you are under to your parents cannot be discharged, but
by a uniform and cheerful obedience; an unreserved and ready
compliance with their wishes, added to the most diligent attention to
their ease and happiness. The virtuous and affectionate behaviour of
children is the best compensation, in their power, for that unwearied
care and solicitude which parents, only, know. Upon daughters,
whose situation and employments lead them more frequently into
scenes of domestic tenderness; who are often called to smooth the
pillow of sick and aged parents, and to administer with a skilful and
delicate hand the cordial, restorative to decaying nature, and
endearing sensibility, and a dutiful acquiescence in the dispositions,
and even peculiarities of those from whom they have derived
existence, are indispensably incumbent.
“Such a conduct will yield a satisfaction of mind more than
equivalent to any little sacrifices of inclination or humour which may
be required at your hands.
“Pope, among all his admired poetry, has not six lines more
beautifully expressive than the following:
“Me, let the pious office long engage,
To rock the cradle of declining age;
With lenient arts extend a mother’s breath,
Make languor smile, and smooth the bed of death;
Explore the thought, explain the asking eye,
And keep awhile one parent from the sky!”

“Next in rank and importance to filial piety, is fraternal love. This is a


natural affection which you cannot too assiduously cultivate. How
delightful to see children of the same family dwell together in unity;
promoting each other’s welfare, and emulous only to excel in acts of
kindness and good will. Between brothers and sisters the connexion
is equally intimate and endearing. There is such a union of interests,
and such an undivided participation of enjoyments, that every
sensible and feeling mind must value the blessings of family
friendship and peace.
“Strive, therefore, my dear pupils, to promote them, as objects
which deserve your particular attention; as attainments which will
not fail richly to reward your labour.
“Prudelia, beside other amiable endowments of person and mind,
possessed the most lively sensibility, and ardent affections.
“The recommendations of her parents, united to her own wishes,
had induced her to give her hand to Clodius, a gentleman of
distinguished merit. He was a foreigner; and his business required
his return to his native country.
“Prudelia bid a reluctant adieu to her friends, and embarked with
him. She lived in affluence, and was admired and caressed by all
that knew her, while a lovely family was rising around her. Yet these
pleasing circumstances and prospects could not extinguish or
alienate that affection, which still glowed in her breast for the
natural guardians and companions of her childhood and youth.
“With the deepest affliction she heard the news of her father’s
death, and the embarrassed situation in which he had left his affairs.
She was impatient to console her widowed mother, and to minister
to her necessities. For these purposes, she prevailed on her husband
to consent that she should visit her, though it was impossible for him
to attend her. With all the transport of dutiful zeal, she flew to the
arms of her bereaved parent. But how great was her astonishment
and grief, when told that her only sister had been deluded by an
affluent villain, and by his insidious arts, seduced from her duty, her
honor, and her home! The emotions of pity, indignation, regret, and
affection, overwhelmed her, at first; but recollecting herself, and
exerting all her fortitude, she nobly resolved, if possible, to snatch
the guilty, yet beloved Myra, from ruin, rather than revenge her
injured family by abandoning her to the infamy she deserved. To this
intent she wrote her a pathetic letter, lamenting her elopement, but
entreating her, notwithstanding, to return and receive her fraternal
embrace. But Myra, conscious of her crime, and unworthiness of her
sister’s condescension and kindness, and above all, dreading the
superiority of her virtue, refused the generous invitation. Prudelia
was not thus to be vanquished in her benevolent undertaking. She
even followed her to her lodgings, and insisted on an interview. Here
she painted, in the most lively colours, the heinousness of her
offence, and the ignominy and wretchedness that awaited her. Her
affection allured, her reasoning convinced her backsliding sister.
Upon the promise of forgiveness from her mother, Myra consented to
leave her infamous paramour, and re-trace the paths of rectitude
and virtue.
“Her seducer was absent on a journey. She, therefore, wrote him a
farewell letter, couched in terms of sincere penitence for her
transgression, and determined resolution of amendment in future,
and left the house. Thus restored and reconciled to her friends, Myra
appeared in quite another character.
“Prudelia tarried with her mother till she had adjusted her affairs,
and seen her comfortably settled and provided for. Then taking her
reclaimed sister with her, she returned to her anxiously expecting
family. The uprightness and modesty of Myra’s conduct, ever after,
rendered her universally esteemed, though the painful consciousness
of her defection was never extinguished in her own bosom.
“A constant sense of her past misconduct depressed her spirits, and
cast a gloom over her mind; yet she was virtuous, though pensive,
during the remainder of her life.
“With this, and other salutary effects in view, how necessary, how
important are filial and fraternal affection!”

Friday, P. M.
FRIENDSHIP.
“Friendship is a term much insisted on by young people; but, like
many others more frequently used than understood. A friend, with
girls in general, is an intimate acquaintance, whose taste and
pleasures are similar to their own; who will encourage, or at least
connive at their foibles and faults, and communicate with them
every secret; in particular those of love and gallantry, in which those
of the other sex are concerned. By such friends their errors and
stratagems are flattered and concealed, while the prudent advice of
real friendship is neglected, till they find too late, how fictitious a
character, and how vain a dependence they have chosen.
“Augusta and Serena were educated at the same school, resided in
the same neighborhood, and were equally volatile in their tempers,
and dissipated in their manners. Hence every plan of amusement
was concerted and enjoyed together. At the play, the ball, the card-
table and every other party of pleasure, they were companions.
“Their parents saw that this intimacy strengthened the follies of
each; and strove to disengage their affections, that they might turn
their attention to more rational entertainments, and more judicious
advisers. But they gloried in their friendship, and thought it a
substitute for every other virtue. They were the dupes of adulation,
and the votaries of coquetry.
“The attentions of a libertine, instead of putting them on their guard
against encroachments, induced them to triumph in their fancied
conquests, and to boast of resolution sufficient to shield them from
delusion.
“Love, however, which with such dispositions, is the pretty play-thing
of imagination, assailed the tender heart of Serena. A gay youth,
with more wit than sense, more show than substance, more art than
honesty, took advantage of her weakness to ingratiate himself into
her favour, and persuade her they could not live without each other.
Augusta was the confident of Serena. She fanned the flame, and
encouraged her resolution of promoting her own felicity, though at
the expense of every other duty. Her parents suspected her amour,
remonstrated against the man, and forbade her forming any
connexion with him, on pain of their displeasure. She apparently
acquiesced; but flew to Augusta for counsel and relief. Augusta
soothed her anxiety, and promised to assist her in the
accomplishment of all her wishes. She accordingly contrived means
for a clandestine intercourse, both personal and epistolary.
“Aristus was a foreigner, and avowed his purpose of returning to his
native country, urging her to accompany him. Serena had a fortune,
independent of her parents, left her by a deceased relation. This,
with her hand, she consented to give to her lover, and to quit a
country, in which she acknowledged but one friend. Augusta praised
her fortitude, and favored her design. She accordingly eloped, and
embarked. Her parents were almost distracted by her imprudent and
undutiful conduct, and their resentment fell on Augusta, who had
acted contrary to all the dictates of integrity and friendship, in
contributing to her ruin; for ruin it proved. Her ungrateful paramour,
having rioted on the property which she bestowed, abandoned her
to want and despair. She wrote to her parents, but received no
answer. She represented her case to Augusta, and implored relief
from her friendship; but Augusta alleged that she had already
incurred the displeasure of her family on her account and chose not
again to subject herself to censure by the same means.
“Serena at length returned to her native shore, and applied in
person to Augusta, who coolly told her that she wished no
intercourse with a vagabond, and then retired. Her parents refused
to receive her into their house; but from motives of compassion and
charity, granted her a small annuity, barely sufficient to keep her and
her infant from want.
“Too late she discovered her mistaken notions of friendship; and
learned by sad experience, that virtue must be its foundation, or
sincerity and constancy can never be its reward.
“Sincerity and constancy are essential ingredients in virtuous
friendship. It invariably seeks the permanent good of its object; and
in so doing, will advise, caution and reprove, with all the frankness
of undissembled affection. In the interchange of genuine friendship,
flattery is utterly excluded. Yet, even in the most intimate
connexions of this kind, a proper degree of respect, attention and
politeness must be observed. You are not so far to presume on the
partiality of friendship, as to hazard giving offence, and wounding
the feelings of persons, merely because you think their regard for
you will plead your excuse, and procure your pardon. Equally
cautious should you be, of taking umbrage at circumstances which
are undesignedly offensive.
“Hear the excellent advice of the wise son of Sirach, upon this
subject:
“Admonish thy friend; it may be he hath not done it; and if he have
done it, that he do it no more. Admonish thy friend; it may be he
hath not said it; and if he have, that he speak it not again. Admonish
thy friend; for many times it is a slander; and believe not every tale.
There is one that slippeth in his speech, but not from his heart; and
who is he that offendeth not with his tongue?”
“Be not hasty in forming friendships; but deliberately examine the
principles, disposition, temper and manners, of the person you wish
to sustain this important character. Be well assured that they are
agreeable to your own, and such as merit your entire esteem and
confidence, before you denominate her your friend. You may have
many general acquaintances, with whom you are pleased and
entertained; but in the chain of friendship there is a still closer link.
“Reserve will wound it, and distrust destroy,
Deliberate on all things with thy friend:
But since friends grow not thick on every bough
Nor ev’ry friend unrotten at the core,
First on thy friend, deliberate with thyself:
Pause, ponder, first: not eager in the choice,
Nor jealous of the chosen: fixen, fix:
Judge before friendship: then confide till death.”

“But if you would have friends, you must show yourselves friendly;
that is, you must be careful to act the part you wish from another. If
your friend have faults, mildly and tenderly represent them to her;
but conceal them as much as possible from the observation of the
world. Endeavor to convince her of her errors, to rectify her
mistakes, and to confirm and increase every virtuous sentiment.
“Should she so far deviate, as to endanger her reputation and
happiness; and should your admonitions fail to reclaim her, become
not, like Augusta, an abettor of her crimes. It is not the part of
friendship to hide transactions which will end in the ruin of your
friend. Rather acquaint those who ought to have the rule over her of
her intended missteps, and you will have discharged your duty; you
will merit, and very probably may afterwards receive her thanks.
“Narcissa and Florinda were united in the bonds of true and
generous friendship. Narcissa was called to spend a few months with
a relation in the metropolis, where she became acquainted with, and
attached to a man who was much her inferior; but whose specious
manners and appearance deceived her youthful heart, though her
reason and judgment informed her, that her parents would
disapprove the connexion. She returned home, the consciousness of
her fault, the frankness which she owed to her friend, and her
partiality to her lover, wrought powerfully upon her mind, and
rendered her melancholy. Florinda soon explored the cause, and
warmly remonstrated against her imprudence in holding a moment’s
intercourse with a man, whom she knew, would be displeasing to
her parents. She searched out his character, and found it far
inadequate to Narcissa’s merit. This she represented to her in its
true colours, and conjured her not to sacrifice her reputation, her
duty and her happiness, by encouraging his addresses; but to no
purpose were her expostulations. Narcissa avowed the design of
permitting him to solicit the consent of her parents, and the
determination of marrying him without it, if they refused.
“Florinda was alarmed at this resolution; and, with painful anxiety,
saw the danger of her friend. She told her plainly, that the regard
she had for her demanded a counteraction of her design; and that if
she found no other way of preventing its execution, she should
discharge her duty by informing her parents of her proceedings. This
Narcissa resented, and immediately withdrew her confidence and
familiarity; but the faithful Florinda neglected not the watchful
solicitude of friendship; and when she perceived that Narcissa’s
family were resolutely opposed to her projected match and that
Narcissa was preparing to put her rash purpose into execution, she
made known the plan which she had concerted and by that mean
prevented her destruction. Narcissa thought herself greatly injured,
and declared that she would never forgive so flagrant a breach of
fidelity. Florinda endeavoured to convince her of her good intentions,
and the real kindness of her motives; but she refused to hear the
voice of wisdom, till a separation from her lover, and a full proof of
his unworthiness opened her eyes to a sight of her own folly and
indiscretion, and to a lively sense of Florinda’s friendship, in saving
her from ruin without her consent. Her heart overflowed with
gratitude to her generous preserver. She acknowledged herself
indebted to Florinda’s benevolence, for deliverance from the baneful
impetuosity of her own passions. She sought and obtained
forgiveness; and ever after lived in the strictest amity with her
faithful benefactress.”

Saturday, A. M.
LOVE.
“The highest state of friendship which this life admits, is in the
conjugal relation. On this refined affection, love, which is but a more
interesting and tender kind of friendship, ought to be founded. The
same virtues, the same dispositions and qualities which are
necessary in a friend, are still more requisite in a companion for life.
And when these enlivening principles are united, they form the basis
of durable happiness. But let not the mask of friendship, or of love,
deceive you. You are now entering upon a new stage of action
where you will probably admire, and be admired. You may attract
the notice of many, who will select you as objects of adulation, to
discover their taste and gallantry; and perhaps of some whose
affections you have really and seriously engaged. The first class your
penetration will enable you to detect; and your good sense and
virtue will lead you to treat them with the neglect they deserve. It is
disreputable for a young lady to receive and encourage the officious
attentions of those mere pleasure-hunters, who rove from fair to fair,
with no other design than the exercise of their art, addresses, and
intrigue. Nothing can render their company pleasing, but a vanity of
being caressed, and a false pride in being thought an object of
general admiration, with a fondness for flattery which bespeaks a
vitiated mind. But when you are addressed by a person of real merit,
who is worthy your esteem and may justly demand your respect, let
him be treated with honor, frankness and sincerity. It is the part of a
prude, to affect a shyness, reserve, and indifference, foreign to the
heart. Innocence and virtue will rise superior to such little arts, and
indulge no wish which needs disguise.
“Still more unworthy are the insidious and deluding wiles of the
coquette. How disgusting must this character appear to persons of
sentiment and integrity! how unbecoming the delicacy and dignity of
an uncorrupted female!
“As you are young and inexperienced, your affections may possibly
be involuntarily engaged, where prudence and duty forbid a
connexion. Beware, then how you admit the passion of love. In
young minds, it is of all others the most uncontrollable. When fancy
takes the reins, it compels its blinded votary to sacrifice reason,
discretion and conscience to its impetuous dictates. But a passion of
this origin tends not to substantial and durable happiness. To secure
this, it must be quite of another kind, enkindled by esteem, founded
on merit, strengthened by congenial dispositions and corresponding
virtues, and terminating in the most pure and refined affection.
“Never suffer your eyes to be charmed by the mere exterior; nor
delude yourselves with the notion of unconquerable love. The eye, in
this respect, is often deceptious, and fills the imagination with
charms which have no reality. Nip, in the bud, every particular liking,
much more all ideas of love, till called forth by unequivocal tokens as
well as professions of sincere regard. Even then, harbor them not
without a thorough knowledge of the temper, disposition and
circumstances of your lover, the advice of your friends; and, above
all the approbation of your parents. Maturely weigh every
consideration for and against, and deliberately determine with
yourselves, what will be most conducive to your welfare and fidelity
in life. Let a rational and discreet plan of thinking and acting,
regulate your deportment, and render you deserving of the affection
you wish to insure. This you will find far more conducive to your
interest, than the indulgence of that romantic passion, which a blind
and misguided fancy paints in such alluring colors to the thoughtless
and inexperienced.
“Recollect the favourite air you so often sing:
“Ye fair, who would be blessed in love,
Take your pride a little lower:
Let the swain that you approve,
Rather like you than adore.

Love that rises into passion,


Soon will end in hate or strife:
But from tender inclination
Flow the lasting joys of life.”

“I by no means undervalue that love which is the noblest principle of


the human mind; but wish only to guard you against the influence of
an ill-placed and ungovernable passion, which is improperly called by
this name.
“A union, formed without a refined and generous affection for its
basis, must be devoid of those tender endearments, reciprocal
attentions, and engaging sympathies, which are peculiarly necessary
to alleviate the cares, dispel the sorrows, and soften the pains of life.
The exercise of that prudence and caution which I have
recommended, will lead you to a thorough investigation of the
character and views of the man by whom you are addressed.
“Without good principles, both of religion and morality, (for the latter
cannot exist independent of the former) you can not safely rely,
either upon his fidelity or his affection. Good principles are the
foundation of a good life.
“If the fountain be pure, the streams which issue from it will be of
the same description.
“Next to this, an amiable temper is essentially requisite. A proud, a
passionate, a revengeful, a malicious, or a jealous temper, will
render your lives uncomfortable, in spite of all the prudence and
fortitude you can exert.
“Beware, then, lest, before marriage, love blind your eyes to those
defects, to a sight of which, grief and disappointment may awaken
you afterwards. You are to consider marriage as a connexion for life;
as the nearest and dearest of all human relations; as involving in it
the happiness or misery of all your days; and as engaging you in a
variety of cares and duties, hitherto unknown. Act, therefore, with
deliberation, and resolve with caution; but, when once you come to
a choice, behave with undeviating rectitude and sincerity.
“Avarice is not commonly a ruling passion in young persons of our
sex. Yet some there are, sordid enough to consider wealth as the
chief good, and to sacrifice every other object to a splendid
appearance. It often happens, that these are miserably disappointed
in their expectations of happiness. They find, by dear bought
experience, that external pomp is but a wretched substitute for
internal satisfaction.
“But I would not have outward circumstances entirely overlooked. A
proper regard should always be had to a comfortable subsistence in
life. Nor can you be justified in suffering a blind passion, under
whatever pretext, to involve you in those embarrassing distresses of
want, which will elude the remedies of love itself, and prove fatal to
the peace and happiness at which you aim.
“In this momentous affair, let the advice and opinion of judicious
friends have their just weight in your minds. Discover, with candor
and frankness, the progress of your amour, so far as is necessary to
enable them to judge aright in the cause; but never relate the love
tales of your suitor, merely for your own, or any other person’s
amusement. The tender themes inspired by love, may be pleasing to
you; but to an uninterested person, must be insipid and disgusting in
the extreme.
“Never boast of the number, nor of the professions of your admirers.
That betrays an unsufferable vanity, and will render you perfectly
ridiculous in the estimation of observers. Besides, it is a most
ungenerous treatment of those who may have entertained, and
expressed a regard for you. Whatever they have said upon this
subject, was doubtless in confidence, and you ought to keep it
sacred, as a secret you have no right to divulge.
“If you disapprove the person, and reject his suit, that will be
sufficiently mortifying, without adding the insult of exposing his
overtures.
“Be very careful to distinguish real lovers from mere gallants. Think
not every man enamoured with you, who is polite and attentive. You
have no right to suppose any man in love with you, till he declares it
in plain, unequivocal and decent terms.
“Never suffer, with impunity, your ear to be wounded by indelicate
expressions, double entendres, and insinuating attempts to seduce
you from the path of rectitude. True love will not seek to degrade its
object, much less to undermine that virtue which ought to be its
basis and support. Let no protestations induce you to believe that
person your friend, who would destroy your dearest interests, and
rob you of innocence and peace. Give no heed to the language of
seduction; but repel the insidious arts of the libertine, with the
dignity and decision of insulted virtue. This practice will raise you
superior to the wiles of deceivers, and render you invulnerable by
the specious flattery of the unprincipled and debauched.
“Think not the libertine worthy of your company and conversation
even as an acquaintance.
“That reformed rakes make the best husbands,” is a common, and I
am sorry to say, a too generally received maxim. Yet I cannot
conceive, that any lady who values, or properly considers her own
happiness, will venture on the dangerous experiment. The term
reformed can, in my opinion, have very little weight; since those,
whose principles are vitiated, and whose minds are debased by a
course of debauchery and excess, seldom change their pursuits, till
necessity, or interest requires it; and, however circumstances may
alter or restrain their conduct, very little dependence can be placed
on men whose disposition is still the same, but only prevented from
indulgence by prudential motives. As a rake is most conversant with
the dissolute and abandoned of both sexes, he doubtless forms his
opinion of others by the standard to which he has been accustomed,
and therefore supposes all women of the same description. Having
been hackneyed in the arts of the baser sort, he cannot form an
idea, that any are in reality superior to them. This renders him
habitually jealous, peevish and tyrannical. Even if his vicious
inclinations be changed, his having passed his best days in vice and
folly, renders him a very unsuitable companion for a person of
delicacy and refinement.
“But whatever inducements some ladies may have to risk themselves
with those who have the reputation of being reformed, it is truly
surprising that any should be so inconsiderate as to unite with such
as are still professed libertines. What hopes of happiness can be
formed with men of this character?
“Vice and virtue can never assimilate; and hearts divided by them
can never coalesce. The former is the parent of discord, disease and
death; the latter, of harmony, health and peace. A house divided
against itself cannot stand; much less can domestic felicity subsist
between such contrasted dispositions.
“But however negligent or mistaken many women of real merit may
be, relative to their own interest, I cannot but wish they would pay
some regard to the honor and dignity of their sex. Custom only has
rendered vice more odious in a woman than in a man. And shall we
give our sanction to a custom, so unjust and destructive in its
operation; a custom which invites and encourages the enemies of
society to seek our ruin? Were those who glory in the seduction of
innocence, to meet with the contempt they deserve, and to be
pointedly neglected by every female of virtue, they would be
ashamed of their evil practices, and impelled to relinquish their
injurious designs.
“But while they are received and caressed in the best companies,
they find restraint altogether needless; and their being men of spirit
and gallantry (as they style themselves) is rather a recommendation
than a reproach!
“I cannot help blushing with indignation, when I see a lady of sense
and character gallanted and entertained by a man who ought to be
banished from society, for having ruined the peace of families, and
blasted the reputation of many, who but for him, might have been
useful and happy in the world; but who by his insidious arts, are
plunged into remediless insignificance, disgrace and misery.”

Saturday, P. M.
RELIGION.
“Having given you my sentiments on a variety of subjects which
demand your particular attention, I come now to the closing and
most important theme; and that is religion. The virtuous education
you have received, and the good principles which have been instilled
into your minds from infancy, will render the enforcement of
Christian precepts and duties a pleasing lesson.
“Religion is to be considered as an essential and durable object; not
as the embellishment of a day; but an acquisition which shall endure
and increase through the endless ages of eternity.
“Lay the foundation of it in youth, and it will not forsake you in
advanced age; but furnish you with an adequate substitute for the
transient pleasures which will then desert you, and prove a source of
rational and refined delight: a refuge from the disappointments and
corroding cares of life, and from the depressions of adverse events.
“Remember now your creator, in the days of your youth, while the
evil days come not, nor the years draw nigh, when you shall say we
have no pleasure in them.” If you wish for permanent happiness,
cultivate the divine favour as your highest enjoyment in life, and
your safest retreat when death shall approach you.
“That even the young are not exempt from the arrest of this
universal conqueror, the tombstone of Amelia will tell you. Youth,
beauty, health and fortune, strewed the path of life with flowers, and
left her no wish ungratified. Love, with its gentlest and purest flame,
animated her heart, and was equally returned by Julius. Their
passion was approved by their parents and friends; the day was
fixed, and preparations were making for the celebration of their
nuptials. At this period Amelia was attacked by a violent cold, which
seating on her lungs, baffled the skill of the most eminent
physicians, and terminated in a confirmed hectic. She perceived her
disorder to be incurable, and with inexpressible regret and concern
anticipated her approaching dissolution. She had enjoyed life too
highly to think much of death; yet die she must! “Oh,” said she,
“that I had prepared, while in health and at ease, for this awful
event! Then should I not be subjected to the keenest distress of
mind, in addition to the most painful infirmities of body! Then should
I be able to look forward with hope, and to find relief in the
consoling expectation of being united beyond the grave, with those
dear and beloved connexions, which I must soon leave behind! Let
my companions and acquaintance learn from me the important
lesson of improving their time to the best of purposes; of acting at
once as becomes mortal and immortal creatures!”
“Hear, my dear pupils, the solemn admonition, and be ye also ready!
“Too many, especially of the young and gay, seem more anxious to
live in pleasure, than to answer the end of their being, by the
cultivation of that piety and virtue which will render them good
members of society, useful to their friends and associates, and
partakers of that heart-felt satisfaction which results from a
conscience void of offence both towards God and man.
“This, however, is an egregious mistake; for in many situations, piety
and virtue are our only source of consolation; and in all, they are
peculiarly friendly to our happiness.
“Do you exult in beauty, and the pride of external charms? Turn your
eyes for a moment, on the miserable Flirtilla.[1] Like her, your
features and complexion may be impaired by disease; and where
then will you find a refuge from mortification and discontent, if
destitute of those ennobling endowments which can raise you
superior to the transient graces of a fair form, if unadorned by that
substantial beauty of mind which can not only ensure respect from
those around you, but inspire you with resignation to the divine will,
and a patient acquiescence in the painful allotments of a holy
Providence. Does wealth await your command, and grandeur with its
fascinating appendages beguile your fleeting moments? Recollect,
that riches often make themselves wings and fly away. A single
instance of mismanagement; a consuming fire, with various other

You might also like