0% found this document useful (0 votes)
165 views

Natural Language Processing - 2

Uploaded by

Sonu Jaiswar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
165 views

Natural Language Processing - 2

Uploaded by

Sonu Jaiswar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

Register Your Book

at ibmpressbooks.com/ibmregister
Upon registration, we will send you electronic sample chapters from two of our popular
IBM Press books. In addition, you will be automatically entered into a monthly drawing
for a free IBM Press book.

Registration also entitles you to:


• Notices and reminders about author appearances, conferences, and online chats
with special guests
• Access to supplemental material that may be available
• Advance notice ol forthcoming editions
• Related book recommendations
• Information about special contests and promotions throughout the year
• Chapter excerpts and supplements of forthcoming books

Contact us
If you are interested in writing a book or reviewing manuscripts prior to publication,
please write to us at:
Editorial Director, IBM Press
c/o Pearson Education
800 East 96lh Street
Indianapolis, IN 46240
e-mail: [email protected]

Visit us on the Web: ibmpressbooks.com


Related Books of Interest

The IBM Style Guide DITA Best Practices


Conventions for Writers By Laura Bellamy, Michelle Carey,
and Editors and Jenifer Schlotfeldt
by Francis DeRespinis, Peter Hayward, ISBN: 0-13-248052-2
Jana Jenkins, Amy Laird, Leslie McDonald, Darwin Information Typing Architecture
Eric Radzinski
(DITA) is today’s most powerful toolbox for
ISBN: 0-13-210130-0 constructing information. By implementing
The IBM Style Guide distills IBM wis- DITA, organizations can gain more value
dom for developing superior content: from their technical documentation than
information that is consistent, clear, ever before. In DITA Best Practices, three
concise, and easy to translate. This ex- DITA pioneers offer the first complete
pert guide contains practical guidance roadmap for successful DITA adoption,
on topic-based writing, writing content implementation, and usage. Drawing
for different media types, and writing on years of experience helping large
for global audiences and can help organizations adopt DITA, the authors
any organization improve and answer crucial questions the “official” DITA
standardize content across authors, documents ignore. An indispensable re-
delivery mechanisms, and geographic source for every writer, editor, information
locations. architect, manager, or consultant involved
The IBM Style Guide can help any with evaluating, deploying, or using DITA.
organization or individual create and
manage content more effectively. The
guidelines are especially valuable for
businesses that have not previously
adopted a corporate style guide, for
anyone who writes or edits for IBM
as an employee or outside contractor,
and for anyone who uses modern ap-
proaches to information architecture.

Sign up for the monthly IBM Press newsletter at


ibmpressbooks/newsletters
Related Books of Interest

Data Integration
Developing Quality Blueprint and Modeling
Technical Information, Techniques for a Scalable and
Second Edition Sustainable Architecture
By Anthony David Giordano
By Gretchen Hargis, Michelle Carey, Ann Kilty
Hernandez, Polly Hughes, Deirdre Longo, ISBN: 0-13-708493-5
Shannon Rouiller, and Elizabeth Wilde Making Data Integration Work: How to
ISBN: 0-13-147749-8 Systematically Reduce Cost, Improve Quality,
Direct from IBM’s own documentation and Enhance Effectiveness
experts, this is the definitive guide This book presents the solution: a clear,
to developing outstanding technical consistent approach to defining, designing,
documentation—for the Web and for and building data integration components to
print. Using extensive before-and-after reduce cost, simplify management, enhance
examples, illustrations, and checklists, quality, and improve effectiveness. Leading
the authors show exactly how to create IBM data management expert Tony Giordano
documentation that’s easy to find, brings together best practices for architec-
understand, and use. This edition includes ture, design, and methodology and shows
how to do the disciplined work of getting data
extensive new coverage of topic-based
integration right.
information, simplifying search and
retrievability, internationalization, visual Mr. Giordano begins with an overview of the
effectiveness, and much more. “patterns” of data integration, showing how
to build blueprints that smoothly handle both
operational and analytic data integration.
Next, he walks through the entire project
lifecycle, explaining each phase, activity, task,
and deliverable through a complete case
study. Finally, he shows how to integrate data
integration with other information manage-
ment disciplines, from data governance
to metadata. The book’s appendices bring
together key principles, detailed models, and
a complete data integration glossary.

Visit ibmpressbooks.com
for all product information
Related Books of Interest
Do It Wrong Quickly
How the Web Changes the
Old Marketing Rules
Moran
ISBN: 0-13-225596-0

Get Bold
Using Social Media to Create a
Search Engine New Type of Social Business
Marketing, Inc. Carter
ISBN: 0-13-261831-1
By Mike Moran and Bill Hunt
ISBN: 0-13-606868-5

The Social Factor


The #1 Step-by-Step Guide to Search Mar-
Innovate, Ignite, and Win
keting Success...Now Completely Updated
through Mass Collaboration
with New Techniques, Tools, Best Practices,
and Social Networking
and Value-Packed Bonus DVD!
Azua
ISBN: 0-13-701890-8
In this book, two world-class experts pres-
ent today’s best practices, step-by-step
techniques, and hard-won tips for using
search engine marketing to achieve your Audience, Relevance,
sales and marketing goals, whatever they are. and Search
Mike Moran and Bill Hunt thoroughly cover Targeting Web Audiences with
both the business and technical aspects Relevant Content
of contemporary search engine marketing, Mathewson, Donatone, Fishel
walking beginners through all the basics while ISBN: 0-13-700420-6
providing reliable, up-to-the-minute insights
for experienced professionals.
Making the World
Thoroughly updated to fully reflect today’s Work Better
latest search engine marketing opportunities,
The Ideas That Shaped a
this book guides you through profiting from
Century and a Company
social media marketing, site search, advanced
keyword tools, hybrid paid search auctions, Maney, Hamm, O’Brien
and much more. ISBN: 0-13-275510-6

Listen to the author’s podcast at:


ibmpressbooks.com/podcasts

Sign up for the monthly IBM Press newsletter at


ibmpressbooks/newsletters
Multilingual
Natural
Language
Processing
Applications
This page intentionally left blank
Multilingual
Natural
Language
Processing
Applications

From Theory to Practice

Edited by Daniel M. Bikel Imed Zitouni

IBM Press
Pearson plc
Upper Saddle River, NJ • Boston • Indianapolis • San Francisco
New York • Toronto • Montreal • London • Munich • Paris • Madrid
Capetown • Sydney • Tokyo • Singapore • Mexico City

ibmpressbooks.com
The authors and publisher have taken care in the preparation of this book, but make no expressed or
implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed
for incidental or consequential damages in connection with or arising out of the use of the information or
programs contained herein.


c Copyright 2012 by International Business Machines Corporation. All rights reserved.

Note to U.S. Government Users: Documentation related to restricted right. Use, duplication, or disclosure
is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corporation.

IBM Press Program Managers: Steven M. Stansel, Ellice Uffer


Cover design: IBM Corporation

Executive Editor: Bernard Goodwin


Marketing Manager: Stephane Nakib
Publicist: Heather Fox
Managing Editor: John Fuller
Designer: Alan Clements
Project Editor: Elizabeth Ryan
Copy Editor: Carol Lallier
Indexer: Jack Lewis
Compositor: LaurelTech
Proofreader: Kelli M. Brooks
Manufacturing Buyer: Dan Uhrig

Published by Pearson plc


Publishing as IBM Press

IBM Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special
sales, which may include electronic versions and/or custom covers and content particular to your business,
training goals, marketing focus, and branding interests. For more information, please contact
U.S. Corporate and Government Sales
1-800-382-3419
[email protected]
For sales outside the United States, please contact
International Sales
[email protected]
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and the publisher was aware of a trademark
claim, the designations have been printed with initial capital letters or in all capitals.

The following terms are trademarks or registered trademarks of International Business Machines Corpora-
tion in the United States, other countries, or both: IBM, the IBM press logo, IBM Watson, ThinkPlace,
WebSphere, and InfoSphere. A current list of IBM trademarks is available on the web at “copyright and
trademark information” as www.ibm.com/legal/copytrade.shtml. Microsoft, Windows, Windows NT, and
the Windows logo are trademarks of the Microsoft Corporation in the United States, other countries, or
both. Java and all Java-based trademarks and logos are trademarks of Oracle and/or its affiliates. Linux
is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company,
product, or service names may be trademarks or service marks of others.

Library of Congress Cataloging-in-Publication Data is on file with the Library of Congress.

All rights reserved. This publication is protected by copyright, and permission must be obtained from
the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any
form or by any means, electronic, mechanical, photocopying, recording, or likewise. To obtain permission
to use material from this work, please submit a written request to Pearson Education, Inc., Permissions
Department, One Lake Street, Upper Saddle River, New Jersey 07458, or you may fax your request to (201)
236-3290.
ISBN-13: 978-0-13-715144-8
ISBN-10: 0-13-715144-6
Text printed in the United States on recycled paper at Courier in Westford, Massachusetts.
First printing, May 2012
I dedicate this book to
my mother Rita, my brother Robert, my sister-in-law Judi,
my nephew Wolfie, and my niece Freya—Bikels all.
I also dedicate it to Science.
DMB

I dedicate this book to


my parents Ali and Radhia, who taught me the love of science,
my wife Barbara, for her support and encouragement,
my kids Nassim and Ines, for the joy they give me.
I also dedicate it to my grandmother Zohra,
my brother Issam, my sister-in-law Chahnez,
as well as my parents-in-law Alain and Pilar.
IZ
Contents

Preface xxi

Acknowledgments xxv

About the Authors xxvii

Part I In Theory 1
Chapter 1 Finding the Structure of Words 3
1.1 Words and Their Components 4
1.1.1 Tokens 4
1.1.2 Lexemes 5
1.1.3 Morphemes 5
1.1.4 Typology 7
1.2 Issues and Challenges 8
1.2.1 Irregularity 8
1.2.2 Ambiguity 10
1.2.3 Productivity 13
1.3 Morphological Models 15
1.3.1 Dictionary Lookup 15
1.3.2 Finite-State Morphology 16
1.3.3 Unification-Based Morphology 18
1.3.4 Functional Morphology 19
1.3.5 Morphology Induction 21
1.4 Summary 22

Chapter 2 Finding the Structure of Documents 29


2.1 Introduction 29
2.1.1 Sentence Boundary Detection 30
2.1.2 Topic Boundary Detection 32
2.2 Methods 33
2.2.1 Generative Sequence Classification Methods 34
2.2.2 Discriminative Local Classification Methods 36

xi
xii Contents

2.2.3 Discriminative Sequence Classification Methods 38


2.2.4 Hybrid Approaches 39
2.2.5 Extensions for Global Modeling for Sentence Segmentation 40
2.3 Complexity of the Approaches 40
2.4 Performances of the Approaches 41
2.5 Features 41
2.5.1 Features for Both Text and Speech 42
2.5.2 Features Only for Text 44
2.5.3 Features for Speech 45
2.6 Processing Stages 48
2.7 Discussion 48
2.8 Summary 49

Chapter 3 Syntax 57
3.1 Parsing Natural Language 57
3.2 Treebanks: A Data-Driven Approach to Syntax 59
3.3 Representation of Syntactic Structure 63
3.3.1 Syntax Analysis Using Dependency Graphs 63
3.3.2 Syntax Analysis Using Phrase Structure Trees 67
3.4 Parsing Algorithms 70
3.4.1 Shift-Reduce Parsing 72
3.4.2 Hypergraphs and Chart Parsing 74
3.4.3 Minimum Spanning Trees and Dependency Parsing 79
3.5 Models for Ambiguity Resolution in Parsing 80
3.5.1 Probabilistic Context-Free Grammars 80
3.5.2 Generative Models for Parsing 83
3.5.3 Discriminative Models for Parsing 84
3.6 Multilingual Issues: What Is a Token? 87
3.6.1 Tokenization, Case, and Encoding 87
3.6.2 Word Segmentation 89
3.6.3 Morphology 90
3.7 Summary 92

Chapter 4 Semantic Parsing 97


4.1 Introduction 97
4.2 Semantic Interpretation 98
4.2.1 Structural Ambiguity 99
4.2.2 Word Sense 99
4.2.3 Entity and Event Resolution 100
4.2.4 Predicate-Argument Structure 100
4.2.5 Meaning Representation 101
4.3 System Paradigms 101
4.4 Word Sense 102
4.4.1 Resources 104
Contents xiii

4.4.2 Systems 105


4.4.3 Software 116
4.5 Predicate-Argument Structure 118
4.5.1 Resources 118
4.5.2 Systems 122
4.5.3 Software 147
4.6 Meaning Representation 147
4.6.1 Resources 148
4.6.2 Systems 149
4.6.3 Software 151
4.7 Summary 152
4.7.1 Word Sense Disambiguation 152
4.7.2 Predicate-Argument Structure 153
4.7.3 Meaning Representation 153

Chapter 5 Language Modeling 169


5.1 Introduction 169
5.2 n-Gram Models 170
5.3 Language Model Evaluation 170
5.4 Parameter Estimation 171
5.4.1 Maximum-Likelihood Estimation and Smoothing 171
5.4.2 Bayesian Parameter Estimation 173
5.4.3 Large-Scale Language Models 174
5.5 Language Model Adaptation 176
5.6 Types of Language Models 178
5.6.1 Class-Based Language Models 178
5.6.2 Variable-Length Language Models 179
5.6.3 Discriminative Language Models 179
5.6.4 Syntax-Based Language Models 180
5.6.5 MaxEnt Language Models 181
5.6.6 Factored Language Models 183
5.6.7 Other Tree-Based Language Models 185
5.6.8 Bayesian Topic-Based Language Models 186
5.6.9 Neural Network Language Models 187
5.7 Language-Specific Modeling Problems 188
5.7.1 Language Modeling for Morphologically Rich Languages 189
5.7.2 Selection of Subword Units 191
5.7.3 Modeling with Morphological Categories 192
5.7.4 Languages without Word Segmentation 193
5.7.5 Spoken versus Written Languages 194
5.8 Multilingual and Crosslingual Language Modeling 195
5.8.1 Multilingual Language Modeling 195
5.8.2 Crosslingual Language Modeling 196
5.9 Summary 198
xiv Contents

Chapter 6 Recognizing Textual Entailment 209


6.1 Introduction 209
6.2 The Recognizing Textual Entailment Task 210
6.2.1 Problem Definition 210
6.2.2 The Challenge of RTE 212
6.2.3 Evaluating Textual Entailment System Performance 213
6.2.4 Applications of Textual Entailment Solutions 214
6.2.5 RTE in Other Languages 218
6.3 A Framework for Recognizing Textual Entailment 219
6.3.1 Requirements 219
6.3.2 Analysis 220
6.3.3 Useful Components 220
6.3.4 A General Model 224
6.3.5 Implementation 227
6.3.6 Alignment 233
6.3.7 Inference 236
6.3.8 Training 238
6.4 Case Studies 238
6.4.1 Extracting Discourse Commitments 239
6.4.2 Edit Distance-Based RTE 240
6.4.3 Transformation-Based Approaches 241
6.4.4 Logical Representation and Inference 242
6.4.5 Learning Alignment Independently of Entailment 244
6.4.6 Leveraging Multiple Alignments for RTE 245
6.4.7 Natural Logic 245
6.4.8 Syntactic Tree Kernels 246
6.4.9 Global Similarity Using Limited Dependency Context 247
6.4.10 Latent Alignment Inference for RTE 247
6.5 Taking RTE Further 248
6.5.1 Improve Analytics 248
6.5.2 Invent/Tackle New Problems 249
6.5.3 Develop Knowledge Resources 249
6.5.4 Better RTE Evaluation 251
6.6 Useful Resources 252
6.6.1 Publications 252
6.6.2 Knowledge Resources 252
6.6.3 Natural Language Processing Packages 253
6.7 Summary 253

Chapter 7 Multilingual Sentiment and Subjectivity Analysis 259


7.1 Introduction 259
7.2 Definitions 260
7.3 Sentiment and Subjectivity Analysis on English 262
7.3.1 Lexicons 262
Contents xv

7.3.2 Corpora 262


7.3.3 Tools 263
7.4 Word- and Phrase-Level Annotations 264
7.4.1 Dictionary-Based 264
7.4.2 Corpus-Based 267
7.5 Sentence-Level Annotations 270
7.5.1 Dictionary-Based 270
7.5.2 Corpus-Based 271
7.6 Document-Level Annotations 272
7.6.1 Dictionary-Based 272
7.6.2 Corpus-Based 274
7.7 What Works, What Doesn’t 274
7.7.1 Best Scenario: Manually Annotated Corpora 274
7.7.2 Second Best: Corpus-Based Cross-Lingual Projections 275
7.7.3 Third Best: Bootstrapping a Lexicon 275
7.7.4 Fourth Best: Translating a Lexicon 276
7.7.5 Comparing the Alternatives 276
7.8 Summary 277

Part II In Practice 283


Chapter 8 Entity Detection and Tracking 285
8.1 Introduction 285
8.2 Mention Detection 287
8.2.1 Data-Driven Classification 287
8.2.2 Search for Mentions 289
8.2.3 Mention Detection Features 291
8.2.4 Mention Detection Experiments 294
8.3 Coreference Resolution 296
8.3.1 The Construction of Bell Tree 297
8.3.2 Coreference Models: Linking and Starting Model 298
8.3.3 A Maximum Entropy Linking Model 300
8.3.4 Coreference Resolution Experiments 302
8.4 Summary 303

Chapter 9 Relations and Events 309


9.1 Introduction 309
9.2 Relations and Events 310
9.3 Types of Relations 311
9.4 Relation Extraction as Classification 312
9.4.1 Algorithm 312
9.4.2 Features 313
9.4.3 Classifiers 316
xvi Contents

9.5 Other Approaches to Relation Extraction 317


9.5.1 Unsupervised and Semisupervised Approaches 317
9.5.2 Kernel Methods 319
9.5.3 Joint Entity and Relation Detection 320
9.6 Events 320
9.7 Event Extraction Approaches 320
9.8 Moving Beyond the Sentence 323
9.9 Event Matching 323
9.10 Future Directions for Event Extraction 326
9.11 Summary 326

Chapter 10 Machine Translation 331


10.1 Machine Translation Today 331
10.2 Machine Translation Evaluation 332
10.2.1 Human Assessment 332
10.2.2 Automatic Evaluation Metrics 334
10.2.3 WER, BLEU, METEOR, . . . 335
10.3 Word Alignment 337
10.3.1 Co-occurrence 337
10.3.2 IBM Model 1 338
10.3.3 Expectation Maximization 339
10.3.4 Alignment Model 340
10.3.5 Symmetrization 340
10.3.6 Word Alignment as Machine Learning Problem 341
10.4 Phrase-Based Models 343
10.4.1 Model 343
10.4.2 Training 344
10.4.3 Decoding 345
10.4.4 Cube Pruning 347
10.4.5 Log-Linear Models and Parameter Tuning 348
10.4.6 Coping with Model Size 349
10.5 Tree-Based Models 350
10.5.1 Hierarchical Phrase-Based Models 350
10.5.2 Chart Decoding 351
10.5.3 Syntactic Models 352
10.6 Linguistic Challenges 354
10.6.1 Lexical Choice 354
10.6.2 Morphology 355
10.6.3 Word Order 356
10.7 Tools and Data Resources 356
10.7.1 Basic Tools 357
10.7.2 Machine Translation Systems 357
10.7.3 Parallel Corpora 358
Contents xvii

10.8 Future Directions 358


10.9 Summary 359

Chapter 11 Multilingual Information Retrieval 365


11.1 Introduction 366
11.2 Document Preprocessing 366
11.2.1 Document Syntax and Encoding 367
11.2.2 Tokenization 369
11.2.3 Normalization 370
11.2.4 Best Practices for Preprocessing 371
11.3 Monolingual Information Retrieval 372
11.3.1 Document Representation 372
11.3.2 Index Structures 373
11.3.3 Retrieval Models 374
11.3.4 Query Expansion 376
11.3.5 Document A Priori Models 377
11.3.6 Best Practices for Model Selection 377
11.4 CLIR 378
11.4.1 Translation-Based Approaches 378
11.4.2 Machine Translation 380
11.4.3 Interlingual Document Representations 381
11.4.4 Best Practices 382
11.5 MLIR 382
11.5.1 Language Identification 383
11.5.2 Index Construction for MLIR 383
11.5.3 Query Translation 384
11.5.4 Aggregation Models 385
11.5.5 Best Practices 385
11.6 Evaluation in Information Retrieval 386
11.6.1 Experimental Setup 387
11.6.2 Relevance Assessments 387
11.6.3 Evaluation Measures 388
11.6.4 Established Data Sets 389
11.6.5 Best Practices 391
11.7 Tools, Software, and Resources 391
11.8 Summary 393

Chapter 12 Multilingual Automatic Summarization 397


12.1 Introduction 397
12.2 Approaches to Summarization 399
12.2.1 The Classics 399
12.2.2 Graph-Based Approaches 401
12.2.3 Learning How to Summarize 406
12.2.4 Multilingual Summarization 409
xviii Contents

12.3 Evaluation 412


12.3.1 Manual Evaluation Methodologies 413
12.3.2 Automated Evaluation Methods 415
12.3.3 Recent Development in Evaluating Summarization Systems 418
12.3.4 Automatic Metrics for Multilingual Summarization 419
12.4 How to Build a Summarizer 420
12.4.1 Ingredients 422
12.4.2 Devices 423
12.4.3 Instructions 423
12.5 Competitions and Datasets 424
12.5.1 Competitions 424
12.5.2 Data Sets 425
12.6 Summary 426

Chapter 13 Question Answering 433


13.1 Introduction and History 433
13.2 Architectures 435
13.3 Source Acquisition and Preprocessing 437
13.4 Question Analysis 440
13.5 Search and Candidate Extraction 443
13.5.1 Search over Unstructured Sources 443
13.5.2 Candidate Extraction from Unstructured Sources 445
13.5.3 Candidate Extraction from Structured Sources 449
13.6 Answer Scoring 450
13.6.1 Overview of Approaches 450
13.6.2 Combining Evidence 452
13.6.3 Extension to List Questions 453
13.7 Crosslingual Question Answering 454
13.8 A Case Study 455
13.9 Evaluation 460
13.9.1 Evaluation Tasks 460
13.9.2 Judging Answer Correctness 461
13.9.3 Performance Metrics 462
13.10 Current and Future Challenges 464
13.11 Summary and Further Reading 465

Chapter 14 Distillation 475


14.1 Introduction 475
14.2 An Example 476
14.3 Relevance and Redundancy 477
14.4 The Rosetta Consortium Distillation System 479
14.4.1 Document and Corpus Preparation 480
14.4.2 Indexing 483
14.4.3 Query Answering 483
Contents xix

14.5 Other Distillation Approaches 488


14.5.1 System Architectures 488
14.5.2 Relevance 488
14.5.3 Redundancy 489
14.5.4 Multimodal Distillation 490
14.5.5 Crosslingual Distillation 490
14.6 Evaluation and Metrics 491
14.6.1 Evaluation Metrics in the GALE Program 492
14.7 Summary 495

Chapter 15 Spoken Dialog Systems 499


15.1 Introduction 499
15.2 Spoken Dialog Systems 499
15.2.1 Speech Recognition and Understanding 500
15.2.2 Speech Generation 503
15.2.3 Dialog Manager 504
15.2.4 Voice User Interface 505
15.3 Forms of Dialog 509
15.4 Natural Language Call Routing 510
15.5 Three Generations of Dialog Applications 510
15.6 Continuous Improvement Cycle 512
15.7 Transcription and Annotation of Utterances 513
15.8 Localization of Spoken Dialog Systems 513
15.8.1 Call-Flow Localization 514
15.8.2 Prompt Localization 514
15.8.3 Localization of Grammars 516
15.8.4 The Source Data 516
15.8.5 Training 517
15.8.6 Test 519
15.9 Summary 520

Chapter 16 Combining Natural Language Processing Engines 523


16.1 Introduction 523
16.2 Desired Attributes of Architectures for Aggregating Speech and
NLP Engines 524
16.2.1 Flexible, Distributed Componentization 524
16.2.2 Computational Efficiency 525
16.2.3 Data-Manipulation Capabilities 526
16.2.4 Robust Processing 526
16.3 Architectures for Aggregation 527
16.3.1 UIMA 527
16.3.2 GATE: General Architecture for Text Engineering 529
16.3.3 InfoSphere Streams 530
xx Contents

16.4 Case Studies 531


16.4.1 The GALE Interoperability Demo System 531
16.4.2 Translingual Automated Language Exploitation
System (TALES) 538
16.4.3 Real-Time Translation Services (RTTS) 538
16.5 Lessons Learned 540
16.5.1 Segmentation Involves a Trade-off between Latency and
Accuracy 540
16.5.2 Joint Optimization versus Interoperability 540
16.5.3 Data Models Need Usage Conventions 540
16.5.4 Challenges of Performance Evaluation 541
16.5.5 Ripple-Forward Training of Engines 541
16.6 Summary 542
16.7 Sample UIMA Code 542

Index 551
Preface
Almost everyone on the planet, it seems, has been touched in some way by advances in
information technology and the proliferation of the Internet. Recently, multimedia infor-
mation sources have become increasingly popular. Nevertheless, the sheer volume of raw
natural language text keeps increasing, and this text is being generated in all the major
languages on Earth. For example, the English Wikipedia reports that 101 language-specific
Wikipedias exist with at least 10,000 articles each. There is therefore a pressing need for
countries, companies, and individuals to analyze this massive amount of text, translate it,
and synthesize and distill it.
Previously, to build robust and accurate multilingual natural language processing (NLP)
applications, a researcher or developer had to consult several reference books and dozens,
if not hundreds, of journal and conference papers. Our aim for this book is to provide a
“one-stop shop” that offers all the requisite background and practical advice for building
such applications. Although it is quite a tall order, we hope that, at a minimum, you find
this book a useful resource.
In the last two decades, NLP researchers have developed exciting algorithms for process-
ing large amounts of text in many different languages. By far, the dominant approach has
been to build a statistical model that can learn from examples. In this way, a model can be
robust to changes in the type of text and even the language of text on which it operates.
With the right design choices, the same model can be trained to work in a new domain or
new language simply by providing new examples in that domain. This approach also obvi-
ates the need for researchers to lay out, in a painstaking fashion, all the rules that govern
the problem at hand and the manner in which those rules must be combined. Rather, a sta-
tistical system typically allows for researchers to provide an abstract expression of possible
features of the input, where the relative importance of those features can be learned during
the training phase and can be applied to new text during the decoding, or inference, phase.
The field of statistical NLP is rapidly changing. Part of the change is due to the field’s
growth. For example, one of the main conferences in the field is that of the Association of
Computational Linguistics, where conference attendance has doubled in the last five years.
Also, the share of NLP papers in the IEEE speech and language processing conferences and
journals more than doubled in the last decade; IEEE constitutes one of the world’s largest
professional associations for the advancement of technology. Not only are NLP researchers
making inherent progress on the various subproblems of the field, but NLP continues to ben-
efit (and borrow) heavily from progress in the machine learning community and linguistics
alike. This book devotes some attention to cutting-edge algorithms and techniques, but its
primary purpose is to be a thorough explication of best practices in the field. Furthermore,
every chapter describes how the techniques discussed apply in a multilingual setting.
This book is divided into two parts. Part I, In Theory, includes the first seven chapters
and lays out the various core NLP problems and algorithms to attack those problems. The

xxi
xxii Preface

first three chapters focus on finding structure in language at various levels of granularity.
Chapter 1 introduces the important concept of morphology, the study of the structure of
words, and ways to process the diverse array of morphologies present in the world’s lan-
guages. Chapter 2 discusses the methods by which documents may be decomposed into
more manageable parts, such as sentences and larger units related by topic. Finally, in this
initial trio of chapters, Chapter 3 investigates the various methods of uncovering a sentence’s
internal structure, or syntax. Syntax has long been a dominant area of research in linguistics,
and that dominance has been mirrored in the field of NLP as well. The dominance, in part,
stems from the fact that the structure of a sentence bears relation to the sentence’s meaning,
so uncovering syntactic structure can serve as a first step toward a full “understanding” of
a sentence.
Finding a structured meaning representation for a sentence, or for some other unit of
text, is often called semantic parsing, which is the concern of Chapter 4. That chapter covers,
inter alia, a related subproblem that has garnered much attention in recent years known
as semantic role labeling, which attempts to find the syntactic phrases that constitute the
arguments to some verb or predicate. By identifying and classifying a verb’s arguments,
we come one step closer to producing a logical form for a sentence, which is one way to
represent a sentence’s meaning in such a way as to be readily processed by machine, using
the rich array of tools available from logic that mankind has been developing since ancient
times.
But what if we do not want or need the deep syntactico-semantic structure that seman-
tic parsing would provide? What if our problem is simply to decide which among many
candidate sentences is the most likely sentence a human would write or speak? One way to
do so would be to develop a model that could score each sentence according to its gram-
maticality and pick the sentence with the highest score. The problem of producing a score
or probability estimate for a sequence of word tokens is known as language modeling and is
the subject of Chapter 5.
Representing meaning and judging a sentence’s grammaticality are only two of many
possible first steps toward processing language. Moving further toward some sense of under-
standing, we might wish to have an algorithm make inferences about facts expressed in
a piece of text. For example, we might want to know if a fact mentioned in one sentence
is entailed by some previous sentence in a document. This sort of inference is known as
recognizing textual entailment and is the subject of Chapter 6.
Finding which facts or statements are entailed by others is clearly important to the
automatic understanding of text, but there is also the nature of those statements. Under-
standing which statements are subjective and the polarity of the opinion expressed is the
subject matter of Chapter 7. Given how often people express opinions, this is clearly an
important problem area, all the more so in an age when social networks are fast becoming
the dominant form of person-to-person communication on the Internet. This chapter rounds
out Part I of our book.
Part II, In Practice, takes the various core areas of NLP described in Part I and explains
how to apply them to the diverse array of real-world NLP applications. Engineering is often
about trade-offs, say, between time and space, and so the chapters in this applied part of our
book explore the trade-offs in making various algorithmic and design choices when building
a robust, multilingual NLP application.
Preface xxiii

Chapter 8 describes ways to identify and classify named entities and other mentions
of those entities in text, as well as methods to identify when two or more entity mentions
corefer. These two problems are typically known as mention detection and coreference res-
olution; they are two of the core parts of a larger application area known as information
extraction.
Chapter 9 continues the information extraction discussion, exploring techniques for find-
ing out how two entities are related to each other, known as relation extraction, and identi-
fying and classifying events, or event extraction. An event, in this case, is when something
happens involving multiple entities, and we would like a machine to uncover who the par-
ticipants are and what their roles are. In this way, event extraction is closely related to the
core NLP problem of semantic role labeling.
Chapter 10 describes one of the oldest problems in the field, and one of the few that
is an inherently multilingual NLP problem: machine translation, or MT. Automatically
translating from one language to another has long been a holy grail of NLP research, and in
recent years the community has developed techniques and can obtain hardware that make
MT a practical reality, reaping rewards after decades of effort.
It is one thing to translate text, but how do we make sense of all the text out there
in seemingly limitless quantity? Chapters 8 and 9 make some headway in this regard by
helping us automatically produce structured records of information in text. Another way to
tackle the quantity problem is to narrow down the scope by finding the few documents,
or subparts of documents, that are relevant based on a search query. This problem is
known as information retrieval and is the subject of Chapter 11. In many ways, com-
mercial search engines such as Google are large-scale information retrieval systems. Given
the popularity of search engines, this is clearly an important NLP problem—all the more
so given the number of corpora that are not public and therefore searchable by commercial
engines.
Another way we might tackle the sheer quantity of text is by automatically summarizing
it, which is the topic of Chapter 12. This very difficult problem involves either finding
the sentences, or bits of sentences, that contribute to providing a relevant summary of a
larger quantity of text or else ingesting the text summarizing its meaning in some internal
representation, and then generating the text that constitutes a summary, much as a human
might do.
Often, humans would like machines to process text automatically because they have
questions they seek to answer. These questions can range from simple, factoid-like questions,
such as “When was John F. Kennedy born?” to more complex questions such as “What is
the largest city in Bavaria, Germany?” Chapter 13 discusses ways to build systems to answer
these types of questions automatically.
What if the types of questions we might like to answer are even more complex? Our
queries might have multiple answers, such as “Name all the foreign heads of state President
Barack Obama met with in 2010.” These types of queries are handled by a relatively new
subdiscipline within NLP known as distillation. In a very real way, distillation combines the
techniques of information retrieval with information extraction and adds a few of its own.
In many cases, we might like to have machines process language in an interactive way,
making use of speech technology that both recognizes and synthesizes speech. Such systems
are known as dialog systems and are covered in Chapter 15. Due to advances in speech
xxiv Preface

recognition, dialog management, and speech synthesis, such systems are becoming increas-
ingly practical and are seeing widespread, real-world deployment.
Finally, we, as NLP researchers and engineers, might like to build systems using diverse
arrays of components developed across the world. This aggregation of processing engines
is described in Chapter 16. Although it is the final chapter of our book, in some ways it
represents a beginning, not an end, to processing text, for it describes how a common
infrastructure can be used to produce a combinatorically diverse array of processing
pipelines.
As much as we hope this book is self-contained, we also hope that for you it serves as
the beginning and not an end. Each chapter has a long list of relevant work upon which it
is based, allowing you to explore any subtopic in great detail. The large community of NLP
researchers is growing throughout the world, and we hope you join us in our exciting efforts
to process text automatically and that you interact with us at universities, at industrial
research labs, at conferences, in blogs, on social networks, and elsewhere. The multilingual
NLP systems of the future are going to be even more exciting than the ones we have now,
and we look forward to all your contributions!
Acknowledgments
This book was, from its inception, designed as a highly collaborative effort. We are immensely
grateful for the encouraging support obtained from the beginning from IBM Press/Prentice
Hall, especially from Bernard Goodwin and all the others at IBM Press who helped us get
this project off the ground and see it to completion. A book of this kind would also not have
been possible without the generous time, effort, and technical acumen of our fellow chapter
authors, so we owe huge thanks to Otakar Smrž, Hyun-Jo You, Dilek Hakkani-Tür, Gokhan
Tur, Benoit Favre, Elizabeth Shriberg, Anoop Sarkar, Sameer Pradhan, Katrin Kirchhoff,
Mark Sammons, V.G.Vinod Vydiswaran, Dan Roth, Carmen Banea, Rada Mihalcea, Janyce
Wiebe, Xiaqiang Luo, Philipp Koehn, Philipp Sorg, Philipp Cimiano, Frank Schilder, Liang
Zhou, Nico Schlaefer, Jennifer Chu-Carroll, Vittorio Castelli, Radu Florian, Roberto Pierac-
cini, David Suendermann, John F. Pitrelli, and Burn Lewis. Daniel M. Bikel is also grateful
to Google Research, especially to Corinna Cortes, for her support during the final stages of
this project. Finally, we—Daniel M. Bikel and Imed Zitouni—would like to express our great
appreciation for the backing of IBM Research, with special thanks to Ellen Yoffa, without
whom this project would not have been possible.

xxv
This page intentionally left blank
About the Authors
Daniel M. Bikel ([email protected]) is a senior research scientist
at Google. He graduated with honors from Harvard in 1993 with a
degree in Classics–Ancient Greek and Latin. From 1994 to 1997, he
worked at BBN on several natural language processing problems,
including development of the first high-accuracy stochastic name-
finder, for which he holds a patent. He received M.S. and Ph.D.
degrees in computer science from the University of Pennsylvania, in
2000 and 2004 respectively, discovering new properties of statisti-
cal parsing algorithms. From 2004 through 2010, he was a research
staff member at IBM Research, working on a wide variety of natu-
ral language processing problems, including parsing, semantic role
labeling, information extraction, machine translation, and question answering. Dr. Bikel
has been a reviewer for the Computational Linguistics journal, and has been on the pro-
gram committees of the ACL, NAACL, EACL, and EMNLP conferences. He has published
numerous peer-reviewed papers in the leading conferences and journals and has built soft-
ware tools that have seen widespread use in the natural language processing community.
In 2008, he won a Best Paper Award (Outstanding Short Paper) at the ACL-08: HLT
conference. Since 2010, Dr. Bikel has been doing natural language processing and speech
processing research at Google.

Imed Zitouni ([email protected]) is a senior researcher working


for IBM since 2004. He received his M.Sc. and Ph.D. in computer
science with honors from University of Nancy, France in 1996 and
2000 respectively. In 1995, he obtained an MEng degree in com-
puter science from Ecole Nationale des Sciences de l’Informatique,
a prestigious national computer institute in Tunisia. Before joining
IBM, he was a principal scientist at a startup company, DIALOCA,
in 1999 and 2000. He then joined Bell Laboratories Lucent-Alcatel
between 2000 and 2004 as a research staff member. His research
interests include natural language processing, language modeling,
spoken dialog systems, speech recognition, and machine learning.
Dr. Zitouni is a member of the IEEE Speech and Language Technical Committee in 2009–
2011. He is the associate editor of the ACM Transactions on Asian Language Information
Processing and the information officer of the Association for Computational Linguistics
(ACL) Special Interest Group on Computational Approaches to Semitic Languages. He
is a senior member of IEEE and member of ISCA and ACL. He served on the program

xxvii
xxviii About the Authors

committee and as a chair for several peer-review conferences and journals. He holds several
patents in the field and authored more than seventy-five papers in peer-review conferences
and journals.

Carmen Banea ([email protected]) is a doctoral student


in the Department of Computer Science and Engineering, Univer-
sity of North Texas. She is working in the field of natural language
processing. Her research work focuses primarily on multilingual app-
roaches to subjectivity and sentiment analysis, where she developed
both dictionary- and corpus-based methods that leverage languages
with rich resources to create tools and data in other languages.
Carmen has authored papers in major natural language processing
conferences, including the Association for Computational Linguis-
tics, Empirical Methods in Natural Language Processing, and the
International Conference on Computational Linguistics. She served
as a program committee member in numerous large conferences and was also a reviewer for
the Computational Linguistics Journal and the Journal of Natural Language Engineering.
She cochaired the TextGraphs 2010 Workshop collocated with ACL 2010 and was one of
the organizers of the University of North Texas site of the North American Computational
Linguistics Olympiad in 2009 to 2011.

Vittorio Castelli ([email protected]) received a Laurea degree


in electrical engineering from Politecnico di Milano in 1988, an M.S.
in electrical engineering in 1990, an M.S. in statistics in 1994, and
a Ph.D. in electrical engineering in 1995, with a dissertation on
information theory and statistical classification. In 1995 he joined
the IBM T. J. Watson Research Center. His recent work is in nat-
ural language processing, specifically in information extraction; he
has worked on the DARPA GALE and machine reading projects.
Vittorio previously started the Personal Wizards project, aimed at
capturing procedural knowledge from observation of experts per-
forming a task. He has also done work on foundations of informa-
tion theory, memory compression, time series prediction and indexing, performance analysis,
methods for improving the reliability and serviceability of computer systems, and digital
libraries for scientific imagery. From 1996 to 1998 he was coinvestigator of the NASA/CAN
project no. NCC5-101. His main research interests include information theory, probability
theory, statistics, and statistical pattern recognition. From 1998 to 2005 he was an adjunct
assistant professor at Columbia University, teaching information theory and statistical pat-
tern recognition. He is a member of Sigma Xi, of the IEEE IT Society, and of the Amer-
ican Statistical Association. Vittorio has published papers on natural language processing
computer-assisted instruction, statistical classification, data compression, image processing,
multimedia databases, database mining and multidimensional indexing structures, intelli-
gent user interfactes, and foundational problems in information theory, and he coedited
Image Databases: Search and Retrieval of Digital Imagery (Wiley, 2002).
About the Authors xxix

Jennifer Chu-Carroll ([email protected]) is a research staff


member in the Semantic Analysis and Integration Department at
the IBM T. J. Watson Research Center. Before joining IBM in 2001,
she spent five years as a member of technical staff at Lucent Tech-
nologies Bell Labratories. Her research interests include question
answering, semantic search, discourse processing, and spoken dialog
management.

Philipp Cimiano ([email protected]) is professor in


computer science at the University of Bielefeld, Germany. He leads
the Semantic Computing Group that is affiliated with the Cognitive
Interaction Technology Excellence Center, funded by the Deutsche
Forschungsgemeinschaft in the framework of the excellence initia-
tive. Philipp Cimiano graduated in computer science (major) and
computational linguistics (minor) from the University of Stuttgart.
He obtained his doctoral degree (summa cum laude) from the Uni-
versity of Karlsruhe. His main research interest lies in the combi-
nation of natural language with semantic technologies. In the last
several years, he has focused on multilingual information access. He
has been involved as main investigator in a number of European (Dot.Kom, X-Media, Mon-
net) as well as national research projects such as SmartWeb (BMBF) and Multipla (DFG).

Benoit Favre ([email protected]) is an associate profes-


sor at Aix-Marseille Université, Marseille, France. He is a researcher
in the field of natural language understanding. His research inter-
ests are in speech and text understanding with a focus on machine
learning approaches. He received his Ph.D. from the University of
Avignon, France, in 2007 on the topic of automatic speech summa-
rization. Benoit was a teaching assistant at University of Avignon
between 2003 and 2007 and a research engineer at Thales Land &
Joint Systems, Paris, during the same period. Between 2007 and
2009, Benoit held a postdoctoral position at the International Com-
puter Institute (Berkeley, CA) working with the speech group. From
2009 to 2010, he held a postdoctoral position at University of Le Mans, France. Since
2010, he is a tenured associate professor at Aix-Marseille Université, member of Laboratoire
d’Informatique Fondamentale. Benoit is the coauthor of more than thirty refereed papers in
international conferences and journals. He was a reviewer for major conferences in the field
(ICASSP, Interspeech, ACL, EMNLP, Coling, NAACL) and for the IEEE Transactions on
Speech and Language Processing. He is a member of the International Speech Communica-
tion Association and IEEE.
xxx About the Authors

Radu Florian ([email protected]) is the manager of the Statisti-


cal Content Analytics (Information Extraction) group at IBM. He
received his Ph.D. in 2002 from Johns Hopkins University, when
he joined the Multilingual NLP group at IBM. At IBM, he has
worked on a variety of research projects in the area of informa-
tion extraction: mention detection, coreference resolution, relation
extraction, cross-document coreference, and targeted information
retrieval. Radu led research groups participating in several DARPA
programs (GALE Distillation, MRP) and NIST-organized evalua-
tions (ACE, TAC-KBP) and joint development programs with IBM
partners for text mining in the medical domain (with Nuance), and
contributed to the Watson Jeopardy! project.

Dilek Hakkani-Tür ([email protected]) is a prin-


cipal scientist at Microsoft. Before joining Microsoft, she was with
the International Computer Science Institute (ICSI) speech group
(2006–2010) and AT&T Labs–Research (2001–2005). She received
her B.Sc. degree from Middle East Technical University in 1994,
and M.Sc. and Ph.D. degrees from Bilkent University, department of
computer engineering, in 1996 and 2000 respectively. Her Ph.D. the-
sis is on statistical language modeling for agglutinative languages.
She worked on machine translation at Carnegie Mellon University,
Language Technologies Institute, in 1997 and at Johns Hopkins
University in 1998. Between 1998 and 1999, Dilek worked on using
lexical and prosodic information for information extraction from speech at SRI Interna-
tional. Her research interests include natural language and speech processing, spoken dialog
systems, and active and unsupervised learning for language processing. She holds 13 patents
and has coauthored over one hundred papers in natural language and speech processing. She
was an associate editor of IEEE Transactions on Audio, Speech and Language Processing
between 2005 and 2008 and currently serves as an elected member of the IEEE Speech and
Language Technical Committee (2009–2012).

Katrin Kirchhoff ([email protected]) is a research associate


professor in electrical engineering at the University of Washington.
Her main research interests are automatic speech recognition, natu-
ral language processing, and human–computer interfaces, with par-
ticular emphasis on multilingual applications. She has authored over
seventy peer-reviewed publications and is coeditor of Multilingual
Speech Processing. Katrin currently serves as a member of the IEEE
Speech Technical Committee and on the editorial boards of Com-
puter, Speech and Language and Speech Communication.
About the Authors xxxi

Philipp Koehn ([email protected]) is a reader at the University


of Edinburgh. He received his Ph.D. from the University of South-
ern California, where he was a research assistant at the Information
Sciences Institute from 1997 to 2003. He was a postdoctoral research
associate at the Massachusetts Institute of Technology in 2004 and
joined the University of Edinburgh as a lecturer in 2005. His research
centers on statistical machine translation, but he has also worked
on speech, text classification, and information extraction. His major
contribution to the machine translation community are the prepara-
tion and release of the Europarl corpus as well as the Pharaoh and
Moses decoder. He is president of the ACL Special Interest Group on
Machine Translation and author of Statistical Machine Translation (Cambridge University
Press, 2010).

Burn L. Lewis ([email protected]) is a member of the computer


science department at the IBM Thomas J. Watson Research Center.
He received B.E. and M.E. degrees in electrical engineering from the
University of Auckland in 1967 and 1968, respectively, and a Ph.D.
in electrical engineering and computer science from the University
of California–Berkeley in 1974. He subsequently joined IBM at the
T. J. Watson Research Center, where he has worked on speech recog-
nition and unstructured information management.

Xiaqiang Luo ([email protected]) is a research staff member


at IBM T. J. Watson Research Center. He has extensive experi-
ences in human language technology, including speech recognition,
spoken dialog systems, and natural language processing. He is a
major contributor to IBM’s success in many government-sponsored
projects in the area of speech and language technology. He received
the prestigious IBM Outstanding Technical Achievement Award in
2007, IBM ThinkPlace Bravo Award in 2006, and numerous inven-
tion achievement awards. Dr. Luo received his Ph.D. and M.S. in
electrical engineering from Johns Hopkins University in 1999 and
1995, respectively, and B.A. in electrical engineering from Univer-
sity of Science and Technology of China in 1990. Dr. Luo is a member of the Association of
Computational Linguistics and has served as program committee member for major tech-
nical conferences in the area of human language and artificial intelligence. He is a board
member of the Chinese Association for Science and Technology (Greater New York Chap-
ter). He served as an associate editor for ACM Transactions on Asian Language Information
Processing (TALIP) from 2007 to 2010.
xxxii About the Authors

Rada Mihalcea ([email protected]) is associate professor in the


Department of Computer Science and Engineering, University of
North Texas. Her research interests are in computational linguis-
tics, with a focus on lexical semantics, graph-based algorithms for
natural language processing, and multilingual natural language pro-
cessing. She is currently involved in a number of research projects,
including word sense disambiguation, monolingual and crosslingual
semantic similarity, automatic keyword extraction and text summa-
rization, emotion and sentiment analysis, and computational humor.
Rada serves or has served on the editorial boards of the Journals
of Computational Linguistics, Language Resources and Evaluations,
Natural Language Engineering, and Research in Language in Computation. Her research has
been funded by the National Science Foundation, Google, the National Endowment for the
Humanities, and the State of Texas. She is the recipient of a National Science Foundation
CAREER award (2008) and a Presidential Early Career Award for Scientists and Engineers
(PECASE, 2009).

Roberto Pieraccini (www.robertopieraccini.com) is chief technol-


ogy officer of SpeechCycle Inc. Roberto graduated in electrical engi-
neering at the University of Pisa, Italy, in 1980. In 1981 he started
working as a speech recognition researcher at CSELT, the research
institution of the Italian telephone operating company. In 1990 he
joined Bell Laboratories (Murray Hill, NJ) as a member of technical
staff where he was involved in speech recognition and spoken lan-
guage understanding research. He then joined AT&T Labs in 1996,
where he started working on spoken dialog research. In 1999 he was
director of R&D for SpeechWorks International. In 2003 he joined
IBM T. J. Watson Research where he managed the Advanced Con-
versational Interaction Technology department, and then joined SpeechCycle in 2005 as their
CTO. Roberto Pieraccini is the author of more than one hundred twenty papers and articles
on speech recognition, language modeling, character recognition, language understanding,
and automatic spoken dialog management. He is an ISCA and IEEE Fellow, a member of the
editorial board of the IEEE Signal Processing Magazine and of the International Journal
of Speech Technology. He is also a member of the Applied Voice Input Output Society and
Speech Technology Consortium boards.
About the Authors xxxiii

John F. Pitrelli ([email protected]) is a member of the


Multilingual Natural Language Processing department at the IBM
T. J. Watson Research Center in Yorktown Heights, New York. He
received S.B., S.M., and Ph.D. degrees in electrical engineering and
computer science from the Massachusetts Institute of Technology
in 1983, 1985, and 1990 respectively, with graduate work in speech
recognition and synthesis. Before his current position, he worked in
the Speech Technology Group at NYNEX Science & Technology,
Inc., in White Plains, New York; was a member of the IBM Pen
Technologies Group; and worked on speech synthesis and prosody
in the Human Language Technologies group at Watson. John’s
research interests include natural language processing, speech synthesis, speech recognition,
handwriting recognition, statistical language modeling, prosody, unstructured information
management, and confidence modeling for recognition. He has published forty papers and
holds four patents.

Sameer Pradhan ([email protected]) is a scientist at


BBN Technologies in Cambridge, Massachusetts. He is the author
of a number of widely cited articles and chapters in the field of com-
putational semantics. He is currently creating the next generation
of semantic analysis engines and their applications, through algo-
rithmic innovation, wide distribution of research tools such as Au-
tomatic Statistical SEmantic Role Tagger (ASSERT), and through
the generation of rich, multilayer, multilingual, integrated resources,
such as OntoNotes, that serve as a platform. Eventually these mod-
els of semantics should replace the currently impoverished, mostly
word-based models, prevalent in most application domains, and help
take the area of language understanding to a new level of richness. Sameer received his Ph.D.
from the University of Colorado in 2005, and since then has been working at BBN Technolo-
gies developing the OntoNotes corpora as part of the DARPA Global Autonomus Language
Exploitation program. He is a member of ACL, and is a founding member of ACL’s Special
Interest Group for Annotation, promoting innovation in the area of annotation. He has reg-
ularly been on the program committees of various natural language processing conferences
and workshops such as ACL, HLT, EMNLP, CoNLL, COLING, LREC, and LAW. He is
also an accomplished chef.
xxxiv About the Authors

Dan Roth ([email protected]) is a professor in the department of


computer science and the Beckman Institute at the University of
Illinois at Urbana-Champaign. He is a Fellow of AAAI, a Univer-
sity of Illinois Scholar, and holds faculty positions at the statistics
and linguistics departments and at the Graduate School of Library
and Information Science. Professor Roth’s research spans theoretical
work in machine learning and intelligent reasoning with a specific
focus on learning and inference in natural language processing and
intelligent access to textual information. He has published over two
hundred papers in these areas and his papers have received mul-
tiple awards. He has developed advanced machine learning-based
tools for natural language applications that are being used widely by the research commu-
nity, including an award-winning semantic parser. He was the program chair of AAAI’11,
CoNLL’02, and ACL’03, and is or has been on the editorial board of several journals in his
research areas. He is currently an associate editor for the Journal of Artificial Intelligence
Research and the Machine Learning Journal. Professor Roth got his B.A. summa cum laude
in mathematics from the Technion, Israel, and his Ph.D. in computer science from Harvard
University.

Mark Sammons ([email protected]) is a principal research


scientist working with the Cognitive Computation Group at the Uni-
versity of Illinois at Urbana-Champaign. His primary interests are
in natural language processing and machine learning, with a focus
on integrating diverse information sources in the context of textual
entailment. His work has focused on developing a textual entail-
ment framework that can easily incorporate new resources, design-
ing appropriate inference procedures for recognizing entailment, and
identifying and developing automated approaches to recognize and
represent implicit content in natural language text. Mark received
his M.Sc. in computer science from the University of Illinois in 2004
and his Ph.D. in mechanical engineering from the University of Leeds, England, in 2000.

Anoop Sarkar (www.cs.sfu.ca/∼anoop) is an associate professor


of computing science at Simon Fraser University in British
Columbia, Canada, where he codirects the Natural Language Lab-
oratory (http://natlang.cs.sfu.ca). He received his Ph.D. from the
Department of Computer and Information Sciences at the University
of Pennsylvania under Professor Aravind Joshi for his work on semi-
supervised statistical parsing and parsing for tree-adjoining gram-
mars. Anoop’s current research is focused on statistical parsing and
machine translation (exploiting syntax or morphology, or both). His
interests also include formal language theory and stochastic gram-
mars, in particular tree automata and tree-adjoining grammars.
About the Authors xxxv

Frank Schilder ([email protected]) is a lead


research scientist at the Research & Development department of
Thomson Reuters. He joined Thomson Reuters in 2004, where he
has been doing applied research on summarization technologies and
information extraction systems. His summarization work has been
implemented as the snippet generator for search results of West-
LawNext, the new legal research system produced by Thomson
Reuters. His current research activities involve the participation in
different research competitions such as the Text Analysis Conference
carried out by the National Institute of Standards and Technology.
He obtained a Ph.D. in cognitive science from the University of
Edinburgh, Scotland, in 1997. From 1997 to 2003, he was employed by the Department for
Informatics at the University of Hamburg, Germany, first as a postdoctoral researcher and
later as an assistant professor. Frank has authored several journal articles and book chapters,
including “Natural Language Processing: Overview” from the Encyclopedia of Language and
Linguistics (Elsevier, 2006), coauthored with Peter Jackson, the chief scientist of Thomson
Reuters. In 2011, he jointly won the Thomson Reuters Innovation challenge. He serves as
reviewer for journals in computational linguistics and as program committee member of
various conferences organized by the Association of Computational Linguistics.

Nico Schlaefer ([email protected]) is a Ph.D. candidate in the


School of Computer Science at Carnegie Mellon University and an
IBM Ph.D. Fellow. His research focus is the application of machine
learning techniques to natural language processing tasks. Schlae-
fer developed algorithms that enable question-answering systems to
find correct answers, even if the original information sources contain
little relevant content, and a flexible architecture that supports the
integration of such algorithms. Schlaefer is the primary author of
OpenEphyra, one of the most widely used open-source question-
answering systems. Nico also contributed a statistical source
expansion approach to Watson, the computer that won against
human champions in the Jeopardy! quiz show. His approach automatically extends knowl-
edge sources with related content from the Web and other large text corpora, making it
easier for Watson to find answers and supporting evidence.
xxxvi About the Authors

Elizabeth Shriberg ([email protected]) is currently a prin-


cipal scientist at Microsoft; previously she was at SRI International
(Menlo Park, CA). She is also affiliated with the International
Computer Science Institute (Berkeley, CA) and CASL (University
of Maryland). She received a B.A. from Harvard (1987) and a
Ph.D. from the University of California–Berkeley (1994). Elizabeth’s
main interest is in modeling spontaneous speech using both lexi-
cal and prosodic information. Her work aims to combine linguistic
knowledge with corpora and techniques from automatic speech and
speaker recognition to advance both scientific understanding and
technology. She has published roughly two hundred papers in speech
science and technology and has served as associate editor of language and speech, on the
boards of Speech Communication and Computational Linguistics, on a range of conference
and workshop boards, on the ISCA Advisory Council, and on the ICSLP Permanent Coun-
cil. She has organized workshops and served on boards for the National Science Foundation,
the European Commission, NWO (Netherlands), and has reviewed for an interdisciplinary
range of conferences, workshops, and journals (e.g., IEEE Transactions on Speech and
Audio Processing, Journal of the Acoustical Society of America, Nature, Journal of Pho-
netics, Computer Speech and Language, Journal of Memory and Language, Memory and
Cognition, Discourse Processes). In 2009 she received the ISCA Fellow Award. In 2010 she
became a Fellow of SRI.

Otakar Smrž ([email protected]) is a postdoctoral research


associate at Carnegie Mellon University in Qatar. He focuses on
methods of learning from comparable corpora to improve statisti-
cal machine translation from and into Arabic. Otakar completed his
doctoral studies in mathematical linguistics at Charles University in
Prague. He designed and implemented the ElixirFM computational
model of Arabic morphology using functional programming and has
developed other open source software for natural language process-
ing. He has been the principal investigator of the Prague Arabic
Dependency Treebank. Otakar used to work as a research scientist
at IBM Czech Republic, where he explored unsupervised semantic
parsing as well as acoustic modeling for multiple languages. Otakar is a cofounder of the
Džám-e Džam Language Institute in Prague.
About the Authors xxxvii

Philipp Sorg ([email protected]) is a Ph.D. student at the


Karlsruhe Institute of Technology, Germany. He has a researcher
position at the Institute of Applied Informatics and Formal De-
scription Methods. Philipp graduated in computer science at the
University of Karlsruhe. His main research interest lies in multilin-
gual information retrieval. His special focus is the exploitation of
social semantics in the context of the Web 2.0. He has been involved
in the European research project Active, as well as in the national
research project Multipla (DFG).

David Suendermann ([email protected]) is the principal


speech scientist at SpeechCycle Labs (New York). Dr. Suendermann
has been working on various fields of speech technology research for
the last ten years. He worked at multiple industrial and academic
institutions including Siemens (Munich), Columbia University (New
York), University of Southern California (Los Angeles), Universitat
Politècnica de Catalunya (Barcelona), and Rheinisch Westfälische
Technische Hochschule (Aachen, Germany). He has authored more
than sixty publications and patents, including a book and five book
chapters, and holds a Ph.D. from the Bundeswehr University in
Munich.

Gokhan Tur ([email protected]) is currently with Microsoft


working as a principal scientist. He received his B.S., M.S., and
Ph.D. from the Department of Computer Science, Bilkent Uni-
versity, Turkey in 1994, 1996, and 2000 respectively. Between
1997 and 1999, Tur visited the Center for Machine Translation of
Carnegie Mellon University, then the Department of Computer Sci-
ence of Johns Hopkins University, and then the Speech Technol-
ogy and Research Lab of SRI International. He worked at AT&T
Labs–Research from 2001 to 2006 and at the Speech Technology
and Research Lab of SRI International from 2006 to 2010. His
research interests include spoken language understanding, speech
and language processing, machine learning, and information retrieval and extraction. Tur
has coauthored more than one hundred papers published in refereed journals or books and
presented at international conferences. He is the editor of Spoken Language Understanding:
Systems for Extracting Semantic Information from Speech (Wiley, 2011). Dr. Tur is a se-
nior member of IEEE, ACL, and ISCA, was a member of IEEE Signal Processing Society
(SPS), Speech and Language Technical Committee (SLTC) for 2006–2008, and is currently
an associate editor for IEEE Transactions on Audio, Speech, and Language Processing.
xxxviii About the Authors

V. G. Vinod Vydiswaran ([email protected]) is currently


a Ph.D. student in the Department of Computer Science at the
University of Illinois, Urbana-Champaign. His thesis is on modeling
information trustworthiness on the Web and is advised by profes-
sors ChengXiang Zhai and Dan Roth. His research interests include
text informatics, natural language processing, machine learning, and
information extraction. V. G. Vinod’s work has included developing
a textual entailment system and applying textual entailment to rela-
tion extraction and information retrieval. He received his M.S. from
Indian Institute of Technology-Bombay in 2004, where he worked on
conditional models for information extraction with Professor Sunita
Sarawagi. Later, he worked at Yahoo! Research & Development Center at Bangalore, India,
on scaling information extraction technologies over the Web.

Janyce Wiebe ([email protected]) is a professor of computer sci-


ence and codirector of the Intelligent Systems Program at the Uni-
versity of Pittsburgh. Her research with students and colleagues has
been in discourse processing, pragmatics, word-sense disambigua-
tion, and probabilistic classification in natural language processing.
A major concentration of her research is subjectivity analysis, rec-
ognizing and interpretating expressions of opinions and sentiments
in text, to support natural language processing applications such as
question answering, information extraction, text categorization, and
summarization. Janyce’s current and past professional roles include
ACL program cochair, NAACL program chair, NAACL executive
board member, computational linguistics, and language resources and evaluation, editorial
board member, AAAI workshop cochair, ACM special interest group on artificial intelligence
(SIGART) vice-chair, and ACM-SIGART/AAAI doctoral consortium chair.

Hyun-Jo You ([email protected]) is currently a lecturer in the


Department of Linguistics, Seoul National University. He received his
Ph.D. from Seoul National University. His research interests include
quantitative linguistics, statistical language modeling, and comput-
erized corpus analysis. He is especially interested in studying the
morpho-syntactic and discourse structure in morphologically rich,
free word order languages such as Korean, Czech, and Russian.
About the Authors xxxix

Liang Zhou ([email protected]) is a research scientist at Thomson


Reuters Corporation. She has extensive knowledge in natural lan-
guage processing, including sentiment analysis, automated text sum-
marization, text understanding, information extraction, question
answering, and information distillation. During her graduate stud-
ies at the Information Sciences Institute, she was actively involved
in various government-sponsored projects, such as NIST Document
Understanding conferences and DARPA Global Autonomous Lan-
guage Exploitation. Dr. Zhou received her Ph.D. from the Univer-
sity of Southern California in 2006, M.S. from Stanford University
in 2001, and B.S. from the University of Tennessee in 1999, all in
computer science.
This page intentionally left blank
Chapter 1
Finding the Structure of Words
Otakar Smrž and Hyun-Jo You

Human language is a complicated thing. We use it to express our thoughts, and through
language, we receive information and infer its meaning. Linguistic expressions are not unor-
ganized, though. They show structure of different kinds and complexity and consist of more
elementary components whose co-occurrence in context refines the notions they refer to in
isolation and implies further meaningful relations between them.
Trying to understand language en bloc is not a viable approach. Linguists have developed
whole disciplines that look at language from different perspectives and at different levels of
detail. The point of morphology, for instance, is to study the variable forms and functions
of words, while syntax is concerned with the arrangement of words into phrases, clauses,
and sentences. Word structure constraints due to pronunciation are described by phonology,
whereas conventions for writing constitute the orthography of a language. The meaning of
a linguistic expression is its semantics, and etymology and lexicology cover especially the
evolution of words and explain the semantic, morphological, and other links among them.
Words are perhaps the most intuitive units of language, yet they are in general tricky to
define. Knowing how to work with them allows, in particular, the development of syntactic
and semantic abstractions and simplifies other advanced views on language. Morphology is
an essential part of language processing, and in multilingual settings, it becomes even more
important.
In this chapter, we explore how to identify words of distinct types in human languages,
and how the internal structure of words can be modeled in connection with the grammatical
properties and lexical concepts the words should represent. The discovery of word structure
is morphological parsing.
How difficult can such tasks be? It depends. In many languages, words are delimited in
the orthography by whitespace and punctuation. But in many other languages, the writing
system leaves it up to the reader to tell words apart or determine their exact phonologi-
cal forms. Some languages use words whose form need not change much with the varying
context; others are highly sensitive about the choice of word forms according to particular
syntactic and semantic constraints and restrictions.

3
4 Chapter 1 Finding the Structure of Words

1.1 Words and Their Components


Words are defined in most languages as the smallest linguistic units that can form a com-
plete utterance by themselves. The minimal parts of words that deliver aspects of meaning
to them are called morphemes. Depending on the means of communication, morphemes are
spelled out via graphemes—symbols of writing such as letters or characters—or are realized
through phonemes, the distinctive units of sound in spoken language.1 It is not always easy
to decide and agree on the precise boundaries discriminating words from morphemes and
from phrases [1, 2].

1.1.1 Tokens
Suppose, for a moment, that words in English are delimited only by whitespace and punc-
tuation [3], and consider Example 1–1:
Example 1–1: Will you read the newspaper? Will you read it? I won’t read it.
If we confront our assumption with insights from etymology and syntax, we notice two
words here: newspaper and won’t. Being a compound word, newspaper has an interesting
derivational structure. We might wish to describe it in more detail, once there is a lexicon or
some other linguistic evidence on which to build the possible hypotheses about the origins of
the word. In writing, newspaper and the associated concept is distinguished from the isolated
news and paper. In speech, however, the distinction is far from clear, and identification of
words becomes an issue of its own.
For reasons of generality, linguists prefer to analyze won’t as two syntactic words, or
tokens, each of which has its independent role and can be reverted to its normalized form.
The structure of won’t could be parsed as will followed by not. In English, this kind of
tokenization and normalization may apply to just a limited set of cases, but in other
languages, these phenomena have to be treated in a less trivial manner.
In Arabic or Hebrew [4], certain tokens are concatenated in writing with the preceding or
the following ones, possibly changing their forms as well. The underlying lexical or syntactic
units are thereby blurred into one compact string of letters and no longer appear as distinct
words. Tokens behaving in this way can be found in various languages and are often called
clitics.
In the writing systems of Chinese, Japanese [5], and Thai, whitespace is not used to
separate words. The units that are delimited graphically in some way are sentences or
clauses. In Korean, character strings are called eojeol ‘word segment’ and roughly correspond
to speech or cognitive units, which are usually larger than words and smaller than clauses [6],
as shown in Example 1–2:
Example 1–2: 학생들에게만 주셨는데
hak.sayng.tul.ey.key.man cwu.syess.nun.te 2
haksayng-tul-eykey-man cwu-si-ess-nunte
student+plural +dative+only give+honorific+past+while
while (he/she) gave (it) only to the students

1. Signs used in sign languages are composed of elements denoted as phonemes, too.
2. We use the Yale romanization of the Korean script and indicate its original characters by dots. Hyphens
mark morphological boundaries, and tokens are separated by plus symbols.
1.1 Words and Their Components 5

Nonetheless, the elementary morphological units are viewed as having their own syntactic
status [7]. In such languages, tokenization, also known as word segmentation, is the
fundamental step of morphological analysis and a prerequisite for most language processing
applications.

1.1.2 Lexemes
By the term word, we often denote not just the one linguistic form in the given context
but also the concept behind the form and the set of alternative forms that can express
it. Such sets are called lexemes or lexical items, and they constitute the lexicon of a lan-
guage. Lexemes can be divided by their behavior into the lexical categories of verbs, nouns,
adjectives, conjunctions, particles, or other parts of speech. The citation form of a lexeme,
by which it is commonly identified, is also called its lemma.
When we convert a word into its other forms, such as turning the singular mouse into
the plural mice or mouses, we say we inflect the lexeme. When we transform a lexeme into
another one that is morphologically related, regardless of its lexical category, we say we
derive the lexeme: for instance, the nouns receiver and reception are derived from the verb
to receive.
Example 1–3: Did you see him? I didn’t see him. I didn’t see anyone.

Example 1–3 presents the problem of tokenization of didn’t and the investigation of
the internal structure of anyone. In the paraphrase I saw no one, the lexeme to see would
be inflected into the form saw to reflect its grammatical function of expressing positive
past tense. Likewise, him is the oblique case form of he or even of a more abstract lexeme
representing all personal pronouns. In the paraphrase, no one can be perceived as the
minimal word synonymous with nobody. The difficulty with the definition of what counts as
a word need not pose a problem for the syntactic description if we understand no one as
two closely connected tokens treated as one fixed element.
In the Czech translation of Example 1–3, the lexeme vidět ‘to see’ is inflected for past
tense, in which forms comprising two tokens are produced in the second and first person
(i.e., viděla jsi ‘you-fem-sg saw’ and neviděla jsem ‘I-fem-sg did not see’). Negation in
Czech is an inflectional parameter rather than just syntactic and is marked both in the verb
and in the pronoun of the latter response, as in Example 1–4:
Example 1–4: Vidělas ho? Neviděla jsem ho. Neviděla jsem nikoho.
saw+you-are him? not-saw I-am him. not-saw I-am no-one.

Here, vidělas is the contracted form of viděla jsi ‘you-fem-sg saw’. The s of jsi ‘you are’
is a clitic, and due to free word order in Czech, it can be attached to virtually any part of
speech. We could thus ask a question like Nikohos neviděla? ‘Did you see no one?’ in which
the pronoun nikoho ‘no one’ is followed by this clitic.

1.1.3 Morphemes
Morphological theories differ on whether and how to associate the properties of word forms
with their structural components [8, 9, 10, 11]. These components are usually called seg-
ments or morphs. The morphs that by themselves represent some aspect of the meaning
of a word are called morphemes of some function.
6 Chapter 1 Finding the Structure of Words

Human languages employ a variety of devices by which morphs and morphemes are
combined into word forms. The simplest morphological process concatenates morphs one by
one, as in dis-agree-ment-s, where agree is a free lexical morpheme and the other elements
are bound grammatical morphemes contributing some partial meaning to the whole word.
In a more complex scheme, morphs can interact with each other, and their forms may
become subject to additional phonological and orthographic changes denoted as morpho-
phonemic. The alternative forms of a morpheme are termed allomorphs.
Examples of morphological alternation and phonologically dependent choice of the form
of a morpheme are abundant in the Korean language. In Korean, many morphemes change
their forms systematically with the phonological context. Example 1–5 lists the allomorphs
-ess-, -ass-, -yess- of the temporal marker indicating past tense. The first two alter according
to the phonological condition of the preceding verb stem; the last one is used especially for
the verb ha- ‘do’. The appropriate allomorph is merely concatenated after the stem, or it can
be further contracted with it, as was -si-ess- into -syess- in Example 1–2. During morpho-
logical parsing, normalization of allomorphs into some canonical form of the morpheme is
desirable, especially because the contraction of morphs interferes with simple segmentation:
Example 1–5: concatenated contracted
(a) 보았- po-ass- 봤- pwass- ‘have seen’
(b) 가지었- ka.ci-ess- 가졌- ka.cyess- ‘have taken’
(c) 하였- ha-yess- 했- hayss- ‘have done’
(d) 되었- toy-ess- 됐- twayss- ‘have become’
(e) 놓았- noh-ass- 놨- nwass- ‘have put’
Contractions (a, b) are ordinary but require attention because two characters are reduced
into one. Other types (c, d, e) are phonologically unpredictable, or lexically dependent. For
example, coh-ass- ‘have been good’ may never be contracted, whereas noh- and -ass- are
merged into nwass- in (e).
There are yet other linguistic devices of word formation to account for, as the morpho-
logical process itself can get less trivial. The concatenation operation can be complemented
with infixation or intertwining of the morphs, which is common, for instance, in Arabic.
Nonconcatenative inflection by modification of the internal vowel of a word occurs even in
English: compare the sounds of mouse and mice, see and saw, read and read.
Notably in Arabic, internal inflection takes place routinely and has a yet different quality.
The internal parts of words, called stems, are modeled with root and pattern morphemes.
Word structure is then described by templates abstracting away from the root but showing
the pattern and all the other morphs attached to either side of it.

Example 1–6: hl stqrO h*h AljrA}d?3


  
 
 
hal sa-taqrau hādihi ’l-ǧarāida?
¯
whether will+you-read this the-newspapers?
 
hl stqrWhA? ln OqrOhA.    
 
   
hal sa-taqrauhā? lan aqraahā.
whether will+you-read+it? not-will I-read+it.

3. The original Arabic script is transliterated using Buckwalter notation. For readability, we also provide
the standard phonological transcription, which reduces ambiguity.
1.1 Words and Their Components 7

The meaning of Example 1–6 is similar to that of Example 1–1, only the phrase
hādihi ’l-ǧarāida refers to ‘these newspapers’. While sa-taqrau ‘you will read’ combines
¯
the future marker sa- with the imperfective second-person masculine singular verb taqrau
in the indicative mood and active voice, sa-taqrauhā ‘you will read it’ also adds the cliticized
feminine singular personal pronoun in the accusative case.4
The citation form of the lexeme to which taqrau ‘you-masc-sg read’ belongs is qara,
roughly ‘to read’. This form is classified by linguists as the basic verbal form represented
by the template faal merged with the consonantal root q r , where the f  l symbols of the
template are substituted by the respective root consonants. Inflections of this lexeme can
modify the pattern faal of the stem of the lemma into fal and concatenate it, under rules
of morphophonemic changes, with further prefixes and suffixes. The structure of taqrau is
thus parsed into the template ta-fal-u and the invariant root.
The word al-ǧarāida ‘the newspapers’ in the accusative case and definite state is another
example of internal inflection. Its structure follows the template al-faāil-a with the root ǧ
r d. This word is the plural of ǧarı̄dah ‘newspaper’ with the template faı̄l-ah. The links
between singular and plural templates are subject to convention and have to be declared in
the lexicon.
Irrespective of the morphological processes involved, some properties or features of a
word need not be apparent explicitly in its morphological structure. Its existing structural
components may be paired with and depend on several functions simultaneously but may
have no particular grammatical interpretation or lexical meaning.
The -ah suffix of ǧarı̄dah ‘newspaper’ corresponds with the inherent feminine gender of
the lexeme. In fact, the -ah morpheme is commonly, though not exclusively, used to mark the
feminine singular forms of adjectives: for example, ǧadı̄d becomes ǧadı̄dah ‘new’. However,
the -ah suffix can be part of words that are not feminine, and there its function can be seen
as either emptied or overridden [12]. In general, linguistic forms should be distinguished
from functions, and not every morph can be assumed to be a morpheme.

1.1.4 Typology
Morphological typology divides languages into groups by characterizing the prevalent mor-
phological phenomena in those languages. It can consider various criteria, and during the
history of linguistics, different classifications have been proposed [13, 14]. Let us outline the
typology that is based on quantitative relations between words, their morphemes, and their
features:

Isolating, or analytic, languages include no or relatively few words that would comprise
more than one morpheme (typical members are Chinese, Vietnamese, and Thai; ana-
lytic tendencies are also found in English).
Synthetic languages can combine more morphemes in one word and are further divided
into agglutinative and fusional languages.
Agglutinative languages have morphemes associated with only a single function at a time
(as in Korean, Japanese, Finnish, and Tamil, etc.).

4. The logical plural of things is formally treated as feminine singular in Arabic.


8 Chapter 1 Finding the Structure of Words

Fusional languages are defined by their feature-per-morpheme ratio higher than one (as in
Arabic, Czech, Latin, Sanskrit, German, etc.).

In accordance with the notions about word formation processes mentioned earlier, we
can also discern:

Concatenative languages linking morphs and morphemes one after another.


Nonlinear languages allowing structural components to merge nonsequentially to apply
tonal morphemes or change the consonantal or vocalic templates of words.

While some morphological phenomena, such as orthographic collapsing, phonological


contraction, or complex inflection and derivation, are more dominant in some languages
than in others, in principle, we can find, and should be able to deal with, instances of these
phenomena across different language families and typological classes.

1.2 Issues and Challenges


Morphological parsing tries to eliminate or alleviate the variability of word forms to provide
higher-level linguistic units whose lexical and morphological properties are explicit and well
defined. It attempts to remove unnecessary irregularity and give limits to ambiguity, both
of which are present inherently in human language.
By irregularity, we mean existence of such forms and structures that are not described
appropriately by a prototypical linguistic model. Some irregularities can be understood by
redesigning the model and improving its rules, but other lexically dependent irregularities
often cannot be generalized.
Ambiguity is indeterminacy in interpretation of expressions of language. Next to acci-
dental ambiguity and ambiguity due to lexemes having multiple senses, we note the issue of
syncretism, or systematic ambiguity.
Morphological modeling also faces the problem of productivity and creativity in language,
by which unconventional but perfectly meaningful new words or new senses are coined.
Usually, though, words that are not licensed in some way by the lexicon of a morphological
system will remain completely unparsed. This unknown word problem is particularly
severe in speech or writing that gets out of the expected domain of the linguistic model,
such as when special terms or foreign names are involved in the discourse or when multiple
languages or dialects are mixed together.

1.2.1 Irregularity
Morphological parsing is motivated by the quest for generalization and abstraction in the
world of words. Immediate descriptions of given linguistic data may not be the ultimate
ones, due to either their inadequate accuracy or inappropriate complexity, and better for-
mulations may be needed. The design principles of the morphological model are therefore
very important.
In Arabic, the deeper study of the morphological processes that are in effect during
inflection and derivation, even for the so-called irregular words, is essential for mastering the
1.2 Issues and Challenges 9

whole morphological and phonological system. With the proper abstractions made, irregular
morphology can be seen as merely enforcing some extended rules, the nature of which is
phonological, over the underlying or prototypical regular word forms [15, 16].
   
Example 1–7: hl rOyth? lm Orh. lm Or OHdA.     
  
   
hal raaytihi? lam arahu. lam ara ah.adan.
whether you-saw+him? not-did I-see+him. not-did I-see anyone.
In Example 1–7, raayti is the second-person feminine singular perfective verb in active
voice, member of the raā ‘to see’ lexeme of the r  y root. The prototypical, regularized
pattern for this citation form is faal, as we saw with qara in Example 1–6. Alternatively,
we could assume the pattern of raā to be faā, thereby asserting in a compact way that
the final root consonant and its vocalic context are subject to the particular phonological
change, resulting in raā like faā instead of raay like faal. The occurrence of this change
in the citation form may have possible implications for the morphological behavior of the
whole lexeme.
Table 1–1 illustrates differences between a naive model of word structure in Arabic and
the model proposed in Smrž [12] and Smrž and Bielický [17] where morphophonemic merge
rules and templates are involved. Morphophonemic templates capture morphological pro-
cesses by just organizing stem patterns and generic affixes without any context-dependent
variation of the affixes or ad hoc modification of the stems. The merge rules, indeed very
terse, then ensure that such structured representations can be converted into exactly the
surface forms, both orthographic and phonological, used in the natural language. Applying
the merge rules is independent of and irrespective of any grammatical parameters or infor-
mation other than that contained in a template. Most morphological irregularities are thus
successfully removed.

Table 1–1: Discovering the regularity of Arabic morphology using


morphophonemic templates, where uniform structural operations apply to
different kinds of stems. In rows, surface forms S of qara ‘to read’ and raā
‘to see’ and their inflections are analyzed into immediate I and
morphophonemic M templates, in which dashes mark the structural boundaries
where merge rules are enforced. The outer columns of the table correspond to
P perfective and I imperfective stems declared in the lexicon; the inner columns
treat active verb forms of the following morphosyntactic properties: I indicative,
S subjunctive, J jussive mood; 1 first, 2 second, 3 third person; M masculine, F
feminine gender; S singular, P plural number

P-stem P−3MS P−2FS P−3MP II2MS IS1−S IJ1−S I-stem


qara qaraa qarati qaraū taqrau aqraa aqra qra S
faal faal-a faal-ti faal-ū ta-fal-u a-fal-a a-fal fal I
faal faal-a faal-ti faal-ū ta-fal-u a-fal-a a-fal- fal M
... ...-a ...-ti ...-ū ta-...-u a-...-a a-...- ...
faā faā-a faā-ti faā-ū ta-fā-u a-fā-a a-fā- fā M
faā faā faal-ti fa-aw ta-fā a-fā a-fa fā I
raā raā raayti raaw tarā arā ara rā S
10 Chapter 1 Finding the Structure of Words

Table 1–2: Examples of major Korean irregular verb classes compared


with regular verbs

Base Form (-e) Meaning Comment


집- cip- 집어 cip.e ‘pick’ regular
깁- kip- 기워 ki.we ‘sew’ p-irregular
믿- mit- 믿어 mit.e ‘believe’ regular
싣- sit- 실어 sil.e ‘load’ t-irregular
씻- ssis- 씻어 ssis.e ‘wash’ regular
잇- is- 이어 i.e ‘link’ s-irregular
낳- nah- 낳아 nah.a ‘bear’ regular
까맣- kka.mah- 까매 kka.may ‘be black’ h-irregular
치르- chi.lu- 치러 chi.le ‘pay’ regular u-ellipsis
이르- i.lu- 이르러 i.lu.le ‘reach’ le-irregular
흐르- hu.lu- 흘러 hul.le ‘flow’ lu-irregular

In contrast, some irregularities are bound to particular lexemes or contexts, and can-
not be accounted for by general rules. Korean irregular verbs provide examples of such
irregularities.
Korean shows exceptional constraints on the selection of grammatical morphemes. It
is hard to find irregular inflection in other agglutinative languages: two irregular verbs
in Japanese [18], one in Finnish [19]. These languages are abundant with morphological
alternations that are formalized by precise phonological rules. Korean additionally features
lexically dependent stem alternation. As in many other languages, i- ‘be’ and ha- ‘do’ have
unique irregular endings. Other irregular verbs are classified by the stem final phoneme.
Table 1–2 compares major irregular verb classes with regular verbs in the same phonological
condition.

1.2.2 Ambiguity
Morphological ambiguity is the possibility that word forms be understood in multiple ways
out of the context of their discourse. Words forms that look the same but have distinct
functions or meaning are called homonyms.
Ambiguity is present in all aspects of morphological processing and language processing
at large. Morphological parsing is not concerned with complete disambiguation of words in
their context, however; it can effectively restrict the set of valid interpretations of a given
word form [20, 21].
In Korean, homonyms are one of the most problematic objects in morphological analysis
because they prevail all around frequent lexical items. Table 1–3 arranges homonyms on
the basis of their behavior with different endings. Example 1–8 is an example of homonyms
through nouns and verbs.
1.2 Issues and Challenges 11

Table 1–3: Systematic homonyms arise as verbs combined with endings


in Korean

(-ko) (-e) (-un) Meaning


묻고 mwut.ko 묻어 mwut.e 묻은 mwut.un ‘bury’
묻고 mwut.ko 물어 mwul.e 물은 mwul.un ‘ask’
물고 mwul.ko 물어 mwul.e 문 mwun ‘bite’
걷고 ket.ko 걷어 ket.e 걷은 ket.un ‘roll up’
걷고 ket.ko 걸어 kel.e 걸은 kel.un ‘walk’
걸고 kel.ko 걸어 kel.e 건 ken ‘hang’
굽고 kwup.ko 굽어 kwup.e 굽은 kwup.un ‘be bent’
굽고 kwup.ko 구워 kwu.we 구운 kwu.wun ‘bake’
이르고 i.lu.ko 이르러 i.lu.le 이른 i.lun ‘reach’
이르고 i.lu.ko 일러 il.le 이른 i.lun ‘say’

Example 1–8: 난 ‘orchid’ ← 난 nan ‘orchid’


난 ‘I’ ← 나 na ‘I’ + -n (topic)
난 ‘which flew’ ← 날- nal- ‘fly’ + -n (relative, past)
난 ‘which got out’ ← 나- na- ‘get out’ + -n (relative, past)

We could also consider ambiguity in the senses of the noun nan, according to the Standard
Korean Language Dictionary: nan1 ‘egg’, nan2 ‘revolt’, nan5 ‘section (in newspaper)’, nan6
‘orchid’, plus several infrequent readings.
Arabic is a language of rich morphology, both derivational and inflectional. Because
Arabic script usually does not encode short vowels and omits yet some other diacritical
marks that would record the phonological form exactly, the degree of its morphological
ambiguity is considerably increased. In addition, Arabic orthography collapses certain word
forms together. The problem of morphological disambiguation of Arabic encompasses not
only the resolution of the structural components of words and their actual morphosyntactic
properties (i.e., morphological tagging [22, 23, 24]) but also tokenization and normalization
[25], lemmatization, stemming, and diacritization [26, 27, 28].
When inflected syntactic words are combined in an utterance, additional phonological
and orthographic changes can take place, as shown in Figure 1–1. In Sanskrit, one such
euphony rule is known as external sandhi [29, 30]. Inverting sandhi during tokenization is
usually nondeterministic in the sense that it can provide multiple solutions. In any language,
tokenization decisions may impose constraints on the morphosyntactic properties of the
tokens being reconstructed, which then have to be respected in further processing. The
tight coupling between morphology and syntax has inspired proposals for disambiguating
them jointly rather than sequentially [4].
Czech is a highly inflected fusional language. Unlike agglutinative languages, inflec-
tional morphemes often represent several functions simultaneously, and there is no partic-
ular one-to-one correspondence between their forms and functions. Inflectional paradigms
12 Chapter 1 Finding the Structure of Words

dirāsatı̄  
 drAsty → dirāsatu ı̄     drAsp y
→ dirāsati ı̄     drAsp y
→ dirāsata ı̄     drAsp y
muallimı̄ya !"#$% mElmy → muallimū ı̄  &"#$% mElmw y
→ muallimı̄ ı̄  !"#$% mElmy y
katabtumūhā  &"' ( ktbtmwhA → katabtum hā   ( ktbtm hA
iǧrāuhu  )  IjrAWh → iǧrāu hu *)  IjrA’ h
iǧrāihi  )  IjrA}h → iǧrāi hu *)  IjrA’ h
iǧrāahu *)  → iǧrāa hu *) 
 
IjrA’h IjrA’ h
li-’l-asafi + ,
llOsf → li ’l-asafi li

+ - . l AlOsf

Figure 1–1: Complex tokenization and normalization of euphony in Arabic. Three nominal cases are
expressed by the same word form with dirāsatı̄ ‘my study’ and muallimı̄ya ‘my teachers’, but the
original case endings are distinct. In katabtumūhā ‘you-masc-pl wrote them’, the liaison vowel ū is
dropped when tokenized. Special attention is needed to normalize some orthographic conventions, such
as the interaction of iǧrā ‘carrying out’ and the cliticized hu ‘his’ respecting the case ending or the
merge of the definite article of asaf ‘regret’ with the preposition li ‘for’

(i.e., schemes for finding the form of a lexeme associated with the required properties) in
Czech are of numerous kinds, yet they tend to include nonunique forms in them.
Table 1–4 lists the paradigms of several common Czech words. Inflectional paradigms
for nouns depend on the grammatical gender and the phonological structure of a lexeme.
The individual forms in a paradigm vary with grammatical number and case, which are the
free parameters imposed only by the context in which a word is used.
Looking at the morphological variation of the word stavenı́ ‘building’, we might wonder
why we should distinguish all the cases for it when this lexeme can take only four different
forms. Is the detail of the case system appropriate? The answer is yes, because we can find
linguistic evidence that leads to this case category abstraction. Just consider other words of
the same meaning in place of stavenı́ in various contexts. We conclude that there is indeed
a case distinction made by the underlying system, but it need not necessarily be expressed
clearly and uniquely in the form of words.
The morphological phenomenon that some words or word classes show instances of
systematic homonymy is called syncretism. In particular, homonymy can occur due to
neutralization and uninflectedness with respect to some morphosyntactic parameters.
These cases of morphological syncretism are distinguished by the ability of the context to
demand the morphosyntactic properties in question, as stated by Baerman, Brown, and
Corbett [10, p. 32]:
Whereas neutralization is about syntactic irrelevance as reflected in morphology,
uninflectedness is about morphology being unresponsive to a feature that is
syntactically relevant.
For example, it seems fine for syntax in Czech or Arabic to request the personal pronoun
of the first-person feminine singular, equivalent to ‘I’, despite it being homonymous with
1.2 Issues and Challenges 13

Table 1–4: Morphological paradigms of the Czech words dům ‘house’,


budova ‘building’, stavba ‘building’, stavenı́ ‘building’. Despite systematic
ambiguities in them, the space of inflectional parameters could not be
reduced without losing the ability to capture all distinct forms elsewhere: S
singular, P plural number; 1 nominative, 2 genitive, 3 dative, 4 accusative, 5
vocative, 6 locative, 7 instrumental case

Masculine inanimate Feminine Feminine Neuter

S1 dům budova stavba stavenı́


S2 domu budovy stavby stavenı́
S3 domu budově stavbě stavenı́
S4 dům budovu stavbu stavenı́
S5 dome budovo stavbo stavenı́
S6 domu / domě budově stavbě stavenı́
S7 domem budovou stavbou stavenı́m
P1 domy budovy stavby stavenı́
P2 domů budov staveb stavenı́
P3 domům budovám stavbám stavenı́m
P4 domy budovy stavby stavenı́
P5 domy budovy stavby stavenı́
P6 domech budovách stavbách stavenı́ch
P7 domy budovami stavbami stavenı́mi

the first-person masculine singular. The reason is that for some other values of the person
category, the forms of masculine and feminine gender are different, and there exist syntactic
dependencies that do take gender into account. It is not the case that the first-person singular
pronoun would have no gender nor that it would have both. We just observe uninflectedness
here. On the other hand, we might claim that in English or Korean, the gender category is
syntactically neutralized if it ever was present, and the nuances between he and she, him
and her, his and hers are only semantic.
With the notion of paradigms and syncretism in mind, we should ask what is the minimal
set of combinations of morphosyntactic inflectional parameters that covers the inflectional
variability in a language. Morphological models that would like to define a joint system of
underlying morphosyntactic properties for multiple languages would have to generalize the
parameter space accordingly and neutralize any systematically void configurations.

1.2.3 Productivity
Is the inventory of words in a language finite, or is it unlimited? This question leads
directly to discerning two fundamental approaches to language, summarized in the dis-
tinction between langue and parole by Ferdinand de Saussure, or in the competence versus
performance duality by Noam Chomsky.
In one view, language can be seen as simply a collection of utterances (parole) actually
pronounced or written (performance). This ideal data set can in practice be approximated
by linguistic corpora, which are finite collections of linguistic data that are studied with
empirical methods and can be used for comparison when linguistic models are developed.
14 Chapter 1 Finding the Structure of Words

Yet, if we consider language as a system (langue), we discover in it structural devices


like recursion, iteration, or compounding that allow to produce (competence) an infinite set
of concrete linguistic utterances. This general potential holds for morphological processes as
well and is called morphological productivity [31, 32].
We denote the set of word forms found in a corpus of a language as its vocabulary. The
members of this set are word types, whereas every original instance of a word form is a word
token.
The distribution of words [33] or other elements of language follows the “80/20 rule,”
also known as the law of the vital few. It says that most of the word tokens in a given corpus
can be identified with just a couple of word types in its vocabulary, and words from the rest
of the vocabulary occur much less commonly if not rarely in the corpus. Furthermore, new,
unexpected words will always appear as the collection of linguistic data is enlarged.
In Czech, negation is a productive morphological operation. Verbs, nouns, adjectives, and
adverbs can be prefixed with ne- to define the complementary lexical concept. In Example
1–9, budeš ‘you will be’ is the second-person singular of být ‘to be’, and nebudu ‘I will not
be’ is the first-person singular of nebýt, the negated být. We could easily have čı́st ‘to read’
and nečı́st ‘not to read’, or we could create an adverbial phrase like noviny nenoviny that
would express ‘indifference to newspapers’ in general:

Example 1–9: Budeš čı́st ty noviny? Budeš je čı́st? Nebudu je čı́st.


you-will read the newspaper? you-will it read? not-I-will it read.

Example 1–9 has the meaning of Example 1–1 and Example 1–6. The word noviny
‘newspaper’ exists only in plural whether it signifies one piece of newspaper or many of
them. We can literally translate noviny as the plural of novina ‘news’ to see the origins of
the word as well as the fortunate analogy with English.
It is conceivable to include all negated lexemes into the lexicon and thereby again achieve
a finite number of word forms in the vocabulary. Generally, though, the richness of a mor-
phological system of a language can make this approach highly impractical.
Most languages contain words that allow some of their structural components to repeat
freely. Consider the prefix pra- related to a notion of ‘generation’ in Czech and how it can
or cannot be iterated, as shown in Example 1–10:

Example 1–10: vnuk ‘grandson’ pravnuk ‘great-grandson’


prapra...vnuk ‘great-great-...grandson’
les ‘forest’ prales ‘jungle’, ‘virgin forest’
zdroj ‘source’ prazdroj ‘urquell’, ‘original source’
starý ‘old’ prastarý ‘time-honored’, ‘dateless’

In creative language, such as in blogs, chats, and emotive informal communication,


iteration is often used to accent intensity of expression. Creativity may, of course, go beyond
the rules of productivity itself [32].
Let us give an example where creativity, productivity, and the issue of unknown words
meet nicely. According to Wikipedia, the word googol is a made-up word denoting the
number “one followed by one hundred zeros,” and the name of the company Google is an
1.3 Morphological Models 15

inadvertent misspelling thereof. Nonetheless, both of these words successfully entered the
lexicon of English where morphological productivity started working, and we now know the
verb to google and nouns like googling or even googlish or googleology [34].
The original names have been adopted by other languages, too, and their own morpho-
logical processes have been triggered. In Czech, one says googlovat, googlit ‘to google’ or
vygooglovat, vygooglit ‘to google out’, googlovánı́ ‘googling’, and so on. In Arabic, the names
are transcribed as ǧūǧūl ‘googol’ and ǧūǧil ‘Google’. The latter one got transformed to the
verb ǧawǧal ‘to google’ through internal inflection, as if there were a genuine root ǧ w ǧ l,
and the corresponding noun ǧawǧalah ‘googling’ exists as well.

1.3 Morphological Models


There are many possible approaches to designing and implementing morphological models.
Over time, computational linguistics has witnessed the development of a number of for-
malisms and frameworks, in particular grammars of different kinds and expressive power,
with which to address whole classes of problems in processing natural as well as formal
languages.
Various domain-specific programming languages have been created that allow us to
implement the theoretical problem using hopefully intuitive and minimal programming
effort. These special-purpose languages usually introduce idiosyncratic notations of programs
and are interpreted using some restricted model of computation. The motivation for such
approaches may partly lie in the fact that, historically, computational resources were too
limited compared to the requirements and complexity of the tasks being solved. Other
motivations are theoretical given that finding a simple but accurate and yet generalizing
model is the point of scientific abstraction.
There are also many approaches that do not resort to domain-specific programming.
They, however, have to take care of the runtime performance and efficiency of the computa-
tional model themselves. It is up to the choice of the programming methods and the design
style whether such models turn out to be pure, intuitive, adequate, complete, reusable,
elegant, or not.
Let us now look at the most prominent types of computational approaches to morphology.
Needless to say, this typology is not strictly exclusive in the sense that comprehensive
morphological models and their applications can combine various distinct implementational
aspects, discussed next.

1.3.1 Dictionary Lookup


Morphological parsing is a process by which word forms of a language are associated with
corresponding linguistic descriptions. Morphological systems that specify these associations
by merely enumerating them case by case do not offer any generalization means. Likewise
for systems in which analyzing a word form is reduced to looking it up verbatim in word
16 Chapter 1 Finding the Structure of Words

lists, dictionaries, or databases, unless they are constructed by and kept in sync with more
sophisticated models of the language.
In this context, a dictionary is understood as a data structure that directly enables
obtaining some precomputed results, in our case word analyses. The data structure can
be optimized for efficient lookup, and the results can be shared. Lookup operations are
relatively simple and usually quick. Dictionaries can be implemented, for instance, as lists,
binary search trees, tries, hash tables, and so on.
Because the set of associations between word forms and their desired descriptions is
declared by plain enumeration, the coverage of the model is finite and the generative
potential of the language is not exploited. Developing as well as verifying the association list
is tedious, liable to errors, and likely inefficient and inaccurate unless the data are retrieved
automatically from large and reliable linguistic resources.
Despite all that, an enumerative model is often sufficient for the given purpose, deals eas-
ily with exceptions, and can implement even complex morphology. For instance, dictionary-
based approaches to Korean [35] depend on a large dictionary of all possible combinations
of allomorphs and morphological alternations. These approaches do not allow development
of reusable morphological rules, though [36].
The word list or dictionary-based approach has been used frequently in various
ad hoc implementations for many languages. We could assume that with the availability of
immense online data, extracting a high-coverage vocabulary of word forms is feasible these
days [37]. The question remains how the associated annotations are constructed and how
informative and accurate they are. References to the literature on the unsupervised learn-
ing and induction of morphology, which are methods resulting in structured and therefore
nonenumerative models, are provided later in this chapter.

1.3.2 Finite-State Morphology


By finite-state morphological models, we mean those in which the specifications written
by human programmers are directly compiled into finite-state transducers. The two most
popular tools supporting this approach, which have been cited in literature and for which
example implementations for multiple languages are available online, include XFST (Xerox
Finite-State Tool) [9] and LexTools [11].5
Finite-state transducers are computational devices extending the power of finite-state
automata. They consist of a finite set of nodes connected by directed edges labeled with
pairs of input and output symbols. In such a network or graph, nodes are also called states,
while edges are called arcs. Traversing the network from the set of initial states to the set
of final states along the arcs is equivalent to reading the sequences of encountered input
symbols and writing the sequences of corresponding output symbols.
The set of possible sequences accepted by the transducer defines the input language;
the set of possible sequences emitted by the transducer defines the output language. For
example, a finite-state transducer could translate the infinite regular language consisting
of the words vnuk, pravnuk, prapravnuk, . . . to the matching words in the infinite regular
language defined by grandson, great-grandson, great-great-grandson, . . .

5. See http://www.fsmbook.com/ and http://compling.ai.uiuc.edu/catms/ respectively.


1.3 Morphological Models 17

The role of finite-state transducers is to capture and compute regular relations on sets
[38, 9, 11].6 That is, transducers specify relations between the input and output languages.
In fact, it is possible to invert the domain and the range of a relation, that is, exchange the
input and the output. In finite-state computational morphology, it is common to refer to the
input word forms as surface strings and to the output descriptions as lexical strings, if
the transducer is used for morphological analysis, or vice versa, if it is used for morphological
generation.
The linguistic descriptions we would like to give to the word forms and their components
can be rather arbitrary and are obviously dependent on the language processed as well as
on the morphological theory followed. In English, a finite-state transducer could analyze the
surface string children into the lexical string child [+plural], for instance, or generate women
from woman [+plural]. For other examples of possible input and output strings, consider
Example 1–8 or Figure 1–1.
Relations on languages can also be viewed as functions. Let us have a relation R, and
let us denote by [Σ] the set of all sequences over some set of symbols Σ, so that the domain
and the range of R are subsets of [Σ]. We can then consider R as a function mapping an
input string into a set of output strings, formally denoted by this type signature, where [Σ]
equals String:

R :: [Σ] → {[Σ]} R :: String → {String} (1.1)

Finite-state transducers have been studied extensively for their formal algebraic proper-
ties and have proven to be suitable models for miscellaneous problems [9]. Their applications
encoding the surface rather than lexical string associations as rewrite rules of phonology
and morphology have been around since the two-level morphology model [39], further pre-
sented in Computational Approaches to Morphology and Syntax [11] and Morphology and
Computation [40].
Morphological operations and processes in human languages can, in the overwhelming
number of cases and to a sufficient degree, be expressed in finite-state terms. Beesley and
Karttunen [9] stress concatenation of transducers as the method for factoring surface and
lexical languages into simpler models and propose a somewhat unsystematic compile-
replace transducer operation for handling nonconcatenative phenomena in morphology.
Roark and Sproat [11], however, argue that building morphological models in general using
transducer composition, which is pure, is a more universal approach.
A theoretical limitation of finite-state models of morphology is the problem of capturing
reduplication of words or their elements (e.g., to express plurality) found in several human
languages. A formal language that contains only words of the form λ1+k , where λ is some
arbitrary sequence of symbols from an alphabet and k ∈ {1, 2, . . . } is an arbitrary natural
number indicating how many times λ is repeated after itself, is not a regular language, not
even a context-free language. General reduplication of strings of unbounded length is thus
not a regular-language operation. Coping with this problem in the framework of finite-state
transducers is discussed by Roark and Sproat [11].

6. Regular relations and regular languages are restricted in their structure by the limited memory of the
device (i.e., the finite set of configurations in which it can occur). Unlike with regular languages, intersection
of regular relations can in general yield nonregular results [38].
18 Chapter 1 Finding the Structure of Words

Finite-state technology can be applied to the morphological modeling of isolating and


agglutinative languages in a quite straightforward manner. Korean finite-state models are
discussed by Kim et al. [41], Lee and Rim [42], and Han [43], to mention a few. For treat-
ments of nonconcatenative morphology using finite-state frameworks, see especially Kay [44],
Beesley [45], Kiraz [46], and Habash, Rambow, and Kiraz [47]. For comparison with finite-
state models of the rich morphology of Czech, compare Skoumalová [48] and Sedláĉek and
Smrž [49].
Implementing a refined finite-state morphological model requires careful fine-tuning of
its lexicons, rewrite rules, and other components, while extending the code can lead to
unexpected interactions in it, as noted by Oazer [50]. Convenient specification languages
like those mentioned previously are needed because encoding the finite-state transducers
directly would be extremely arduous, error prone, and unintelligible.
Finite-state tools are available in most general-purpose programming languages in the
form of support for regular expression matching and substitution. While these may not
be the ultimate choice for building full-fledged morphological analyzers or generators of a
natural language, they are very suitable for developing tokenizers and morphological guessers
capable of suggesting at least some structure for words that are formed correctly but cannot
be identified with concrete lexemes during full morphological parsing [9].

1.3.3 Unification-Based Morphology


Unification-based approaches to morphology have been inspired by advances in various for-
mal linguistic frameworks aiming at enabling complete grammatical descriptions of human
languages, especially head-driven phrase structure grammar (HPSG) [51], and by develop-
ment of languages for lexical knowledge representation, especially DATR [52]. The concepts
and methods of these formalisms are often closely connected to those of logic programming.
In the excellent thesis by Erjavec [53], the scientific context is discussed extensively and
profoundly; refer also to the monographs by Carpenter [54] and Shieber [55].
In finite-state morphological models, both surface and lexical forms are by themselves
unstructured strings of atomic symbols. In higher-level approaches, linguistic information is
expressed by more appropriate data structures that can include complex values or can be
recursively nested if needed. Morphological parsing P thus associates linear forms φ with
alternatives of structured content ψ, cf. (1.1):
P :: φ → {ψ} P :: f orm → {content} (1.2)
Erjavec [53] argues that for morphological modeling, word forms are best captured by
regular expressions, while the linguistic content is best described through typed feature
structures. Feature structures can be viewed as directed acyclic graphs. A node in a feature
structure comprises a set of attributes whose values can be feature structures again. Nodes
are associated with types, and atomic values are attributeless nodes distinguished by their
type. Instead of unique instances of values everywhere, references can be used to establish
value instance identity. Feature structures are usually displayed as attribute-value matrices
or as nested symbolic expressions.
Unification is the key operation by which feature structures can be merged into a more
informative feature structure. Unification of feature structures can also fail, which means
1.3 Morphological Models 19

that the information in them is mutually incompatible. Depending on the flavor of the
processing logic, unification can be monotonic (i.e., information-preserving), or it can allow
inheritance of default values and their overriding. In either case, information in a model can
be efficiently shared and reused by means of inheritance hierarchies defined on the feature
structure types.
Morphological models of this kind are typically formulated as logic programs, and unifi-
cation is used to solve the system of constraints imposed by the model. Advantages of this
approach include better abstraction possibilities for developing a morphological grammar as
well as elimination of redundant information from it.
However, morphological models implemented in DATR can, under certain assumptions,
be converted to finite-state machines and are thus formally equivalent to them in the range
of morphological phenomena they can describe [11]. Interestingly, one-level phonology [56]
formulating phonological constraints as logic expressions can be compiled into finite-state
automata, which can then be intersected with morphological transducers to exclude any
disturbing phonologically invalid surface strings [cf. 57, 53]
Unification-based models have been implemented for Russian [58], Czech [59], Slovene
[53], Persian [60], Hebrew [61], Arabic [62, 63], and other languages. Some rely on DATR;
some adopt, adapt, or develop other unification engines.

1.3.4 Functional Morphology


This group of morphological models includes not only the ones following the methodology
of functional morphology [64], but even those related to it, such as morphological resource
grammars of Grammatical Framework [65]. Functional morphology defines its models using
principles of functional programming and type theory. It treats morphological operations
and processes as pure mathematical functions and organizes the linguistic as well as abstract
elements of a model into distinct types of values and type classes.
Though functional morphology is not limited to modeling particular types of mor-
phologies in human languages, it is especially useful for fusional morphologies. Linguistic
notions like paradigms, rules and exceptions, grammatical categories and parameters, lex-
emes, morphemes, and morphs can be represented intuitively and succinctly in this ap-
proach. Designing a morphological system in an accurate and elegant way is encouraged by
the computational setting, which supports logical decoupling of subproblems and reinforces
the semantic structure of a program by strong type checking.
Functional morphology implementations are intended to be reused as programming
libraries capable of handling the complete morphology of a language and to be incorporated
into various kinds of applications. Morphological parsing is just one usage of the system,
the others being morphological generation, lexicon browsing, and so on. Next to parsing
(1.2), we can describe inflection I, derivation D, and lookup L as functions of these generic
types:

I :: lexeme → {parameter} → {f orm} (1.3)


D :: lexeme → {parameter} → {lexeme} (1.4)
L :: content → {lexeme} (1.5)
20 Chapter 1 Finding the Structure of Words

A functional morphology model can be compiled into finite-state transducers if needed,


but can also be used interactively in an interpreted mode, for instance. Computation within
a model may exploit lazy evaluation and employ alternative methods of efficient parsing,
lookup, and so on [see 66, 12].
Many functional morphology implementations are embedded in a general-purpose pro-
gramming language, which gives programmers more freedom with advanced programming
techniques and allows them to develop full-featured, real-world applications for their mod-
els. The Zen toolkit for Sanskrit morphology [67, 68] is written in OCaml. It influenced
the functional morphology framework [64] in Haskell, with which morphologies of Latin,
Swedish, Spanish, Urdu [69], and other languages have been implemented.
In Haskell, in particular, developers can take advantage of its syntactic flexibility and
design their own notation for the functional constructs that model the given problem. The
notation then constitutes a so-called domain-specific embedded language, which makes pro-
gramming even more fun. Figure 1–2 illustrates how the ElixirFM implementation of Ara-
bic morphology [12, 17] captures the structure of words and defines the lexicon. Despite
the entries being most informative, their format is simply similar to that found in printed
dictionaries. Operators like >|, |<, |<< and labels like verb are just infix functions; patterns
and affixes like FaCY, FCI, At are data constructors.

|> ”d r y ” <| [ dry  


faā
FaCY ‘ verb ‘ [ ”know ” , ” n o t i c e ” ]
fı̄
‘ imperf ‘ FCI ,
fāā
FACY ‘ verb ‘ [ ” f l a t t e r ” , ” deceive ” ] , afā
HaFCY ‘ verb ‘ [ ” i n f o r m ” , ” l e t know” ] , lā-a-fı̄-ı̄y
lA >| ” ’ a ” >>| FCI |<< ” I y ” ‘ adj ‘ [ ” agnostic ” ] , fiāl-ah
FiCAL |< aT ‘ noun ‘ [ ” k n o w l e d g e ” , ” knowing ” ] , mufāā-ah
mufāā-āt
MuFACY |< aT ‘ noun ‘ [ ”flattery” ]
fāı̄
‘ plural ‘ MuFACY |< At ,
FACI ‘ adj ‘ [ ” aware ” , ” knowing ” ] ]

know, notice I (i) darā  knowledge, knowing dirāyah  


flatter, deceive III dārā  flattery

mudārāh  %
   %)
inform, let know IV adrā   (mudārayāt /

lā-adrı̄y   - dārin 
agnostic
 aware, knowing

Figure 1–2: Excerpt from the ElixirFM lexicon and a layout generated from it. The source code of
entries nested under the d r y root is shown in monospace font. Note the custom notation and the
economy yet informativeness of the declaration
1.3 Morphological Models 21

Even without the options provided by general-purpose programming languages, func-


tional morphology models achieve high levels of abstraction. Morphological grammars in
Grammatical Framework [65] can be extended with descriptions of the syntax and seman-
tics of a language. Grammatical Framework itself supports multilinguality, and models of
more than a dozen languages are available in it as open-source software [70, 71].
Grammars in the OpenCCG project [72] can be viewed as functional models, too.
Their formalism discerns declarations of features, categories, and families that provide type-
system-like means for representing structured values and inheritance hierarchies on them.
The grammars leverage heavily the functionality to define parametrized macros to mini-
mize redundancy in the model and make required generalizations. Expansion of macros in
the source code has effects similar to inlining of functions. The original text of the gram-
mar is reduced to associations between word forms and their morphosyntactic and lexical
properties.

1.3.5 Morphology Induction


We have focused on finding the structure of words in diverse languages supposing we know
what we are looking for. We have not considered the problem of discovering and induc-
ing word structure without the human insight (i.e., in an unsupervised or semi-supervised
manner). The motivation for such approaches lies in the fact that for many languages,
linguistic expertise might be unavailable or limited, and implementations adequate to a
purpose may not exist at all. Automated acquisition of morphological and lexical infor-
mation, even if not perfect, can be reused for bootstrapping and improving the classical
morphological models, too.
Let us skim over the directions of research in this domain. In the studies by
Hammarström [73] and Goldsmith [74], the literature on unsupervised learning of mor-
phology is reviewed in detail. Hammarström divides the numerous approaches into three
main groups. Some works compare and cluster words based on their similarity according to
miscellaneous metrics [75, 76, 77, 78]; others try to identify the prominent features of word
forms distinguishing them from the unrelated ones. Most of the published approaches cast
morphology induction as the problem of word boundary and morpheme boundary detection,
sometimes acquiring also lexicons and paradigms [79, 80, 81, 82, 83].7
There are several challenging issues about deducing word structure just from the forms
and their context. They are caused by ambiguity [76] and irregularity [75] in morphology,
as well as by orthographic and phonological alternations [85] and nonlinear morphological
processes [86, 87].
In order to improve the chances of statistical inference, parallel learning of morphologies
for multiple languages is proposed by Snyder and Barzilay [88], resulting in discovery of
abstract morphemes. The discriminative log-linear model of Poon, Cherry, and Toutanova
[89] enhances its generalization options by employing overlapping contextual features when
making segmentation decisions [cf. 90].

7. Compare these with a semisupervised approach to word hyphenation [84].


22 Chapter 1 Finding the Structure of Words

1.4 Summary
In this chapter, we learned that morphology can be looked at from opposing viewpoints:
one that tries to find the structural components from which words are built versus a more
syntax-driven perspective wherein the functions of words are the focus of the study. Another
distinction can be made between analytic and generative aspects of morphology or can
consider man-made morphological frameworks versus systems for unsupervised induction
of morphology. Yet other kinds of issues are raised about how well and how easily the
morphological models can be implemented.
We described morphological parsing as the formal process recovering structured infor-
mation from a linear sequence of symbols, where ambiguity is present and where multiple
interpretations should be expected.
We explored interesting morphological phenomena in different types of languages and
mentioned several hints in respect to multilingual processing and model development.
With Korean as a language where agglutination moderated by phonological rules is the
dominant morphological process, we saw that a viable model of word decomposition can
work at the morphemes level, regardless of whether they are lexical or grammatical.
In Czech and Arabic as fusional languages with intricate systems of inflectional and
derivational parameters and lexically dependent word stem variation, such factorization is
not useful. Morphology is better described via paradigms associating the possible forms of
lexemes with their corresponding properties.
We discussed various options for implementing either of these models using modern
programming techniques.

Acknowledgment
We would like to thank Petr Novák for his valuable comments on an earlier draft of this
chapter.

Bibliography
[1] M. Liberman, “Morphology.” Linguistics 001, Lecture 7, University of Pennsylvania,
2009. http://www.ling.upenn.edu/courses/Fall 2009/ling001/morphology.html.
[2] M. Haspelmath, “The indeterminacy of word segmentation and the nature of mor-
phology and syntax,” Folia Linguistica, vol. 45, 2011.
[3] H. Kučera and W. N. Francis, Computational Analysis of Present-Day American
English. Providence, RI: Brown University Press, 1967.
[4] S. B. Cohen and N. A. Smith, “Joint morphological and syntactic disambiguation,”
in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Lan-
guage Processing and Computational Natural Language Learning (EMNLP-CoNLL),
pp. 208–217, 2007.
Bibliography 23

[5] T. Nakagawa, “Chinese and Japanese word segmentation using word-level and
character-level information,” in Proceedings of 20th International Conference on Com-
putational Linguistics, pp. 466–472, 2004.
[6] H. Shin and H. You, “Hybrid n-gram probability estimation in morphologically rich
languages,” in Proceedings of the 23rd Pacific Asia Conference on Language, Infor-
mation and Computation, 2009.
[7] D. Z. Hakkani-Tür, K. Oflazer, and G. Tür, “Statistical morphological disambiguation
for agglutinative languages,” in Proceedings of the 18th Conference on Computational
Linguistics, pp. 285–291, 2000.
[8] G. T. Stump, Inflectional Morphology: A Theory of Paradigm Structure. Cambridge
Studies in Linguistics, New York: Cambridge University Press, 2001.
[9] K. R. Beesley and L. Karttunen, Finite State Morphology. CSLI Studies in Compu-
tational Linguistics, Stanford, CA: CSLI Publications, 2003.
[10] M. Baerman, D. Brown, and G. G. Corbett, The Syntax-Morphology Interface. A Study
of Syncretism. Cambridge Studies in Linguistics, New York: Cambridge University
Press, 2006.
[11] B. Roark and R. Sproat, Computational Approaches to Morphology and Syntax. Oxford
Surveys in Syntax and Morphology, New York: Oxford University Press, 2007.
[12] O. Smrž, “Functional Arabic morphology. Formal system and implementation,” PhD
thesis, Charles University in Prague, 2007.
[13] H. Eifring and R. Theil, Linguistics for Students of Asian and African Languages.
Universitetet i Oslo, 2005.
[14] B. Bickel and J. Nichols, “Fusion of selected inflectional formatives & exponence of
selected inflectional formatives,” in The World Atlas of Language Structures Online
(M. Haspelmath, M. S. Dryer, D. Gil, and B. Comrie, eds.), ch. 20 & 21, Munich: Max
Planck Digital Library, 2008.
[15] W. Fischer, A Grammar of Classical Arabic. Trans. Jonathan Rodgers. Yale Language
Series, New Haven, CT: Yale University Press, 2002.
[16] K. C. Ryding, A Reference Grammar of Modern Standard Arabic. New York: Cam-
bridge University Press, 2005.
[17] O. Smrž and V. Bielický, “ElixirFM.” Functional Arabic Morphology, SourceForge.net,
2010. http://sourceforge.net/projects/elixer-fm/.
[18] T. Kamei, R. Kōno, and E. Chino, eds., The Sanseido Encyclopedia of Linguistics,
Volume 6 Terms (in Japanese). Sanseido, 1996.
[19] F. Karlsson, Finnish Grammar. Helsinki: Werner Söderström Osakenyhtiö, 1987.
[20] J. Hajič and B. Hladká, “Tagging inflective languages: Prediction of morphological cat-
egories for a rich, structured tagset,” in Proceedings of COLING-ACL 1998, pp. 483–
490, 1998.
24 Chapter 1 Finding the Structure of Words

[21] J. Hajič, “Morphological tagging: Data vs. dictionaries,” in Proceedings of NAACL-


ANLP 2000, pp. 94–101, 2000.
[22] N. Habash and O. Rambow, “Arabic tokenization, part-of-speech tagging and mor-
phological disambiguation in one fell swoop,” in Proceedings of the 43rd Annual
Meeting of the Association for Computational Linguistics (ACL’05), pp. 573–580,
2005.
[23] N. A. Smith, D. A. Smith, and R. W. Tromble, “Context-based morphological dis-
ambiguation with random fields,” in Proceedings of HLT/EMNLP 2005, pp. 475–482,
2005.
[24] J. Hajič, O. Smrž, T. Buckwalter, and H. Jin, “Feature-based tagger of approximations
of functional Arabic morphology,” in Proceedings of the 4th Workshop on Treebanks
and Linguistic Theories (TLT 2005), pp. 53–64, 2005.
[25] T. Buckwalter, “Issues in Arabic orthography and morphology analysis,” in COLING
2004 Computational Approaches to Arabic Script-based Languages, pp. 31–34, 2004.
[26] R. Nelken and S. M. Shieber, “Arabic diacritization using finite-state transducers,”
in Proceedings of the ACL Workshop on Computational Approaches to Semitic Lan-
guages, pp. 79–86, 2005.
[27] I. Zitouni, J. S. Sorensen, and R. Sarikaya, “Maximum entropy based restoration of
Arabic diacritics,” in Proceedings of the 21st International Conference on Compu-
tational Linguistics and 44th Annual Meeting of the Association for Computational
Linguistics, pp. 577–584, 2006.
[28] N. Habash and O. Rambow, “Arabic diacritization through full morphological tag-
ging,” in Human Language Technologies 2007: The Conference of the North American
Chapter of the Association for Computational Linguistics; Companion Volume, Short
Papers, pp. 53–56, 2007.
[29] G. Huet, “Lexicon-directed segmentation and tagging of Sanskrit,” in Proceedings of
the XIIth World Sanskrit Conference, pp. 307–325, 2003.
[30] G. Huet, “Formal structure of Sanskrit text: Requirements analysis for a mechanical
Sanskrit processor,” in Sanskrit Computational Linguistics: First and Second Inter-
national Symposia (G. Huet, A. Kulkarni, and P. Scharf, eds.), vol. 5402 of LNAI,
pp. 162–199, Berlin: Springer, 2009.
[31] F. Katamba and J. Stonham, Morphology. Basingstoke: Palgrave Macmillan, 2006.
[32] L. Bauer, Morphological Productivity, Cambridge Studies in Linguistics. New York:
Cambridge University Press, 2001.
[33] R. H. Baayen, Word Frequency Distributions, Text, Speech and Language Technology.
Boston: Kluwer Academic Publishers, 2001.
[34] A. Kilgarriff, “Googleology is bad science,” Computational Linguistics, vol. 33, no. 1,
pp. 147–151, 2007.
Bibliography 25

[35] H.-C. Kwon and Y.-S. Chae, “A dictionary-based morphological analysis,” in Proceed-
ings of Natural Language Processing Pacific Rim Symposium, pp. 178–185, 1991.
[36] D.-B. Kim, K.-S. Choi, and K.-H. Lee, “A computational model of Korean morphologi-
cal analysis: A prediction-based approach,” Journal of East Asian Linguistics, vol. 5,
no. 2, pp. 183–215, 1996.
[37] A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE
Intelligent Systems, vol. 24, no. 2, pp. 8–12, 2009.
[38] R. M. Kaplan and M. Kay, “Regular models of phonological rule systems,” Computa-
tional Linguistics, vol. 20, no. 3, pp. 331–378, 1994.
[39] K. Koskenniemi, “Two-level morphology: A general computational model for word
form recognition and production,” PhD thesis, University of Helsinki, 1983.
[40] R. Sproat, Morphology and Computation. ACL–MIT Press Series in Natural Language
Processing. Cambridge, MA: MIT Press, 1992.
[41] D.-B. Kim, S.-J. Lee, K.-S. Choi, and G.-C. Kim, “A two-level morphological analysis
of Korean,” in Proceedings of the 15th International Conference on Computational
Linguistics, pp. 535–539, 1994.
[42] S.-Z. Lee and H.-C. Rim, “Korean morphology with elementary two-level rules and
rule features,” in Proceedings of the Pacific Association for Computational Linguistics,
pp. 182–187, 1997.
[43] N.-R. Han, “Klex: A finite-state trancducer lexicon of Korean,” in Finite-state Meth-
ods and Natural Language Processing: 5th International Workshop, FSMNLP 2005,
pp. 67–77, Springer, 2006.
[44] M. Kay, “Nonconcatenative finite-state morphology,” in Proceedings of the Third Con-
ference of the European Chapter of the ACL (EACL-87), pp. 2–10, ACL, 1987.
[45] K. R. Beesley, “Arabic morphology using only finite-state operations,” in COLING-
ACL’98 Proceedings of the Workshop on Computational Approaches to Semitic lan-
guages, pp. 50–57, 1998.
[46] G. A. Kiraz, Computational Nonlinear Morphology with Emphasis on Semitic Lan-
guages. Studies in Natural Language Processing, Cambridge: Cambridge University
Press, 2001.
[47] N. Habash, O. Rambow, and G. Kiraz, “Morphological analysis and generation for
Arabic dialects,” in Proceedings of the ACL Workshop on Computational Approaches
to Semitic Languages, pp. 17–24, 2005.
[48] H. Skoumalová, “A Czech morphological lexicon,” in Proceedings of the Third Meeting
of the ACL Special Interest Group in Computational Phonology, pp. 41–47, 1997.
[49] R. Sedláček and P. Smrž, “A new Czech morphological analyser ajka,” in Text, Speech
and Dialogue, vol. 2166, pp. 100–107, Berlin: Springer, 2001.
26 Chapter 1 Finding the Structure of Words

[50] K. Oflazer, “Computational morphology.” ESSLLI 2006 European Summer School in


Logic, Language, and Information, 2006.
[51] C. Pollard and I. A. Sag, Head-Driven Phrase Structure Grammar. Chicago: University
of Chicago Press, 1994.
[52] R. Evans and G. Gazdar, “DATR: A language for lexical knowledge representation,”
Computational Linguistics, vol. 22, no. 2, pp. 167–216, 1996.
[53] T. Erjavec, “Unification, inheritance, and paradigms in the morphology of natural
languages,” PhD thesis, University of Ljubljana, 1996.
[54] B. Carpenter, The Logic of Typed Feature Structures. Cambridge Tracts in Theoretical
Computer Science 32, New York: Cambridge University Press, 1992.
[55] S. M. Shieber, Constraint-Based Grammar Formalisms: Parsing and Type Inference
for Natural and Computer Languages. Cambridge, MA: MIT Press, 1992.
[56] S. Bird and T. M. Ellison, “One-level phonology: Autosegmental representations and
rules as finite automata,” Computational Linguistics, vol. 20, no. 1, pp. 55–90, 1994.
[57] S. Bird and P. Blackburn, “A logical approach to Arabic phonology,” in Proceedings
of the 5th Conference of the European Chapter of the Association for Computational
Linguistics, pp. 89–94, 1991.
[58] G. G. Corbett and N. M. Fraser, “Network morphology: A DATR account of Russian
nominal inflection,” Journal of Linguistics, vol. 29, pp. 113–142, 1993.
[59] J. Hajič, “Unification morphology grammar. Software system for multilanguage mor-
phological analysis,” PhD thesis, Charles University in Prague, 1994.
[60] K. Megerdoomian, “Unification-based Persian morphology,” in Proceedings of CICLing
2000, 2000.
[61] R. Finkel and G. Stump, “Generating Hebrew verb morphology by default inheritance
hierarchies,” in Proceedings of the Workshop on Computational Approaches to Semitic
Languages, pp. 9–18, 2002.
[62] S. R. Al-Najem, “Inheritance-based approach to Arabic verbal root-and-pattern mor-
phology,” in Arabic Computational Morphology. Knowledge-based and Empirical Meth-
ods (A. Soudi, A. van den Bosch, and G. Neumann, eds.), vol. 38, pp. 67–88, Berlin:
Springer, 2007.
[63] S. Köprü and J. Miller, “A unification based approach to the morphological analy-
sis and generation of Arabic,” in CAASL-3: Third Workshop on Computational Ap-
proaches to Arabic Script-based Languages, 2009.
[64] M. Forsberg and A. Ranta, “Functional morphology,” in Proceedings of the 9th
ACM SIGPLAN International Conference on Functional Programming, ICFP 2004,
pp. 213–223, 2004.
[65] A. Ranta, “Grammatical Framework: A type-theoretical grammar formalism,” Journal
of Functional Programming, vol. 14, no. 2, pp. 145–189, 2004.
Bibliography 27

[66] P. Ljunglöf, “Pure functional parsing. An advanced tutorial,” Licenciate thesis,


Göteborg University & Chalmers University of Technology, 2002.
[67] G. Huet, “The Zen computational linguistics toolkit,” ESSLLI 2002 European Summer
School in Logic, Language, and Information, 2002.
[68] G. Huet, “A functional toolkit for morphological and phonological processing,
application to a Sanskrit tagger,” Journal of Functional Programming, vol. 15, no. 4,
pp. 573–614, 2005.
[69] M. Humayoun, H. Hammarström, and A. Ranta, “Urdu morphology, orthography and
lexicon extraction,” in CAASL-2: Second Workshop on Computational Approaches to
Arabic Script-based Languages, pp. 59–66, 2007.
[70] A. Dada and A. Ranta, “Implementing an open source Arabic resource grammar in
GF,” in Perspectives on Arabic Linguistics (M. A. Mughazy, ed.), vol. XX, pp. 209–
231, John Benjamins, 2007.
[71] A. Ranta, “Grammatical Framework.” Programming Language for Multilingual Gram-
mar Applications, http://www.grammaticalframework.org/, 2010.
[72] J. Baldridge, S. Chatterjee, A. Palmer, and B. Wing, “DotCCG and VisCCG: Wiki
and programming paradigms for improved grammar engineering with OpenCCG,” in
Proceedings of the Workshop on Grammar Engineering Across Frameworks, 2007.
[73] H. Hammarström, “Unsupervised learning of morphology and the languages of the
world,” PhD thesis, Chalmers University of Technology and University of Gothenburg,
2009.
[74] J. A. Goldsmith, “Segmentation and morphology,” in Computational Linguistics and
Natural Language Processing Handbook (A. Clark, C. Fox, and S. Lappin, eds.),
pp. 364–393, Chichester: Wiley-Blackwell, 2010.
[75] D. Yarowsky and R. Wicentowski, “Minimally supervised morphological analysis by
multimodal alignment,” in Proceedings of the 38th Meeting of the Association for
Computational Linguistics, pp. 207–216, 2000.
[76] P. Schone and D. Jurafsky, “Knowledge-free induction of inflectional morphologies,”
in Proceedings of the North American Chapter of the Association for Computational
Linguistics, pp. 183–191, 2001.
[77] S. Neuvel and S. A. Fulop, “Unsupervised learning of morphology without mor-
phemes,” in Proceedings of the ACL-02 Workshop on Morphological and Phonological
Learning, pp. 31–40, 2002.
[78] N. Hathout, “Acquistion of the morphological structure of the lexicon based on lexical
similarity and formal analogy,” in Coling 2008: Proceedings of the 3rd Textgraphs
Workshop on Graph-based Algorithms for Natural Language Processing, pp. 1–8,
2008.
[79] J. Goldsmith, “Unsupervised learning of the morphology of a natural language,” Com-
putational Linguistics, vol. 27, no. 2, pp. 153–198, 2001.
28 Chapter 1 Finding the Structure of Words

[80] H. Johnson and J. Martin, “Unsupervised learning of morphology for English and
Inuktikut,” in Companion Volume of the Proceedings of the Human Language Tech-
nologies: The Annual Conference of the North American Chapter of the Association
for Computational Linguistics 2003: Short Papers, pp. 43–45, 2003.
[81] M. Creutz and K. Lagus, “Induction of a simple morphology for highly-inflecting
languages,” in Proceedings of the 7th Meeting of the ACL Special Interest Group in
Computational Phonology, pp. 43–51, 2004.
[82] M. Creutz and K. Lagus, “Unsupervised models for morpheme segmentation and
morphology learning,” ACM Transactions on Speech and Language Processing, vol. 4,
no. 1, pp. 1–34, 2007.
[83] C. Monson, J. Carbonell, A. Lavie, and L. Levin, “ParaMor: Minimally supervised
induction of paradigm structure and morphological analysis,” in Proceedings of Ninth
Meeting of the ACL Special Interest Group in Computational Morphology and Phonol-
ogy, pp. 117–125, 2007.
[84] F. M. Liang, “Word Hy-phen-a-tion by Com-put-er,” PhD thesis, Stanford University,
1983.
[85] V. Demberg, “A language-independent unsupervised model for morphological segmen-
tation,” in Proceedings of the 45th Annual Meeting of the Association of Computational
Linguistics, pp. 920–927, 2007.
[86] A. Clark, “Supervised and unsupervised learning of Arabic morphology,” in Ara-
bic Computational Morphology. Knowledge-based and Empirical Methods (A. Soudi,
A. van den Bosch, and G. Neumann, eds.), vol. 38, pp. 181–200, Berlin: Springer, 2007.
[87] A. Xanthos, Apprentissage automatique de la morphologie: le cas des structures racine-
schème. Sciences pour la communication, Bern: Peter Lang, 2008.
[88] B. Snyder and R. Barzilay, “Unsupervised multilingual learning for morphological
segmentation,” in Proceedings of ACL-08: HLT, pp. 737–745, 2008.
[89] H. Poon, C. Cherry, and K. Toutanova, “Unsupervised morphological segmentation
with log-linear models,” in Proceedings of Human Language Technologies: Annual Con-
ference of the North American Chapter of the Association for Computational Linguis-
tics, pp. 209–217, 2009.
[90] S. Della Pietra, V. Della Pietra, and J. Lafferty, “Inducing features of random fields,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 4,
pp. 380–393, 1997.
Index
. (period), sentence segmentation markers, 30 overview of, 527
“” (Quotation marks), sentence segmentation UIMA, 527–529
markers, 30 Aggregation models, for MLIR, 385
! (Exclamation point), as sentence Agreement feature, of coreference models, 301
segmentation marker, 30 Air Travel Information System (ATIS)
? (Question mark), sentence segmentation as resource for meaning representation, 148
markers, 30 rule-based systems for semantic parsing,
80/20 rule (vital few), 14 150
supervised systems for semantic parsing,
150–151
a priori models, in document retrieval, 377 Algorithms. See by individual types
Abbreviations, punctuation marks in, 30 Alignment-error rate (AER), 343
Absity parser, rule-based semantic parsing, Alignment, in RTE
122 implementing, 233–236
Abstracts latent alignment inference, 247–248
in automatic summarization, 397 learning alignment independently of
defined, 400 entailment, 244–245
Accumulative vector space model, for leveraging multiple alignments, 245
document retrieval, 374–375 modeling, 226
Accuracy, in QA, 462 Allomorphs, 6
ACE. See Automatic content extraction “almost-parsing” language model, 181
(ACE) Ambiguity
Acquis corpus disambiguation problem in morphology, 91
for evaluating IR systems, 390 in interpretation of expressions, 10–13
for machine translation, 358 issues with morphology induction, 21
Adequacy, of translation, 334 PCFGs and, 80–83
Adjunctive arguments, PropBank verb resolution in parsing, 80
predicates, 119–120 sentence segmentation markers and, 30
AER (Alignment-error rate), 343 structural, 99
AEs (Analysis engines), UIMA, 527 in syntactic analysis, 61
Agglutinative languages types of, 8
finite-state technology applied to, 18 word sense and. See Disambiguation
linear decomposition of words, 192 systems, word sense
morphological typology and, 7 Analysis engines (AEs), UIMA, 527
parsing issues related to morphology, 90–91 Analysis, in RTE framework
Aggregate processor, combining NLP engines, annotators, 219
523 improving, 248–249
Aggregation architectures, for NLP. See also multiview representation of, 220–222
Natural language processing (NLP), overview of, 220
combining engines for Analysis stage, of summarization system
GATE, 529–530 building a summarization system and, 421
InfoSphere Streams, 530–531 overview of, 400
551
552 Index

Anaphora resolution. See also Coreference polarity analysis of words and phrases, 269
resolution productivity/creativity in, 15
automatic summarization and, 398 regional dialects not in written form, 195
cohesion of, 401 RTE in, 218
multilingual automatic summarization and, stem-matching features for capturing
410 morphological similarities, 301
QA architectures and, 438–439 TALES case study, 538
zero anaphora resolution, 249, 444 tokens in, 4
Anchored speech recognition, 490 translingual summarization, 398–399,
Anchors, in SSTK, 246 424–426
Annotation/annotation guidelines unification-based models, 19
entity detection and, 293 Architectures
in GALE, 478 aggregation architectures for NLP, 527–529
Penn Treebank and, 87–88 for question answering (QA), 435–437
phrase structure trees and, 68–69 of spoken dialog systems, 505
QA architectures and, 439–440 system architectures for distillation, 488
in RTE, 219, 222–224 system architectures for semantic parsing,
snippet processing and, 485 101–102
for treebanks, 62 types of EDT architectures, 286–287
of utterances based on rule-based Arguments
grammars, 502–503
consistency of argument identification, 323
of utterances in spoken dialog systems, 513
event extraction and, 321–322
Answers, in QA
in GALE distillation initiative, 475
candidate answer extraction. See Candidate
in RTE systems, 220
answer extraction, in QA
Arguments, predicate-argument recognition
candidate answer generation. See
argument sequence information, 137–138
Candidate answer generation, in QA
classification and identification, 139–140
evaluating correctness of, 461–462
scores for, 450–453, 458–459 core and adjunctive, 119
scoring component for, 435 disallowing overlaps, 137
type classification of, 440–442 discontiguous, 121
Arabic identification and classification, 123
ambiguity in, 11–12 noun arguments, 144–146
corpora for relation extraction, 317 ART (artifact) relation class, 312
distillation, 479, 490–491 ASCII
EDT and, 286 as encoding scheme, 368
ElixirFM lexicon, 20 parsing issues related, 89
encoding and script, 368 Asian Federation of Natural Language
English-to-Arabic machine translation, 114 Processing, 218
as fusional language, 8 Asian languages. See also by individual Asian
GALE IOD and, 532, 534–536 languages
IR and, 371 multilingual IR and, 366, 390
irregularity in, 8–9 QA and, 434, 437, 455, 460–461, 466
language modeling, 189–191, 193 Ask.com, 435
mention detection experiments, 294–296 ASR (automatic speech recognition)
morphemes in, 6 sentence boundary annotation, 29
morphological analysis of, 191 sentence segmentation markers, 31
multilingual issues in predicate-argument ASSERT (Automatic Statistical SEmantic
structures, 146–147 Role Tagger), 147, 447
Index 553

ATIS. See Air Travel Information System Base phrase chunks, 132–133
(ATIS) BASEBALL system, in history of QA
Atomic events, summarization and, 418 systems, 434
Attribute features, in coreference models, 301 Basic Elements (BE)
Automatic content extraction (ACE) automatic evaluation of summarization,
coreference resolution experiments, 302–303 417–419
event extraction and, 320–321 metrics in, 420
mention detection and, 287, 294 Bayes rule, for sentence or topic
relation extraction and, 311–312 segmentation, 39–40
in Rosetta Consortium distillation system, Bayes theorem, maximum-likelihood
480–481 estimation and, 376
Automatic speech recognition (ASR) Bayesian parameter estimation, 173–174
sentence boundary annotation, 29 Bayesian topic-based language models,
sentence segmentation markers, 31 186–187
Automatic Statistical SEmantic Role Tagger BBN, event extraction and, 322
(ASSERT), 147, 447 BE (Basic Elements)
Automatic summarization automatic evaluation of summarization,
bibliography, 427–432 417–419
coherence and cohesion in, 401–404 metrics in, 420
extraction and modification processes in, BE with Transformations for Evaluation
399–400 (BEwTE), 419–420
graph-based approaches, 401 Beam search
history of, 398–399 machine translation and, 346
introduction to, 397–398 reducing search space using, 290–291
learning how to summarize, 406–409 Bell tree, for coreference resolution, 297–298
LexPageRank, 406
Bengali. See Indian languages
multilingual. See Multilingual automatic
Berkeley word aligner, in machine translation,
summarization
357
stages of, 400
Bibliographic summaries, in automatic
summary, 426–427
summarization, 397
surface-based features used in, 400–401
Bilingual latent semantic analysis (bLSA),
TextRank, 404–406
197–198
Automatic Summary Evaluation based on
Binary classifier, in event matching, 323–324
n-gram graphs (AutoSummENG),
Binary conditional model, for probability of
419–420
mention links, 297–300
BLEU
Babel Fish machine translation metrics, 334, 336
crosslingual question answering and, 455 mention detection experiments and, 295
Systran, 331 ROUGE compared with, 415–416
Backend services, of spoken dialog system, Block comparison method, for topic
500 segmentation, 38
Backoff smoothing techniques bLSA (bilingual latent semantic analysis),
generalized backoff strategy, 183–184 197–198
in language model estimation, 172 BLUE (Boeing Language Understanding
nonnormalized form, 175 Engine), 242–244
parallel backoff, 184 BM25 model, in document retrieval, 375
Backus-Naur form, of context-free grammar, BNC (British National Corpus), 118
59 Boeing Language Understanding Engine
BananaSplit, IR preprocessing and, 392 (BLUE), 242–244
554 Index

Boolean models Canonization, deferred in RTE multiview


for document representation in monolingual representation, 222
IR, 372 Capitalization (Uppercase), sentence
for document retrieval, 374 segmentation markers, 30
Boolean named entity flags, in PSG, 126 CAS (Common analysis structure), UIMA,
Bootstrapping 527, 536
building subjectivity lexicons, 266–267 Cascading systems, types of EDT
corpus-based approach to subjectivity and architectures, 286–287
sentiment analysis, 269 Case
dictionary-based approach to subjectivity parsing issues related to, 88
and sentiment analysis, 273 sentence segmentation markers, 30
ranking approaches to subjectivity and Catalan, 109
sentiment analysis, 275–276 Categorical ambiguity, word sense and, 104
semisupervised approach to relation Cause-and-effect relations, causal reasoning
extraction, 318 and, 250
Boundary classification problems CCG (Combinatory Categorical Grammar),
overview of, 33 129–130
sentence boundaries. See Sentence CFGs. See Context-free grammar (CFGs)
boundary detection Character n-gram models, 370
topic boundaries. See Topic segmentation Chart decoding, tree-based models for
British National Corpus (BNC), 118 machine translation, 351–352
Brown Corpus, as resource for semantic Chart parsing, worst-case parsing algorithm
parsing, 104 for CFGs, 74–79
Buckwalter Morphological Analyzer, 191 Charts, IXIR distillation system, 488–489
CHILL (Constructive Heuristics Induction for
Language Learning), 151
C-ASSERT, software programs for semantic Chinese
role labeling, 147 anaphora frequency in, 444
Call-flow challenges of sentence and topic
localization of, 514 segmentation, 30
strategy of dialog manager, 504 corpora for relation extraction, 317
voice user interface (VUI) and, 505–506 corpus-based approach to subjectivity and
Call routing, natural language and, 510 sentiment analysis, 274–275
Canadian Hansards crosslingual language modeling, 197–198
corpora for IR, 391 data sets related to summarization, 424–426
corpora for machine translation, 358 dictionary-based approach to subjectivity
Candidate answer extraction, in QA and sentiment analysis, 272–273
answer scores, 450–453 distillation, 479, 490–491
combining evidence, 453–454 EDT and, 286
structural matching, 446–448 HowNet lexicon for, 105
from structured sources, 449–450 human assessment of word meaning, 333
surface patterns, 448–449 IR and, 366, 390
type-based, 446 isolating (analytic) languages, 7
from unstructured sources, 445 as isolating or analytic language, 7
Candidate answer generation, in QA language modeling in without word
components in QA architectures, 435 segmentation, 193–194
overview of, 443 lingPipe for word segmentation, 423
Candidate boundaries, processing stages of machine translation and, 322, 354, 358
segmentation tasks, 48 mention detection experiments, 294–296
Index 555

multilingual issues in predicate-argument Classifiers


structures, 146–147 in event matching, 323–324
phrase structure treebank, 70 localization of grammars and, 516
polarity analysis of words and phrases, 269 maximum entropy classifiers, 37, 39–40
preprocessing best practices in IR, 372 in mention detection, 292–293
QA and, 461, 464 pipeline of, 321
QA architectures and, 437–438 in relation extraction, 313, 316–317
resources for semantic parsing, 122 in subjectivity and sentiment analysis,
RTE in, 218 270–272, 274
scripts not using whitespace, 369 Type classifier in QA systems, 440–442
subjectivity and sentiment analysis, in word disambiguation, 110
259–260 CLASSIFY function, 313
TALES case study, 538 ClearTK tool, for building summarization
translingual summarization, 399, 410 system, 423
word segmentation and parsing, 89–90 CLIR. See Crosslingual information retrieval
word segmentation in, 4–5 (CLIR)
word sense annotation in, 104 Clitics
Chomsky, Noam, 13, 98–99 Czech example, 5
Chunk-based systems, 132–133 defined, 4
Chunks Co-occurence, of words between languages,
defined, 292 337–338
meaning chunks in semantic parsing, 97 Coarse to fine parsing, 77–78
CIDR algorithm, for multilingual Code switchers
summarization, 411 impact on sentence segmentation, 31
Citations multilingual language modeling and,
evaluation in distillation, 493 195–196
in GALE distillation initiative, 477 COGEX, for answer scores in QA, 451
CKY algorithm, worst-case parsing for CFGs, Coherence, sentence-sentence connections
76–78 and, 402
Class-based language models, 178–179 Cohesion, anaphora resolution and, 401–402
Classes Collection language, in CLIR, 365
language modeling using morphological Combination hypothesis, combining classifiers
categories, 193 to boost performance, 293
of relations, 311 Combinatory Categorical Grammar (CCG),
Classification 129–130
of arguments, 123, 139–140 Common analysis structure (CAS), UIMA,
data-driven, 287–289 527, 536
dynamic class context in PSG, 128 Communicator program, for meaning
event extraction and, 321–322 representation, 148–150
overcoming independence assumption, Comparators, RTE, 219, 222–223
137–138 Competence vs. performance, Chomsky on, 13
paradigms, 133–137 Compile/replace transducer (Beesley and
problems related to sentence boundaries. Karttunen), 17
See Sentence boundary detection Componentization of design, for NLP
problems related to topic boundaries. See aggregation, 524–525
Topic segmentation Components of words
relation extraction and, 312–316 lexemes, 5
Classification tag lattice (trellis), searching morphemes, 5–7
for mentions, 289 morphological typology and, 7–8
556 Index

Compound slitting Context-free grammar (CFGs)


BananaSplit tool, 392 for analysis of natural language syntax,
normalization for fusional languages, 371 60–61
Computational efficiency dependency graphs in syntax analysis,
desired attributes of NLP aggregation, 65–67
525–526 rules of syntax, 59
in GALE IOD, 537 shift-reduce parsing, 72–73
in GATE, 530 worst-case parsing algorithm, 74–78
in InfoSphere Streams, 530–531 Contextual subjectivity analysis, 261
in UIMA, 528 Contradiction, in textual entailment, 211
Computational Natural Language Learning Conversational speech, sentence segmentation
(CoNLL), 132 in, 31
Concatenative languages, 8 Core arguments, PropBank verb predicates,
Concept space, interlingual document 119
representations, 381 Coreference resolution. See also Anaphora
Conceptual density, as measure of semantic resolution
similarity, 112 automatic summarization and, 398
Conditional probability, MaxEnt formula for, Bell tree for, 297–298
316 experiments in, 302–303
Conditional random fields (CRFs) information extraction and, 100, 285–286
in discriminative parsing model, 84 MaxEnt model applied to, 300–301
machine learning and, 342 models for, 298–300
measuring token frequency, 369 overview of, 295–296
mention detection and, 287 as relation extraction system, 311
relation extraction and, 316 in RTE, 212, 227
sentence or topic segmentation and, 39–40 Corpora
Confidence weighted score (CWS), in QA, 463 for distillation, 480–483
CoNLL (Computational Natural Language for document-level annotations, 274
Learning), 132 Europarl (European Parliament), 295, 345
Constituents for IR systems, 390–391
atomic events and, 418 for machine translation (MT), 358
in PSG, 127 for relation extraction, 317
Constituents, in RTE for semantic parsing, 104–105
comparing annotation constituents, 222–224 for sentence-level annotations, 271–272
multiview representation of analysis and, for subjectivity and sentiment analysis,
220 262–263, 274–275
numerical quantities (NUM), 221, 233 for summarization, 406, 425
Constraint-based language models, 177 for word/phrase-level annotations, 267–269
Constructive Heuristics Induction for Coverage rate criteria, in language model
Language Learning (CHILL), 151 evaluation, 170
Content Analysis Toolkit (Tika), for Cranfield paradigm, 387
preprocessing IR documents, 392 Creativity/productivity, and the unknown
Content word, in PSG, 125–126 word problem, 13–15
Context, as measure of semantic similarity, CRFs. See Conditional random fields (CRFs)
112 Cross-Language Evaluation Forum (CLEF)
Context-dependent process, in GALE IOD, applying to RTE to non-English languages,
536–537 218
Context features, of Rosetta Consortium IR and, 377, 390
distillation system, 486 QA and, 434, 454, 460–464
Index 557

Cross-language mention propagation, 293, 295 Data-manipulation capabilities


Cross-lingual projections, 275 desired attributes of NLP aggregation, 526
Crossdocument coreference (XDC), in in GATE, 530
Rosetta Consortium distillation in InfoSphere Streams, 531
system, 482–483 in UIMA, 528–529
Crossdocument Structure Theory Bank Data reorganization, speech-to-text (STT)
(CSTBank), 425 and, 535–536
Crossdocument structure theory (CST), 425 Data sets
Crosslingual distillation, 490–491 for evaluating IR systems, 389–391
Crosslingual information retrieval (CLIR) for multilingual automatic summarization,
best practices, 382 425–426
interlingual document representations, Data types
381–382 GALE Type System (GTS), 534–535
machine translation, 380–381 usage conventions for NLP aggregation,
overview of, 365, 378 540–541
translation-based approaches, 378–380 Databases
Crosslingual language modeling, 196–198 of entity relations and events, 309–310
Crosslingual question answering, 454–455 relational, 449
Crosslingual summarization, 398 DATR, unification-based morphology and,
CST (Crossdocument structure theory), 425 18–19
CSTBank (Crossdocument Structure Theory DBpedia, 449
Bank), 425 de Saussure, Ferdinand, 13
Cube pruning, decoding phrase-based models, Decision trees, for sentence or topic
347–348 segmentation, 39–40
CWS (Confidence weighted score), in QA, 463 Decoding phrase-based models
Cyrillic alphabet, 371 cube pruning approach, 347–348
Czech
overview of, 345–347
ambiguity in, 11–13
Deep representation, in semantic
dependency graphs in syntax analysis,
interpretation, 101
62–65
Deep semantic parsing
dependency parsing in, 79
coverage in, 102
finite-state models, 18
overview of, 98
as fusional language, 8
Defense Advanced Research Projects Agency
language modeling, 193
(DARPA)
morphological richness of, 355
GALE distillation initiative, 475–476
negation indicated by inflection, 5
GALE IOD case study. See Interoperability
parsing issues related to morphology, 91
Demo (IOD), GALE case study
productivity/creativity in, 14–15
Topic Detection and Tracking (TDT)
syntactic features used in sentence and
program, 32–33
topic segmentation, 43
unification-based models, 19 Definitional questions, QA and, 433
Deletions metrics, machine translation, 335
Dependencies
DAMSL (Dialog Act Markup in Several global similarity in RTE and, 247
Layers), 31 high-level features in event matching,
Data-driven 324–326
machine translation, 331 Dependency graphs
mention detection, 287–289 phrase structure trees compared with, 69–70
Data formats, challenges in NLP aggregation, in syntactic analysis, 63–67
524 in treebank construction, 62
558 Index

Dependency parsing semi-supervised, 114–116


implementing RTE and, 227 software programs for, 116–117
Minipar and Stanford Parser, 456 supervised, 109–112
MST algorithm for, 79–80 unsupervised, 112–114
shift-reduce parsing algorithm for, 73 Discontiguous arguments, PropBank verb
structural matching and, 447 predicates, 121
tree edit distance based on, 240–241 Discourse commitments (beliefs), RTE system
worst-case parsing algorithm for CFGs, 78 based on, 239–240
Dependency trees Discourse connectives, relating sentences by,
non projective, 65–67 29
overview of, 130–132 Discourse features
patterns used in relation extraction, 318 relating sentences by discourse connectives,
projective, 64–65 29
Derivation, parsing and, 71–72 in sentence and topic segmentation, 44
Devanagari, preprocessing best practices in Discourse segmentation. See Topic
IR, 371 segmentation
Dialog Act Markup in Several Layers Discourse structure
(DAMSL), 31 automatic summarization and, 398, 410
Dialog manager RTE applications and, 249
directing speech generation, 499–500 Discovery of inference rules from text (DIRT),
overview of, 504–505 242
Dialog module (DM) Discriminative language models
call-flow localization and, 514 modeling using morphological categories,
voice user interface and, 507–508 192–193
Dialogs modeling without word segmentation, 194
forms of, 509–510 overview of, 179–180
rules of, 499–500 Discriminative local classification methods,
Dictionary-based approach, in subjectivity for sentence/topic boundary detection,
and sentiment analysis 36–38
document-level annotations, 272–273 Discriminative parsing models
sentence-level annotations, 270–271 morphological information in, 91–92
word/phrase-level annotations, 264–267 overview of, 84–87
Dictionary-based morphology, 15–16 Discriminative sequence classification
Dictionary-based translations methods
applying to CLIR, 380 complexity of, 40–41
crosslingual modeling and, 197 overview of, 34
Directed dialogs, 509 performance of, 41
Directed graphs, 79–80 for sentence/topic boundary detection,
Dirichlet distribution 38–39
Hierarchical Dirichlet process (HDP), 187 Distance-based reordering model, in machine
language models and, 174 translation, 344
Latent Dirichlet allocation (LDA) model, Distance, features of coreference models, 301
186 Distillation
DIRT (Discovery of inference rules from text), bibliography, 495–497
242 crosslingual, 490–491
Disambiguation systems, word sense document and corpus preparation, 480–483
overview of, 105 evaluation and metrics, 491–494
rule-based, 105–109 example, 476–477
semantic parsing and, 152–153 indexing and, 483
Index 559

introduction to, 475–476 topic boundary detection (segmentation),


multimodal, 490 32–33
query answers and, 483–487 typographical and structural features for
redundancy reduction, 489–490 segmentation, 44–45
relevance and redundancy and, 477–479 Document Understanding Conference (DUC),
relevance detection, 488–489 404, 424
Rosetta Consortium system, 479–480 Documents, in distillation systems
summary, 495 indexing, 483
system architectures for, 488 preparing, 480–483
DM (Dialog module) retrieving, 483–484
call-flow localization and, 514 Documents, in IR
interlingual representation, 381–382
voice user interface and, 507–508
monolingual representation, 372–373
Document-level annotations, for subjectivity
preprocessing, 366–367
and sentiment analysis
a priori models, 377
corpus-based, 274
reducing MLIR to CLIR, 383–384
dictionary-based, 272–273
syntax and encoding, 367–368
overview of, 272
translating entire collection, 379
Document retrieval system, INDRI, 323 Documents, QA searches, 444
Document structure Domain dependent scope, for semantic
bibliography, 49–56 parsing, 102
comparing segmentation methods, 40–41 Domain independent scope, for semantic
discourse features of segmentation methods, parsing, 102
44 Dominance relations, 325
discriminative local classification method DSO Corpus, of Sense-Tagged English, 104
for segmentation, 36–38 DUC (Document Understanding Conference),
discriminative sequence classification 404, 424
method for segmentation, 38–39 Dutch
discussion, 48–49 IR and, 390–391
extensions for global modeling sentence normalization and, 371
segmentation, 40 QA and, 439, 444, 461
features of segmentation methods, 41–42 RTE in, 218
generative sequence classification method
for segmentation, 34–36
hybrid methods for segmentation, 39–40 Edit distance, features of coreference models,
introduction to, 29–30 301
lexical features of segmentation methods, Edit Distance Textual Entailment Suite
42–43 (EDITS), 240–241
methods for detecting probable sentence or EDT. See Entity detection and tracking
topic boundaries, 33–34 (EDT)
performance of segmentation methods, 41 Elaborative summaries, in automatic
processing stages of segmentation tasks, 48 summarization, 397
prosodic features for segmentation, 45–48 ElixirFM lexicon, 20
sentence boundary detection Ellipsis, linguistic supports for cohesion, 401
(segmentation), 30–32 EM algorithm. See Expectation-maximization
speech-related features for segmentation, 45 (EM) algorithm
summary, 49 Encoding
syntactic features of segmentation methods, of documents in information retrieval, 368
43–44 parsing issues related to, 89
560 Index

English relations. See Relations


call-flow localization and, 514 resolution in semantic interpretation, 100
co-occurrence of words between languages, Entity detection and tracking (EDT)
337–339 Bell tree for, 297–298
corpora for relation extraction, 317 bibliography, 303–307
corpus-based approach to subjectivity and combining entity and relation detection, 320
sentiment analysis, 271–272 coreference models, 298–300
crosslingual language modeling, 197–198 coreference resolution, 295–296
dependency graphs in syntax analysis, 65 data-driven classification, 287–289
discourse parsers for, 403 experiments in coreference resolution,
distillation, 479, 490–491 302–303
finite-state transducer applied to English experiments in mention detection, 294–295
example, 17 features for mention detection, 291–294
GALE IOD and, 532, 534–536 introduction to, 285–287
IR and, 390 MaxEnt model applied to, 300–301
as isolating or analytic language, 7 mention detection task, 287
machine translation and, 322, 354, 358 searching for mentions, 289–291
manually annotated corpora for, 274 summary, 303
mention detection, 287 Equivalent terms, in GALE distillation
mention detection experiments, 294–296 initiative, 475
multilingual issues in predicate-argument Errors
structures, 146–147 machine translation, 335–337, 343, 349
normalization and, 370 parsing, 141–144
phrase structure trees in syntax analysis, 62 sentence and topic segmentation, 41
polarity analysis of words and phrases, 269 ESA (Explicit semantic analysis), for
productivity/creativity in, 14–15 interlingual document representation,
QA and, 444, 461 382
QA architectures and, 437 Europarl (European Parliament) corpus
RTE in, 218 evaluating co-occurrence of word between
sentence segmentation markers, 30 languages, 337
subjectivity and sentiment analysis, for IR systems, 391
259–260, 262 for machine translation, 358
as SVO language, 356 phrase translation tables, 345
TALES case study, 538 European Language Resources Association,
tokenization and, 410 218
translingual summarization, 398–399, European languages. See also by individual
424–426 languages
word order and, 356 crosslingual question answering and, 455
WordNet and, 109 QA architectures and, 437
Enrichment, in RTE whitespace use in, 369
implementing, 228–231 European Parliament Plenary Speech corpus,
modeling, 225 295
Ensemble clustering methods, in relation EVALITA, applying to RTE to non-English
extraction, 317–318 languages, 218
Entities Evaluation, in automatic summarization
classifiers, 292–293 automated evaluation methodologies,
entity-based relation extraction, 314–315 415–418
events. See Events manual evaluation methodologies, 413–415
Index 561

overview of, 412–413 Expectation-maximization (EM) algorithm


recent developments in, 418–419 split-merge over trees using, 83
Evaluation, in distillation symmetrization and, 340–341
citation checking, 493 word alignment between languages and,
GALE and, 492 339–340
metrics, 493–494 Experiments
overview of, 491–492 in coreference resolution, 302–303
relevance and redundancy and, 492–493 in mention detection, 294–295
Evaluation, in IR setting up for IR evaluation, 387
best practices, 391 Explicit semantic analysis (ESA), for
data sets for, 389–390 interlingual document representation,
experimental setup for, 387 382
measures in, 388–389 eXtended WordNet (XWN), 451
overview of, 386–387 Extraction
relevance assessment, 387–388 in automatic summarization, 399–400
trec-eval tool for, 393 as classification problem, 312–313
Evaluation, in MT of events, 320–322, 326
automatic evaluation, 334–335 of relations, 310–311
human assessment, 332–334 Extraction, in QA
meaning and, 332 candidate extraction from structured
metrics for, 335–337 sources, 449–450
Evaluation, in QA candidate extraction from unstructured
answer correctness, 461–462 sources, 445–449
performance metrics, 462–464 candidate extraction techniques in QA, 443
tasks, 460–461 Extracts
in automatic summarization, 397
Evaluation, in RTE
defined, 400
general model and, 224
Extrinsic evaluation, of summarization, 412
improving, 251–252
performance evaluation, 213–214
Evaluation, of aggregated NLP, 541 F-measure, in mention detection, 294
Evaluative summaries, in automatic Factoid QA systems
summarization, 397 answer correctness, 461
Events. See also Entities answer scores and, 450–453
extraction, 320–322 baseline, 443
future directions in extraction, 326 candidate extraction or generation and, 435
matching, 323–326 challenges in, 464–465
moving beyond sentence processing, 323 crosslingual question answering and, 454
overview of, 320 evaluation tasks, 460–461
resolution in semantic interpretation, 100 extracting using high-level searches, 445
Exceptions extracting using structural matching, 446
challenges in NLP aggregation, 524 MURAX and, 434
functional morphology models and, 19 performance metrics, 462–463
Exclamation point (!), as sentence questions, 433
segmentation marker, 30 type classification of, 440
Existence classifier, in relation extraction, 313 Factoids, in manual evaluation of
Expansion documents, query expansion and, summarization, 413
377 Factored (cascaded) model, 313
Expansion rules, features of predicate- Factored language models (FLM)
argument structures, 145 machine translation and, 355
562 Index

Factored language models (continued ) Frame elements


morphological categories in, 193 in PSG, 126
overview of, 183–184 semantic frames in FrameNet, 118
Feature extractors FrameNet
building summarization systems, 423 limitation of, 122–123
distillation and, 485–486 resources, 122
summarization and, 406 resources for predicate-argument
Features recognition, 118–122
in mention detection system, 291–294 Freebase, 449
typed feature structures and unification, French
18–19 automatic speech recognition (ASR), 179
in word disambiguation system, 110–112 dictionary-based approach to subjectivity
Features, in sentence or topic segmentation and sentiment analysis, 267
human assessment of translation English to,
defined, 33
332–333
discourse features, 44
IR and, 378, 390–391
lexical features, 42–43
language modeling, 188
overview of, 41–42
localization of spoken dialog systems, 513
predictions based on, 29
machine translation and, 350, 353–354, 358
prosodic features, 45–48 phrase structure trees in syntax analysis,
speech-related features, 45 62
syntactic features, 43–44 polarity analysis of words and phrases, 269
typographical and structural features, QA and, 454, 461
44–45 RTE in, 217–218
Fertility, word alignment and, 340 translingual summarization, 398
Files types, document syntax and, 367–368 word segmentation and, 90
Finite-state morphology, 16–18 WordNet and, 109
Finite-state transducers, 16–17, 20 Functional morphology, 19–21
Finnish Functions, viewing language relations as, 17
as agglutinative language, 7 Fusional languages
IR and, 390–391 functional morphology models and, 19
irregular verbs, 10 morphological typology and, 8
language modeling, 189–191 normalization and, 371
parsing issues related to morphology, 91 preprocessing best practices in IR, 371
summarization and, 399
FIRE (Forum for Information Retrieval
Evaluation), 390 GALE. See Global Autonomous Language
Flexible, distributed componentization Exploitation (GALE)
desired attributes of NLP aggregation, GALE Type System (GTS), 534–535
524–525 GATE. See General Architecture for Text
in GATE, 530 Engineering (GATE)
in InfoSphere Streams, 530 Gazetteer, features of mention detection
in UIMA, 528 system, 293
FLM. See Factored language models (FLM) GEN-AFF (general-affiliation), relation class,
Fluency, of translation, 334 312
Forum for Information Retrieval Evaluation Gender
(FIRE), 390 ambiguity resolution, 13
FraCaS corpus, applying natural logic to multilingual approaches to grammatical
RTE, 246 gender, 398
Index 563

General Architecture for Text Engineering Global Autonomous Language Exploitation


(GATE) (GALE)
attributes of, 530 distillation initiative of DARPA, 475–476
history of summarization systems, 399 evaluation in distillation, 492
overview of, 529–530 Interoperability Demo case study. See
summarization frameworks, 422 Interoperability Demo (IOD), GALE
General Inquirer, subjectivity and sentiment case study
analysis lexicon, 262 metrics for evaluating distillation, 494
Generalized backoff strategy, in FLM, 183–184 relevance and redundancy in, 477–479
Generative parsing models, 83–84 Global linear model, discriminative approach
Generative sequence classification methods to learning, 84
complexity of, 40 Good-Turing
overview of, 34 machine translation and, 345
performance of, 41 smoothing techniques in language model
for sentence/topic boundary detection, estimation, 172
34–36 Google, 435
Geometric vector space model, for document Google Translate, 331, 455
retrieval, 375 Grammars
GeoQuery Combinatory Categorical Grammar (CCG),
resources for meaning representation, 149 129–130
supervised systems for semantic parsing, context-free. See Context-free grammar
151 (CFGs)
German head-driven phrase structure grammar
co-occurrence of words between languages, (HPSG), 18
337–339 localization of, 514, 516–517
dictionary-based approach to subjectivity morphological resource grammars, 19, 21
and sentiment analysis, 265–266, 273 phrase structure. See Phrase Structure
discourse parsers for, 403 Grammar (PSG)
as fusional language, 8 probabilistic context-free. See Probabilistic
IR and, 390–392 context-free grammars (PCFGs)
language modeling, 189 rule-based grammars in speech recognition,
mention detection, 287 501–503
morphological richness of, 354–355 Tree-Adjoining Grammar (TAG), 130
normalization, 370–371 voice user interface (VUI), 508–509
OOV rate in, 191 Grammatical Framework, 19, 21
phrase-based model for decoding, 345 Graph-based approaches, to automatic
polarity analysis of words and phrases, 269 summarization
QA and, 461 applying RST to summarization, 402–404
RTE in, 218 coherence and cohesion and, 401–402
subjectivity and sentiment analysis, 259, LexPageRank, 406
276 overview of, 401
summarization and, 398, 403–404, 420 TextRank, 404–406
WordNet and, 109 Graph generation, in RTE
Germanic languages, language modeling for, implementing, 231–232
189 modeling, 226
GetService process, of voice user interface Graphemes, 4
(VUI), 506–507 Greedy best-fit decoding, in mention
Giza, machine translation program, 423 detection, 322
GIZA toolkit, for machine translation, 357 Groups, aligning views in RTE, 233
564 Index

Grow-diag-final method, for word alignment, Homonymy


341 in Korean, 10
GTS (GALE Type System), 534–535 word sense ambiguities and, 104
Gujarati. See Indian languages HowNet
dictionary-based approach to subjectivity
and sentiment analysis, 272–273
HDP (Hierarchical Dirichlet process), 187 semantic parsing resource, 105
Head-driven phrase structure grammar HTML Parser, preprocessing IR documents,
(HPSG), 18 392
Head word Hunalign tool, for machine translation, 357
dependency trees and, 131 Hungarian
in Phrase Structure Grammar (PSG), 124 dependency graphs in syntax analysis, 65
Headlines, typographical and structural IR and, 390
features for sentence and topic morphological richness of, 355
segmentation, 44–45 Hybrid methods, for segmentation, 39–40
Hebrew Hypergraphs, worst-case parsing algorithm
encoding and script, 368 for CFGs, 74–79
preprocessing best practices in IR, 371 Hypernyms, 442
tokens in, 4 Hyponymy, 310
unification-based models, 19 Hypotheses, machine translation and, 346
HELM (hidden event language model)
applied to sentence segmentation, 36
methods for sentence or topic segmentation, IBM Models, for machine translation, 338–341
40 Identification, of arguments, 123, 139–140
Hidden event language model (HELM) IDF. See Inverse document frequency (IDF)
applied to sentence segmentation, 36 IE. See Information extraction (IE)
methods for sentence or topic segmentation, ILP (Integer linear programming), 247
40 Implementation process, in RTE
Hidden Markov model (HMM) alignment, 233–236
applied to topic and sentence segmentation, enrichment, 228–231
34–36 graph generation, 231–232
measuring token frequency, 369 inference, 236–238
mention detection and, 287 overview of, 227
methods for sentence or topic segmentation, preprocessing, 227–228
39 training, 238
word alignment between languages and, IMS (It Makes Sense), program for word
340 sense disambiguation, 117
Hierarchical Dirichlet process (HDP), 187 Independence assumption
Hierarchical phrase-based models, in machine document retrieval and, 372
translation, 350–351 overcoming in predicate-argument
Hierarchical phrase pairs, in machine structure, 137–138
translation, 351 Indexes
High-level features, in event matching, 324 of documents in distillation system, 483
Hindi. See also Indian languages for IR generally, 366
IR and, 390 latent semantic indexing (LSI), 381
resources for semantic parsing, 122 for monolingual IR, 373–374
translingual summarization, 399 for multilingual IR, 383–384
History, conditional context of probability, phrase indices, 366, 369–370
83 positional indices, 366
HMM. See Hidden Markov model (HMM) translating MLIR queries, 384
Index 565

Indian languages, IR and. See also Hindi, 390 functional description, 532–534
INDRI document retrieval system, 323 implementing, 534–537
Inexact retrieval models, for monolingual overview of, 531–532
information retrieval, 374 Interoperability, in aggregated NLP, 540
InfAP metrics, for IR performance, 389 Interpolation, language model adaptation
Inference, textual. See Textual inference and, 176
Inflectional paradigms Intrinsic evaluation, of summarization, 412
in Czech, 11–12 Inverse document frequency (IDF)
in morphologically rich languages, 189 answer scores in QA and, 450–451
Information context, as measure of semantic document representation in monoligual IR,
similarity, 112 373
Information extraction (IE). See also Entity relationship questions and, 488
detection and tracking (EDT) searching over unstructured sources, 445
defined, 285 Inverted indexes, for monolingual information
entity and event resolution and, 100 retrieval, 373–374
Information retrieval (IR) IOD case study. See Interoperability Demo
bibliography, 394–396 (IOD), GALE case study
crosslingual. See Crosslingual information IR. See Information retrieval (IR)
retrieval (CLIR) Irregularity
data sets used in evaluation of, 389–391 defined, 8
distillation compared with, 475 issues with morphology induction, 21
document preprocessing for, 366–367 in linguistic models, 8–10
document syntax and encoding, 367–368 IRSTLM toolkit, for machine translation, 357
evaluation in, 386–387, 391 Isolating (analytic) languages
introduction to, 366 finite-state technology applied to, 18
key word searches in, 433 morphological typology and, 7
It Makes Sense (IMS), program for word
measures in, 388–389
sense disambiguation, 117
monolingual. See Monolingual information
Italian
retrieval
dependency graphs in syntax analysis, 65
multilingual. See Multilingual information
IR and, 390–391
retrieval (MLIR)
normalization and, 371
normalization and, 370–371
polarity analysis of words and phrases, 269
preprocessing best practices, 371–372
QA and, 461
redundancy problem and, 488
RTE in, 218
relevance assessment, 387–388
summarization and, 399
summary, 393
WordNet and, 109
tokenization and, 369–370
IVR (interactive voice response), 505, 511
tools, software, and resources, 391–393 IXIR distillation system, 488–489
translingual, 491
Informative summaries, in automatic
summarization, 401–404 Japanese
InfoSphere Streams, 530–531 as agglutinative language, 7
Insertion metric, in machine translation, 335 anaphora frequency in, 444
Integer linear programming (ILP), 247 call-flow localization and, 514
Interactive voice response (IVR), 505, 511 crosslingual QA, 455
Interoperability Demo (IOD), GALE case discourse parsers for, 403
study EDT and, 286
computational efficiency, 537 GeoQuery corpus translated into, 149
flexible application building with, 537 IR and, 390
566 Index

Japanese (continued ) morphemes in, 6–7


irregular verbs, 10 polarity analysis of words and phrases, 269
language modeling, 193–194 preprocessing best practices in IR, 371–372
polarity analysis of words and phrases, 269 resources for semantic parsing, 122
preprocessing best practices in IR, 371–372 word segmentation in, 4–5
QA architectures and, 437–438, 461, 464 KRISPER program, for rule-based semantic
semantic parsing, 122, 151 parsing, 151
subjectivity and sentiment analysis, 259,
267–271
word order and, 356
Language identification, in MLIR, 383
word segmentation in, 4–5
Language models
JAVELIN system, for QA, 437
adaptation, 176–178
Joint inference, NLP and, 320
Bayesian parameter estimation, 173–174
Joint systems
Bayesian topic-based, 186–187
optimization vs. interoperability in
bibliography, 199–208
aggregated NLP, 540
class-based, 178–179
types of EDT architectures, 286
crosslingual, 196–198
Joshua machine translation program, 357, 423
discriminative, 179–180
JRC-Acquis corpus
for document retrieval, 375–376
for evaluating IR systems, 390
evaluation of, 170–171
for machine translation, 358
factored, 183–184
introduction to, 169
KBP (Knowledge Base Population), of Text language-specific problems, 188–189
Analysis Conferences (TAC), 481–482 large-scale models, 174–176
Kernel functions, SVM mapping and, 317 MaxEnt, 181–183
Kernel methods, for relation extraction, 319 maximum-likelihood estimation and
Keyword searches smoothing, 171–173
in IR, 433 morphological categories in, 192–193
searching over unstructured sources, for morphologically rich languages,
443–445 189–191
KL-ONE system, for predicate-argument multilingual, 195–196
recognition, 122 n-gram approximation, 170
Kneser-Ney smoothing technique, in language neural network, 187–188
model estimation, 172 spoken vs. written languages and, 194–195
Knowledge Base Population (KBP), of Text subword unit selection, 191–192
Analysis Conferences (TAC), 481–482 summary, 198
Korean syntax-based, 180–181
as agglutinative language, 7 tree-based, 185–186
ambiguity in, 10–11 types of, 178
dictionary-based approach in, 16 variable-length, 179
EDT and, 286 word segmentation and, 193–194
encoding and script, 368 The Language Understanding Annotated
finite-state models, 18 Corpus, 425
gender, 13 Langue and parole (de Saussure), 13
generative parsing model, 92 Latent Dirichlet allocation (LDA) model, 186
IR and, 390 Latent semantic analysis (LSA)
irregular verbs, 10 bilingual (bLSA), 197–198
language modeling, 190 language model adaptation and, 176–177
language modeling using subword units, 192 probabilistic (PLSA), 176–177
Index 567

Latent semantic indexing (LSI), 381 dictionary-based approach to subjectivity


Latin and sentiment analysis, 270, 273
as fusional language, 8 ElixirFM lexicon of Arabic, 20
morphologies of, 20 sets of lexemes constituting, 5
preprocessing best practices in IR, 371 subjectivity and sentiment analysis with,
transliteration of scripts to, 368 262, 275–276
Latvian LexPageRank, approach to automatic
IR and, 390 summarization, 406, 411
summarization and, 399 LexTools, for finite-state morphology, 16
LDA (Latent Dirichlet allocation) model, 186 Linear model interpolation, for smoothing
LDC. See Linguistic Data Consortium (LDC) language model estimates, 173
LDOCE (Longman Dictionary of LinearRank algorithm, learning
Contemporary English), 104 summarization, 408
LEA. See Lexical entailment algorithm (LEA) lingPipe tool, for summarization, 423
Learning, discriminative approach to, 84 Linguistic challenges, in MT
Lemmas lexical choice, 354–355
defined, 5 morphology and, 355
machine translation metrics and, 336 word order and, 356
mapping terms to, 370 Linguistic Data Consortium (LDC)
Lemmatizers corpora for machine translation, 358
mapping terms to lemmas, 370 evaluating co-occurrence of word between
preprocessing best practices in IR, 371 languages, 337
Lemur IR framework, 392 history of summarization systems, 399
Lesk algorithm, 105–106 OntoNotes corpus, 104
Lexemes on sentence segmentation markers in
functional morphology models and, 19 conversational speech, 31
overview of, 5 summarization frameworks, 422
Lexical chains, in topic segmentation, 38, 43 List questions
Lexical choice, in machine translation, 354–355 extension to, 453
Lexical collocation, 401 QA and, 433
Lexical entailment algorithm (LEA) Local collocations, features of supervised
alignment stage of RTE model, 236 systems, 110–111
enrichment stage of RTE model, 228–231 Localization, of spoken dialog systems
inference stage of RTE model, 237 call-flow localization, 514
preprocessing stage in RTE model, 227–228 localization of grammars, 516–517
training stage of RTE model, 238 overview of, 513–514
Lexical features prompt localization, 514–516
context as, 110 testing, 519–520
in coreference models, 301 training, 517–519
in event matching, 324 Log-linear models, phrase-based models for
in mention detection, 292 MT, 348–349
of relation extraction systems, 314 Logic-based representation, applying to RTE,
in sentence and topic segmentation, 42–43 242–244
Lexical matching, 212–213 Logographic scripts, preprocessing best
Lexical ontologies, relation extraction and, practices in IR, 371
310 Long-distance dependencies, syntax-based
Lexical strings, 17, 18 language models for, 180–181
Lexicon, of languages Longman Dictionary of Contemporary
building, 265–266 English (LDOCE), 104
568 Index

Lookup operations, dictionaries and, 16 paraphrasing and, 59


Loudness, prosodic cues, 45–47 phrase-based models, 343–344
Low-level features, in event matching, 324 programs for, 423
Lucene RTE applied to, 217–218
document indexing with, 483 in RTTS, 538
document retrieval with, 483–484 sentences as processing unit in, 29
IR frameworks, 392 statistical. See Statistical machine
LUNAR QA system, 434 translation (SMT)
summary, 359
Machine learning. See also Conditional symmetrization, 340–341
random fields (CRFs) syntactic models, 352–354
event extraction and, 322 systems for, 357–358
measuring token frequency, 369 in TALES, 538
summarization and, 406–409 tools for, 356–357, 392
word alignment as learning problem, training issues, 197
341–343 training phrase-based models, 344–345
Machine translation (MT) translation-based approach to CLIR,
alignment models, 340 378–380
automatic evaluation, 334–335 tree-based models, 350
bibliography, 360–363 word alignment and, 337, 341–343
chart decoding, 351–352 word order and, 356
CLIR applied to, 380–381 MAP (maximum a posteriori)
co-occurrence of words and, 337–338 Bayesian parameter estimation and,
coping with model size, 349–350 173–174
corpora for, 358 language model adaptation and, 177–178
crosslingual QA and, 454 MAP (Mean average precision), metrics for
cube pruning approach to decoding, IR systems, 389
347–348 Marathi, 390
data reorganization and, 536 Margin infused relaxed algorithm (MIRA)
data resources for, 356–357 methods for sentence or topic segmentation,
decoding phrase-based models, 345–347 39
expectation maximization (EM) algorithm,
unsupervised approaches to machine
339–340
learning, 342
future directions, 358–359
Markov model. See also Hidden Markov
in GALE IOD, 532–533
model (HMM), 34–36
hierarchical phrase-based models, 350–351
Matches, machine translation metrics, 335
history and current state of, 331–332
Matching events, 323–326
human assessment and, 332–334
IBM Model 1, 338–339 Mate retrieval setup, relevance assessment
lexical choice, 354–355 and, 388
linguistic choices, 354 MaxEnt model
log-linear models and parameter tuning, applied to distillation, 480
348–349 classifiers for relation extraction, 316–317
meaning evaluation, 332 classifiers for sentence or topic
metrics, 335–337 segmentation, 37, 39–40
morphology and, 355 coreference resolution with, 300–301
multilingual automatic summarization and, language model adaptation and, 177
410 memory-based learning compared with, 322
overview of, 331 mention detection, 287–289
Index 569

modeling using morphological categories, computing probability of mention links,


193 297–300
modeling without word segmentation, 194 data-driven classification, 287–289
overview of, 181–183 experiments in, 294–295
subjectivity and sentiment analysis with, features for, 291–294
274 greedy best-fit decoding, 322
unsupervised approaches to machine MaxEnt model applied to entity-mention
learning, 342 relationships, 301
Maximal marginal relevance (MMR), in mention-matching features in event
automatic summarization, 399 matching, 324
Maximum a posteriori (MAP) overview of, 287
Bayesian parameter estimation and, problems in information extraction,
173–174 285–286
language model adaptation and, 177–178 in Rosetta Consortium distillation system,
Maximum-likelihood estimation 480–481
Bayesian parameter estimation and, searching for mentions, 289–291
173–174 Mention-synchronous process, 297
as parameter estimation language model, Mentions
171–173 entity relations and, 310–311
used with document models in information named, nominal, prenominal, 287
retrieval, 375–376
Meronymy, 310
MEAD system, for automatic summarization,
MERT (minimum error rate training), 349
410–411, 423
METEOR, metrics for machine translation,
Mean average precision (MAP), metrics for
336
IR systems, 389
METONYMY class, ACE, 312
Mean reciprocal rank (MRR), metrics for QA
Metrics
systems, 462–463
Meaning chunks, semantic parsing and, 97 distillation, 491–494
Meaning of words. See Word meaning graph generation and, 231
Meaning representation IR, 388
Air Travel Information System (ATIS), 148 machine translation, 335–337
Communicator program, 148–149 magnitude of RTE metrics, 233
GeoQuery, 149 for multilingual automatic summarization,
overview of, 147–148 419–420
RoboCup, 149 QA, 462–464
rule-based systems for, 150 RTE annotation constituents, 222–224
semantic interpretation and, 101 Microsoft, history of QA systems and, 435
software programs for, 151 Minimum error rate training (MERT), 349
summary, 153–154 Minimum spanning trees (MSTs), 79–80
supervised systems for, 150–151 Minipar
Measures. See Metrics dependency parsing with, 456
Media Resource Control Protocol (MRCP), rule-based dependency parser, 131–132
504 MIRA (margin infused relaxed algorithm)
Meeting Recorder Dialog Act (MRDA), 31 methods for sentence or topic segmentation,
Memory-based learning, 322 39
MEMT (multi-engine machine translation), in unsupervised approaches to machine
GALE IOD, 532–533 learning, 342
Mention detection Mixed initiative dialogs, in spoken dialog
Bell tree and, 297 systems, 509
570 Index

MLIR. See Multilingual information retrieval overview of, 15


(MLIR) unification-based, 18–19
MLIS-MUSI summarization system, 399 Morphological parsing
MMR (maximal marginal relevance), in ambiguity and, 10–13
automatic summarization, 399 dictionary lookup and, 15
Models, information retrieval discovery of word structure by, 3
monolingual, 374–376 irregularity and, 8–10
selection best practices, 377–378 issues and challenges, 8
Models, word alignment Morphology
EM algorithm, 339–340 categories in language models, 192–193
IBM Model 1, 338–339 compared with syntax and phonology and
improvements on IBM Model 1, 340 orthography, 3
Modern Standard Arabic (MSA), 189–191 induction, 21
Modification processes, in automatic language models for morphologically rich
summarization, 399–400 languages, 189–191
Modifier word, dependency trees and, 131 linguistic challenges in machine translation,
Monolingual information retrieval. See also 355
Information retrieval (IR) parsing issues related to, 90–92
document a priori models, 377 typology, 7–8
document representation, 372–373 Morphs (segments)
index structures, 373–374 data-sparseness problem and, 286
model selection best practices, 377–378 defined, 5
models for, 374–376 functional morphology models and, 19
not all morphs can be assumed to be
overview of, 372
morphemes, 7
query expansion technique, 376–377
typology and, 8
Monotonicity
Moses system
applying natural logic to RTE, 246
grow-diag-final method, 341
defined, 224
machine translation, 357, 423
Morfessor package, for identifying
MPQA corpus
morphemes, 191–192
manually annotated corpora for English, 274
Morphemes
subjectivity and sentiment analysis, 263,
abstract in morphology induction, 21 272
automatic algorithms for identifying, MRCP (Media Resource Control Protocol),
191–192 504
defined, 4 MRDA (Meeting Recorder Dialog Act), 31
examples of, 6–7 MRR (Mean reciprocal rank), metrics for QA
functional morphology models and, 19 systems, 462–463
Japanese text segmented into, 438 MSA (Modern Standard Arabic), 189–191
language modeling for morphologically rich MSE (Multilingual Summarization
languages, 189 Evaluation), 399, 425
overview of, 5–6 MSTs (minimum spanning trees), 79–80
parsing issues related to, 90–91 Multext Dataset, corpora for evaluating IR
typology and, 7–8 systems, 390
Morphological models Multi-engine machine translation (MEMT),
automating (morphology induction), 21 in GALE IOD, 532–533
dictionary-based, 15–16 Multilingual automatic summarization
finite-state, 16–18 automated evaluation methodologies,
functional, 19–21 415–418
Index 571

building a summarization system, 420–421, Naı̈ve Bayes


423–424 classifiers for relation extraction, 316
challenges in, 409–410 subjectivity and sentiment analysis, 274
competitions related to, 424–425 Named entity recognition (NER)
data sets for, 425–426 aligning views in RTE, 233
devices/tools for, 423 automatic summarization and, 398
evaluating quality of summaries, 412–413 candidate answer generation and, 449
frameworks summarization system can be challenges in RTE, 212
implemented in, 422–423 enrichment stage of RTE model, 229–230
manual evaluation methodologies, 413–415 features of supervised systems, 112
metrics for, 419–420 graph generation stage of RTE model, 231
recent developments, 418–419 impact on searches, 444
systems for, 410–412 implementing RTE and, 227
Multilingual information retrieval (MLIR) information extraction and, 100
aggregation models, 385
mention detection related to, 287
best practices, 385–386
in PSG, 125–126
defined, 382
QA architectures and, 439
index construction, 383–384
in Rosetta Consortium distillation system,
language identification, 383
480
overview of, 365
in RTE, 221
query translation, 384
National Institute of Standards and
Multilingual language modeling, 195–196
Technology (NIST)
Multilingual Summarization Evaluation
BLEU score, 295
(MSE), 399, 425
Multimodal distillation, 490 relation extraction and, 311
Multiple reference translations, 336 summarization frameworks, 422
Multiple views, overcoming parsing errors, textual entailment and, 211, 213
142–144 Natural language
MURAX, 434 call routing, 510
parsing, 57–59
Natural language generation (NLG), 503–504
Natural language processing (NLP)
n-gram
applications of syntactic parsers, 59
localization of grammars and, 516
trigrams, 502–503 applying to non-English languages, 218
n-gram approximation distillation and. See Distillation
language model evaluation and, 170–171 extraction of document structure as aid in,
language-specific modeling problems, 29
188–189 joint inference, 320
maximum-likelihood estimation, 171–172 machine translation and, 331
smoothing techniques in language model minimum spanning trees (MST) and, 79
estimation, 172 multiview representation of analysis, 220–222
statistical language models using, 170 packages for, 253
subword units used with, 192 problems in information extraction, 286
n-gram models. See also Phrase indices relation extraction and, 310
AutoSummENG graph, 419 RTE applied to NLP problems, 214
character models, 370 RTE as subfield of. See Recognizing textual
defined, 369–370 entailment (RTE)
document representation in monolingual syntactic analysis of natural language, 57
IR, 372–373 textual inference, 209
572 Index

Natural language processing (NLP), NNLMs (neural network language models)


combining engines for aggregation language modeling using morphological
architectures, 527 categories, 193
bibliography, 548–549 overview of, 187–188
computational efficiency, 525–526 NOMinalization LEXicon (NOMLEX), 121
data-manipulation capacity, 526 Non projective dependency trees, 65–66
flexible, distributed componentization, Nonlinear languages, morphological typology
524–525 and, 8
GALE Interoperability Demo case study, Normalization
531–537 Arabic, 12
General Architecture for Text Engineering overview of, 370–371
(GATE), 529–530 tokens and, 4
InfoSphere Streams, 530–531 Z-score normalization, 385
introduction to, 523–524 Normalized discounting cumulative gain
lessons learned, 540–542 (NDCG), 389
robust processing, 526–527 Norwegian, 461
RTTS case study, 538–540 Noun arguments, 144–146
summary, 542 Noun head, of prepositional phrases in PSB,
TALES case study, 538 127
Unstructured Information Management NTCIR. See NII Test Collection for IR
Architecture (UIMA), 527–529, Systems (NTCIR)
542–547 Numerical quantities (NUM) constituents, in
Natural Language Toolkit (NLTK), 422 RTE, 221, 233
Natural language understanding (NLU), 209
Natural logic-based representation, applying Objective word senses, 261
to RTE, 245–246 OCR (Optical character recognition), 31
NDCG (Normalized discounting cumulative One vs. All (OVA) approach, 136–137
gain), 389 OntoNotes corpus, 104
NER. See Named entity recognition (NER) OOV (out of vocabulary)
Neural network language models (NNLMs) coverage rates in language models, 170
language modeling using morphological morphologically rich languages and, 189–190
categories, 193 OOV rate
overview of, 187–188 in Germanic languages, 191
Neural networks, approach to machine inventorying morphemes and, 192
learning, 342 language modeling without word
Neutralization, homonyms and, 12 segmentation, 194
The New York Times Annotated Corpus, 425 Open-domain QA systems, 434
NewsBlaster, for automatic summarization, Open Standard by the Organization for the
411–412 Advancement of Structured
NII Test Collection for IR Systems (NTCIR) Information Standards (OASIS), 527
answer scores in QA and, 453 OpenCCG project, 21
data sets for evaluating IR systems, 390 openNLP, 423
evaluation of QA, 460–464 Opinion questions, QA and, 433
history of QA systems and, 434 OpinionFinder
NIST. See National Institute of Standards as rule-based system, 263
and Technology (NIST) subjectivity and sentiment analysis,
NLG (natural language generation), 503–504 271–272, 275–276
NLP. See Natural language processing (NLP) subjectivity and sentiment analysis lexicon,
NLTK (Natural Language Toolkit), 422 262
Index 573

Optical character recognition (OCR), 31 hypergraphs and chart parsing, 74–79


OPUS project, corpora for machine natural language, 57–59
translation, 358 semantic parsing. See semantic parsing
Ordinal constituent position, in PSG, 127 sentences as processing unit in, 29
ORG-AFF (organization-affiliation) class, shift-reduce parsing, 72–73
311–312 Part of speech (POS)
Orthography class-based language models and, 178
Arabic, 11 features of supervised systems, 110
issues with morphology induction, 21 implementing RTE and, 227
Out of vocabulary (OOV) natural language grammars and, 60
coverage rates in language models, 170 in PSG, 125–127
morphologically rich languages and, QA architectures and, 439
189–190 in Rosetta Consortium distillation system,
480
for sentence segmentation, 43
PageRank
syntactic analysis of natural language,
automatic summarization, 401
57–58
LexPageRank compared with, 406
PART-WHOLE relation class, 311
TextRank compared with, 404
Partial order method, for ranking sentences,
Paradigms
407
classification, 133–137
Particle language model, subword units in,
functional morphology models and, 19
192
inflectional paradigms in Czech, 11–12
Partition function, in MaxEnt formula, 316
inflectional paradigms in morphologically
rich languages, 189 PASCAL. See Pattern Analysis, Statistical
ParaEval Modelling and Computational
automatic evaluation of summarization, Learning (PASCAL)
418 Path
metrics in, 420 in CCG, 130
Paragraphs, sentences forming, 29 in PSG, 124, 128–129
Parallel backoff, 184 in TAG, 130
Parameter estimation language models for verb sense disambiguation, 112
Bayesian parameter estimation, 173–174 Pattern Analysis, Statistical Modelling and
large-scale models, 174–176 Computational Learning (PASCAL)
maximum-likelihood estimation and evaluating textual entailment, 213
smoothing, 171–173 RTE challenge, 451–452
Parameter tuning, 348–349 textual entailment and, 211
Parameters, functional morphology models Pauses, prosodic cues, 45–47
and, 19 Peer surveys, in evaluation of summarization,
Paraphrasing, parsing natural language and, 412
58–59 Penn Treebank
Parasitic gap recovery, in RTE, 249 dependency trees and, 130–132
parole and langue (de Saussure), 13 parsing issues and, 87–89
Parsing performance degradation and, 147
algorithms for, 70–72 phrase structure trees in, 68, 70
ambiguity resolution in, 80 PropBank and, 123
defined, 97 PER (Position-independent error rate), 335
dependency parsing, 79–80 PER-SOC (personal-social) relation class, 311
discriminative models, 84–87 Performance
generative models, 83–84 of aggregated NLP, 541
574 Index

Performance (continued ) Phrase structure trees


combining classifiers to boost (Combination examples of, 68–70
hypothesis), 293 morphological information in, 91
competence vs. performance (Chomsky), 13 in syntactic analysis, 67
of document segmentation methods, 41 treebank construction and, 62
evaluating IR, 389 Phrases
evaluating QA, 462–464 early approaches to summarization and, 400
evaluating RTE, 213–214 types in CCG, 129–130
feature performance in predicate-argument PHYS (physical) relation class, 311
structure, 138–140 Pipeline approach, to event extraction,
Penn Treebank, 147 320–321
Period (.), sentence segmentation markers, 30 Pitch, prosodic cues, 45–47
Perplexity Pivot language, translation-based approach to
criteria in language model evaluation, CLIR, 379–380
170–171 Polarity
inventorying morphemes and, 192 corpus-based approach to subjectivity and
language modeling using morphological sentiment analysis, 269
categories, 193 relationship to monotonicity, 246
language modeling without word word sense classified by, 261
segmentation, 194 Polysemy, 104
Persian Portuguese
IR and, 390 IR and, 390–391
QA and, 461
unification-based models, 19
RTE in, 218
Phoenix, 150
POS. See Part of speech (POS)
Phonemes, 4
Position-independent error rate (PER), 335
Phonology
Positional features, approaches to
compared with morphology and syntax and
summarization and, 401
orthography, 3
Positional indices, tokens and, 366
issues with morphology induction, 21
Posting lists, term relationships in document
Phrasal verb collocations, in PSG, 126
retrieval, 373–374
Phrase-based models, for MT Pre-reordering, word order in machine
coping with model size, 349–350 translation, 356
cube pruning approach to decoding, Preboundary lengthening, in sentence
347–348 segmentation, 47
decoding, 345–347 Precision, IR evaluation measure, 388
hierarchical phrase-based models, 350–351 Predicate-argument structure
log-linear models and parameter tuning, base phrase chunks, 132–133
348–349 classification paradigms, 133–137
overview of, 343–344 Combinatory Categorical Grammar (CCG),
training, 344–345 129–130
Phrase feature, in PSG, 124 dependency trees, 130–132
Phrase indices, tokenization and, 366, 369–370 feature performance, salience, and selection,
Phrase-level annotations, for subjectivity and 138–140
sentiment analysis FrameNet resources, 118–119
corpus-based, 267–269 multilingual issues, 146–147
dictionary-based, 264–267 noun arguments, 144–146
overview of, 264 other resources, 121–122
Phrase Structure Grammar (PSG), 124–129 overcoming parsing errors, 141–144
Index 575

overcoming the independence assumption, Probability


137–138 history of, 83
Phrase Structure Grammar (PSG), 124–129 MaxEnt formula for conditional probability,
PropBank resources, 119–121 316
robustness across genres, 147 Productivity/creativity, and the unknown
semantic interpretation and, 100 word problem, 13–15
semantic parsing. See Predicate-argument Projective dependency trees
structure overview of, 64–65
semantic role labeling, 118 worst-case parsing algorithm for CFGs, 78
sizing training data, 140–141 Projectivity
software programs for, 147 in dependency analysis, 64
structural matching and, 447–448 non projective dependency trees, 65–67
projective dependency trees, 64–65
summary, 153
Prompt localization, spoken dialog systems,
syntactic representation, 123–124
514–516
systems, 122–123
PropBank
Tree-Adjoining Grammar, 130
annotation of, 447
Predicate context, in PSG, 129 dependency trees and, 130–132
Predicate feature, in Phrase Structure limitation of, 122
Grammar (PSG), 124 Penn Treebank and, 123
Prepositional phrase adjunct, features of as resource for predicate-argument
supervised systems, 111 recognition, 119–122
Preprocessing, in IR tagging text with arguments, 124
best practices, 371–372 Prosody
documents for information retrieval, defined, 45
366–367 sentence and topic segmentation, 45–48
tools for, 392 Pseudo relevance feedback (PRF)
Preprocessing, in RTE as alternative to query expansion, 445
implementing, 227–228 overview of, 377
modeling, 224–225 PSG (Phrase Structure Grammar), 124–129
Preprocessing queries, 483 Publications, resources for RTE, 252
Preterminals. See Part of speech (POS) Punctuation
Previous role, in PSG, 126 in PSG, 129
PRF (Pseudo relevance feedback) typographical and structural features for
as alternative to query expansion, 445 sentence and topic segmentation, 44–45
overview of, 377 PUNDIT, 122
Private states. See also Subjectivity and Pushdown automaton, in CFGs, 72
sentiment analysis, 260 Pyramid, for manual evaluation of
Probabilistic context-free grammars (PCFGs) summarization, 413–415
for ambiguity resolution, 80–83
dependency graphs in syntax analysis, QA. See Question answering (QA)
66–67 QUALM QA system, 434
generative parsing models, 83–84 Queries
parsing techniques, 78 evaluation in distillation, 492
Probabilistic latent semantic analysis preprocessing, 483
(PLSA), 176–177 QA architectures and, 439
Probabilistic models searching unstructured sources, 443–445
document a priori models, 377 translating CLIR queries, 379
for document retrieval, 375 translating MLIR queries, 384
576 Index

Query answering distillation system Ranks methods, for sentences, 407


document retrieval, 483–484 RDF (Resource Description Framework), 450
overview of, 483 Real-Time Translation Services (RTTS),
planning stage, 487 538–540
preprocessing queries, 483 Realization stage, of summarization systems
snippet filtering, 484 building a summarization system and, 421
snippet processing, 485–487 overview of, 400
Query expansion Recall, IR evaluation measures, 388
applying to CLIR queries, 380 Recall-Oriented Understudy for Gisting
for improving information retrieval, 376–377 Evaluation (ROUGE)
searching over unstructured sources, 445 automatic evaluation of summarization,
Query generation, in QA architectures, 435 415–418
Query language, in CLIR, 365 metrics in, 420
Question analysis, in QA, 435, 440–443 Recognizing textual entailment (RTE)
Question answering (QA) alignment, 233–236
answer scores, 450–453 analysis, 220
architectures, 435–437 answer scoring and, 464
bibliography, 467–473 applications of, 214
candidate extraction from structured bibliography, 254–258
sources, 449–450 case studies, 238–239
candidate extraction from unstructured challenge of, 212–213
sources, 445–449 comparing constituents in, 222–224
case study, 455–460 developing knowledge resources for,
challenges in, 464–465 249–251
crosslingual, 454–455 discourse commitments extraction case
evaluating answer correctness, 461–462 study, 239–240
evaluation tasks, 460–461
enrichment, 228–231
introduction to and history of, 433–435
evaluating performance of, 213–214
IR compared with, 366
framework for, 219
performance metrics, 462–464
general model for, 224–227
question analysis, 440–443
graph generation, 231–232
RTE applied to, 215
implementation of, 227
searching over unstructured sources,
improving analytics, 248–249
443–445
improving evaluation, 251–252
source acquisition and preprocessing,
inference, 236–238
437–440
introduction to, 209–210
summary, 465–467
Question mark (?), sentence segmentation investing/applying to new problems, 249
markers, 30 latent alignment inference, 247–248
Questions, in GALE distillation initiative, 475 learning alignment independently of
Quotation marks (“”), sentence segmentation entailment, 244–245
markers, 30 leveraging multiple alignments, 245
limited dependency context for global
similarity, 247
R summarization frameworks, 422 logical representation and inference,
RandLM toolkit, for machine translation, 357 242–244
Random forest language models (RFLMs) machine translation, 217–218
modeling using morphological categories, multiview representation, 220–222
193 natural logic and, 245–246
tree-based modeling, 185–186 in non-English languages, 218–219
Index 577

PASCAL challenge, 451 features of classification-based extractors,


preprocessing, 227–228 313–316
problem definition, 210–212 introduction to, 309–310
QA and, 215, 433–434 kernel methods for extracting, 319
requirements for RTE framework, 219–220 recognition impacting searches, 444
resources for, 252–253 summary, 326–327
searching for relations, 215–217 supervised and unsupervised approaches to
summary, 253–254 extracting, 317–319
Syntactic Semantic Tree Kernels (SSTKs), transitive closure of, 324–326
246–247 types of, 311–312
training, 238 Relationship questions, QA and, 433, 488
transformation-based approaches to, Relevance, feedback and query expansion,
241–242 376–377
tree edit distance case study, 240–241 Relevance, in distillation
Recombination, machine translation and, 346 analysis of, 492–493
Recursive transition networks (RTNs), 150 detecting, 488–489
Redundancy, in distillation examples of irrelevant answers, 477
detecting, 492–493 overview of, 477–479
overview of, 477–479 redundancy reduction and, 488–490
reducing, 489–490
Relevance, in IR
Redundancy, in IR, 488
assessment, 387–388
Reduplication of words, limits of finite-state
evaluation, 386
models, 17
Remote operation, challenges in NLP
Reference summaries, 412, 419
aggregation, 524
Regular expressions
Resource Description Framework (RDF), 450
surface patterns for extracting candidate
Resources, for RTE
answers, 449
in type-based candidate extraction, 446 developing knowledge resources, 249–251
Regular relations, finite-state transducers overview of, 252–253
capturing and computing, 17 Restricted domains, history of QA systems,
Related terms, in GALE distillation initiative, 434
475 Result pooling, relevance assessment and, 387
Relation extraction systems Rewrite rules (in phonology and morphology),
classification approach, 312–313 17
coreference resolution as, 311 RFLMs (Random forest language models)
features of classification-based systems, modeling using morphological categories,
313–316 193
kernel methods for, 319 tree-based modeling, 185–186
overview of, 310 Rhetorical structure theory (RST), applying
supervised and unsupervised, 317–319 to summarization, 401–404
Relational databases, 449 RoboCup, for meaning representation, 149
Relations Robust processing
bibliography, 327–330 desired attributes of NLP aggregation,
classifiers for, 316 526–527
combining entity and relation detection, 320 in GATE, 529
between constituents in RTE, 220 in InfoSphere Streams, 531
detection in Rosetta Consortium in UIMA, 529
distillation system, 480–482 Robust risk minimization (RRM), mention
extracting, 310–313 detection and, 287
578 Index

Roget’s Thesaurus Rules, functional morphology models and, 19


semantic parsing, 104 Russian
word sense disambiguation, 106–107 language modeling using subword units, 192
Role extractors, classifiers for relation parsing issues related to morphology, 91
extraction, 316 unification-based models, 19
Romanian
approaches to subjectivity and sentiment
analysis, 276–277 SALAAM algorithms, 114–115
corpus-based approach to subjectivity and SALSA project, for predicate-argument
sentiment analysis, 271–272 recognition, 122
cross-lingual projections, 275 Sanskrit
ambiguity in, 11
dictionary-based approach to subjectivity
as fusional language, 8
and sentiment analysis, 264–266, 270
Zen toolkit for morphology of, 20
IR and, 390
SAPT (semantically augmented parse tree),
QA and, 461
151
subjectivity and sentiment analysis, 259
Scalable entailment relation recognition
summarization and, 399
(SERR), 215–217
Romanization, transliteration of scripts to
SCGIS (Sequential conditional generalized
Latin (Roman) alphabet, 368
iterative scaling), 289
Rosetta Consortium system
Scores
document and corpus preparation, 480–483
ranking answers in QA, 435, 450–453,
indexing and, 483 458–459
overview of, 479–480 ranking sentences, 407
query answers and, 483–487 sentence relevance in distillation systems,
ROUGE (Recall-Oriented Understudy for 485–486
Gisting Evaluation) Scripts
automatic evaluation of summarization, preprocessing best practices in IR, 371–372
415–418 transliteration and direction of, 368
metrics in, 420 SCUs (summarization content units), in
RRM (robust risk minimization), mention Pyramid method, 414–415
detection and, 287 Search component, in QA architectures, 435
RST (rhetorical structure theory), applying Searches
to summarization, 401–404 broadening to overcome parsing errors, 144
RTNs (recursive transition networks), 150 in mention detection, 289–291
RTTS (Real-Time Translation Services), over unstructured sources in QA, 443–445
538–540 QA architectures and, 439
Rule-based grammars, in speech recognition, QA vs. IR, 433
501–502 reducing search space using beam search,
Rule-based sentence segmentation, 31–32 290–291
Rule-based systems for relations, 215–217
dictionary-based approach to subjectivity SEE (Summary Evaluation Environment), 413
and sentiment analysis, 270 Seeds, unsupervised systems and, 112
for meaning representation, 150 Segmentation
statistical models compared with, 292 in aggregated NLP, 540
subjectivity and sentiment analysis, 267 sentence boundaries. See Sentence
word and phrase-level annotations in boundary detection
subjectivity and sentiment analysis, topic boundaries. See Topic segmentation
263 Semantic concordance (SEMCOR) corpus,
for word sense disambiguation, 105–109 WordNet, 104
Index 579

Semantic interpretation features of classification-based relation


entity and event resolution, 100 extraction systems, 315–316
meaning representation, 101 finding entity relations, 310
overview of, 98–99 latent semantic indexing (LSI), 381
predicate-argument structure and, 100 QA and, 439–440
structural ambiguity and, 99 structural matching and, 446–447
word sense and, 99–100 topic detection and, 33
Semantic parsing SEMCOR (semantic concordance) corpus,
Air Travel Information System (ATIS), 148 WordNet, 104
bibliography, 154–167 SEMEVAL, 263
Communicator program, 148–149 Semi-supervised systems, for word sense
corpora for, 104–105 disambiguation, 114–116
entity and event resolution, 100 Semistructured data, candidate extraction
GeoQuery, 149 from, 449–450
introduction to, 97–98 SemKer system, applying syntactic tree
meaning representation, 101, 147–148 kernels to RTE, 246
as part of semantic interpretation, 98–99 Sense induction, unsupervised systems and,
predicate-argument structure. See 112
Predicate-argument structure SENSEVAL, for word sense disambiguation,
resource availability for disambiguation of 105–107
word sense, 104–105 Sentence boundary detection
RoboCup, 149 comparing segmentation methods, 40–41
rule-based systems, 105–109, 150 detecting probable sentence or topic
semi-supervised systems, 114–116 boundaries, 33–34
software programs for, 116–117, 151 discourse features, 44
structural ambiguity and, 99 discriminative local classification method
summary, 151 for, 36–38
supervised systems, 109–112, 150–151 discriminative sequence classification
system paradigms, 101–102 method for, 38–39
unsupervised systems, 112–114 extensions for global modeling, 40
word sense and, 99–100, 102–105 features of segmentation methods, 41–42
Semantic role labeling (SRL). See also generative sequence classification method,
Predicate-argument structure 34–36
challenges in RTE and, 212 hybrid methods, 39–40
combining dependency parsing with, 132 implementing RTE and, 227
implementing RTE and, 227 introduction to, 29
overcoming independence assumption, lexical features, 42–43
137–138 overview of, 30–32
predicate-argument structure training, 447 performance of, 41
in Rosetta Consortium distillation system, processing stages of, 48
480 prosodic features, 45–48
in RTE, 221 speech-related features, 45
sentences as processing unit in, 29 syntactic features, 43–44
for shallow semantic parsing, 118 typographical and structural features, 44–45
Semantically augmented parse tree (SAPT), Sentence-level annotations, for subjectivity
151 and sentiment analysis
Semantics corpus-based approach, 271–272
defined, 97 dictionary-based approach, 270–271
explicit semantic analysis (ESA), 382 overview of, 269
580 Index

Sentence splitters, tools for building filtering, 484


summarization systems, 423 main and supporting, 477–478
Sentences multimodal distillation and, 490
coherence of sentence-sentence connections, planning and, 487
402 processing, 485–487
extracting within-sentence relations, 310 Snowball Stemmer, 392
methods for learning rank of, 407 Software programs
parasitic gap recovery, 249 for meaning representation, 151
processing for event extraction, 323 for predicate-argument structure, 147
relevance in distillation systems, 485–486 for semantic parsing, 116–117
units in sentence segmentation, 33 Sort expansion, machine translation phrase
unsupervised approaches to selection, 489 decoding, 347–348
Sentential complement, features of supervised Sources, in QA
systems, 111 acquiring, 437–440
Sentential forms, parsing and, 71–72 candidate extraction from structured,
Sentiment analysis. See Subjectivity and 449–450
sentiment analysis candidate extraction from unstructured,
SentiWordNet, 262 445–449
Sequential conditional generalized iterative searching over unstructured, 443–445
scaling (SCGIS), 289 Spanish
SERR (scalable entailment relation code switching example, 31, 195–196
recognition), 215–217 corpus-based approach to subjectivity and
Shallow semantic parsing sentiment analysis, 272
coverage in semantic parsing, 102 discriminative approach to parsing, 91–92
overview of, 98 GeoQuery corpus translated into, 149
semantic role labeling for, 118 IR and, 390–391
structural matching and, 447 localization of spoken dialog systems,
Shalmaneser program, for semantic role 513–514, 517–520
labeling, 147 mention detection experiments, 294–296
Shift-reduce parsing, 72–73 morphologies of, 20
SHRDLU QA system, 434 polarity analysis of words and phrases,
SIGHAN, Chinese word segmentation, 194 269
SIGLEX (Special Group on LEXicon), 103 QA and, 461
Similarity enablement, relation extraction resources for semantic parsing, 122
and, 310 RTE in, 218
Slovene unification-based model, 19 semantic parser for, 151
SLU (statistical language understanding) summarization and, 398
continuous improvement cycle in dialog TAC and, 424
systems, 512–513 TALES case study, 538
generations of dialog systems, 511–512 WordNet and, 109
Smoothing techniques Special Group on LEXicon (SIGLEX), 103
Laplace smoothing, 174 Speech
machine translation and, 345 discourse features in topic or sentence
n-gram approximation, 172–173 segmentation, 44
SMT. See Statistical machine translation lexical features in sentence segmentation,
(SMT) 42
Snippets, in distillation prosodic features for sentence or topic
crosslingual distillation and, 491 segmentation, 45–48
evaluation, 492–493 sentence segmentation accuracy, 41
Index 581

Speech generation SRILM (Stanford Research Institute


dialog manager directing, 499–500 Language Modeling)
spoken dialog systems and, 503–504 overview of, 184
Speech recognition SRILM toolkit for machine translation, 357
anchored speech recognition, 490 SRL. See Semantic role labeling (SRL)
automatic speech recognition (ASR), 29, 31 SSI (Structural semantic interconnections)
language modeling using subword units, 192 algorithm, 107–109
MaxEnt model applied to, 181–183 SSTKs (Syntactic Semantic Tree Kernels),
Morfessor package applied to, 191–192 246–247
neural network language models applied to, Stacks, of hypotheses in machine translation,
188 346
rule-based grammars in, 501–502 Stanford Parser, dependency parsing with,
spoken dialog systems and, 500–503 456
Speech Recognition Grammar Specification Stanford Research Institute Language
(SRGS), 501–502 Modeling (SRILM)
Speech-to-text (STT) overview of, 184
data reorganization and, 535–536 SRILM toolkit for machine translation, 357
in GALE IOD, 532–533 START QA system, 435–436
NLP and, 523–524 Static knowledge, in textual entailment, 210
in RTTS, 538 Statistical language models
Split-head concept, in parsing, 78 n-gram approximation, 170–171
Spoken dialog systems overview of, 169
architecture of, 505 rule-based systems compared with, 292
bibliography, 521–522 spoken vs. written languages and, 194–195
call-flow localization, 514 translation with, 331
continuous improvement cycle in, 512–513 Statistical language understanding (SLU)
dialog manager, 504–505 continuous improvement cycle in dialog
forms of dialogs, 509–510 systems, 512–513
functional diagram of, 499–500 generations of dialog systems, 511–512
generations of, 510–512 Statistical machine translation (SMT)
introduction to, 499 applying to CLIR, 381
localization of, 513–514 cross-language mention propagation,
localization of grammars, 516–517 293–294
natural language call routing, 510 evaluating co-occurrence of words, 337–338
prompt localization, 514–516 mention detection experiments, 293–294
speech generation, 503–504 Stemmers
speech recognition and understanding, mapping terms to stems, 370
500–503 preprocessing best practices in IR, 371
summary, 520–521 Snowball Stemmer, 392
testing, 519–520 Stems, mapping terms to, 370
training, 517–519 Stop-words, removing in normalization, 371
transcription and annotation of utterances, Structural ambiguity, 99
513 Structural features
voice user interface (VUI), 505–509 of classification-based relation extraction
Spoken languages, vs. written languages and systems, 314
language models, 194–195 sentence and topic segmentation, 44–45
SRGS (Speech Recognition Grammar Structural matching, for candidate extraction
Specification), 501–502 in QA, 446–448
582 Index

Structural semantic interconnections (SSI) Summarization, automatic. See Automatic


algorithm, 107–109 summarization
Structure Summarization content units (SCUs), in
of documents. See Document structure Pyramid method, 414–415
of words. See Word structure Summary Evaluation Environment (SEE),
Structured data 413
candidate extraction from structured SummBank
sources, 449–450 history of summarization systems, 399
candidate extraction from unstructured summarization data set, 425
sources, 445–449 Supertags, in TAG, 130
Structured knowledge, 434 Supervised systems
Structured language model, 181 for meaning representation, 150–151
Structured queries, 444 for relation extraction, 317–319
STT (Speech-to-text). See Speech-to-text for sentence segmentation, 37
(STT) for word sense disambiguation, 109–112
Subcategorization Support vector machines (SVMs)
in PSG, 125 classifiers for relation extraction, 316–317
in TAG, 130 corpus-based approach to subjectivity and
for verb sense disambiguation, 112 sentiment analysis, 272, 274
Subclasses, of relations, 311 mention detection and, 287
Subject/object presence, features of methods for sentence or topic segmentation,
supervised systems, 111 37–39
Subject, object, verb (SOV) word order, 356 training and test software, 135–137
Subjectivity, 260 unsupervised approaches to machine
Subjectivity analysis, 260 learning, 342
Subjectivity and sentiment analysis Surface-based features, in automatic
applied to English, 262 summarization, 400–401
bibliography, 278–281 Surface patterns, for candidate extraction in
comparing approaches to, 276–277 QA, 448–449
corpora for, 262–263 Surface strings
definitions, 260–261 input words in input/output language
document-level annotations, 272–274 relations, 17
introduction to, 259–260 unification-based morphology and, 18
lexicons and, 262 SVMs. See Support vector machines (SVMs)
ranking approaches to, 274–276 SVO (subject, verb, object) word order, 356
sentence-level annotations, 269, 270–272 Swedish
summary, 277 IR and, 390–391
tools for, 263–264 morphologies of, 20
word and phrase level annotations, 264–269 semantic parsing and, 122
Substitution, linguistic supports for cohesion, summarization and, 399
401 SwiRL program, for semantic role labeling,
Subword units, selecting for language models, 147
191–192 Syllabic scripts, 371
SUMMA Symmetrization, word alignment and,
history of summarization systems, 399 340–341
for multilingual automatic summarization, Syncretism, 8
411 Synonyms
summarization frameworks, 423 answers in QA systems and, 442
SUMMARIST, 398 machine translation metrics and, 336
Index 583

Syntactic features System architectures


of classification-based relation extraction for distillation, 488
systems, 315 for semantic parsing, 101–102
of coreference models, 301 System paradigms, for semantic parsing,
of mention detection system, 292 101–102
in sentence and topic segmentation, 43–44 Systran’s Babelfish program, 331
Syntactic models, for machine translation,
352–354
TAC. See Text Analysis Conferences (TAC)
Syntactic pattern, in PSG, 126
TAG (Tree-Adjoining Grammar), 130
Syntactic relations, features of supervised
TALES (Translingual Automated Language
systems, 111
Exploitation System), 538
Syntactic representation, in Tamil
predicate-argument structure, 123–124 as agglutinative language, 7
Syntactic roles, in TAG, 130 IR and, 390
Syntactic Semantic Tree Kernels (SSTKs), Task-based evaluation, of translation, 334
246–247 TBL (transformation-based learning), for
Syntactic Structures (Chomsky), 98–99 sentence segmentation, 37
Syntax TDT (Topic Detection and Tracking)
ambiguity resolution, 80 program, 32–33, 42, 425–426
bibliography, 92–95 Telugu, 390
compared with morphology and phonology Templates, in GALE distillation initiative,
and orthography, 3 475
context-free grammar (CFGs) and, 59–61 Temporal cue words, in PSG, 127–128
dependency graphs for analysis of, 63–67 TER (Translation-error rate), 337
discriminative parsing models, 84–87 Term-document matrix, document
of documents in IR, 367–368 representation in monolingual IR, 373
generative parsing models, 83–84 Term frequency-inverse document frequency
introduction to, 57 (TF-IDF)
minimum spanning trees and dependency multilingual automatic summarization and,
parsing, 79–80 411
morphology and, 90–92 QA scoring and, 450–451
parsing algorithms for, 70–72 unsupervised approaches to sentence
selection, 489
parsing natural language, 57–59
Term frequency (TF)
phrase structure trees for analysis of, 67–70
TF document model, 373
probabilistic context-free grammars, 80–83
unsupervised approaches to sentence
QA and, 439–440
selection, 489
shift-reduce parsing, 72–73 Terms
structural matching and, 446–447 applying RTE to unknown, 217
summary, 92 early approaches to summarization and, 400
tokenization, case, and encoding and, in GALE distillation initiative, 475
87–89 mapping term vectors to topic vectors, 381
treebanks data-driven approach to, 61–63 mapping to lemmas, 370
word segmentation and, 89–90 posting lists, 373–374
worst-case parsing algorithm for CFGs, Terrier IR framework, 392
74–79 Text Analysis Conferences (TAC)
Syntax-based language models, 180–181 competitions related to summarization,
Synthetic languages, morphological typology 424
and, 7 data sets related to summarization, 425
584 Index

Text Analysis Conferences (TAC) (continued ) Tika (Content Analysis Toolkit), for
evaluation of QA systems, 460–464 preprocessing IR documents, 392
history of QA systems, 434 TinySVM software, for SVM training and
Knowledge Base Population (KBP), testing, 135–136
481–482 Token streams, 372–373
learning summarization, 408 Tokenization
Text REtrieval Conference (TREC) Arabic, 12
data sets for evaluating IR systems, character n-gram models and, 370
389–390 multilingual automatic summarization and,
evaluation of QA systems, 460–464 410
history of QA systems, 434 normalization and, 370–371
redundancy reduction, 489 parsing issues related to, 87–88
Text Tiling method (Hearst) phrase indices and, 369–370
sentence segmentation, 42 in Rosetta Consortium distillation system,
topic segmentation, 37–38 480
Text-to-speech (TTS) word segmentation and, 369
architecture of spoken dialog systems, 505 Tokenizers, tools for building summarization
history of dialog managers, 504 systems, 423
localization of grammars and, 514 Tokens
in RTTS, 538 lexical features in sentence segmentation,
speech generation, 503–504 42–43
TextRank, graphical approaches to automatic mapping between scripts (normalization),
summarization, 404–406 370–371
Textual entailment. See also Recognizing MLIR indexes and, 384
textual entailment (RTE) output from information retrieval, 366
contradiction in, 211 processing stages of segmentation tasks, 48
defined, 210 in sentence segmentation, 30
entailment pairs, 210 translating MLIR queries, 384
Textual inference in word structure, 4–5
implementing, 236–238 Top-k models, for monolingual information
latent alignment inference, 247–248 retrieval, 374
modeling, 226–227 Topic-dependent language model adaptation,
NLP and, 209 176
RTE and, 242–244 Topic Detection and Tracking (TDT)
TF-IDF (term frequency-inverse document program, 32–33, 42, 425–426
frequency) Topic or domain, features of supervised
multilingual automatic summarization and, systems, 111
411 Topic segmentation
QA scoring and, 450–451 comparing segmentation methods, 40–41
unsupervised approaches to sentence discourse features, 44
selection, 489 discriminative local classification method,
TF (term frequency) 36–38
TF document model, 373 discriminative sequence classification
unsupervised approaches to sentence method, 38–39
selection, 489 extensions for global modeling, 40
Thai features of, 41–42
as isolating or analytic language, 7 generative sequence classification method,
word segmentation in, 4–5 34–36
Thot program, for machine translation, 423 hybrid methods, 39–40
Index 585

introduction to, 29 Translingual summarization. See also


lexical features, 42–43 Automatic summarization, 398
methods for detecting probable topic Transliteration, mapping text between
boundaries, 33–34 scripts, 368
overview of, 32–33 TREC. See Text REtrieval Conference
performance of, 41 (TREC)
processing stages of segmentation tasks, 48 trec-eval, evaluation of IR systems, 393
prosodic features, 45–48 Tree-Adjoining Grammar (TAG), 130
speech-related features, 45 Tree-based language models, 185–186
syntactic features, 43–44 Tree-based models, for MT
typographical and structural features, chart decoding, 351–352
44–45 hierarchical phrase-based models, 350–351
Topics, mapping term vectors to topic linguistic choices and, 354
vectors, 381 overview of, 350
Traces nodes, Treebanks, 120–121 syntactic models, 352–354
Training Tree edit distance, applying to RTE, 240–241
issues related to machine translation (MT), Treebanks
197 data-driven approach to syntactic analysis,
minimum error rate training (MERT), 349 61–63
phrase-based models, 344–345 dependency graphs in syntax analysis,
predicate-argument structure, 140–141, 447 63–67
recognizing textual entailment (RTE), 238 phrase structure trees in syntax analysis,
in RTE, 238 67–70
spoken dialog systems, 517–519 traces nodes marked as arguments in
stage of RTE model, 238 PropBank, 120–121
support vector machines (SVMs), 135–137 worst-case parsing algorithm for CFGs, 77
Transcription Trigger models, dynamic self-adapting
of utterances based on rule-based language models, 176–177
grammars, 502–503 Triggers
of utterances in spoken dialog systems, 513 consistency of, 323
Transducers, finite-state, 16–17 finding event triggers, 321–322
Transformation-based approaches, applying Trigrams, 502–503
to RTE, 241–242 Troponymy, 310
Transformation-based learning (TBL), for Tuning sets, 348
sentence segmentation, 37 Turkish
Transformation stage, of summarization dependency graphs in syntax analysis, 62,
systems, 400, 421 65
Transitive closure, of relations, 324–326 GeoQuery corpus translated into, 149
Translation language modeling for morphologically rich
human assessment of word meaning, languages, 189–191
333–334 language modeling using morphological
by machines. See Machine translation (MT) categories, 192–193
translation-based approach to CLIR, machine translation and, 354
378–380 morphological richness of, 355
Translation-error rate (TER), 337 parsing issues related to morphology, 90–91
Translingual Automated Language semantic parser for, 151
Exploitation System (TALES), 538 syntactic features used in sentence and
Translingual information retrieval, 491 topic segmentation, 43
586 Index

Type-based candidate extraction, in QA, 446, subjectivity and sentiment analysis, 264
451 word sense disambiguation, 112–114
Type classifier Update summarization, in automatic
answers in QA systems, 440–442 summarization, 397
in relation extraction, 313 Uppercase (capitalization), sentence
Type system, GALE Type System (GTS), segmentation markers, 30
534–535 UTF-8/UTF-16 (Unicode)
Typed feature structures, unification-based encoding and script, 368
morphology and, 18–19 parsing issues related to encoding systems,
Typographical features, sentence and topic 89
segmentation, 44–45 Utterances, in spoken dialog systems
Typology, morphological, 7–8 rule-based approach to transcription and
annotation, 502–503
transcription and annotation of, 513
UCC (UIMA Component Container), 537
UIMA. See Unstructured Information
Management Architecture (UIMA) Variable-length language models, 179
Understanding, spoken dialog systems and, Vector space model
500–503 document representation in monolingual
Unicode (UTF-8/UTF-16) IR, 372–373
encoding and script, 368 for document retrieval, 374–375
parsing issues related to encoding systems, Verb clustering, in PSG, 125
89 Verb sense, in PSG, 126–127
Unification-based morphology, 18–19 Verb, subject, object (VSO) word order, 356
Unigram models (Yamron), 35–36 VerbNet, resources for predicate-argument
Uninflectedness, homonyms and, 12 recognition, 121
Units of thought, interlingual document Verbs
representations, 381 features of predicate-argument structures,
Unknown terms, applying RTE to, 217 145
Unknown word problem, 8, 13–15 relation extraction and, 310
Unstructured data, candidate extraction Vietnamese
from, 445–449 as isolating or analytic language, 7
Unstructured Information Management NER task in, 287
Architecture (UIMA) Views
attributes of, 528–529 in GALE IOD, 534
GALE IOD and, 535, 537 RTE systems, 220
overview of, 527–528 Vital few (80/20 rule), 14
RTTS and, 538–540 Viterbi algorithm
sample code, 542–547 applied to Rosetta Consortium distillation
summarization frameworks, 422 system, 480
UIMA Component Container (UCC), 537 methods for sentence or topic segmentation,
Unstructured text, history of QA systems 39–40
and, 434 searching for mentions, 291
Unsupervised adaptation, language model Vocabulary
adaptation and, 177 indexing IR output, 366
Unsupervised systems language models and, 169
machine learning, 342 in morphologically rich languages, 190
relation extraction, 317–319 productivity/creativity and, 14
sentence selection, 489 topic segmentation methods, 38
Index 587

Voice Extensible Markup Language. See Word alignment, in MT


VoiceXML (Voice Extensible Markup alignment models, 340
Language) Berkeley word aligner, 357
Voice feature, in PSG, 124 co-occurrence of words between languages,
Voice of sentence, features of supervised 337–338
systems, 111 EM algorithm, 339–340
Voice quality, prosodic modeling and, 47 IBM Model 1, 338–339
Voice user interface (VUI) as machine learning problem, 341–343
call-flow, 505–506 overview of, 337
dialog module (DM) of, 507–508 symmetrization, 340–341
GetService process of, 506–507
Word boundary detection, 227
grammars of, 508–509
Word-error rate (WER), machine translation
VUI completeness principle, 509–510
metrics and, 336–337
VoiceXML (Voice Extensible Markup
Word lists. See Dictionary-based morphology
Language)
Word meaning
architecture of spoken dialog systems, 505
automatic evaluation, 334–335
generations of dialog systems, 511–512
history of dialog managers, 504 evaluation of, 332
VUI. See Voice user interface (VUI) human assessment of, 332–334
Word order, 356
Word/phrase-level annotations, for
W3C (World Wide Web Consortium), 504 subjectivity and sentiment analysis
WASP program, for rule-based semantic corpus-based approach, 267–269
parsing systems, 151 dictionary-based approach, 264–267
Web 2.0, accelerating need for crosslingual overview of, 264
retrieval, 365 Word segmentation
WER (word-error rate), machine translation in Chinese, Japanese, Thai, and Korean
metrics and, 336–337 writing systems, 4–5
Whitespace languages lacking, 193–194
preprocessing best practices in IR, 371 phrase indices based on, 369–370
in word separation, 369 preprocessing best practices in IR, 371
Wikipedia syntax and, 89–90
answer scores in QA and, 452 tokenization and, 369
for automatic word sense disambiguation,
Word sense
115–116
classifying according to subjectivity and
crosslingual question answering and, 455
polarity, 261
as example of explicit semantic analysis,
disambiguation, 105, 152–153
382
overview of, 102–104
predominance of English in, 438
WikiRelate! program, for word sense resources, 104–105
disambiguation, 117 rule-based systems, 105–109
Wiktionary semantic interpretation and, 99–100
crosslingual question answering and, 455 semi-supervised systems, 114–116
as example of explicit semantic analysis, software programs for, 116–117
382 supervised systems, 109–112
Witten-Bell smoothing technique, in language unsupervised systems, 112–114
model estimation, 172 Word sequence, 169
Wolfram Alpha QA system, 435 Word structure
Word alignment, cross-language mention ambiguity in interpretation of expressions,
propagation, 293 10–13
588 Index

Word structure (continued ) SEMCOR (semantic concordance) corpus,


automated morphology (morphology 104–105
induction), 21 subjectivity and sentiment analysis
bibliography, 22–28 lexicons, 262
dictionary-based morphology, 15–16 synonyms, 336
finite-state morphology, 16–18 word sense disambiguation and, 117
functional morphology, 19–21 World Wide Web Consortium (W3C), 504
introduction to, 3–4 Written languages, vs. spoken languages in
irregularity in linguistic models, 8–10 language modeling, 194–195
issues and challenges, 8 WSJ, 147
lexemes, 5
morphemes, 5–7
morphological models, 15 XDC (Crossdocument coreference), in
morphological typology, 7–8 Rosetta Consortium distillation
productivity/creativity and the unknown system, 482–483
word problem, 13–15 Xerox Finite-State Tool (XFST), 16
summary, 22 XWN (eXtended WordNet), 451
tokens and, 4–5
unification-based morphology, 18–19
units in sentence segmentation, 33 YamCha software, for SVM training and
WordNet testing, 135–136
classifying word sense according to Yarowsky algorithm, for word sense
subjectivity and polarity, 261 disambiguation, 114–116
eXtended WordNet (XWN), 451
features of supervised systems, 112
hierarchical concept information in, 109 Z-score normalization, for MLIR aggregation,
QA answer scores and, 452 385
as resource for domain-specific information, Zen toolkit for morphology, applying to
122 Sanskrit, 20
RTE applied to machine translation, 218 Zero anaphora resolution, 249, 444

You might also like