Natural Language Processing - 2
Natural Language Processing - 2
at ibmpressbooks.com/ibmregister
Upon registration, we will send you electronic sample chapters from two of our popular
IBM Press books. In addition, you will be automatically entered into a monthly drawing
for a free IBM Press book.
Contact us
If you are interested in writing a book or reviewing manuscripts prior to publication,
please write to us at:
Editorial Director, IBM Press
c/o Pearson Education
800 East 96lh Street
Indianapolis, IN 46240
e-mail: [email protected]
Data Integration
Developing Quality Blueprint and Modeling
Technical Information, Techniques for a Scalable and
Second Edition Sustainable Architecture
By Anthony David Giordano
By Gretchen Hargis, Michelle Carey, Ann Kilty
Hernandez, Polly Hughes, Deirdre Longo, ISBN: 0-13-708493-5
Shannon Rouiller, and Elizabeth Wilde Making Data Integration Work: How to
ISBN: 0-13-147749-8 Systematically Reduce Cost, Improve Quality,
Direct from IBM’s own documentation and Enhance Effectiveness
experts, this is the definitive guide This book presents the solution: a clear,
to developing outstanding technical consistent approach to defining, designing,
documentation—for the Web and for and building data integration components to
print. Using extensive before-and-after reduce cost, simplify management, enhance
examples, illustrations, and checklists, quality, and improve effectiveness. Leading
the authors show exactly how to create IBM data management expert Tony Giordano
documentation that’s easy to find, brings together best practices for architec-
understand, and use. This edition includes ture, design, and methodology and shows
how to do the disciplined work of getting data
extensive new coverage of topic-based
integration right.
information, simplifying search and
retrievability, internationalization, visual Mr. Giordano begins with an overview of the
effectiveness, and much more. “patterns” of data integration, showing how
to build blueprints that smoothly handle both
operational and analytic data integration.
Next, he walks through the entire project
lifecycle, explaining each phase, activity, task,
and deliverable through a complete case
study. Finally, he shows how to integrate data
integration with other information manage-
ment disciplines, from data governance
to metadata. The book’s appendices bring
together key principles, detailed models, and
a complete data integration glossary.
Visit ibmpressbooks.com
for all product information
Related Books of Interest
Do It Wrong Quickly
How the Web Changes the
Old Marketing Rules
Moran
ISBN: 0-13-225596-0
Get Bold
Using Social Media to Create a
Search Engine New Type of Social Business
Marketing, Inc. Carter
ISBN: 0-13-261831-1
By Mike Moran and Bill Hunt
ISBN: 0-13-606868-5
IBM Press
Pearson plc
Upper Saddle River, NJ • Boston • Indianapolis • San Francisco
New York • Toronto • Montreal • London • Munich • Paris • Madrid
Capetown • Sydney • Tokyo • Singapore • Mexico City
ibmpressbooks.com
The authors and publisher have taken care in the preparation of this book, but make no expressed or
implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed
for incidental or consequential damages in connection with or arising out of the use of the information or
programs contained herein.
c Copyright 2012 by International Business Machines Corporation. All rights reserved.
Note to U.S. Government Users: Documentation related to restricted right. Use, duplication, or disclosure
is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corporation.
IBM Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special
sales, which may include electronic versions and/or custom covers and content particular to your business,
training goals, marketing focus, and branding interests. For more information, please contact
U.S. Corporate and Government Sales
1-800-382-3419
[email protected]
For sales outside the United States, please contact
International Sales
[email protected]
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and the publisher was aware of a trademark
claim, the designations have been printed with initial capital letters or in all capitals.
The following terms are trademarks or registered trademarks of International Business Machines Corpora-
tion in the United States, other countries, or both: IBM, the IBM press logo, IBM Watson, ThinkPlace,
WebSphere, and InfoSphere. A current list of IBM trademarks is available on the web at “copyright and
trademark information” as www.ibm.com/legal/copytrade.shtml. Microsoft, Windows, Windows NT, and
the Windows logo are trademarks of the Microsoft Corporation in the United States, other countries, or
both. Java and all Java-based trademarks and logos are trademarks of Oracle and/or its affiliates. Linux
is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company,
product, or service names may be trademarks or service marks of others.
All rights reserved. This publication is protected by copyright, and permission must be obtained from
the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any
form or by any means, electronic, mechanical, photocopying, recording, or likewise. To obtain permission
to use material from this work, please submit a written request to Pearson Education, Inc., Permissions
Department, One Lake Street, Upper Saddle River, New Jersey 07458, or you may fax your request to (201)
236-3290.
ISBN-13: 978-0-13-715144-8
ISBN-10: 0-13-715144-6
Text printed in the United States on recycled paper at Courier in Westford, Massachusetts.
First printing, May 2012
I dedicate this book to
my mother Rita, my brother Robert, my sister-in-law Judi,
my nephew Wolfie, and my niece Freya—Bikels all.
I also dedicate it to Science.
DMB
Preface xxi
Acknowledgments xxv
Part I In Theory 1
Chapter 1 Finding the Structure of Words 3
1.1 Words and Their Components 4
1.1.1 Tokens 4
1.1.2 Lexemes 5
1.1.3 Morphemes 5
1.1.4 Typology 7
1.2 Issues and Challenges 8
1.2.1 Irregularity 8
1.2.2 Ambiguity 10
1.2.3 Productivity 13
1.3 Morphological Models 15
1.3.1 Dictionary Lookup 15
1.3.2 Finite-State Morphology 16
1.3.3 Unification-Based Morphology 18
1.3.4 Functional Morphology 19
1.3.5 Morphology Induction 21
1.4 Summary 22
xi
xii Contents
Chapter 3 Syntax 57
3.1 Parsing Natural Language 57
3.2 Treebanks: A Data-Driven Approach to Syntax 59
3.3 Representation of Syntactic Structure 63
3.3.1 Syntax Analysis Using Dependency Graphs 63
3.3.2 Syntax Analysis Using Phrase Structure Trees 67
3.4 Parsing Algorithms 70
3.4.1 Shift-Reduce Parsing 72
3.4.2 Hypergraphs and Chart Parsing 74
3.4.3 Minimum Spanning Trees and Dependency Parsing 79
3.5 Models for Ambiguity Resolution in Parsing 80
3.5.1 Probabilistic Context-Free Grammars 80
3.5.2 Generative Models for Parsing 83
3.5.3 Discriminative Models for Parsing 84
3.6 Multilingual Issues: What Is a Token? 87
3.6.1 Tokenization, Case, and Encoding 87
3.6.2 Word Segmentation 89
3.6.3 Morphology 90
3.7 Summary 92
Index 551
Preface
Almost everyone on the planet, it seems, has been touched in some way by advances in
information technology and the proliferation of the Internet. Recently, multimedia infor-
mation sources have become increasingly popular. Nevertheless, the sheer volume of raw
natural language text keeps increasing, and this text is being generated in all the major
languages on Earth. For example, the English Wikipedia reports that 101 language-specific
Wikipedias exist with at least 10,000 articles each. There is therefore a pressing need for
countries, companies, and individuals to analyze this massive amount of text, translate it,
and synthesize and distill it.
Previously, to build robust and accurate multilingual natural language processing (NLP)
applications, a researcher or developer had to consult several reference books and dozens,
if not hundreds, of journal and conference papers. Our aim for this book is to provide a
“one-stop shop” that offers all the requisite background and practical advice for building
such applications. Although it is quite a tall order, we hope that, at a minimum, you find
this book a useful resource.
In the last two decades, NLP researchers have developed exciting algorithms for process-
ing large amounts of text in many different languages. By far, the dominant approach has
been to build a statistical model that can learn from examples. In this way, a model can be
robust to changes in the type of text and even the language of text on which it operates.
With the right design choices, the same model can be trained to work in a new domain or
new language simply by providing new examples in that domain. This approach also obvi-
ates the need for researchers to lay out, in a painstaking fashion, all the rules that govern
the problem at hand and the manner in which those rules must be combined. Rather, a sta-
tistical system typically allows for researchers to provide an abstract expression of possible
features of the input, where the relative importance of those features can be learned during
the training phase and can be applied to new text during the decoding, or inference, phase.
The field of statistical NLP is rapidly changing. Part of the change is due to the field’s
growth. For example, one of the main conferences in the field is that of the Association of
Computational Linguistics, where conference attendance has doubled in the last five years.
Also, the share of NLP papers in the IEEE speech and language processing conferences and
journals more than doubled in the last decade; IEEE constitutes one of the world’s largest
professional associations for the advancement of technology. Not only are NLP researchers
making inherent progress on the various subproblems of the field, but NLP continues to ben-
efit (and borrow) heavily from progress in the machine learning community and linguistics
alike. This book devotes some attention to cutting-edge algorithms and techniques, but its
primary purpose is to be a thorough explication of best practices in the field. Furthermore,
every chapter describes how the techniques discussed apply in a multilingual setting.
This book is divided into two parts. Part I, In Theory, includes the first seven chapters
and lays out the various core NLP problems and algorithms to attack those problems. The
xxi
xxii Preface
first three chapters focus on finding structure in language at various levels of granularity.
Chapter 1 introduces the important concept of morphology, the study of the structure of
words, and ways to process the diverse array of morphologies present in the world’s lan-
guages. Chapter 2 discusses the methods by which documents may be decomposed into
more manageable parts, such as sentences and larger units related by topic. Finally, in this
initial trio of chapters, Chapter 3 investigates the various methods of uncovering a sentence’s
internal structure, or syntax. Syntax has long been a dominant area of research in linguistics,
and that dominance has been mirrored in the field of NLP as well. The dominance, in part,
stems from the fact that the structure of a sentence bears relation to the sentence’s meaning,
so uncovering syntactic structure can serve as a first step toward a full “understanding” of
a sentence.
Finding a structured meaning representation for a sentence, or for some other unit of
text, is often called semantic parsing, which is the concern of Chapter 4. That chapter covers,
inter alia, a related subproblem that has garnered much attention in recent years known
as semantic role labeling, which attempts to find the syntactic phrases that constitute the
arguments to some verb or predicate. By identifying and classifying a verb’s arguments,
we come one step closer to producing a logical form for a sentence, which is one way to
represent a sentence’s meaning in such a way as to be readily processed by machine, using
the rich array of tools available from logic that mankind has been developing since ancient
times.
But what if we do not want or need the deep syntactico-semantic structure that seman-
tic parsing would provide? What if our problem is simply to decide which among many
candidate sentences is the most likely sentence a human would write or speak? One way to
do so would be to develop a model that could score each sentence according to its gram-
maticality and pick the sentence with the highest score. The problem of producing a score
or probability estimate for a sequence of word tokens is known as language modeling and is
the subject of Chapter 5.
Representing meaning and judging a sentence’s grammaticality are only two of many
possible first steps toward processing language. Moving further toward some sense of under-
standing, we might wish to have an algorithm make inferences about facts expressed in
a piece of text. For example, we might want to know if a fact mentioned in one sentence
is entailed by some previous sentence in a document. This sort of inference is known as
recognizing textual entailment and is the subject of Chapter 6.
Finding which facts or statements are entailed by others is clearly important to the
automatic understanding of text, but there is also the nature of those statements. Under-
standing which statements are subjective and the polarity of the opinion expressed is the
subject matter of Chapter 7. Given how often people express opinions, this is clearly an
important problem area, all the more so in an age when social networks are fast becoming
the dominant form of person-to-person communication on the Internet. This chapter rounds
out Part I of our book.
Part II, In Practice, takes the various core areas of NLP described in Part I and explains
how to apply them to the diverse array of real-world NLP applications. Engineering is often
about trade-offs, say, between time and space, and so the chapters in this applied part of our
book explore the trade-offs in making various algorithmic and design choices when building
a robust, multilingual NLP application.
Preface xxiii
Chapter 8 describes ways to identify and classify named entities and other mentions
of those entities in text, as well as methods to identify when two or more entity mentions
corefer. These two problems are typically known as mention detection and coreference res-
olution; they are two of the core parts of a larger application area known as information
extraction.
Chapter 9 continues the information extraction discussion, exploring techniques for find-
ing out how two entities are related to each other, known as relation extraction, and identi-
fying and classifying events, or event extraction. An event, in this case, is when something
happens involving multiple entities, and we would like a machine to uncover who the par-
ticipants are and what their roles are. In this way, event extraction is closely related to the
core NLP problem of semantic role labeling.
Chapter 10 describes one of the oldest problems in the field, and one of the few that
is an inherently multilingual NLP problem: machine translation, or MT. Automatically
translating from one language to another has long been a holy grail of NLP research, and in
recent years the community has developed techniques and can obtain hardware that make
MT a practical reality, reaping rewards after decades of effort.
It is one thing to translate text, but how do we make sense of all the text out there
in seemingly limitless quantity? Chapters 8 and 9 make some headway in this regard by
helping us automatically produce structured records of information in text. Another way to
tackle the quantity problem is to narrow down the scope by finding the few documents,
or subparts of documents, that are relevant based on a search query. This problem is
known as information retrieval and is the subject of Chapter 11. In many ways, com-
mercial search engines such as Google are large-scale information retrieval systems. Given
the popularity of search engines, this is clearly an important NLP problem—all the more
so given the number of corpora that are not public and therefore searchable by commercial
engines.
Another way we might tackle the sheer quantity of text is by automatically summarizing
it, which is the topic of Chapter 12. This very difficult problem involves either finding
the sentences, or bits of sentences, that contribute to providing a relevant summary of a
larger quantity of text or else ingesting the text summarizing its meaning in some internal
representation, and then generating the text that constitutes a summary, much as a human
might do.
Often, humans would like machines to process text automatically because they have
questions they seek to answer. These questions can range from simple, factoid-like questions,
such as “When was John F. Kennedy born?” to more complex questions such as “What is
the largest city in Bavaria, Germany?” Chapter 13 discusses ways to build systems to answer
these types of questions automatically.
What if the types of questions we might like to answer are even more complex? Our
queries might have multiple answers, such as “Name all the foreign heads of state President
Barack Obama met with in 2010.” These types of queries are handled by a relatively new
subdiscipline within NLP known as distillation. In a very real way, distillation combines the
techniques of information retrieval with information extraction and adds a few of its own.
In many cases, we might like to have machines process language in an interactive way,
making use of speech technology that both recognizes and synthesizes speech. Such systems
are known as dialog systems and are covered in Chapter 15. Due to advances in speech
xxiv Preface
recognition, dialog management, and speech synthesis, such systems are becoming increas-
ingly practical and are seeing widespread, real-world deployment.
Finally, we, as NLP researchers and engineers, might like to build systems using diverse
arrays of components developed across the world. This aggregation of processing engines
is described in Chapter 16. Although it is the final chapter of our book, in some ways it
represents a beginning, not an end, to processing text, for it describes how a common
infrastructure can be used to produce a combinatorically diverse array of processing
pipelines.
As much as we hope this book is self-contained, we also hope that for you it serves as
the beginning and not an end. Each chapter has a long list of relevant work upon which it
is based, allowing you to explore any subtopic in great detail. The large community of NLP
researchers is growing throughout the world, and we hope you join us in our exciting efforts
to process text automatically and that you interact with us at universities, at industrial
research labs, at conferences, in blogs, on social networks, and elsewhere. The multilingual
NLP systems of the future are going to be even more exciting than the ones we have now,
and we look forward to all your contributions!
Acknowledgments
This book was, from its inception, designed as a highly collaborative effort. We are immensely
grateful for the encouraging support obtained from the beginning from IBM Press/Prentice
Hall, especially from Bernard Goodwin and all the others at IBM Press who helped us get
this project off the ground and see it to completion. A book of this kind would also not have
been possible without the generous time, effort, and technical acumen of our fellow chapter
authors, so we owe huge thanks to Otakar Smrž, Hyun-Jo You, Dilek Hakkani-Tür, Gokhan
Tur, Benoit Favre, Elizabeth Shriberg, Anoop Sarkar, Sameer Pradhan, Katrin Kirchhoff,
Mark Sammons, V.G.Vinod Vydiswaran, Dan Roth, Carmen Banea, Rada Mihalcea, Janyce
Wiebe, Xiaqiang Luo, Philipp Koehn, Philipp Sorg, Philipp Cimiano, Frank Schilder, Liang
Zhou, Nico Schlaefer, Jennifer Chu-Carroll, Vittorio Castelli, Radu Florian, Roberto Pierac-
cini, David Suendermann, John F. Pitrelli, and Burn Lewis. Daniel M. Bikel is also grateful
to Google Research, especially to Corinna Cortes, for her support during the final stages of
this project. Finally, we—Daniel M. Bikel and Imed Zitouni—would like to express our great
appreciation for the backing of IBM Research, with special thanks to Ellen Yoffa, without
whom this project would not have been possible.
xxv
This page intentionally left blank
About the Authors
Daniel M. Bikel ([email protected]) is a senior research scientist
at Google. He graduated with honors from Harvard in 1993 with a
degree in Classics–Ancient Greek and Latin. From 1994 to 1997, he
worked at BBN on several natural language processing problems,
including development of the first high-accuracy stochastic name-
finder, for which he holds a patent. He received M.S. and Ph.D.
degrees in computer science from the University of Pennsylvania, in
2000 and 2004 respectively, discovering new properties of statisti-
cal parsing algorithms. From 2004 through 2010, he was a research
staff member at IBM Research, working on a wide variety of natu-
ral language processing problems, including parsing, semantic role
labeling, information extraction, machine translation, and question answering. Dr. Bikel
has been a reviewer for the Computational Linguistics journal, and has been on the pro-
gram committees of the ACL, NAACL, EACL, and EMNLP conferences. He has published
numerous peer-reviewed papers in the leading conferences and journals and has built soft-
ware tools that have seen widespread use in the natural language processing community.
In 2008, he won a Best Paper Award (Outstanding Short Paper) at the ACL-08: HLT
conference. Since 2010, Dr. Bikel has been doing natural language processing and speech
processing research at Google.
xxvii
xxviii About the Authors
committee and as a chair for several peer-review conferences and journals. He holds several
patents in the field and authored more than seventy-five papers in peer-review conferences
and journals.
Human language is a complicated thing. We use it to express our thoughts, and through
language, we receive information and infer its meaning. Linguistic expressions are not unor-
ganized, though. They show structure of different kinds and complexity and consist of more
elementary components whose co-occurrence in context refines the notions they refer to in
isolation and implies further meaningful relations between them.
Trying to understand language en bloc is not a viable approach. Linguists have developed
whole disciplines that look at language from different perspectives and at different levels of
detail. The point of morphology, for instance, is to study the variable forms and functions
of words, while syntax is concerned with the arrangement of words into phrases, clauses,
and sentences. Word structure constraints due to pronunciation are described by phonology,
whereas conventions for writing constitute the orthography of a language. The meaning of
a linguistic expression is its semantics, and etymology and lexicology cover especially the
evolution of words and explain the semantic, morphological, and other links among them.
Words are perhaps the most intuitive units of language, yet they are in general tricky to
define. Knowing how to work with them allows, in particular, the development of syntactic
and semantic abstractions and simplifies other advanced views on language. Morphology is
an essential part of language processing, and in multilingual settings, it becomes even more
important.
In this chapter, we explore how to identify words of distinct types in human languages,
and how the internal structure of words can be modeled in connection with the grammatical
properties and lexical concepts the words should represent. The discovery of word structure
is morphological parsing.
How difficult can such tasks be? It depends. In many languages, words are delimited in
the orthography by whitespace and punctuation. But in many other languages, the writing
system leaves it up to the reader to tell words apart or determine their exact phonologi-
cal forms. Some languages use words whose form need not change much with the varying
context; others are highly sensitive about the choice of word forms according to particular
syntactic and semantic constraints and restrictions.
3
4 Chapter 1 Finding the Structure of Words
1.1.1 Tokens
Suppose, for a moment, that words in English are delimited only by whitespace and punc-
tuation [3], and consider Example 1–1:
Example 1–1: Will you read the newspaper? Will you read it? I won’t read it.
If we confront our assumption with insights from etymology and syntax, we notice two
words here: newspaper and won’t. Being a compound word, newspaper has an interesting
derivational structure. We might wish to describe it in more detail, once there is a lexicon or
some other linguistic evidence on which to build the possible hypotheses about the origins of
the word. In writing, newspaper and the associated concept is distinguished from the isolated
news and paper. In speech, however, the distinction is far from clear, and identification of
words becomes an issue of its own.
For reasons of generality, linguists prefer to analyze won’t as two syntactic words, or
tokens, each of which has its independent role and can be reverted to its normalized form.
The structure of won’t could be parsed as will followed by not. In English, this kind of
tokenization and normalization may apply to just a limited set of cases, but in other
languages, these phenomena have to be treated in a less trivial manner.
In Arabic or Hebrew [4], certain tokens are concatenated in writing with the preceding or
the following ones, possibly changing their forms as well. The underlying lexical or syntactic
units are thereby blurred into one compact string of letters and no longer appear as distinct
words. Tokens behaving in this way can be found in various languages and are often called
clitics.
In the writing systems of Chinese, Japanese [5], and Thai, whitespace is not used to
separate words. The units that are delimited graphically in some way are sentences or
clauses. In Korean, character strings are called eojeol ‘word segment’ and roughly correspond
to speech or cognitive units, which are usually larger than words and smaller than clauses [6],
as shown in Example 1–2:
Example 1–2: 학생들에게만 주셨는데
hak.sayng.tul.ey.key.man cwu.syess.nun.te 2
haksayng-tul-eykey-man cwu-si-ess-nunte
student+plural +dative+only give+honorific+past+while
while (he/she) gave (it) only to the students
1. Signs used in sign languages are composed of elements denoted as phonemes, too.
2. We use the Yale romanization of the Korean script and indicate its original characters by dots. Hyphens
mark morphological boundaries, and tokens are separated by plus symbols.
1.1 Words and Their Components 5
Nonetheless, the elementary morphological units are viewed as having their own syntactic
status [7]. In such languages, tokenization, also known as word segmentation, is the
fundamental step of morphological analysis and a prerequisite for most language processing
applications.
1.1.2 Lexemes
By the term word, we often denote not just the one linguistic form in the given context
but also the concept behind the form and the set of alternative forms that can express
it. Such sets are called lexemes or lexical items, and they constitute the lexicon of a lan-
guage. Lexemes can be divided by their behavior into the lexical categories of verbs, nouns,
adjectives, conjunctions, particles, or other parts of speech. The citation form of a lexeme,
by which it is commonly identified, is also called its lemma.
When we convert a word into its other forms, such as turning the singular mouse into
the plural mice or mouses, we say we inflect the lexeme. When we transform a lexeme into
another one that is morphologically related, regardless of its lexical category, we say we
derive the lexeme: for instance, the nouns receiver and reception are derived from the verb
to receive.
Example 1–3: Did you see him? I didn’t see him. I didn’t see anyone.
Example 1–3 presents the problem of tokenization of didn’t and the investigation of
the internal structure of anyone. In the paraphrase I saw no one, the lexeme to see would
be inflected into the form saw to reflect its grammatical function of expressing positive
past tense. Likewise, him is the oblique case form of he or even of a more abstract lexeme
representing all personal pronouns. In the paraphrase, no one can be perceived as the
minimal word synonymous with nobody. The difficulty with the definition of what counts as
a word need not pose a problem for the syntactic description if we understand no one as
two closely connected tokens treated as one fixed element.
In the Czech translation of Example 1–3, the lexeme vidět ‘to see’ is inflected for past
tense, in which forms comprising two tokens are produced in the second and first person
(i.e., viděla jsi ‘you-fem-sg saw’ and neviděla jsem ‘I-fem-sg did not see’). Negation in
Czech is an inflectional parameter rather than just syntactic and is marked both in the verb
and in the pronoun of the latter response, as in Example 1–4:
Example 1–4: Vidělas ho? Neviděla jsem ho. Neviděla jsem nikoho.
saw+you-are him? not-saw I-am him. not-saw I-am no-one.
Here, vidělas is the contracted form of viděla jsi ‘you-fem-sg saw’. The s of jsi ‘you are’
is a clitic, and due to free word order in Czech, it can be attached to virtually any part of
speech. We could thus ask a question like Nikohos neviděla? ‘Did you see no one?’ in which
the pronoun nikoho ‘no one’ is followed by this clitic.
1.1.3 Morphemes
Morphological theories differ on whether and how to associate the properties of word forms
with their structural components [8, 9, 10, 11]. These components are usually called seg-
ments or morphs. The morphs that by themselves represent some aspect of the meaning
of a word are called morphemes of some function.
6 Chapter 1 Finding the Structure of Words
Human languages employ a variety of devices by which morphs and morphemes are
combined into word forms. The simplest morphological process concatenates morphs one by
one, as in dis-agree-ment-s, where agree is a free lexical morpheme and the other elements
are bound grammatical morphemes contributing some partial meaning to the whole word.
In a more complex scheme, morphs can interact with each other, and their forms may
become subject to additional phonological and orthographic changes denoted as morpho-
phonemic. The alternative forms of a morpheme are termed allomorphs.
Examples of morphological alternation and phonologically dependent choice of the form
of a morpheme are abundant in the Korean language. In Korean, many morphemes change
their forms systematically with the phonological context. Example 1–5 lists the allomorphs
-ess-, -ass-, -yess- of the temporal marker indicating past tense. The first two alter according
to the phonological condition of the preceding verb stem; the last one is used especially for
the verb ha- ‘do’. The appropriate allomorph is merely concatenated after the stem, or it can
be further contracted with it, as was -si-ess- into -syess- in Example 1–2. During morpho-
logical parsing, normalization of allomorphs into some canonical form of the morpheme is
desirable, especially because the contraction of morphs interferes with simple segmentation:
Example 1–5: concatenated contracted
(a) 보았- po-ass- 봤- pwass- ‘have seen’
(b) 가지었- ka.ci-ess- 가졌- ka.cyess- ‘have taken’
(c) 하였- ha-yess- 했- hayss- ‘have done’
(d) 되었- toy-ess- 됐- twayss- ‘have become’
(e) 놓았- noh-ass- 놨- nwass- ‘have put’
Contractions (a, b) are ordinary but require attention because two characters are reduced
into one. Other types (c, d, e) are phonologically unpredictable, or lexically dependent. For
example, coh-ass- ‘have been good’ may never be contracted, whereas noh- and -ass- are
merged into nwass- in (e).
There are yet other linguistic devices of word formation to account for, as the morpho-
logical process itself can get less trivial. The concatenation operation can be complemented
with infixation or intertwining of the morphs, which is common, for instance, in Arabic.
Nonconcatenative inflection by modification of the internal vowel of a word occurs even in
English: compare the sounds of mouse and mice, see and saw, read and read.
Notably in Arabic, internal inflection takes place routinely and has a yet different quality.
The internal parts of words, called stems, are modeled with root and pattern morphemes.
Word structure is then described by templates abstracting away from the root but showing
the pattern and all the other morphs attached to either side of it.
3. The original Arabic script is transliterated using Buckwalter notation. For readability, we also provide
the standard phonological transcription, which reduces ambiguity.
1.1 Words and Their Components 7
The meaning of Example 1–6 is similar to that of Example 1–1, only the phrase
hādihi ’l-ǧarāida refers to ‘these newspapers’. While sa-taqrau ‘you will read’ combines
¯
the future marker sa- with the imperfective second-person masculine singular verb taqrau
in the indicative mood and active voice, sa-taqrauhā ‘you will read it’ also adds the cliticized
feminine singular personal pronoun in the accusative case.4
The citation form of the lexeme to which taqrau ‘you-masc-sg read’ belongs is qara,
roughly ‘to read’. This form is classified by linguists as the basic verbal form represented
by the template faal merged with the consonantal root q r , where the f l symbols of the
template are substituted by the respective root consonants. Inflections of this lexeme can
modify the pattern faal of the stem of the lemma into fal and concatenate it, under rules
of morphophonemic changes, with further prefixes and suffixes. The structure of taqrau is
thus parsed into the template ta-fal-u and the invariant root.
The word al-ǧarāida ‘the newspapers’ in the accusative case and definite state is another
example of internal inflection. Its structure follows the template al-faāil-a with the root ǧ
r d. This word is the plural of ǧarı̄dah ‘newspaper’ with the template faı̄l-ah. The links
between singular and plural templates are subject to convention and have to be declared in
the lexicon.
Irrespective of the morphological processes involved, some properties or features of a
word need not be apparent explicitly in its morphological structure. Its existing structural
components may be paired with and depend on several functions simultaneously but may
have no particular grammatical interpretation or lexical meaning.
The -ah suffix of ǧarı̄dah ‘newspaper’ corresponds with the inherent feminine gender of
the lexeme. In fact, the -ah morpheme is commonly, though not exclusively, used to mark the
feminine singular forms of adjectives: for example, ǧadı̄d becomes ǧadı̄dah ‘new’. However,
the -ah suffix can be part of words that are not feminine, and there its function can be seen
as either emptied or overridden [12]. In general, linguistic forms should be distinguished
from functions, and not every morph can be assumed to be a morpheme.
1.1.4 Typology
Morphological typology divides languages into groups by characterizing the prevalent mor-
phological phenomena in those languages. It can consider various criteria, and during the
history of linguistics, different classifications have been proposed [13, 14]. Let us outline the
typology that is based on quantitative relations between words, their morphemes, and their
features:
Isolating, or analytic, languages include no or relatively few words that would comprise
more than one morpheme (typical members are Chinese, Vietnamese, and Thai; ana-
lytic tendencies are also found in English).
Synthetic languages can combine more morphemes in one word and are further divided
into agglutinative and fusional languages.
Agglutinative languages have morphemes associated with only a single function at a time
(as in Korean, Japanese, Finnish, and Tamil, etc.).
Fusional languages are defined by their feature-per-morpheme ratio higher than one (as in
Arabic, Czech, Latin, Sanskrit, German, etc.).
In accordance with the notions about word formation processes mentioned earlier, we
can also discern:
1.2.1 Irregularity
Morphological parsing is motivated by the quest for generalization and abstraction in the
world of words. Immediate descriptions of given linguistic data may not be the ultimate
ones, due to either their inadequate accuracy or inappropriate complexity, and better for-
mulations may be needed. The design principles of the morphological model are therefore
very important.
In Arabic, the deeper study of the morphological processes that are in effect during
inflection and derivation, even for the so-called irregular words, is essential for mastering the
1.2 Issues and Challenges 9
whole morphological and phonological system. With the proper abstractions made, irregular
morphology can be seen as merely enforcing some extended rules, the nature of which is
phonological, over the underlying or prototypical regular word forms [15, 16].
Example 1–7: hl rOyth? lm Orh. lm Or OHdA.
hal raaytihi? lam arahu. lam ara ah.adan.
whether you-saw+him? not-did I-see+him. not-did I-see anyone.
In Example 1–7, raayti is the second-person feminine singular perfective verb in active
voice, member of the raā ‘to see’ lexeme of the r y root. The prototypical, regularized
pattern for this citation form is faal, as we saw with qara in Example 1–6. Alternatively,
we could assume the pattern of raā to be faā, thereby asserting in a compact way that
the final root consonant and its vocalic context are subject to the particular phonological
change, resulting in raā like faā instead of raay like faal. The occurrence of this change
in the citation form may have possible implications for the morphological behavior of the
whole lexeme.
Table 1–1 illustrates differences between a naive model of word structure in Arabic and
the model proposed in Smrž [12] and Smrž and Bielický [17] where morphophonemic merge
rules and templates are involved. Morphophonemic templates capture morphological pro-
cesses by just organizing stem patterns and generic affixes without any context-dependent
variation of the affixes or ad hoc modification of the stems. The merge rules, indeed very
terse, then ensure that such structured representations can be converted into exactly the
surface forms, both orthographic and phonological, used in the natural language. Applying
the merge rules is independent of and irrespective of any grammatical parameters or infor-
mation other than that contained in a template. Most morphological irregularities are thus
successfully removed.
In contrast, some irregularities are bound to particular lexemes or contexts, and can-
not be accounted for by general rules. Korean irregular verbs provide examples of such
irregularities.
Korean shows exceptional constraints on the selection of grammatical morphemes. It
is hard to find irregular inflection in other agglutinative languages: two irregular verbs
in Japanese [18], one in Finnish [19]. These languages are abundant with morphological
alternations that are formalized by precise phonological rules. Korean additionally features
lexically dependent stem alternation. As in many other languages, i- ‘be’ and ha- ‘do’ have
unique irregular endings. Other irregular verbs are classified by the stem final phoneme.
Table 1–2 compares major irregular verb classes with regular verbs in the same phonological
condition.
1.2.2 Ambiguity
Morphological ambiguity is the possibility that word forms be understood in multiple ways
out of the context of their discourse. Words forms that look the same but have distinct
functions or meaning are called homonyms.
Ambiguity is present in all aspects of morphological processing and language processing
at large. Morphological parsing is not concerned with complete disambiguation of words in
their context, however; it can effectively restrict the set of valid interpretations of a given
word form [20, 21].
In Korean, homonyms are one of the most problematic objects in morphological analysis
because they prevail all around frequent lexical items. Table 1–3 arranges homonyms on
the basis of their behavior with different endings. Example 1–8 is an example of homonyms
through nouns and verbs.
1.2 Issues and Challenges 11
We could also consider ambiguity in the senses of the noun nan, according to the Standard
Korean Language Dictionary: nan1 ‘egg’, nan2 ‘revolt’, nan5 ‘section (in newspaper)’, nan6
‘orchid’, plus several infrequent readings.
Arabic is a language of rich morphology, both derivational and inflectional. Because
Arabic script usually does not encode short vowels and omits yet some other diacritical
marks that would record the phonological form exactly, the degree of its morphological
ambiguity is considerably increased. In addition, Arabic orthography collapses certain word
forms together. The problem of morphological disambiguation of Arabic encompasses not
only the resolution of the structural components of words and their actual morphosyntactic
properties (i.e., morphological tagging [22, 23, 24]) but also tokenization and normalization
[25], lemmatization, stemming, and diacritization [26, 27, 28].
When inflected syntactic words are combined in an utterance, additional phonological
and orthographic changes can take place, as shown in Figure 1–1. In Sanskrit, one such
euphony rule is known as external sandhi [29, 30]. Inverting sandhi during tokenization is
usually nondeterministic in the sense that it can provide multiple solutions. In any language,
tokenization decisions may impose constraints on the morphosyntactic properties of the
tokens being reconstructed, which then have to be respected in further processing. The
tight coupling between morphology and syntax has inspired proposals for disambiguating
them jointly rather than sequentially [4].
Czech is a highly inflected fusional language. Unlike agglutinative languages, inflec-
tional morphemes often represent several functions simultaneously, and there is no partic-
ular one-to-one correspondence between their forms and functions. Inflectional paradigms
12 Chapter 1 Finding the Structure of Words
dirāsatı̄
drAsty → dirāsatu ı̄ drAsp y
→ dirāsati ı̄ drAsp y
→ dirāsata ı̄ drAsp y
muallimı̄ya !"#$% mElmy → muallimū ı̄ &"#$% mElmw y
→ muallimı̄ ı̄ !"#$% mElmy y
katabtumūhā
&"' ( ktbtmwhA → katabtum hā
( ktbtm hA
iǧrāuhu ) IjrAWh → iǧrāu hu *) IjrA’ h
iǧrāihi ) IjrA}h → iǧrāi hu *) IjrA’ h
iǧrāahu *) → iǧrāa hu *)
IjrA’h IjrA’ h
li-’l-asafi + ,
llOsf → li ’l-asafi li
+ - . l AlOsf
Figure 1–1: Complex tokenization and normalization of euphony in Arabic. Three nominal cases are
expressed by the same word form with dirāsatı̄ ‘my study’ and muallimı̄ya ‘my teachers’, but the
original case endings are distinct. In katabtumūhā ‘you-masc-pl wrote them’, the liaison vowel ū is
dropped when tokenized. Special attention is needed to normalize some orthographic conventions, such
as the interaction of iǧrā ‘carrying out’ and the cliticized hu ‘his’ respecting the case ending or the
merge of the definite article of asaf ‘regret’ with the preposition li ‘for’
(i.e., schemes for finding the form of a lexeme associated with the required properties) in
Czech are of numerous kinds, yet they tend to include nonunique forms in them.
Table 1–4 lists the paradigms of several common Czech words. Inflectional paradigms
for nouns depend on the grammatical gender and the phonological structure of a lexeme.
The individual forms in a paradigm vary with grammatical number and case, which are the
free parameters imposed only by the context in which a word is used.
Looking at the morphological variation of the word stavenı́ ‘building’, we might wonder
why we should distinguish all the cases for it when this lexeme can take only four different
forms. Is the detail of the case system appropriate? The answer is yes, because we can find
linguistic evidence that leads to this case category abstraction. Just consider other words of
the same meaning in place of stavenı́ in various contexts. We conclude that there is indeed
a case distinction made by the underlying system, but it need not necessarily be expressed
clearly and uniquely in the form of words.
The morphological phenomenon that some words or word classes show instances of
systematic homonymy is called syncretism. In particular, homonymy can occur due to
neutralization and uninflectedness with respect to some morphosyntactic parameters.
These cases of morphological syncretism are distinguished by the ability of the context to
demand the morphosyntactic properties in question, as stated by Baerman, Brown, and
Corbett [10, p. 32]:
Whereas neutralization is about syntactic irrelevance as reflected in morphology,
uninflectedness is about morphology being unresponsive to a feature that is
syntactically relevant.
For example, it seems fine for syntax in Czech or Arabic to request the personal pronoun
of the first-person feminine singular, equivalent to ‘I’, despite it being homonymous with
1.2 Issues and Challenges 13
the first-person masculine singular. The reason is that for some other values of the person
category, the forms of masculine and feminine gender are different, and there exist syntactic
dependencies that do take gender into account. It is not the case that the first-person singular
pronoun would have no gender nor that it would have both. We just observe uninflectedness
here. On the other hand, we might claim that in English or Korean, the gender category is
syntactically neutralized if it ever was present, and the nuances between he and she, him
and her, his and hers are only semantic.
With the notion of paradigms and syncretism in mind, we should ask what is the minimal
set of combinations of morphosyntactic inflectional parameters that covers the inflectional
variability in a language. Morphological models that would like to define a joint system of
underlying morphosyntactic properties for multiple languages would have to generalize the
parameter space accordingly and neutralize any systematically void configurations.
1.2.3 Productivity
Is the inventory of words in a language finite, or is it unlimited? This question leads
directly to discerning two fundamental approaches to language, summarized in the dis-
tinction between langue and parole by Ferdinand de Saussure, or in the competence versus
performance duality by Noam Chomsky.
In one view, language can be seen as simply a collection of utterances (parole) actually
pronounced or written (performance). This ideal data set can in practice be approximated
by linguistic corpora, which are finite collections of linguistic data that are studied with
empirical methods and can be used for comparison when linguistic models are developed.
14 Chapter 1 Finding the Structure of Words
Example 1–9 has the meaning of Example 1–1 and Example 1–6. The word noviny
‘newspaper’ exists only in plural whether it signifies one piece of newspaper or many of
them. We can literally translate noviny as the plural of novina ‘news’ to see the origins of
the word as well as the fortunate analogy with English.
It is conceivable to include all negated lexemes into the lexicon and thereby again achieve
a finite number of word forms in the vocabulary. Generally, though, the richness of a mor-
phological system of a language can make this approach highly impractical.
Most languages contain words that allow some of their structural components to repeat
freely. Consider the prefix pra- related to a notion of ‘generation’ in Czech and how it can
or cannot be iterated, as shown in Example 1–10:
inadvertent misspelling thereof. Nonetheless, both of these words successfully entered the
lexicon of English where morphological productivity started working, and we now know the
verb to google and nouns like googling or even googlish or googleology [34].
The original names have been adopted by other languages, too, and their own morpho-
logical processes have been triggered. In Czech, one says googlovat, googlit ‘to google’ or
vygooglovat, vygooglit ‘to google out’, googlovánı́ ‘googling’, and so on. In Arabic, the names
are transcribed as ǧūǧūl ‘googol’ and ǧūǧil ‘Google’. The latter one got transformed to the
verb ǧawǧal ‘to google’ through internal inflection, as if there were a genuine root ǧ w ǧ l,
and the corresponding noun ǧawǧalah ‘googling’ exists as well.
lists, dictionaries, or databases, unless they are constructed by and kept in sync with more
sophisticated models of the language.
In this context, a dictionary is understood as a data structure that directly enables
obtaining some precomputed results, in our case word analyses. The data structure can
be optimized for efficient lookup, and the results can be shared. Lookup operations are
relatively simple and usually quick. Dictionaries can be implemented, for instance, as lists,
binary search trees, tries, hash tables, and so on.
Because the set of associations between word forms and their desired descriptions is
declared by plain enumeration, the coverage of the model is finite and the generative
potential of the language is not exploited. Developing as well as verifying the association list
is tedious, liable to errors, and likely inefficient and inaccurate unless the data are retrieved
automatically from large and reliable linguistic resources.
Despite all that, an enumerative model is often sufficient for the given purpose, deals eas-
ily with exceptions, and can implement even complex morphology. For instance, dictionary-
based approaches to Korean [35] depend on a large dictionary of all possible combinations
of allomorphs and morphological alternations. These approaches do not allow development
of reusable morphological rules, though [36].
The word list or dictionary-based approach has been used frequently in various
ad hoc implementations for many languages. We could assume that with the availability of
immense online data, extracting a high-coverage vocabulary of word forms is feasible these
days [37]. The question remains how the associated annotations are constructed and how
informative and accurate they are. References to the literature on the unsupervised learn-
ing and induction of morphology, which are methods resulting in structured and therefore
nonenumerative models, are provided later in this chapter.
The role of finite-state transducers is to capture and compute regular relations on sets
[38, 9, 11].6 That is, transducers specify relations between the input and output languages.
In fact, it is possible to invert the domain and the range of a relation, that is, exchange the
input and the output. In finite-state computational morphology, it is common to refer to the
input word forms as surface strings and to the output descriptions as lexical strings, if
the transducer is used for morphological analysis, or vice versa, if it is used for morphological
generation.
The linguistic descriptions we would like to give to the word forms and their components
can be rather arbitrary and are obviously dependent on the language processed as well as
on the morphological theory followed. In English, a finite-state transducer could analyze the
surface string children into the lexical string child [+plural], for instance, or generate women
from woman [+plural]. For other examples of possible input and output strings, consider
Example 1–8 or Figure 1–1.
Relations on languages can also be viewed as functions. Let us have a relation R, and
let us denote by [Σ] the set of all sequences over some set of symbols Σ, so that the domain
and the range of R are subsets of [Σ]. We can then consider R as a function mapping an
input string into a set of output strings, formally denoted by this type signature, where [Σ]
equals String:
Finite-state transducers have been studied extensively for their formal algebraic proper-
ties and have proven to be suitable models for miscellaneous problems [9]. Their applications
encoding the surface rather than lexical string associations as rewrite rules of phonology
and morphology have been around since the two-level morphology model [39], further pre-
sented in Computational Approaches to Morphology and Syntax [11] and Morphology and
Computation [40].
Morphological operations and processes in human languages can, in the overwhelming
number of cases and to a sufficient degree, be expressed in finite-state terms. Beesley and
Karttunen [9] stress concatenation of transducers as the method for factoring surface and
lexical languages into simpler models and propose a somewhat unsystematic compile-
replace transducer operation for handling nonconcatenative phenomena in morphology.
Roark and Sproat [11], however, argue that building morphological models in general using
transducer composition, which is pure, is a more universal approach.
A theoretical limitation of finite-state models of morphology is the problem of capturing
reduplication of words or their elements (e.g., to express plurality) found in several human
languages. A formal language that contains only words of the form λ1+k , where λ is some
arbitrary sequence of symbols from an alphabet and k ∈ {1, 2, . . . } is an arbitrary natural
number indicating how many times λ is repeated after itself, is not a regular language, not
even a context-free language. General reduplication of strings of unbounded length is thus
not a regular-language operation. Coping with this problem in the framework of finite-state
transducers is discussed by Roark and Sproat [11].
6. Regular relations and regular languages are restricted in their structure by the limited memory of the
device (i.e., the finite set of configurations in which it can occur). Unlike with regular languages, intersection
of regular relations can in general yield nonregular results [38].
18 Chapter 1 Finding the Structure of Words
that the information in them is mutually incompatible. Depending on the flavor of the
processing logic, unification can be monotonic (i.e., information-preserving), or it can allow
inheritance of default values and their overriding. In either case, information in a model can
be efficiently shared and reused by means of inheritance hierarchies defined on the feature
structure types.
Morphological models of this kind are typically formulated as logic programs, and unifi-
cation is used to solve the system of constraints imposed by the model. Advantages of this
approach include better abstraction possibilities for developing a morphological grammar as
well as elimination of redundant information from it.
However, morphological models implemented in DATR can, under certain assumptions,
be converted to finite-state machines and are thus formally equivalent to them in the range
of morphological phenomena they can describe [11]. Interestingly, one-level phonology [56]
formulating phonological constraints as logic expressions can be compiled into finite-state
automata, which can then be intersected with morphological transducers to exclude any
disturbing phonologically invalid surface strings [cf. 57, 53]
Unification-based models have been implemented for Russian [58], Czech [59], Slovene
[53], Persian [60], Hebrew [61], Arabic [62, 63], and other languages. Some rely on DATR;
some adopt, adapt, or develop other unification engines.
Figure 1–2: Excerpt from the ElixirFM lexicon and a layout generated from it. The source code of
entries nested under the d r y root is shown in monospace font. Note the custom notation and the
economy yet informativeness of the declaration
1.3 Morphological Models 21
1.4 Summary
In this chapter, we learned that morphology can be looked at from opposing viewpoints:
one that tries to find the structural components from which words are built versus a more
syntax-driven perspective wherein the functions of words are the focus of the study. Another
distinction can be made between analytic and generative aspects of morphology or can
consider man-made morphological frameworks versus systems for unsupervised induction
of morphology. Yet other kinds of issues are raised about how well and how easily the
morphological models can be implemented.
We described morphological parsing as the formal process recovering structured infor-
mation from a linear sequence of symbols, where ambiguity is present and where multiple
interpretations should be expected.
We explored interesting morphological phenomena in different types of languages and
mentioned several hints in respect to multilingual processing and model development.
With Korean as a language where agglutination moderated by phonological rules is the
dominant morphological process, we saw that a viable model of word decomposition can
work at the morphemes level, regardless of whether they are lexical or grammatical.
In Czech and Arabic as fusional languages with intricate systems of inflectional and
derivational parameters and lexically dependent word stem variation, such factorization is
not useful. Morphology is better described via paradigms associating the possible forms of
lexemes with their corresponding properties.
We discussed various options for implementing either of these models using modern
programming techniques.
Acknowledgment
We would like to thank Petr Novák for his valuable comments on an earlier draft of this
chapter.
Bibliography
[1] M. Liberman, “Morphology.” Linguistics 001, Lecture 7, University of Pennsylvania,
2009. http://www.ling.upenn.edu/courses/Fall 2009/ling001/morphology.html.
[2] M. Haspelmath, “The indeterminacy of word segmentation and the nature of mor-
phology and syntax,” Folia Linguistica, vol. 45, 2011.
[3] H. Kučera and W. N. Francis, Computational Analysis of Present-Day American
English. Providence, RI: Brown University Press, 1967.
[4] S. B. Cohen and N. A. Smith, “Joint morphological and syntactic disambiguation,”
in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Lan-
guage Processing and Computational Natural Language Learning (EMNLP-CoNLL),
pp. 208–217, 2007.
Bibliography 23
[5] T. Nakagawa, “Chinese and Japanese word segmentation using word-level and
character-level information,” in Proceedings of 20th International Conference on Com-
putational Linguistics, pp. 466–472, 2004.
[6] H. Shin and H. You, “Hybrid n-gram probability estimation in morphologically rich
languages,” in Proceedings of the 23rd Pacific Asia Conference on Language, Infor-
mation and Computation, 2009.
[7] D. Z. Hakkani-Tür, K. Oflazer, and G. Tür, “Statistical morphological disambiguation
for agglutinative languages,” in Proceedings of the 18th Conference on Computational
Linguistics, pp. 285–291, 2000.
[8] G. T. Stump, Inflectional Morphology: A Theory of Paradigm Structure. Cambridge
Studies in Linguistics, New York: Cambridge University Press, 2001.
[9] K. R. Beesley and L. Karttunen, Finite State Morphology. CSLI Studies in Compu-
tational Linguistics, Stanford, CA: CSLI Publications, 2003.
[10] M. Baerman, D. Brown, and G. G. Corbett, The Syntax-Morphology Interface. A Study
of Syncretism. Cambridge Studies in Linguistics, New York: Cambridge University
Press, 2006.
[11] B. Roark and R. Sproat, Computational Approaches to Morphology and Syntax. Oxford
Surveys in Syntax and Morphology, New York: Oxford University Press, 2007.
[12] O. Smrž, “Functional Arabic morphology. Formal system and implementation,” PhD
thesis, Charles University in Prague, 2007.
[13] H. Eifring and R. Theil, Linguistics for Students of Asian and African Languages.
Universitetet i Oslo, 2005.
[14] B. Bickel and J. Nichols, “Fusion of selected inflectional formatives & exponence of
selected inflectional formatives,” in The World Atlas of Language Structures Online
(M. Haspelmath, M. S. Dryer, D. Gil, and B. Comrie, eds.), ch. 20 & 21, Munich: Max
Planck Digital Library, 2008.
[15] W. Fischer, A Grammar of Classical Arabic. Trans. Jonathan Rodgers. Yale Language
Series, New Haven, CT: Yale University Press, 2002.
[16] K. C. Ryding, A Reference Grammar of Modern Standard Arabic. New York: Cam-
bridge University Press, 2005.
[17] O. Smrž and V. Bielický, “ElixirFM.” Functional Arabic Morphology, SourceForge.net,
2010. http://sourceforge.net/projects/elixer-fm/.
[18] T. Kamei, R. Kōno, and E. Chino, eds., The Sanseido Encyclopedia of Linguistics,
Volume 6 Terms (in Japanese). Sanseido, 1996.
[19] F. Karlsson, Finnish Grammar. Helsinki: Werner Söderström Osakenyhtiö, 1987.
[20] J. Hajič and B. Hladká, “Tagging inflective languages: Prediction of morphological cat-
egories for a rich, structured tagset,” in Proceedings of COLING-ACL 1998, pp. 483–
490, 1998.
24 Chapter 1 Finding the Structure of Words
[35] H.-C. Kwon and Y.-S. Chae, “A dictionary-based morphological analysis,” in Proceed-
ings of Natural Language Processing Pacific Rim Symposium, pp. 178–185, 1991.
[36] D.-B. Kim, K.-S. Choi, and K.-H. Lee, “A computational model of Korean morphologi-
cal analysis: A prediction-based approach,” Journal of East Asian Linguistics, vol. 5,
no. 2, pp. 183–215, 1996.
[37] A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE
Intelligent Systems, vol. 24, no. 2, pp. 8–12, 2009.
[38] R. M. Kaplan and M. Kay, “Regular models of phonological rule systems,” Computa-
tional Linguistics, vol. 20, no. 3, pp. 331–378, 1994.
[39] K. Koskenniemi, “Two-level morphology: A general computational model for word
form recognition and production,” PhD thesis, University of Helsinki, 1983.
[40] R. Sproat, Morphology and Computation. ACL–MIT Press Series in Natural Language
Processing. Cambridge, MA: MIT Press, 1992.
[41] D.-B. Kim, S.-J. Lee, K.-S. Choi, and G.-C. Kim, “A two-level morphological analysis
of Korean,” in Proceedings of the 15th International Conference on Computational
Linguistics, pp. 535–539, 1994.
[42] S.-Z. Lee and H.-C. Rim, “Korean morphology with elementary two-level rules and
rule features,” in Proceedings of the Pacific Association for Computational Linguistics,
pp. 182–187, 1997.
[43] N.-R. Han, “Klex: A finite-state trancducer lexicon of Korean,” in Finite-state Meth-
ods and Natural Language Processing: 5th International Workshop, FSMNLP 2005,
pp. 67–77, Springer, 2006.
[44] M. Kay, “Nonconcatenative finite-state morphology,” in Proceedings of the Third Con-
ference of the European Chapter of the ACL (EACL-87), pp. 2–10, ACL, 1987.
[45] K. R. Beesley, “Arabic morphology using only finite-state operations,” in COLING-
ACL’98 Proceedings of the Workshop on Computational Approaches to Semitic lan-
guages, pp. 50–57, 1998.
[46] G. A. Kiraz, Computational Nonlinear Morphology with Emphasis on Semitic Lan-
guages. Studies in Natural Language Processing, Cambridge: Cambridge University
Press, 2001.
[47] N. Habash, O. Rambow, and G. Kiraz, “Morphological analysis and generation for
Arabic dialects,” in Proceedings of the ACL Workshop on Computational Approaches
to Semitic Languages, pp. 17–24, 2005.
[48] H. Skoumalová, “A Czech morphological lexicon,” in Proceedings of the Third Meeting
of the ACL Special Interest Group in Computational Phonology, pp. 41–47, 1997.
[49] R. Sedláček and P. Smrž, “A new Czech morphological analyser ajka,” in Text, Speech
and Dialogue, vol. 2166, pp. 100–107, Berlin: Springer, 2001.
26 Chapter 1 Finding the Structure of Words
[80] H. Johnson and J. Martin, “Unsupervised learning of morphology for English and
Inuktikut,” in Companion Volume of the Proceedings of the Human Language Tech-
nologies: The Annual Conference of the North American Chapter of the Association
for Computational Linguistics 2003: Short Papers, pp. 43–45, 2003.
[81] M. Creutz and K. Lagus, “Induction of a simple morphology for highly-inflecting
languages,” in Proceedings of the 7th Meeting of the ACL Special Interest Group in
Computational Phonology, pp. 43–51, 2004.
[82] M. Creutz and K. Lagus, “Unsupervised models for morpheme segmentation and
morphology learning,” ACM Transactions on Speech and Language Processing, vol. 4,
no. 1, pp. 1–34, 2007.
[83] C. Monson, J. Carbonell, A. Lavie, and L. Levin, “ParaMor: Minimally supervised
induction of paradigm structure and morphological analysis,” in Proceedings of Ninth
Meeting of the ACL Special Interest Group in Computational Morphology and Phonol-
ogy, pp. 117–125, 2007.
[84] F. M. Liang, “Word Hy-phen-a-tion by Com-put-er,” PhD thesis, Stanford University,
1983.
[85] V. Demberg, “A language-independent unsupervised model for morphological segmen-
tation,” in Proceedings of the 45th Annual Meeting of the Association of Computational
Linguistics, pp. 920–927, 2007.
[86] A. Clark, “Supervised and unsupervised learning of Arabic morphology,” in Ara-
bic Computational Morphology. Knowledge-based and Empirical Methods (A. Soudi,
A. van den Bosch, and G. Neumann, eds.), vol. 38, pp. 181–200, Berlin: Springer, 2007.
[87] A. Xanthos, Apprentissage automatique de la morphologie: le cas des structures racine-
schème. Sciences pour la communication, Bern: Peter Lang, 2008.
[88] B. Snyder and R. Barzilay, “Unsupervised multilingual learning for morphological
segmentation,” in Proceedings of ACL-08: HLT, pp. 737–745, 2008.
[89] H. Poon, C. Cherry, and K. Toutanova, “Unsupervised morphological segmentation
with log-linear models,” in Proceedings of Human Language Technologies: Annual Con-
ference of the North American Chapter of the Association for Computational Linguis-
tics, pp. 209–217, 2009.
[90] S. Della Pietra, V. Della Pietra, and J. Lafferty, “Inducing features of random fields,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 4,
pp. 380–393, 1997.
Index
. (period), sentence segmentation markers, 30 overview of, 527
“” (Quotation marks), sentence segmentation UIMA, 527–529
markers, 30 Aggregation models, for MLIR, 385
! (Exclamation point), as sentence Agreement feature, of coreference models, 301
segmentation marker, 30 Air Travel Information System (ATIS)
? (Question mark), sentence segmentation as resource for meaning representation, 148
markers, 30 rule-based systems for semantic parsing,
80/20 rule (vital few), 14 150
supervised systems for semantic parsing,
150–151
a priori models, in document retrieval, 377 Algorithms. See by individual types
Abbreviations, punctuation marks in, 30 Alignment-error rate (AER), 343
Absity parser, rule-based semantic parsing, Alignment, in RTE
122 implementing, 233–236
Abstracts latent alignment inference, 247–248
in automatic summarization, 397 learning alignment independently of
defined, 400 entailment, 244–245
Accumulative vector space model, for leveraging multiple alignments, 245
document retrieval, 374–375 modeling, 226
Accuracy, in QA, 462 Allomorphs, 6
ACE. See Automatic content extraction “almost-parsing” language model, 181
(ACE) Ambiguity
Acquis corpus disambiguation problem in morphology, 91
for evaluating IR systems, 390 in interpretation of expressions, 10–13
for machine translation, 358 issues with morphology induction, 21
Adequacy, of translation, 334 PCFGs and, 80–83
Adjunctive arguments, PropBank verb resolution in parsing, 80
predicates, 119–120 sentence segmentation markers and, 30
AER (Alignment-error rate), 343 structural, 99
AEs (Analysis engines), UIMA, 527 in syntactic analysis, 61
Agglutinative languages types of, 8
finite-state technology applied to, 18 word sense and. See Disambiguation
linear decomposition of words, 192 systems, word sense
morphological typology and, 7 Analysis engines (AEs), UIMA, 527
parsing issues related to morphology, 90–91 Analysis, in RTE framework
Aggregate processor, combining NLP engines, annotators, 219
523 improving, 248–249
Aggregation architectures, for NLP. See also multiview representation of, 220–222
Natural language processing (NLP), overview of, 220
combining engines for Analysis stage, of summarization system
GATE, 529–530 building a summarization system and, 421
InfoSphere Streams, 530–531 overview of, 400
551
552 Index
Anaphora resolution. See also Coreference polarity analysis of words and phrases, 269
resolution productivity/creativity in, 15
automatic summarization and, 398 regional dialects not in written form, 195
cohesion of, 401 RTE in, 218
multilingual automatic summarization and, stem-matching features for capturing
410 morphological similarities, 301
QA architectures and, 438–439 TALES case study, 538
zero anaphora resolution, 249, 444 tokens in, 4
Anchored speech recognition, 490 translingual summarization, 398–399,
Anchors, in SSTK, 246 424–426
Annotation/annotation guidelines unification-based models, 19
entity detection and, 293 Architectures
in GALE, 478 aggregation architectures for NLP, 527–529
Penn Treebank and, 87–88 for question answering (QA), 435–437
phrase structure trees and, 68–69 of spoken dialog systems, 505
QA architectures and, 439–440 system architectures for distillation, 488
in RTE, 219, 222–224 system architectures for semantic parsing,
snippet processing and, 485 101–102
for treebanks, 62 types of EDT architectures, 286–287
of utterances based on rule-based Arguments
grammars, 502–503
consistency of argument identification, 323
of utterances in spoken dialog systems, 513
event extraction and, 321–322
Answers, in QA
in GALE distillation initiative, 475
candidate answer extraction. See Candidate
in RTE systems, 220
answer extraction, in QA
Arguments, predicate-argument recognition
candidate answer generation. See
argument sequence information, 137–138
Candidate answer generation, in QA
classification and identification, 139–140
evaluating correctness of, 461–462
scores for, 450–453, 458–459 core and adjunctive, 119
scoring component for, 435 disallowing overlaps, 137
type classification of, 440–442 discontiguous, 121
Arabic identification and classification, 123
ambiguity in, 11–12 noun arguments, 144–146
corpora for relation extraction, 317 ART (artifact) relation class, 312
distillation, 479, 490–491 ASCII
EDT and, 286 as encoding scheme, 368
ElixirFM lexicon, 20 parsing issues related, 89
encoding and script, 368 Asian Federation of Natural Language
English-to-Arabic machine translation, 114 Processing, 218
as fusional language, 8 Asian languages. See also by individual Asian
GALE IOD and, 532, 534–536 languages
IR and, 371 multilingual IR and, 366, 390
irregularity in, 8–9 QA and, 434, 437, 455, 460–461, 466
language modeling, 189–191, 193 Ask.com, 435
mention detection experiments, 294–296 ASR (automatic speech recognition)
morphemes in, 6 sentence boundary annotation, 29
morphological analysis of, 191 sentence segmentation markers, 31
multilingual issues in predicate-argument ASSERT (Automatic Statistical SEmantic
structures, 146–147 Role Tagger), 147, 447
Index 553
ATIS. See Air Travel Information System Base phrase chunks, 132–133
(ATIS) BASEBALL system, in history of QA
Atomic events, summarization and, 418 systems, 434
Attribute features, in coreference models, 301 Basic Elements (BE)
Automatic content extraction (ACE) automatic evaluation of summarization,
coreference resolution experiments, 302–303 417–419
event extraction and, 320–321 metrics in, 420
mention detection and, 287, 294 Bayes rule, for sentence or topic
relation extraction and, 311–312 segmentation, 39–40
in Rosetta Consortium distillation system, Bayes theorem, maximum-likelihood
480–481 estimation and, 376
Automatic speech recognition (ASR) Bayesian parameter estimation, 173–174
sentence boundary annotation, 29 Bayesian topic-based language models,
sentence segmentation markers, 31 186–187
Automatic Statistical SEmantic Role Tagger BBN, event extraction and, 322
(ASSERT), 147, 447 BE (Basic Elements)
Automatic summarization automatic evaluation of summarization,
bibliography, 427–432 417–419
coherence and cohesion in, 401–404 metrics in, 420
extraction and modification processes in, BE with Transformations for Evaluation
399–400 (BEwTE), 419–420
graph-based approaches, 401 Beam search
history of, 398–399 machine translation and, 346
introduction to, 397–398 reducing search space using, 290–291
learning how to summarize, 406–409 Bell tree, for coreference resolution, 297–298
LexPageRank, 406
Bengali. See Indian languages
multilingual. See Multilingual automatic
Berkeley word aligner, in machine translation,
summarization
357
stages of, 400
Bibliographic summaries, in automatic
summary, 426–427
summarization, 397
surface-based features used in, 400–401
Bilingual latent semantic analysis (bLSA),
TextRank, 404–406
197–198
Automatic Summary Evaluation based on
Binary classifier, in event matching, 323–324
n-gram graphs (AutoSummENG),
Binary conditional model, for probability of
419–420
mention links, 297–300
BLEU
Babel Fish machine translation metrics, 334, 336
crosslingual question answering and, 455 mention detection experiments and, 295
Systran, 331 ROUGE compared with, 415–416
Backend services, of spoken dialog system, Block comparison method, for topic
500 segmentation, 38
Backoff smoothing techniques bLSA (bilingual latent semantic analysis),
generalized backoff strategy, 183–184 197–198
in language model estimation, 172 BLUE (Boeing Language Understanding
nonnormalized form, 175 Engine), 242–244
parallel backoff, 184 BM25 model, in document retrieval, 375
Backus-Naur form, of context-free grammar, BNC (British National Corpus), 118
59 Boeing Language Understanding Engine
BananaSplit, IR preprocessing and, 392 (BLUE), 242–244
554 Index
Indian languages, IR and. See also Hindi, 390 functional description, 532–534
INDRI document retrieval system, 323 implementing, 534–537
Inexact retrieval models, for monolingual overview of, 531–532
information retrieval, 374 Interoperability, in aggregated NLP, 540
InfAP metrics, for IR performance, 389 Interpolation, language model adaptation
Inference, textual. See Textual inference and, 176
Inflectional paradigms Intrinsic evaluation, of summarization, 412
in Czech, 11–12 Inverse document frequency (IDF)
in morphologically rich languages, 189 answer scores in QA and, 450–451
Information context, as measure of semantic document representation in monoligual IR,
similarity, 112 373
Information extraction (IE). See also Entity relationship questions and, 488
detection and tracking (EDT) searching over unstructured sources, 445
defined, 285 Inverted indexes, for monolingual information
entity and event resolution and, 100 retrieval, 373–374
Information retrieval (IR) IOD case study. See Interoperability Demo
bibliography, 394–396 (IOD), GALE case study
crosslingual. See Crosslingual information IR. See Information retrieval (IR)
retrieval (CLIR) Irregularity
data sets used in evaluation of, 389–391 defined, 8
distillation compared with, 475 issues with morphology induction, 21
document preprocessing for, 366–367 in linguistic models, 8–10
document syntax and encoding, 367–368 IRSTLM toolkit, for machine translation, 357
evaluation in, 386–387, 391 Isolating (analytic) languages
introduction to, 366 finite-state technology applied to, 18
key word searches in, 433 morphological typology and, 7
It Makes Sense (IMS), program for word
measures in, 388–389
sense disambiguation, 117
monolingual. See Monolingual information
Italian
retrieval
dependency graphs in syntax analysis, 65
multilingual. See Multilingual information
IR and, 390–391
retrieval (MLIR)
normalization and, 371
normalization and, 370–371
polarity analysis of words and phrases, 269
preprocessing best practices, 371–372
QA and, 461
redundancy problem and, 488
RTE in, 218
relevance assessment, 387–388
summarization and, 399
summary, 393
WordNet and, 109
tokenization and, 369–370
IVR (interactive voice response), 505, 511
tools, software, and resources, 391–393 IXIR distillation system, 488–489
translingual, 491
Informative summaries, in automatic
summarization, 401–404 Japanese
InfoSphere Streams, 530–531 as agglutinative language, 7
Insertion metric, in machine translation, 335 anaphora frequency in, 444
Integer linear programming (ILP), 247 call-flow localization and, 514
Interactive voice response (IVR), 505, 511 crosslingual QA, 455
Interoperability Demo (IOD), GALE case discourse parsers for, 403
study EDT and, 286
computational efficiency, 537 GeoQuery corpus translated into, 149
flexible application building with, 537 IR and, 390
566 Index
Text Analysis Conferences (TAC) (continued ) Tika (Content Analysis Toolkit), for
evaluation of QA systems, 460–464 preprocessing IR documents, 392
history of QA systems, 434 TinySVM software, for SVM training and
Knowledge Base Population (KBP), testing, 135–136
481–482 Token streams, 372–373
learning summarization, 408 Tokenization
Text REtrieval Conference (TREC) Arabic, 12
data sets for evaluating IR systems, character n-gram models and, 370
389–390 multilingual automatic summarization and,
evaluation of QA systems, 460–464 410
history of QA systems, 434 normalization and, 370–371
redundancy reduction, 489 parsing issues related to, 87–88
Text Tiling method (Hearst) phrase indices and, 369–370
sentence segmentation, 42 in Rosetta Consortium distillation system,
topic segmentation, 37–38 480
Text-to-speech (TTS) word segmentation and, 369
architecture of spoken dialog systems, 505 Tokenizers, tools for building summarization
history of dialog managers, 504 systems, 423
localization of grammars and, 514 Tokens
in RTTS, 538 lexical features in sentence segmentation,
speech generation, 503–504 42–43
TextRank, graphical approaches to automatic mapping between scripts (normalization),
summarization, 404–406 370–371
Textual entailment. See also Recognizing MLIR indexes and, 384
textual entailment (RTE) output from information retrieval, 366
contradiction in, 211 processing stages of segmentation tasks, 48
defined, 210 in sentence segmentation, 30
entailment pairs, 210 translating MLIR queries, 384
Textual inference in word structure, 4–5
implementing, 236–238 Top-k models, for monolingual information
latent alignment inference, 247–248 retrieval, 374
modeling, 226–227 Topic-dependent language model adaptation,
NLP and, 209 176
RTE and, 242–244 Topic Detection and Tracking (TDT)
TF-IDF (term frequency-inverse document program, 32–33, 42, 425–426
frequency) Topic or domain, features of supervised
multilingual automatic summarization and, systems, 111
411 Topic segmentation
QA scoring and, 450–451 comparing segmentation methods, 40–41
unsupervised approaches to sentence discourse features, 44
selection, 489 discriminative local classification method,
TF (term frequency) 36–38
TF document model, 373 discriminative sequence classification
unsupervised approaches to sentence method, 38–39
selection, 489 extensions for global modeling, 40
Thai features of, 41–42
as isolating or analytic language, 7 generative sequence classification method,
word segmentation in, 4–5 34–36
Thot program, for machine translation, 423 hybrid methods, 39–40
Index 585
Type-based candidate extraction, in QA, 446, subjectivity and sentiment analysis, 264
451 word sense disambiguation, 112–114
Type classifier Update summarization, in automatic
answers in QA systems, 440–442 summarization, 397
in relation extraction, 313 Uppercase (capitalization), sentence
Type system, GALE Type System (GTS), segmentation markers, 30
534–535 UTF-8/UTF-16 (Unicode)
Typed feature structures, unification-based encoding and script, 368
morphology and, 18–19 parsing issues related to encoding systems,
Typographical features, sentence and topic 89
segmentation, 44–45 Utterances, in spoken dialog systems
Typology, morphological, 7–8 rule-based approach to transcription and
annotation, 502–503
transcription and annotation of, 513
UCC (UIMA Component Container), 537
UIMA. See Unstructured Information
Management Architecture (UIMA) Variable-length language models, 179
Understanding, spoken dialog systems and, Vector space model
500–503 document representation in monolingual
Unicode (UTF-8/UTF-16) IR, 372–373
encoding and script, 368 for document retrieval, 374–375
parsing issues related to encoding systems, Verb clustering, in PSG, 125
89 Verb sense, in PSG, 126–127
Unification-based morphology, 18–19 Verb, subject, object (VSO) word order, 356
Unigram models (Yamron), 35–36 VerbNet, resources for predicate-argument
Uninflectedness, homonyms and, 12 recognition, 121
Units of thought, interlingual document Verbs
representations, 381 features of predicate-argument structures,
Unknown terms, applying RTE to, 217 145
Unknown word problem, 8, 13–15 relation extraction and, 310
Unstructured data, candidate extraction Vietnamese
from, 445–449 as isolating or analytic language, 7
Unstructured Information Management NER task in, 287
Architecture (UIMA) Views
attributes of, 528–529 in GALE IOD, 534
GALE IOD and, 535, 537 RTE systems, 220
overview of, 527–528 Vital few (80/20 rule), 14
RTTS and, 538–540 Viterbi algorithm
sample code, 542–547 applied to Rosetta Consortium distillation
summarization frameworks, 422 system, 480
UIMA Component Container (UCC), 537 methods for sentence or topic segmentation,
Unstructured text, history of QA systems 39–40
and, 434 searching for mentions, 291
Unsupervised adaptation, language model Vocabulary
adaptation and, 177 indexing IR output, 366
Unsupervised systems language models and, 169
machine learning, 342 in morphologically rich languages, 190
relation extraction, 317–319 productivity/creativity and, 14
sentence selection, 489 topic segmentation methods, 38
Index 587