9781601981899-summary
9781601981899-summary
1561/1500000003
Information Extraction
Full text available at: http://dx.doi.org/10.1561/1500000003
Information Extraction
Sunita Sarawagi
Boston – Delft
Full text available at: http://dx.doi.org/10.1561/1500000003
ISBN: 978-1-60198-188-2
c 2007 S. Sarawagi
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of specific clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
‘services’ for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy-
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho-
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +1-781-871-0245; www.nowpublishers.com;
[email protected]
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to now
Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; e-mail:
[email protected]
Full text available at: http://dx.doi.org/10.1561/1500000003
Editor-in-Chief:
Joseph M. Hellerstein
Computer Science Division
University of California, Berkeley
Berkeley, CA
USA
[email protected]
Editors
Surajit Chaudhuri (Microsoft Research)
Ronald Fagin (IBM Research)
Minos Garofalakis (Intel Research)
Johannes Gehrke (Cornell University)
Alon Halevy (Google)
Jeffrey Naughton (University of Wisconsin)
Jignesh Patel (University of Michigan)
Raghu Ramakrishnan (Yahoo! Research)
Full text available at: http://dx.doi.org/10.1561/1500000003
Editorial Scope
Information Extraction
Sunita Sarawagi
Abstract
The automatic extraction of information from unstructured sources has
opened up new avenues for querying, organizing, and analyzing data
by drawing upon the clean semantics of structured databases and the
abundance of unstructured data. The field of information extraction
has its genesis in the natural language processing community where the
primary impetus came from competitions centered around the recog-
nition of named entities like people names and organization from news
articles. As society became more data oriented with easy online access
to both structured and unstructured data, new applications of struc-
ture extraction came around. Now, there is interest in converting our
personal desktops to structured databases, the knowledge in scien-
tific publications to structured records, and harnessing the Internet for
structured fact finding queries. Consequently, there are many different
communities of researchers bringing in techniques from machine learn-
ing, databases, information retrieval, and computational linguistics for
various aspects of the information extraction problem.
This review is a survey of information extraction research of over
two decades from these diverse communities. We create a taxonomy
of the field along various dimensions derived from the nature of the
Full text available at: http://dx.doi.org/10.1561/1500000003
extraction task, the techniques used for extraction, the variety of input
resources exploited, and the type of output produced. We elaborate on
rule-based and statistical methods for entity and relationship extrac-
tion. In each case we highlight the different kinds of models for cap-
turing the diversity of clues driving the recognition process and the
algorithms for training and efficiently deploying the models. We survey
techniques for optimizing the various steps in an information extraction
pipeline, adapting to dynamic data, integrating with existing entities
and handling uncertainty in the extraction process.
Full text available at: http://dx.doi.org/10.1561/1500000003
Contents
1 Introduction 1
1.1 Applications 2
1.2 Organization of the Survey 6
1.3 Types of Structure Extracted 7
1.4 Types of Unstructured Sources 11
1.5 Input Resources for Extraction 12
1.6 Methods of Extraction 16
1.7 Output of Extraction Systems 17
1.8 Challenges 17
ix
Full text available at: http://dx.doi.org/10.1561/1500000003
4 Relationship Extraction 55
4.1 Predicting the Relationship Between a Given
Entity Pair 56
4.2 Extracting Entity Pairs Given a Relationship Type 65
6 Concluding Remarks 99
Acknowledgments 103
References 105
Full text available at: http://dx.doi.org/10.1561/1500000003
1
Introduction
2 Introduction
1.1 Applications
Structure extraction is useful in a diverse set of applications. We list a
representative subset of these, categorized along whether the applica-
tions are enterprise, personal, scientific, or Web-oriented.
1.1 Applications 3
The popular MUC [57, 100, 198] and ACE [1] competitions are based
on the extraction of structured entities like people and company
names, and relations such as “is-CEO-of” between them. Other pop-
ular tasks are: tracking disease outbreaks [99], and terrorist events
from news sources. Consequently there are several research publica-
tions [71, 98, 209] and many research prototypes [10, 73, 99, 181] that
target extraction of named entities and their relationship from news
articles. Two recent applications of information extraction on news
articles are: the automatic creation of multimedia news by integrat-
ing video and pictures of entities and events annotated in the news
articles,1 and hyperlinking news articles to background information on
people, locations, and companies.2
Customer Care: Any customer-oriented enterprise collects many
forms of unstructured data from customer interaction; for effective
management these have to be closely integrated with the enterprise’s
own structured databases and business ontologies. This has given rise
to many interesting extraction problems such as the identification of
product names and product attributes from customer emails, linking of
customer emails to a specific transaction in a sales database [19, 44], the
extraction of merchant name and addresses from sales invoices [226],
the extraction of repair records from insurance claim forms [168],
the extraction of customer moods from phone conversation tran-
scripts [112], and the extraction of product attribute value pairs from
textual product descriptions [97].
Data Cleaning: An essential step in all data warehouse cleaning pro-
cesses is converting addresses that are stored as flat strings into their
structured forms such as road name, city, and state. Large customer-
oriented organizations like banks, telephone companies, and universities
store millions of addresses. In the original form, these addresses have
little explicit structure. Often for the same person, there are different
address records stored in different databases. During warehouse con-
struction, it is necessary to put all these addresses in a standard canon-
ical format where all the different fields are identified and duplicates
1 http://spotlight.reuters.com/.
2 http://www.linkedfacts.com.
Full text available at: http://dx.doi.org/10.1561/1500000003
4 Introduction
removed. An address record broken into its structured fields not only
enables better querying, it also provides a more robust way of doing
deduplication and householding — a process that identifies all addresses
belonging to the same household [3, 8, 25, 187].
Classified Ads: Classified ads and other listings such as restau-
rant lists is another domain with implicit structure that when
exposed can be invaluable for querying. Many researchers have specifi-
cally targeted such record-oriented data in their extraction research
[150, 156, 157, 195].
1.1 Applications 5
3 http://www.scholar.google.com.
4 http://rexa.info.
Full text available at: http://dx.doi.org/10.1561/1500000003
6 Introduction
1.3.1 Entities
Entities are typically noun phrases and comprise of one to a few tokens
in the unstructured text. The most popular form of entities is named
entities like names of persons, locations, and companies as popular-
ized in the MUC [57, 100], ACE [1, 159], and CoNLL [206] compe-
titions. Named entity recognition was first introduced in the sixth
MUC [100] and consisted of three subtasks: proper names and acronyms
of persons, locations, and organizations (ENAMEX), absolute tem-
poral terms (TIMEX) and monetary and other numeric expressions
(NUMEX). Now the term entities is expanded to also include gener-
ics like disease names, protein names, paper titles, and journal names.
The ACE competition for entity relationship extraction from natural
language text lists more than 100 different entity types.
Figures 1.1 and 1.2 present examples of entity extractions: Fig-
ure 1.1 shows the classical IE task of extracting person, organization,
Full text available at: http://dx.doi.org/10.1561/1500000003
8 Introduction
Fig. 1.1 Traditionally named entity and relationship extraction from plain text (in this case
a news article). The extracted entities are bold-faced with the entity type surrounding it.
Fig. 1.2 Text segmentation as an example of entity extraction from address records.
and location entities from news articles; Figure 1.2 shows an example
where entity extraction can be treated as a problem of segmenting a
text record into structured entities. In this case an address string is
segmented so as to identify six structured entities. More examples of
segmentation of addresses coming from diverse geographical locations
appear in Table 1.1.
We cover techniques for entity extraction in Sections 2 and 3.
1.3.2 Relationships
Relationships are defined over two or more entities related in a pre-
defined way. Examples are “is employee of” relationship between a
person and an organization, “is acquired by” relationship between pairs
of companies, “location of outbreak” relationship between a disease
Full text available at: http://dx.doi.org/10.1561/1500000003
Table 1.1 Sample addresses from different countries. The first line shows the unformatted
address and the second line shows the address broken into its elements.
and a location, and “is price of” relationship between a product name
and a currency amount on a web-page. Figure 1.1 shows instances of
the extraction of two relationships from a news article. The extrac-
tion of relationships differs from the extraction of entities in one sig-
nificant way. Whereas entities refer to a sequence of words in the
source and can be expressed as annotations on the source, relation-
ships are not annotations on a subset of words. Instead they express
the associations between two separate text snippets representing the
entities.
The extraction of multi-way relationships is often referred to as
record extraction. A popular subtype of record extraction is event
extraction. For example, for an event such as a disease outbreak we
extract a multi-way relationship involving the “disease name”, “loca-
tion of the outbreak”, “number of people affected”, “number of people
killed”, and “date of outbreak.” Some record extraction tasks are trivial
because the unstructured string implies a fixed set of relationships. For
example, for addresses, the relation “is located in” is implied between
an extracted street name and city name.
In Section 4, we cover techniques for relationship extraction con-
centrating mostly on binary relationships.
Another form of multi-way relationship popular in the natural
language community is Semantic Role Labeling [124], where given a
Full text available at: http://dx.doi.org/10.1561/1500000003
10 Introduction
12 Introduction
Open Ended Sources: Recently [14, 37, 86, 192], there is interest in
extracting instances of relationships and entities from open domains
such as the web where there is little that can be expected in terms of
homogeneity or consistency. In such situations, one important factor is
to exploit the redundancy of the extracted information across many dif-
ferent sources. We discuss extractions from such sources in the context
of relationship extraction in Section 4.2.
14 Introduction
5 http://www.alphaworks.ibm.com/tech/lrw.
6 http://nlp.stanford.edu/software/.
7 http://opennlp.sourceforge.net/.
Full text available at: http://dx.doi.org/10.1561/1500000003
16 Introduction
analysis community [139] and elsewhere [40, 85, 157, 191]. We will not
discuss these in this survey.
1.8 Challenges
Large scale deployments of information extraction models raises many
challenges of accuracy, performance, maintainability, and usability that
we elaborate on next.
1.8.1 Accuracy
The foremost challenge facing the research community, in spite of more
than two decades of research in the field, is designing models that
achieve high accuracy of extraction. We list some of the factors that
contribute to the difficulty of achieving high accuracy in extraction
tasks.
Diversity of Clues: The inherent complexity of the recognition task
makes it crucial to combine evidence from a diverse set of clues, each of
which could individually be very weak. Even the simplest and the most
well-explored of tasks, Named Entity recognition, depends on a myriad
set of clues including orthographic property of the words, their part
of speech, similarity with an existing database of entities, presence of
specific signature words and so on. Optimally combining these different
Full text available at: http://dx.doi.org/10.1561/1500000003
18 Introduction
1.8 Challenges 19
Section Layout
The rest of the survey is organized as follows. In Section 2, we cover
rule-based techniques for entity extraction. In Section 3, we present
an overview of statistical methods for entity extraction. In Section 4,
we cover statistical and rule-based techniques for relationship extrac-
tion. In Section 5, we discuss work on handling various performance
and systems issues associated with creating an operational extraction
system.
Full text available at: http://dx.doi.org/10.1561/1500000003
References
[1] 2004. ACE. Annotation guidelines for entity detection and tracking.
[2] E. Agichtein, “Extracting relations from large text collections,” PhD thesis,
Columbia University, 2005.
[3] E. Agichtein and V. Ganti, “Mining reference tables for automatic text
segmentation,” in Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Seattle, USA,
2004.
[4] E. Agichtein and L. Gravano, “Snowball: Extracting relations from large plain-
text collections,” in Proceedings of the 5th ACM International Conference on
Digital Libraries, 2000.
[5] E. Agichtein and L. Gravano, “Querying text databases for efficient informa-
tion extraction,” in ICDE, 2003.
[6] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast
discovery of association rules,” in Advances in Knowledge Discovery and Data
Mining, (U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
eds.), ch. 12, pp. 307–328, AAAI/MIT Press, 1996.
[7] J. Aitken, “Learning information extraction rules: An inductive logic program-
ming approach,” in Proceedings of the 15th European Conference on Artificial
Intelligence, pp. 355–359, 2002.
[8] R. Ananthakrishna, S. chaudhuri, and V. Ganti, “Eliminating fuzzy duplicates
in data warehouses,” in International Conference on Very Large Databases
(VLDB), 2002.
[9] R. Ando and T. Zhang, “A framework for learning predictive structures from
multiple tasks and unlabeled data,” Journal of Machine Learning Research,
vol. 6, pp. 1817–1853, 2005.
105
Full text available at: http://dx.doi.org/10.1561/1500000003
106 References
References 107
108 References
References 109
110 References
References 111
[90] I. P. Fellegi and A. B. Sunter, “A theory for record linkage,” Journal of the
American Statistical Society, vol. 64, pp. 1183–1210, 1969.
[91] D. Ferrucci and A. Lally, “Uima: An architectural approach to unstructured
information processing in the corporate research environment,” Natural Lan-
guage Engineering, vol. 10, nos. 3–4, pp. 327–348, 2004.
[92] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local informa-
tion into information extraction systems by gibbs sampling,” in Proceedings
of the 43nd Annual Meeting of the Association for Computational Linguistics
(ACL 2005), 2005.
[93] J. R. Finkel, T. Grenager, and C. D. Manning, “Incorporating non-local infor-
mation into information extraction systems by gibbs sampling,” in ACL, 2005.
[94] G. W. Flake, E. J. Glover, S. Lawrence, and C. L. Giles, “Extracting query
modifications from nonlinear svms,” in WWW, pp. 317–324, 2002.
[95] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby, “Selective sampling using
the query by committee algorithm,” Machine Learning, vol. 28, nos. 2–3,
pp. 133–168, 1997.
[96] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak, “Towards
domain-independent information extraction from web tables,” in WWW ’07:
Proceedings of the 16th International Conference on World Wide Web, pp. 71–
80, ACM, 2007.
[97] R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano, “Text mining for product
attribute extraction,” SIGKDD Explorations Newsletter, vol. 8, pp. 41–48,
2006.
[98] R. Grishman, “Information extraction: Techniques and challenges,” in SCIE,
1997.
[99] R. Grishman, S. Huttunen, and R. Yangarber, “Information extraction for
enhanced access to disease outbreak reports,” Journal of Biomedical Infor-
matics, vol. 35, pp. 236–246, 2002.
[100] R. Grishman and B. Sundheim, “Message understanding conference-6: A brief
history,” in Proceedings of the 16th Conference on Computational Linguistics,
pp. 466–471, USA, Morristown, NJ: Association for Computational Linguis-
tics, 1996.
[101] R. Gupta, A. A. Diwan, and S. Sarawagi, “Efficient inference with cardinality-
based clique potentials,” in Proceedings of the 24th International Conference
on Machine Learning (ICML), USA, 2007.
[102] R. Gupta and S. Sarawagi, “Curating probabilistic databases from information
extraction models,” in Proceedings of the 32nd International Conference on
Very Large Databases (VLDB), 2006.
[103] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo, “Extract-
ing semistructure information from the web,” in Workshop on Mangement of
Semistructured Data, 1997.
[104] B. He, M. Patel, Z. Zhang, and K. C.-C. Chang, “Accessing the deep web,”
Communications on ACM, vol. 50, pp. 94–101, 2007.
[105] M. A. Hearst, “Automatic acquisition of hyponyms from large text corpora,”
in Proceedings of the 14th Conference on Computational Linguistics, pp. 539–
545, 1992.
Full text available at: http://dx.doi.org/10.1561/1500000003
112 References
[106] C.-N. Hsu and M.-T. Dung, “Generating finite-state transducers for semistruc-
tured data extraction from the web,” Information Systems Special Issue on
Semistructured Data, vol. 23, 1998.
[107] J. Huang, T. Chen, A. Doan, and J. Naughton, On the Provenance of Non-
answers to Queries Over Extracted Data.
[108] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf, “Correcting
sample selection bias by unlabeled data,” in Advances in Neural Information
Processing Systems 20, Cambridge, MA: MIT Press, 2007.
[109] M. Hurst, “The interpretation of tables in texts,” PhD thesis, University of
Edinburgh, School of Cognitive Science, Informatics, University of Edinburgh,
2000.
[110] P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, “Towards a query
optimizer for text-centric tasks,” ACM Transactions on Database Systems,
vol. 32, 2007.
[111] N. Ireson, F. Ciravegna, M. E. Califf, D. Freitag, N. Kushmerick, and
A. Lavelli, “Evaluating machine learning for information extraction,” in
ICML, pp. 345–352, 2005.
[112] M. Jansche and S. P. Abney, “Information extraction from voicemail tran-
scripts,” in EMNLP ’02: Proceedings of the ACL-02 Conference on Empirical
Methods in Natural Language Processing, pp. 320–327, USA, Morristown, NJ:
Association for Computational Linguistics, 2002.
[113] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu,
“Avatar information extraction system,” IEEE Data Engineering Bulletin,
vol. 29, pp. 40–48, 2006.
[114] J. Jiang and C. Zhai, “A systematic exploration of the feature space for rela-
tion extraction,” in Human Language Technologies 2007: The Conference of
the North American Chapter of the Association for Computational Linguistics;
Proceedings of the Main Conference, pp. 113–120, 2007.
[115] N. Kambhatla, “Combining lexical, syntactic and semantic features with max-
imum entropy models for information extraction,” in The Companion Volume
to the Proceedings of 42st Annual Meeting of the Association for Computa-
tional Linguistics, pp. 178–181, Barcelona, Spain: Association for Computa-
tional Linguistics, July 2004.
[116] S. Khaitan, G. Ramakrishnan, S. Joshi, and A. Chalamalla, “Rad: A scalable
framework for annotator development,” in ICDE, pp. 1624–1627, 2008.
[117] M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee, “n-gram/2l: A space
and time efficient two-level n-gram inverted index structure,” in VLDB ’05:
Proceedings of the 31st International Conference on Very Large Data Bases,
pp. 325–336, 2005.
[118] D. Klein and C. D. Manning, “Conditional structure versus conditional estima-
tion in NLP models,” in Workshop on Empirical Methods in Natural Language
Processing (EMNLP), 2002.
[119] D. Koller and N. Friedman, “Structured probabilistic models,” Under prepa-
ration, 2007.
[120] V. Krishnan and C. D. Manning, “An effective two-stage model for exploiting
non-local dependencies in named entity recognition,” in ACL-COLING, 2006.
Full text available at: http://dx.doi.org/10.1561/1500000003
References 113
114 References
[138] I. Mansuri and S. Sarawagi, “A system for integrating unstructured data into
relational databases,” in Proceedings of the 22nd IEEE International Confer-
ence on Data Engineering (ICDE), 2006.
[139] S. Mao, A. Rosenfeld, and T. Kanungo, “Document structure analysis algo-
rithms: A literature survey,” Document Recognition and Retrieval X, vol. 5010,
pp. 197–207, 2003.
[140] B. Marthi, B. Milch, and S. Russell, “First-order probabilistic models
for information extraction,” in Working Notes of the IJCAI-2003 Work-
shop on Learning Statistical Models from Relational Data (SRL-2003),
(L. Getoor and D. Jensen, eds.), pp. 71–78, Acapulco, Mexico, August 11
2003.
[141] D. Maynard, V. Tablan, C. Ursu, H. Cunningham, and Y. Wilks, “Named
entity recognition from diverse text types,” Recent Advances in Natural Lan-
guage Processing 2001 Conference, Tzigov Chark, Bulgaria, 2001.
[142] A. McCallum, “Information extraction: Distilling structured data from
unstructured text,” ACM Queue, vol. 3, pp. 48–57, 2005.
[143] A. McCallum, D. Freitag, and F. Pereira, “Maximum entropy markov models
for information extraction and segmentation,” in Proceedings of the Interna-
tional Conference on Machine Learning (ICML-2000), pp. 591–598, Palo Alto,
CA, 2000.
[144] A. McCallum, K. Nigam, J. Reed, J. Rennie, and K. Seymore, Cora: Computer
Science Research Paper Search Engine, http://cora.whizbang.com/, 2000.
[145] A. McCallum and B. Wellner, “Toward conditional models of identity uncer-
tainty with application to proper noun coreference,” in Proceedings of the
IJCAI-2003 Workshop on Information Integration on the Web, pp. 79–86,
Acapulco, Mexico, August 2003.
[146] A. K. McCallum, Mallet: A Machine Learning for Language Toolkit.
http://mallet.cs.umass.edu, 2002.
[147] D. McDonald, H. Chen, H. Su, and B. Marshall, “Extracting gene pathway
relations using a hybrid grammar: The arizona relation parser,” Bioinformat-
ics, vol. 20, pp. 3370–3378, 2004.
[148] R. McDonald, K. Crammer, and F. Pereira, “Flexible text segmentation with
structured multilabel classification,” in HLT/EMNLP, 2005.
[149] G. Mecca, P. Merialdo, and P. Atzeni, “Araneus in the era of xml,” in
IEEE Data Engineering Bullettin, Special Issue on XML, IEEE, September
1999.
[150] M. Michelson and C. A. Knoblock, “Semantic annotation of unstructured and
ungrammatical text,” in Proceedings of the 19th International Joint Confer-
ence on Artificial Intelligence (IJCAI), pp. 1091–1098, 2005.
[151] M. Michelson and C. A. Knoblock, “Creating relational data from unstruc-
tured and ungrammatical data sources,” Journal of Artificial Intelligence
Research (JAIR), vol. 31, pp. 543–590, 2008.
[152] E. Minkov, R. C. Wang, and W. W. Cohen, “Extracting personal names from
email: Applying named entity recognition to informal text,” in HLT/EMNLP,
2005.
[153] R. J. Mooney and R. C. Bunescu, “Mining knowledge from text using infor-
mation extraction,” SIGKDD Explorations, vol. 7, pp. 3–10, 2005.
Full text available at: http://dx.doi.org/10.1561/1500000003
References 115
116 References
References 117
118 References
References 119