100% found this document useful (1 vote)
28 views

Search Based Applications At the Confluence of Search and Database Technologies 1st Edition Gregory Grefenstette pdf download

The document discusses the emergence of Search-Based Applications (SBAs) that integrate search engine and database technologies to address modern information management challenges. It highlights the potential of SBAs to aggregate and analyze large volumes of both structured and unstructured data, making it accessible in a contextual manner. The book aims to introduce SBAs through case studies and explore the converging trends in search and database disciplines.

Uploaded by

triceltomboc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
28 views

Search Based Applications At the Confluence of Search and Database Technologies 1st Edition Gregory Grefenstette pdf download

The document discusses the emergence of Search-Based Applications (SBAs) that integrate search engine and database technologies to address modern information management challenges. It highlights the potential of SBAs to aggregate and analyze large volumes of both structured and unstructured data, making it accessible in a contextual manner. The book aims to introduce SBAs through case studies and explore the converging trends in search and database disciplines.

Uploaded by

triceltomboc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Search Based Applications At the Confluence of

Search and Database Technologies 1st Edition


Gregory Grefenstette pdf download

https://ebookname.com/product/search-based-applications-at-the-
confluence-of-search-and-database-technologies-1st-edition-
gregory-grefenstette/

Get Instant Ebook Downloads – Browse at https://ebookname.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

The Search Nora Roberts

https://ebookname.com/product/the-search-nora-roberts/

Professional Microsoft Search SharePoint 2007 and


Search Server 2008 Tom Rizzo

https://ebookname.com/product/professional-microsoft-search-
sharepoint-2007-and-search-server-2008-tom-rizzo/

Google and the Culture of Search 1st Edition Ken Hillis

https://ebookname.com/product/google-and-the-culture-of-
search-1st-edition-ken-hillis/

Outpatient Surgery Clinical Decision Making and Board


Review 1st Edition Alan Dardik (Author)

https://ebookname.com/product/outpatient-surgery-clinical-
decision-making-and-board-review-1st-edition-alan-dardik-author/
Liquidated Damages and Extensions of Time In
Construction Contracts Third Edition Brian
Eggleston(Auth.)

https://ebookname.com/product/liquidated-damages-and-extensions-
of-time-in-construction-contracts-third-edition-brian-
egglestonauth/

Eliminating Health Disparities Measurement and Data


Needs 1st Edition National Research Council

https://ebookname.com/product/eliminating-health-disparities-
measurement-and-data-needs-1st-edition-national-research-council/

The Psychological Development of Girls and Women


Rethinking change in time 2nd Edition Greene

https://ebookname.com/product/the-psychological-development-of-
girls-and-women-rethinking-change-in-time-2nd-edition-greene/

Bibliography and Footnotes 3rd, rev., enl. ed., Reprint


2020 Edition Peyton Hurt (Editor)

https://ebookname.com/product/bibliography-and-footnotes-3rd-rev-
enl-ed-reprint-2020-edition-peyton-hurt-editor/

Drama Between Poetry and Performance 1st Edition W. B.


Worthen

https://ebookname.com/product/drama-between-poetry-and-
performance-1st-edition-w-b-worthen/
Ideals and Ideologies A Reader 10th Edition Terence
Ball

https://ebookname.com/product/ideals-and-ideologies-a-
reader-10th-edition-terence-ball/
Search-Based Applications
At the Confluence of Search and Database Technologies
Synthesis Lectures on
Information Concepts,
Retrieval, and Services
Editor
Gari Marchionini, University of North Carolina, Chapel Hill
Synthesis Lectures on Information Concepts, Retrieval, and Services is edited by Gary Marchionini of
the University of North Carolina. The series will publish 50- to 100-page publications on topics
pertaining to information science and applications of technology to information discovery, production,
distribution, and management. The scope will largely follow the purview of premier information and
computer science conferences, such as ASIST, ACM SIGIR, ACM/IEEE JCDL, and ACM CIKM.
Potential topics include, but not are limited to: data models, indexing theory and algorithms,
classification, information architecture, information economics, privacy and identity, scholarly
communication, bibliometrics and webometrics, personal information management, human
information behavior, digital libraries, archives and preservation, cultural informatics, information
retrieval evaluation, data fusion, relevance feedback, recommendation systems, question answering,
natural language processing for retrieval, text summarization, multimedia retrieval, multilingual
retrieval, and exploratory search.

Search-Based Applications - At the Confluence of Search and Database Technologies


Gregory Grefenstette and Laura Wilber
2010

Information Concepts: From Books to Cyberspace Identities


Gary Marchionini
2010

Estimating the Query Difficulty for Information Retrieval


David Carmel and Elad Yom-Tov
2010

iRODS Primer: Integrated Rule-Oriented Data System


Arcot Rajasekar, Reagan Moore, Chien-Yi Hou, Christopher A. Lee, Richard Marciano, Antoine de
Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, Paul Tooby, and Bing Zhu
2010
iv
Collaborative Web Search: Who, What, Where, When, and Why
Meredith Ringel Morris and Jaime Teevan
2009

Multimedia Information Retrieval


Stefan Rueger
2009

Online Multiplayer Games


William Sims Bainbridge
2009

Information Architecture: The Design and Integration of Information Spaces


Wei Ding and Xia Lin
2009

Reading and Writing the Electronic Book


Catherine C. Marshall
2009

Hypermedia Genes: An Evolutionary Perspective on Concepts, Models, and Architectures


Nuno M. Guimarïes and Luïs M. Carrico
2009

Understanding User-Web Interactions via Web Analytics


Bernard J. ( Jim) Jansen
2009

XML Retrieval
Mounia Lalmas
2009

Faceted Search
Daniel Tunkelang
2009

Introduction to Webometrics: Quantitative Web Research for the Social Sciences


Michael Thelwall
2009

Exploratory Search: Beyond the Query-Response Paradigm


Ryen W. White and Resa A. Roth
2009
v
New Concepts in Digital Reference
R. David Lankes
2009

Automated Metadata in Multimedia Information Systems: Creation, Refinement, Use in


Surrogates, and Evaluation
Michael G. Christel
2009
Copyright © 2011 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.

Search-Based Applications - At the Confluence of Search and Database Technologies


Gregory Grefenstette and Laura Wilber
www.morganclaypool.com

ISBN: 9781608455072 paperback


ISBN: 9781608455089 ebook

DOI 10.2200/S00320ED1V01Y201012ICR017

A Publication in the Morgan & Claypool Publishers series


SYNTHESIS LECTURES ON INFORMATION CONCEPTS, RETRIEVAL, AND SERVICES

Lecture #17
Series Editor: Gari Marchionini, University of North Carolina, Chapel Hill
Series ISSN
Synthesis Lectures on Information Concepts, Retrieval, and Services
Print 1947-945X Electronic 1947-9468
Search-Based Applications
At the Confluence of Search and Database Technologies

Gregory Grefenstette and Laura Wilber


Exalead, S.A.

SYNTHESIS LECTURES ON INFORMATION CONCEPTS, RETRIEVAL, AND


SERVICES #17

M
&C Morgan & cLaypool publishers
ABSTRACT
We are poised at a major turning point in the history of information management via computers.
Recent evolutions in computing, communications, and commerce are fundamentally reshaping the
ways in which we humans interact with information, and generating enormous volumes of electronic
data along the way. As a result of these forces, what will data management technologies, and their
supporting software and system architectures, look like in ten years? It is difficult to say, but we can
see the future taking shape now in a new generation of information access platforms that combine
strategies and structures of two familiar – and previously quite distinct – technologies, search engines
and databases, and in a new model for software applications, the Search-Based Application (SBA),
which offers a pragmatic way to solve both well-known and emerging information management
challenges as of now. Search engines are the world’s most familiar and widely deployed information
access tool, used by hundreds of millions of people every day to locate information on the Web, but
few are aware they can now also be used to provide precise, multidimensional information access
and analysis that is hard to distinguish from current database applications, yet endowed with the
usability and massive scalability of Web search. In this book, we hope to introduce Search Based
Applications to a wider audience, using real case studies to show how this flexible technology can be
used to intelligently aggregate large volumes of unstructured data (like Web pages) and structured
data (like database content), and to make that data available in a highly contextual, quasi real-time
manner to a wide base of users for a varied range of purposes. We also hope to shed light on the
general convergences underway in search and database disciplines, convergences that make SBAs
possible, and which serve as harbingers of information management paradigms and technologies to
come.

KEYWORDS
search-based applications, search engines, semantic technologies, natural language pro-
cessing, human-computer information retrieval, data retrieval, online analytical pro-
cessing, OLAP, data integration, alternative data access platforms, unified information
access, NoSQL, mash-up technologies
ix

Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

1 Search Based Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 What is a Search Based Application? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 High Impact, Low Risk Solution for Businesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Fertile Ground for Interdisciplinary Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 A Valuable Tool for Database Administrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 New Opportunities for Search Specialists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 New Flexibility for Software Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Lecture Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Evolving Business Information Access Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


2.1 Changing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The Need for High Performance and Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 The Need for Unified Access to Global Information . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 The Need for Simple Yet Secure Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Origins and Histories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 What has Changed Recently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Search Engines Enter the Enterprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Databases Go Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.3 Structural and Conceptual Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Data Models & Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


4.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Conceptual Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
x
4.1.3 Storage Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Conceptual Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.3 Storage Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 What has Changed Recently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Data Collection/Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.2 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.1 Creation/Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.2 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.2 Relevancy Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1.1 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.1 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3 What’s Changed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xi
7.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8 Data Security, Usability, Performance, Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


8.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.3 What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

9 Summary Evolutions and Convergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


9.1 SBA-Enabling Search Engine Evolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.1.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.1.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.1.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.1.4 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.1.5 Data Retrieval & Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.1.6 Data Security, Usability, Performance, Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

10 SBA Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.1 What is an SBA Platform? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.2 Information Access Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.3 SBA Platforms: Market Leaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.4 SBA Platforms: Other Vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10.5 SBA Vendors: COTS Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

11 SBA Uses & Preconditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


11.1 When Are SBAs Used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.2 How Are SBAs Used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

12 Anatomy of a Search Based Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


12.1 SBAs for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
12.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
12.1.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12.1.3 Data Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
12.1.4 Data Retrieval & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
12.2 SBAs for Unstructured Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
xii
12.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
12.2.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
12.2.3 Data Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
12.2.4 Data Retrieval & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
12.3 SBAs for Hybrid Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

13 Case Study: GEFCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83


13.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
13.2 A Track & Trace Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
13.3 Existing Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
13.4 Opting for a Search Based Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
13.5 First prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
13.6 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
13.7 Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

14 Case Study: Urbanizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89


14.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
14.2 The Urbanizer Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
14.3 How Urbanizer Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
14.4 What’s Next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

15 Case Study: National Postal Agency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


15.1 Customer Service SBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
15.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
15.1.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
15.2 Operational Business Intelligence (OBI) SBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
15.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
15.2.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
15.3 Sales Information SBA for Telemarketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
15.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
15.3.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

16 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103


16.1 The Influence of the Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
16.1.1 Surfacing Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
16.1.2 Opening Access to Multimedia Content . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
16.2 The Influence of the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
xiii
16.3 The Influence of the Mobile Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
16.3.1 Mission-Based IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
16.3.2 Innovation in Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
16.4 ...And Continuing Database/Search Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 106

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115


Acknowledgments
We would like to thank Gary Marchionini and Diane Cerra for inviting us to participate in
this timely and important lecture series, with a special thank you to Diane for her assistance and
patience in guiding us through the publication process. We would also like to thank Morgan &
Claypool’s reviewers, including Susan Feldman, Stephen Arnold and John Tait, for their thoughtful
suggestions and comments on our manuscript. Ms. Feldman and Mr. Arnold are constant sources
of insight for all of us working in search and information access-related disciplines, and we welcome
Mr. Tait’s remarks based on his long IR research experience at the University of Sunderland and his
more recent efforts at advancing research in IR for patents and other large scale collections at the
Information Retrieval Facility.
In addition, we are grateful to our colleagues and managers at Exalead for allowing us time
to work on this lecture, and for providing valuable feedback on our draft manuscript, especially
Olivier Astier, Stéphane Donzé and David Thoumas. We would also like to thank our partners
and customers. They are the source of the examples provided in this book, and they have played a
pioneering role in expanding the boundaries of applied search technologies, in general, and search-
based applications, in particular.
Finally, we would like to thank our families.Their love sustains us in all we do, and we dedicate
this book to them.

Gregory Grefenstette and Laura Wilber


December 2010
Glossary
Glossary
ACID Constraints on a database for achieving Atomicity, Consistency, Isolation
and Durability

Agility The ease with which a computer application can be altered, improved, or
extended

API Application Programming Interface, specifies how to call a computer pro-


gram, what arguments to use, and what you can expect as output

Application Part of the Open System Interconnection model, in which an application


layer interacts with a human user, or another application

Atomicity The idea that a database transaction either succeeds or fails in its entirety

Availability The percentage of time that data can be read or used.

Batch A computer task that is programmed to run at a certain time (usually at


night) with no human intervention

B2C Business to Customer; B2C websites offer goods or services directly to users

B+ tree A block-oriented data structure for efficient insertion and removal of data
nodes

BI Business Intelligence, views on data that aid users with business planning
and decision making

BigTable An internal data storage system used by Google, handles multidimensional


key-value pairs

BSON Binary JSON


xviii GLOSSARY
Business Any information processing application used in running a business
application

Cache A rapid computer memory where frequently or recently used data is tem-
porarily stored

CAP One cannot achieve Consistency, Availability, and Partition tolerance at the
theorem same time

Category A flat or hierarchic semantic dimension added to a document, or part of a


document

Categorization Assigning, usually through statistical means, one or more categories to text

CDM Customer Data Management

Cloud Computer applications that are executed on computers outside the enter-
services prise rather than in-house. Examples are SalesForce, Google Apps, Yahoo
mail, etc.

Clustering Grouping documents according to content similarity

CMS Content Management System

Consistency A quality of an information system in which only valid data is recorded; that
is, there are not two conflicting versions of the same data

Connector A program that extracts information from a certain file format, or from a
database

Consolidation Making all the data concerning one entity available in one output

COTS Commercial off-the-shelf software

Crawl Fetching web pages for indexing by following URLs found in each page

CRM Customer Relationship Management, applications used by businesses to


interact with customers
GLOSSARY xix
CSIS Customer Service Information System

Data Merging data from different data sources or different information systems
integration

Data A subset of data found in an enterprise information system, relevant for a


mart specific group or purpose

Data A database which is used to consolidate data from disparate sources


warehouse

DBA Database administrator, the person who is responsible for maintaining (and
often designing) an organization’ database(s)

Deep Web Web pages that are dynamically generated as a result of form input and/or
database querying

Directory A listing of the files or websites in a particular storage system

DIS Decision Intelligence System, a computer-based system for helping decision


making

Document A model of seeing a database entity as a single persistent document, com-


model posed of typed fields and categories corresponding to the entity’s attributes

Dublin Core A standard for metadata associated with documents, such as Title, Creator,
Metadata Publisher, etc.

Durability A database quality that means that successfully completed transactions must
persist (or be recoverable) in the case of a system failure

EDI Electronic Data Interchange, an early database communication system

ETL Extract-Transform-Load, any method for extracting all or part of a database


and storing it in another database

Enterprise Searching access-controlled, structured and unstructured data found within


Search the enterprise
xx GLOSSARY
ERP Enterprise Resource Planning

Evolutive Model that can be easily extended with new fields or data types without
Data Model rebuilding the entire data structure

Facet A dimension of meaning that can be used for restricting search, for example
shirts and coats are two facets that could be found on a shopping site

Field A labeled part of a document in a search engine. Fields can be typed to


contain text, numbers, dates, GPS coordinates, or categories

Firewall A computer-implemented protection that isolates internal company data


from outside access

File server A service that provides sequential or direct access to computer files

Full-text A system for searching any of the words found in documents, rather than
engine just a set of manually assigned keywords

Garbage A process for recovering memory, usually by recognizing deleted or out-of-


collection date data

Gartner An information technology research and advisory firm that reports on tech-
nology issues

GPS Global Positioning System, a system of satellites for geolocating a point on


the globe

Hash table Hashing converts a data item into a single number, and the hash table maps
this number to a list of items

Heuristics Methods based more on demonstrated performance than theory, weighting


words by their inverse frequency in a collection is an example

HTTP HyperText Transfer Protocol, an application layer protocol for accessing


web pages

IDC International Data Corporation, a global provider of market intelligence


and analysis concerning information technology
GLOSSARY xxi

ILM Information Lifecycle Management

IMAP Internet Message Access Protocol, a format for transmitting emails

Index, A data structure that contains lists of words with pointers to where the words
inverted are found in documents

Index slice One section of an inverted index which can be distributed over many dif-
ferent computer stores

Intranet A secure network that gives authorized users Web-style access to an orga-
nization’s information assets (e.g., internal documents and web pages)

IR Information Retrieval, the study of how to index and retrieve information,


usually from unstructured text

IS Information System, a generic term for any computer system for storing and
retrieving information

Isolation The database constraint specifying that data involved in a transaction are
isolated from (inaccessible to) other transactions until the transaction is
completed to avoid conflicts and overwrites

IT Information Technology, a generic term covering all aspects of using com-


puters to store and manipulate information

JDBC Java Database Connectivity, a Java version of ODBC


Join In a relational database, gathering together data contained in different tables

JSON JavaScript Object Notation, a standard for exchanging data between systems

Key-value A data storage and retrieval system in which a key (identifying an entity)
store is linked to the one or more values associated with that entity. This allows
rapid lookup of values associated with an entity, but does not allow joins on
other fields

Mash-up A software application that dynamically aggregates information from many


different sources, or output from many processes, in a single screen
xxii GLOSSARY

MDM Master Data Management, a system of policies, processes and technologies


designed to maintain the accuracy and consistency of essential data across
many data silos

Metadata Typed data associated with a document, for example, Author, Date, Category

Mobile Web Web pages accessible through a mobile device such as a smartphone

MySQL A popular open source relational database

Normalized A model for a relational database that is designed to prevent redundancies


relational that can cause anomalies when inserting, updating, and deleting data
schema

NoSQL Not Only SQL, an umbrella term for large scale data storage and retrieval
systems that use structures and querying methodologies that are different
from those of relational database systems

OBI Operational Business Intelligence, data reporting and analysis that supports
decision making concerning routine, day-to-day operations

OCR Optical Character Recognition, a technology used for converting paper doc-
uments or text encapsulated in images into electronic text, usually with some
noise caused by the conversion

ODBC Open Database Connectivity, a middleware for enabling and managing ex-
changes between databases
Offloading Extracting information from a database application and storing it in a search
engine application

OLAP Online Analytical Processing, tools for analyzing data in databases

OLTP Online Transaction Processing

Ontology A taxonomy with rules that can deduce links not necessarily present in the
taxonomy
GLOSSARY xxiii
Partition Means that a distributed database can still function if some of its nodes are
tolerance no longer available

Performance The measure of a computer application’s rapidity, throughput, availability,


or resource utilization

PHP PHP: Hypertext Preprocessor, a language for programming web pages

PLM Product Lifecycle Management, systems which allow for the management
of a product from design to retirement

Plug-and-play Modules that can be used without any reprogramming, “out of the box”

POC Proof of concept, an application that proves that something can be done,
though it may not be optimized for performance

Portal A web interface to a data source

Primary key In a relational database, a value corresponding to a unique entity, that allows
tables to be joined for a given entity

RDBMS Relational database management system

Redundancy Storing the same data in two different places in a data base, or information
system.This can cause problems of consistency if one of the values is changed
and not the other

Relational A model for databases in which data is represented as tables. Some values,
model called primary keys, link tables together

Relevancy For a given query, a heuristically determined score of the supposed pertinence
of a document to the query

REST Representational State Transfer, protocol used in web services, in which no


state is preserved, but in which every operation of reading or writing is self
sufficient

RFID Radio Frequency Identification, systems using embedded chips to transmit


information
xxiv GLOSSARY

RSS Really Simple Syndication, an XML format for transmitting frequently


updated data

R tree An efficient data structure for storing GPS-indexed points and finding all
the points in a given radius around a point

RDF Resource Description Framework, a format for representing data as sets of


triples, used in semantic web representations

SBA Search Based Applications, an information access or analysis application


built on a search engine, rather than on a database.

SCM Supply Chain Management

Scalability The desirable quality of being able to treat larger and larger data sets without
a decrease in performance, or rise in cost

Search A computer program for indexing and searching in documents


engine

Semantic Web Collection of web pages that are annotated with machine readable descrip-
tions of their content

Semi- Data found in places where the data type can be surmised, such as in explicitly
structured labeled metadata, or in structured tables on web pages
data

SEO Search engine optimization, strategies that help a web page owner to im-
prove a site’s ranking in common web search engines

SERP Search engine results page, the output of a query to a search engine

Silo An imagery-filled term for an isolated information system

SMART An early search engine developed by Gerald Salton at Cornell


system
GLOSSARY xxv
SOAP Simple Object Access Protocol, a format for transmitting data between
services

Social Data uploaded by identified users, such as in YouTube, FaceBook, Flickr


media

SQL Structured Query Language, commonly used language for manipulating


relational databases

Structured Data organized according to an explicit schema and broken down into dis-
data crete units of meaning, with units represented using consistent data types
and formats (databases, log files, spreadsheets)

SVM Support vector machine, used in classification

Table Part of a relational database, a body of related information. Each row of the
table corresponds to one entity, and each column, to some attribute of this
entity

Taxonomy A hierarchically typed system of entities, such as mammals being part of


animals being part of living beings

TCO Total cost of ownership, how much an application costs when all implicit
and explicit costs are factored in over time

Timestamp A chronological value indicating when some data was created

Top-k The k highest ranked responses in a database system that can rank answers
to a query

Transaction In databases, a sequence of actions that should be performed as an uninter-


ruptable unit, for example, purchasing a seat on a flight

Unstructured Data that is not formally or consistently organized, such as textual data
data (email, reports, documents) and multimedia content

URL Universal Resource Locator, the address of a web page


Another Random Document on
Scribd Without Any Related Topics
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookname.com

You might also like