Search Based Applications At the Confluence of Search and Database Technologies 1st Edition Gregory Grefenstette pdf download
Search Based Applications At the Confluence of Search and Database Technologies 1st Edition Gregory Grefenstette pdf download
https://ebookname.com/product/search-based-applications-at-the-
confluence-of-search-and-database-technologies-1st-edition-
gregory-grefenstette/
https://ebookname.com/product/the-search-nora-roberts/
https://ebookname.com/product/professional-microsoft-search-
sharepoint-2007-and-search-server-2008-tom-rizzo/
https://ebookname.com/product/google-and-the-culture-of-
search-1st-edition-ken-hillis/
https://ebookname.com/product/outpatient-surgery-clinical-
decision-making-and-board-review-1st-edition-alan-dardik-author/
Liquidated Damages and Extensions of Time In
Construction Contracts Third Edition Brian
Eggleston(Auth.)
https://ebookname.com/product/liquidated-damages-and-extensions-
of-time-in-construction-contracts-third-edition-brian-
egglestonauth/
https://ebookname.com/product/eliminating-health-disparities-
measurement-and-data-needs-1st-edition-national-research-council/
https://ebookname.com/product/the-psychological-development-of-
girls-and-women-rethinking-change-in-time-2nd-edition-greene/
https://ebookname.com/product/bibliography-and-footnotes-3rd-rev-
enl-ed-reprint-2020-edition-peyton-hurt-editor/
https://ebookname.com/product/drama-between-poetry-and-
performance-1st-edition-w-b-worthen/
Ideals and Ideologies A Reader 10th Edition Terence
Ball
https://ebookname.com/product/ideals-and-ideologies-a-
reader-10th-edition-terence-ball/
Search-Based Applications
At the Confluence of Search and Database Technologies
Synthesis Lectures on
Information Concepts,
Retrieval, and Services
Editor
Gari Marchionini, University of North Carolina, Chapel Hill
Synthesis Lectures on Information Concepts, Retrieval, and Services is edited by Gary Marchionini of
the University of North Carolina. The series will publish 50- to 100-page publications on topics
pertaining to information science and applications of technology to information discovery, production,
distribution, and management. The scope will largely follow the purview of premier information and
computer science conferences, such as ASIST, ACM SIGIR, ACM/IEEE JCDL, and ACM CIKM.
Potential topics include, but not are limited to: data models, indexing theory and algorithms,
classification, information architecture, information economics, privacy and identity, scholarly
communication, bibliometrics and webometrics, personal information management, human
information behavior, digital libraries, archives and preservation, cultural informatics, information
retrieval evaluation, data fusion, relevance feedback, recommendation systems, question answering,
natural language processing for retrieval, text summarization, multimedia retrieval, multilingual
retrieval, and exploratory search.
XML Retrieval
Mounia Lalmas
2009
Faceted Search
Daniel Tunkelang
2009
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00320ED1V01Y201012ICR017
Lecture #17
Series Editor: Gari Marchionini, University of North Carolina, Chapel Hill
Series ISSN
Synthesis Lectures on Information Concepts, Retrieval, and Services
Print 1947-945X Electronic 1947-9468
Search-Based Applications
At the Confluence of Search and Database Technologies
M
&C Morgan & cLaypool publishers
ABSTRACT
We are poised at a major turning point in the history of information management via computers.
Recent evolutions in computing, communications, and commerce are fundamentally reshaping the
ways in which we humans interact with information, and generating enormous volumes of electronic
data along the way. As a result of these forces, what will data management technologies, and their
supporting software and system architectures, look like in ten years? It is difficult to say, but we can
see the future taking shape now in a new generation of information access platforms that combine
strategies and structures of two familiar – and previously quite distinct – technologies, search engines
and databases, and in a new model for software applications, the Search-Based Application (SBA),
which offers a pragmatic way to solve both well-known and emerging information management
challenges as of now. Search engines are the world’s most familiar and widely deployed information
access tool, used by hundreds of millions of people every day to locate information on the Web, but
few are aware they can now also be used to provide precise, multidimensional information access
and analysis that is hard to distinguish from current database applications, yet endowed with the
usability and massive scalability of Web search. In this book, we hope to introduce Search Based
Applications to a wider audience, using real case studies to show how this flexible technology can be
used to intelligently aggregate large volumes of unstructured data (like Web pages) and structured
data (like database content), and to make that data available in a highly contextual, quasi real-time
manner to a wide base of users for a varied range of purposes. We also hope to shed light on the
general convergences underway in search and database disciplines, convergences that make SBAs
possible, and which serve as harbingers of information management paradigms and technologies to
come.
KEYWORDS
search-based applications, search engines, semantic technologies, natural language pro-
cessing, human-computer information retrieval, data retrieval, online analytical pro-
cessing, OLAP, data integration, alternative data access platforms, unified information
access, NoSQL, mash-up technologies
ix
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
5 Data Collection/Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.2 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.1 Creation/Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.2 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.2 Relevancy Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1.1 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.1 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3 What’s Changed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xi
7.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10 SBA Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.1 What is an SBA Platform? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.2 Information Access Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.3 SBA Platforms: Market Leaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.4 SBA Platforms: Other Vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10.5 SBA Vendors: COTS Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Agility The ease with which a computer application can be altered, improved, or
extended
Atomicity The idea that a database transaction either succeeds or fails in its entirety
B2C Business to Customer; B2C websites offer goods or services directly to users
B+ tree A block-oriented data structure for efficient insertion and removal of data
nodes
BI Business Intelligence, views on data that aid users with business planning
and decision making
Cache A rapid computer memory where frequently or recently used data is tem-
porarily stored
CAP One cannot achieve Consistency, Availability, and Partition tolerance at the
theorem same time
Categorization Assigning, usually through statistical means, one or more categories to text
Cloud Computer applications that are executed on computers outside the enter-
services prise rather than in-house. Examples are SalesForce, Google Apps, Yahoo
mail, etc.
Consistency A quality of an information system in which only valid data is recorded; that
is, there are not two conflicting versions of the same data
Connector A program that extracts information from a certain file format, or from a
database
Consolidation Making all the data concerning one entity available in one output
Crawl Fetching web pages for indexing by following URLs found in each page
Data Merging data from different data sources or different information systems
integration
DBA Database administrator, the person who is responsible for maintaining (and
often designing) an organization’ database(s)
Deep Web Web pages that are dynamically generated as a result of form input and/or
database querying
Dublin Core A standard for metadata associated with documents, such as Title, Creator,
Metadata Publisher, etc.
Durability A database quality that means that successfully completed transactions must
persist (or be recoverable) in the case of a system failure
Evolutive Model that can be easily extended with new fields or data types without
Data Model rebuilding the entire data structure
Facet A dimension of meaning that can be used for restricting search, for example
shirts and coats are two facets that could be found on a shopping site
File server A service that provides sequential or direct access to computer files
Full-text A system for searching any of the words found in documents, rather than
engine just a set of manually assigned keywords
Gartner An information technology research and advisory firm that reports on tech-
nology issues
Hash table Hashing converts a data item into a single number, and the hash table maps
this number to a list of items
Index, A data structure that contains lists of words with pointers to where the words
inverted are found in documents
Index slice One section of an inverted index which can be distributed over many dif-
ferent computer stores
Intranet A secure network that gives authorized users Web-style access to an orga-
nization’s information assets (e.g., internal documents and web pages)
IS Information System, a generic term for any computer system for storing and
retrieving information
Isolation The database constraint specifying that data involved in a transaction are
isolated from (inaccessible to) other transactions until the transaction is
completed to avoid conflicts and overwrites
JSON JavaScript Object Notation, a standard for exchanging data between systems
Key-value A data storage and retrieval system in which a key (identifying an entity)
store is linked to the one or more values associated with that entity. This allows
rapid lookup of values associated with an entity, but does not allow joins on
other fields
Metadata Typed data associated with a document, for example, Author, Date, Category
Mobile Web Web pages accessible through a mobile device such as a smartphone
NoSQL Not Only SQL, an umbrella term for large scale data storage and retrieval
systems that use structures and querying methodologies that are different
from those of relational database systems
OBI Operational Business Intelligence, data reporting and analysis that supports
decision making concerning routine, day-to-day operations
OCR Optical Character Recognition, a technology used for converting paper doc-
uments or text encapsulated in images into electronic text, usually with some
noise caused by the conversion
ODBC Open Database Connectivity, a middleware for enabling and managing ex-
changes between databases
Offloading Extracting information from a database application and storing it in a search
engine application
Ontology A taxonomy with rules that can deduce links not necessarily present in the
taxonomy
GLOSSARY xxiii
Partition Means that a distributed database can still function if some of its nodes are
tolerance no longer available
PLM Product Lifecycle Management, systems which allow for the management
of a product from design to retirement
Plug-and-play Modules that can be used without any reprogramming, “out of the box”
POC Proof of concept, an application that proves that something can be done,
though it may not be optimized for performance
Primary key In a relational database, a value corresponding to a unique entity, that allows
tables to be joined for a given entity
Redundancy Storing the same data in two different places in a data base, or information
system.This can cause problems of consistency if one of the values is changed
and not the other
Relational A model for databases in which data is represented as tables. Some values,
model called primary keys, link tables together
Relevancy For a given query, a heuristically determined score of the supposed pertinence
of a document to the query
R tree An efficient data structure for storing GPS-indexed points and finding all
the points in a given radius around a point
Scalability The desirable quality of being able to treat larger and larger data sets without
a decrease in performance, or rise in cost
Semantic Web Collection of web pages that are annotated with machine readable descrip-
tions of their content
Semi- Data found in places where the data type can be surmised, such as in explicitly
structured labeled metadata, or in structured tables on web pages
data
SEO Search engine optimization, strategies that help a web page owner to im-
prove a site’s ranking in common web search engines
SERP Search engine results page, the output of a query to a search engine
Structured Data organized according to an explicit schema and broken down into dis-
data crete units of meaning, with units represented using consistent data types
and formats (databases, log files, spreadsheets)
Table Part of a relational database, a body of related information. Each row of the
table corresponds to one entity, and each column, to some attribute of this
entity
TCO Total cost of ownership, how much an application costs when all implicit
and explicit costs are factored in over time
Top-k The k highest ranked responses in a database system that can rank answers
to a query
Unstructured Data that is not formally or consistently organized, such as textual data
data (email, reports, documents) and multimedia content
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookname.com