Advanced Database Query Systems Techniques Applications and Technologies 1st Edition Li Yan pdf download
Advanced Database Query Systems Techniques Applications and Technologies 1st Edition Li Yan pdf download
https://ebookgate.com/product/advanced-database-query-systems-
techniques-applications-and-technologies-1st-edition-li-yan/
https://ebookgate.com/product/advanced-operating-systems-and-
kernel-applications-techniques-and-technologies-1st-edition-yair-
wiseman/
https://ebookgate.com/product/computational-models-software-
engineering-and-advanced-technologies-in-air-transportation-next-
generation-applications-1st-edition-li-weigang/
https://ebookgate.com/product/encyclopedia-of-database-
technologies-and-applications-laura-c-rivero/
Database Technologies Concepts Methodologies Tools and
Applications 1st Edition John Erickson
https://ebookgate.com/product/database-technologies-concepts-
methodologies-tools-and-applications-1st-edition-john-erickson/
https://ebookgate.com/product/dried-blood-spots-applications-and-
techniques-1st-edition-wenkui-li/
https://ebookgate.com/product/advanced-water-technologies-
concepts-and-applications-1st-edition-p-k-tewari/
https://ebookgate.com/product/database-modeling-for-industrial-
data-management-emerging-technologies-and-applications-zongmin-
ma/
https://ebookgate.com/product/atlas-of-gastrointestinal-
endomicroscopy-1st-edition-yan-qing-li/
Advanced Database Query
Systems:
Techniques, Applications and
Technologies
Li Yan
Northeastern University, China
Zongmin Ma
Northeastern University, China
Senior Editorial Director: Kristin Klinger
Director of Book Publications: Julia Mosemann
Editorial Director: Lindsay Johnston
Acquisitions Editor: Erika Carter
Development Editor: Michael Killian
Production Coordinator: Jamie Snavely
Typesetters: Michael Brehm, and Milan Vracarich Jr.
Cover Design: Nick Newcomer
Copyright © 2011 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.
Editorial Advisory Board
Reda Alhajj, University of Calgary, Canada
Gloria Bordogna, CNR-IDPA, Dalmine (Bg), Italy
Alfredo Cuzzocrea, University of Calabria, Italy
Guy De Tré, Ghent University, Belgium
Janusz Kacprzyk, Polish Academy of Sciences, Poland
Xiangfu Meng, Liaoning Technical University, China
Tadeusz Pankowski, Poznan University of Technology, Poland
Slobodan Ribarić, University of Zagreb, Croatia
Leonid Tineo, Universidad Simón Bolívar, Venezuela
Sławomir Zadrożny, Polish Academy of Sciences, Poland
Table of Contents
Acknowledgment................................................................................................................................ xvii
Section 1
Chapter 1
Automatic Categorization of Web Database Query Results.................................................................... 1
Xiangfu Meng, Liaoning Technical University, China
Li Yan, Northeastern University, China
Z. M. Ma, Northeastern University, China
Chapter 2
Practical Approaches to the Many-Answer Problem............................................................................. 28
Mounir Bechchi, LINA-University of Nantes, France
Guillaume Raschia, LINA-University of Nantes, France
Noureddine Mouaddib, LINA-University of Nantes, Morocco
Chapter 3
Concept-Oriented Query Language for Data Modeling and Analysis................................................... 85
Alexandr Savinov, SAP Research Center Dresden, Germany
Chapter 4
Evaluating Top-k Skyline Queries Efficiently..................................................................................... 102
Marlene Goncalves, Universidad Simón Bolívar, Venezuela
María Esther Vidal, Universidad Simón Bolívar, Venezuela
Chapter 5
Remarks on a Fuzzy Approach to Flexible Database Querying, its Extension and Relation
to Data Mining and Summarization..................................................................................................... 118
Janusz Kacprzyk, Polish Academy of Sciences, Poland
Guy De Tré, Ghent University, Belgium
Sławomir Zadrożny, Polish Academy of Sciences, Poland
Chapter 6
Flexible Querying of Imperfect Temporal Metadata in Spatial Data Infrastructures.......................... 140
Gloria Bordogna, CNR-IDPA, Italy
Francesco Bucci, CNR-IREA, Italy
Paola Carrara, CNR-IREA, Italy
Monica Pepe, CNR-IREA, Italy
Anna Rampini, CNR-IREA, Italy
Chapter 7
Fuzzy Querying Capability at Core of a RDBMS............................................................................... 160
Ana Aguilera, Universidad de Carabobo, Venezuela
José Tomás Cadenas, Universidad Simón Bolívar, Venezuela
Leonid Tineo, Universidad Simón Bolívar, Venezuela
Chapter 8
An Extended Relational Model & SQL for Fuzzy Multidatabases..................................................... 185
Awadhesh Kumar Sharma, M.M.M. Engg College, India
A. Goswami, IIT Kharagpur, India
D. K. Gupta, IIT Kharagpur, India
Section 2
Chapter 9
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML
Data Integration System....................................................................................................................... 221
Tadeusz Pankowski, Poznan University of Technology, Poland
Chapter 10
Deciding Query Entailment in Fuzzy OWL Lite Ontologies.............................................................. 247
Jingwei Cheng, Northeastern University, China
Z. M. Ma, Northeastern University, China
Li Yan, Northeastern University, China
Chapter 11
Relational Techniques for Storing and Querying RDF Data: An Overview........................................ 269
Sherif Sakr, University of New South Wales, Australia
Ghazi Al-Naymat, University of New South Wales, Australia
Section 3
Chapter 12
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword:
An Experimental Query Rewriter in Java............................................................................................ 287
Eric Draken, University of Calgary, Canada
Shang Gao, University of Calgary, Canada
Reda Alhajj, University of Calgary, Canada & Global University, Lebanon
Chapter 13
Querying Graph Databases: An Overview........................................................................................... 304
Sherif Sakr, University of New South Wales, Australia
Ghazi Al-Naymat, University of New South Wales, Australia
Chapter 14
Querying Multimedia Data by Similarity in Relational DBMS.......................................................... 323
Maria Camila Nardini Barioni, Federal University of ABC, Brazil
Daniel dos Santos Kaster, University of Londrina, Brazil
Humberto Luiz Razente, Federal University of ABC, Brazil
Agma Juci Machado Traina, University of São Paulo at São Carlos, Brazil
Caetano Traina Júnior, University of São Paulo at São Carlos, Brazil
Index.................................................................................................................................................... 386
Detailed Table of Contents
Acknowledgment................................................................................................................................ xvii
Section 1
Chapter 1
Automatic Categorization of Web Database Query Results.................................................................... 1
Xiangfu Meng, Liaoning Technical University, China
Li Yan, Northeastern University, China
Z. M. Ma, Northeastern University, China
This chapter proposes a novel categorization approach which consists of two steps. The first step ana-
lyzes query history of all users in the system offline and generates a set of clusters over the tuples,
where each cluster represents one type of user preference. When a user issues a query, the second step
presents to the user a category tree over the clusters generated in the first step such that the user can
easily select the subset of query results matching his needs. The chapter develops heuristic algorithms
to compute the min-cost categorization. The efficiency and effectiveness of the proposed approach are
demonstrated by experimental results.
Chapter 2
Practical Approaches to the Many-Answer Problem............................................................................. 28
Mounir Bechchi, LINA-University of Nantes, France
Guillaume Raschia, LINA-University of Nantes, France
Noureddine Mouaddib, LINA-University of Nantes, Morocco
This chapter reviews and discusses several research efforts that have attempted to provide users with
effective and efficient ways to access databases. The focus is on a simple but useful strategy for retriev-
ing relevant answers accurately and quickly without being distracted by irrelevant ones. The chapter
presents a very recent but promising approach to quickly provide users with structured and approximate
representations of users’ query results, a must have for decision support systems. The underlying algo-
rithm operates on pre-computed knowledge-based summaries of the queried data, instead of raw data
themselves.
Chapter 3
Concept-Oriented Query Language for Data Modeling and Analysis................................................... 85
Alexandr Savinov, SAP Research Center Dresden, Germany
This chapter describes a novel query language, called the concept-oriented query language, and dem-
onstrates how it can be used for data modeling and analysis. The query language is based on a novel
construct, called concept, and two relations between concepts, inclusion and partial order. Concepts
generalize conventional classes and are used for describing domain-specific identities. Inclusion rela-
tion generalized inheritance and is used for describing hierarchical address spaces. Partial order among
concepts is used to define two main operations: projection and de-projection. The chapter demonstrates
how these constructs are used to solve typical tasks in data modeling and analysis such as logical navi-
gation, multidimensional analysis and inference.
Chapter 4
Evaluating Top-k Skyline Queries Efficiently..................................................................................... 102
Marlene Goncalves, Universidad Simón Bolívar, Venezuela
María Esther Vidal, Universidad Simón Bolívar, Venezuela
This chapter describes existing solutions and proposes to use the TKSI algorithm for the Top-k Skyline
problem. TKSI reduces the search space by computing only a subset of the Skyline that is required to
produce the top-k objects. In addition, the Skyline Frequency Metric is implemented to discriminate
among the Skyline objects those that best meet the multidimensional criteria. The chapter empirically
studies the quality of TKSI, and the experimental results show the TKSI may be able to speed up the
computation of the Top-k Skyline in at least 50% percent with regards to the state-of-the-art solutions.
Chapter 5
Remarks on a Fuzzy Approach to Flexible Database Querying, its Extension and Relation
to Data Mining and Summarization..................................................................................................... 118
Janusz Kacprzyk, Polish Academy of Sciences, Poland
Guy De Tré, Ghent University, Belgium
Sławomir Zadrożny, Polish Academy of Sciences, Poland
This chapter is meant to revive the line of research in flexible querying languages based on the use of
fuzzy logic. Details of a basic technique of flexible fuzzy querying are recalled and some newest devel-
opments in this area are discussed. Moreover, it is shown how other relevant tasks may be implemented
in the framework of such queries interface. In particular, the chapter considers fuzzy queries with lin-
guistic quantifiers and shows their intrinsic relation with linguistic data summarization. Moreover, so
called bipolar queries are mentioned and advocated as a next relevant breakthrough in flexible querying
based on fuzzy logic and possibility theory.
Chapter 6
Flexible Querying of Imperfect Temporal Metadata in Spatial Data Infrastructures.......................... 140
Gloria Bordogna, CNR-IDPA, Italy
Francesco Bucci, CNR-IREA, Italy
Paola Carrara, CNR-IREA, Italy
Monica Pepe, CNR-IREA, Italy
Anna Rampini, CNR-IREA, Italy
This chapter discusses the limitations of current temporal metadata in discovery services of spatial data
infrastructures (SDIs) and proposes some solutions. The proposal of a formal and operational method
is presented to represent imperfect temporal metadata values and allow users to express flexible search
conditions, i.e. tolerant to under-satisfaction. In doing so, discovery services can apply partial matching
mechanisms between the “desired” metadata, expressed by the user, and the archived metadata: this
would allow retrieving geodata in decreasing order of relevance to the user needs, as it usually occurs
on the Web when using search engines. Finally, the chapter illustrates the proposal with an example.
Chapter 7
Fuzzy Querying Capability at Core of a RDBMS............................................................................... 160
Ana Aguilera, Universidad de Carabobo, Venezuela
José Tomás Cadenas, Universidad Simón Bolívar, Venezuela
Leonid Tineo, Universidad Simón Bolívar, Venezuela
This chapter concentrates on incorporating the fuzzy capabilities to a relational database management
system (RDBMS) of open source. The fuzzy capabilities include connectors, modifiers, comparators,
quantifiers and queries. The extensions consider a more flexible DDL and DML languages. The aim
is to show the design and implementation details in the RDBMS PostgreSQL. For this, a fuzzy query
processor and fuzzy access mechanism are designed and implemented. The physical fuzzy relational
operators are also defined and implemented. The flow of a fuzzy query through the different modules
(parser, planner, optimizer and executor) is shown. The chapter includes some experimental results to
demonstrate the performance of the proposal solution. These results show that the extensions do not
decrease the performance of the RDBMS.
Chapter 8
An Extended Relational Model & SQL for Fuzzy Multidatabases..................................................... 185
Awadhesh Kumar Sharma, M.M.M. Engg College, India
A. Goswami, IIT Kharagpur, India
D. K. Gupta, IIT Kharagpur, India
This chapter investigates the problems in integration of fuzzy relational databases and extends the rela-
tional data model to support fuzzy multidatabases of type-2 that contain integrated fuzzy relational da-
tabases. The extended model named fuzzy tuple source (FTS) relational data model is provided with a
set of FTS relational operations to manipulate the global relations called FTS relations from such fuzzy
multidatabases. The chapter proposes and implements a full set of FTS relational algebraic operations
capable of manipulating an extensive set of fuzzy relational multidatabases of type-2 that include fuzzy
data values in their instances. To facilitate formulation of global fuzzy query over FTS relations in such
fuzzy multidatabases, an appropriate extension to SQL is done so as to get fuzzy tuple source structured
query language (FTS-SQL).
Section 2
Chapter 9
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML
Data Integration System....................................................................................................................... 221
Tadeusz Pankowski, Poznan University of Technology, Poland
This chapter discusses a method for schema mapping and query reformulation in a P2P XML data
integration system. The discussed formal approach enables us to specify schemas, schema constraints,
schema mappings, and queries in a uniform and precise way. Based on this approach, the chapter de-
fines some basic operations used for query reformulation and data merging, and proposes algorithms for
automatic generation of XQuery programs performing these operations in real. Some issues concerning
query propagation strategies and merging modes are discussed, when missing data is to be discovered
in the P2P integration processes. The approach is implemented in 6P2P system. Its general architecture
is presented and the way how queries and answers are sent across the P2P environment is sketched.
Chapter 10
Deciding Query Entailment in Fuzzy OWL Lite Ontologies.............................................................. 247
Jingwei Cheng, Northeastern University, China
Z. M. Ma, Northeastern University, China
Li Yan, Northeastern University, China
This chapter focuses on fuzzy (threshold) conjunctive queries over knowledge bases encoding in fuzzy
DL SHIF(D), the logic counterpart of fuzzy OWL Lite language. The decidability of fuzzy query entail-
ment in this setting is shown by providing a corresponding tableau-based algorithm. It is also shown
that the data complexity for answering fuzzy conjunctive queries in fuzzy SHIF(D) is in coNP, as long
as only simple roles occur in the query. Regarding combined complexity, the chapter proves a co3NEx-
pTime upper bound in the size of the knowledge base and the query.
Chapter 11
Relational Techniques for Storing and Querying RDF Data: An Overview........................................ 269
Sherif Sakr, University of New South Wales, Australia
Ghazi Al-Naymat, University of New South Wales, Australia
This chapter concentrates on using relational query processors to store and query RDF data. An over-
view of the different approaches is given and these approaches are classified according to the storage
and query evaluation strategies.
Section 3
Chapter 12
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword:
An Experimental Query Rewriter in Java............................................................................................ 287
Eric Draken, University of Calgary, Canada
Shang Gao, University of Calgary, Canada
Reda Alhajj, University of Calgary, Canada & Global University, Lebanon
This chapter intends to provide SQL expression equivalent to explicit relational algebra division (with
static divisor). The goal is to implement a SQL query rewriter in Java which takes as input a divide
grammar and rewrites it to an efficient query using current SQL keywords.
Chapter 13
Querying Graph Databases: An Overview........................................................................................... 304
Sherif Sakr, University of New South Wales, Australia
Ghazi Al-Naymat, University of New South Wales, Australia
This chapter provides an overview of different techniques for indexing and querying graph databases.
An overview of several proposals of graph query lan�guage is also given and a set of guidelines for fu-
ture research directions is provided.
Chapter 14
Querying Multimedia Data by Similarity in Relational DBMS.......................................................... 323
Maria Camila Nardini Barioni, Federal University of ABC, Brazil
Daniel dos Santos Kaster, University of Londrina, Brazil
Humberto Luiz Razente, Federal University of ABC, Brazil
Agma Juci Machado Traina, University of São Paulo at São Carlos, Brazil
Caetano Traina Júnior, University of São Paulo at São Carlos, Brazil
This chapter presents an already validated strategy that adds similarity queries to SQL, supporting a
powerful set of similarity operators. The chapter also describes techniques to store and retrieve multi-
media objects in an efficient way and show existing DBMS alternatives to execute similarity queries
over multimedia data.
Index.................................................................................................................................................... 386
xii
Preface
Databases are designed to support the data storage, processing, and retrieval activities related to data
management. The wide usage of databases in various applications has resulted in an enormous wealth
of data, which populate various types of databases around the worlds. Ones can find many types of
database systems, for example, relational databases, object-oriented databases, object-relational data-
bases, deductive databases, parallel databases, distributed databases, multidatabase systems, Web data-
bases, XML databases, multimedia databases, temporal/spatial databases, spatiotemporal databases, and
uncertain databases. As a result, databases have become the repositories of large volumes of data.
Database query is closely related to data management. Database query processing is such a procedure
that database management systems (DBMSs) obtain the information needed by the users from the data-
bases according to users’ requirements, and then provides them to the users after this useful information
is organized. It is very critical to deal with the enormity and retrieve the worthwhile information for
effective problem solving and decision making. It is especially true when a variety of database types,
data types, and users’ requirements, as well as large volumes of data, are available. The techniques of
database queries are challenging today’s database systems and promoting their evolvement. There is no
doubt that database query systems play an important role in data management, and data management
requires database query support.
The research and development of information queries over a variety of databases are receiving increas-
ing attention. By means of query technology, large volumes of information in databases can be retrieved,
and Information Systems are hereby built based on databases to support various problem solving and
decision making. So database queries are the fields which must be investigated by academic researchers
together with developers and users both from database and industry areas.
This book focuses on the following issues of advanced database query systems: the technologies
and methodologies of database queries, XML and metadata queries, and applications of database query
systems, aiming at providing a single account of technologies and practices in advanced database query
systems. The objective of the book is to provide the state of the art information to academics, researchers
and industry practitioners who are involved or interested in the study, use, design, and development of
advanced and emerging database queries with ultimate aim to empower individuals and organizations
in building competencies for exploiting the opportunities of the data and knowledge society. This book
presents the latest research and application results in advanced database query systems. The different
chapters in the book have been contributed by different authors and provide possible solutions for the
different types of technological problems concerning database queries.
This book, which consists of fourteen chapters, is organized into three major sections. The first sec-
tion discusses the technologies and methodologies of database queries, over the first eight chapters. The
xiii
next three chapters covering XML and metadata queries comprise the second section. The third section,
containing the final three chapters, focuses on the design and applications of database query systems.
First of all, we take a look at the issues of the technologies and methodologies of database queries.
Web database queries are often exploratory. The users often find that their queries return too many
answers and many of them may be irrelevant. Based on different kinds of user preferences, Xiangfu
Meng, Li Yan and Z. M. Ma propose a novel categorization approach which consists of two steps. The
first step analyzes query history of all users in the system offline and generates a set of clusters over
the tuples, where each cluster represents one type of user preference. When a user issues a query, the
second step presents to the user a category tree over the clusters generated in the first step such that
the user can easily select the subset of query results matching his needs. The problem of constructing
a category tree is a cost optimization problem and the authors develop heuristic algorithms to compute
the min-cost categorization. The efficiency and effectiveness of their approach are demonstrated by
experimental results.
Database systems are increasingly used for interactive and exploratory data retrieval. In such retriev-
als, user queries often result in too many answers, so users waste significant time and efforts sifting
and sorting through these answers to find the relevant ones. Mounir Bechchi, Guillaume Raschia and
Noureddine Mouaddib first review and discuss several research efforts that have attempted to provide
users with effective and efficient ways to access databases. Then, they focus on a simple but useful
strategy for retrieving relevant answers accurately and quickly without being distracted by irrelevant
ones. They present a very recent but promising approach to quickly provide users with structured and
approximate representations of users’ query results, a must have for decision support systems. The un-
derlying algorithm operates on pre-computed knowledge-based summaries of the queried data, instead
of raw data themselves. Thus, this first-citizen data structure is also presented.
Alexandr Savinov describes a novel query language, called the concept-oriented query language
(COQL), and demonstrates how it can be used for data modeling and analysis. The query language is
based on a novel construct, called concept, and two relations between concepts, inclusion and partial
order. Concepts generalize conventional classes and are used for describing domain-specific identities.
This includes relation generalized inheritance and is used for describing hierarchical address spaces.
Partial order among concepts is used to define two main operations: projection and de-projection. Sa-
vinov demonstrates how these constructs are used to solve typical tasks in data modeling and analysis
such as logical navigation, multidimensional analysis, and inference.
Criteria that induce a Skyline naturally represent user’s preference conditions useful to discard ir-
relevant data in large datasets. However, in the presence of high-dimensional Skyline spaces, the size
of the Skyline can still be very large. To identify the best k points among the Skyline, the Top-k Skyline
approach has been proposed. Marlene Goncalves and María-Esther Vidal describe existing solutions
and propose to use the TKSI algorithm for the Top-k Skyline problem. TKSI reduces the search space
by computing only a subset of the Skyline that is required to produce the top-k objects. In addition, the
Skyline Frequency Metric is implemented to discriminate among the Skyline objects those that best
meet the multidimensional criteria. They empirically study the quality of TKSI, and their experimental
results show the TKSI may be able to speed up the computation of the Top-k Skyline in at least 50%
percent with regards to the state-of-the-art solutions.
Janusz Kacprzyk, Guy De Tré, and Sławomir Zadrożny briefly present the concept of, a rationale for
and various approaches to the use of fuzzy logic in flexible querying. They discuss first some historical
developments, and then the main issues related to fuzzy querying. Next, they concentrate on fuzzy queries
xiv
with linguistic quantifiers, and discuss in more detail their FQUERY for Access fuzzy querying system.
They indicate not only the straightforward power of that fuzzy querying system but its great potential
as a tool to implement linguistic data summaries that may provide an ultimately human consistent way
of data mining and data summarization. Also, they briefly mention the concept of bipolar queries that
may reflect positive and negative preferences of the user, and may be a breakthrough in fuzzy querying.
In the context of fuzzy querying and linguistic summarization they mention a considerable potential of
their new recent proposals to explicitly use in linguistic data summarization some elements of natural
language generation (NLG), and some natural language generation related elements of Halliday’s sys-
temic functional linguistics (SFL). They argue that this may be a promising direction for future research.
Gloria Bordogna et al. discuss the limitations of current temporal metadata in discovery services of
Spatial Data Infrastructures (SDIs) and propose some solutions. They present their proposal of a formal
and operational method to represent imperfect temporal metadata values and allow users to express
flexible search conditions, i.e. tolerant to under-satisfaction. In doing so, discovery services can apply
partial matching mechanisms between the “desired” metadata, expressed by the user, and the archived
metadata: this would allow retrieving geodata in decreasing order of relevance to the user needs, as it
usually occurs on the Web when using search engines. The proposal is finally illustrated with an example.
Ana Aguilera, José Tomás Cadenas and Leonid Tineo concentrate on incorporating the fuzzy capa-
bilities to a relational database management system (RDBMS) of open source. The fuzzy capabilities
include connectors, modifiers, comparators, quantifiers, and queries. The extensions consider a more
flexible DDL and DML languages. The aim is to show the design and implementation details in the
RDBMS PostgreSQL. For this, they design and implement a fuzzy query processor and fuzzy access
mechanism. Also, they define and implement the physical fuzzy relational operators. They show the flow
of a fuzzy query through the different modules (parser, planner, optimizer, and executor). They include
some experimental results to demonstrate the performance of the proposal solution. These results show
that the extensions do not decrease the performance of the RDBMS.
Awadhesh Kumar Sharma, A. Goswami, and D.K. Gupta investigate the problems in integration
of fuzzy relational databases and extend the relational data model to support fuzzy multidatabases of
type-2 that contain integrated fuzzy relational databases. The extended model is given the name fuzzy
tuple source (FTS) relational data model which is provided with a set of FTS relational operations to
manipulate the global relations called FTS relations from such fuzzy multidatabases. They propose and
implement a full set of FTS relational algebraic operations capable of manipulating an extensive set of
fuzzy relational multidatabases of type-2 that include fuzzy data values in their instances. To facilitate
formulation of global fuzzy query over FTS relations in such fuzzy multidatabases, an appropriate
extension to SQL can be done so as to get fuzzy tuple source structured query language (FTS-SQL).
The second section deals with the issues of XML and metadata queries.
Tadeusz Pankowski addresses the problem of data integration in a P2P environment, where each peer
stores schema of its local data, mappings between the schemas, and some schema constraints. The goal
of the integration is to answer queries formulated against a chosen peer. The answer must consist of
data stored in the queried peer as well as data of its direct and indirect partners. Pankowski focuses on
defining and using mappings, schema constraints, query propagation across the P2P system, and query
answering in such scenario. Schemas, mappings, constraints (functional dependencies) and queries are
all expressed using a unified approach based on tree-pattern formulas. He discusses how functional de-
pendencies can be exploited to increase information content of answers (by discovering missing values)
xv
and to control merging operations and propagation strategies. He proposes algorithms for translating
high-level specifications of mappings and queries into XQuery programs, and shows how the discussed
method has been implemented in SixP2P (or 6P2P) system.
Significant research efforts in the Semantic Web community have recently been directed toward the
representation and reasoning with fuzzy ontologies. Description logics (DLs) are the logical foundations
of standard Web ontology languages. Conjunctive queries are deemed as an expressive reasoning service
for DLs. Jingwei Cheng, Z. M. Ma, and Li Yan focus on fuzzy (threshold) conjunctive queries over
knowledge bases encoding in fuzzy DL SHIF(D), the logic counterpart of fuzzy OWL Lite language. They
show decidability of fuzzy query entailment in this setting by providing a corresponding tableau-based
algorithm. Also they show data complexity for answering fuzzy conjunctive queries in fuzzy SHIF(D)
is in coNP, as long as only simple roles occur in the query. Regarding combined complexity, they prove
a co3NExpTime upper bound in the size of the knowledge base and the query.
The Resource Description Framework (RDF) is a flexible model for representing information about
resources in the Web. With the increasing amount of RDF data which is becoming available, efficient
and scalable management of RDF data has become a fundamental challenge to achieve the Semantic
Web vision. The RDF model has attracted attentions in the database community, and many researchers
have proposed different solutions to store and query RDF data efficiently. Sherif Sakr and Ghazi Al-
Naymat concentrate on using relational query processors to store and query RDF data. They give an
overview of the different approaches and classify these approaches according to the storage and query
evaluation strategies.
In the third section, we see the design and application aspects of database query systems.
Relational Algebra (RA) and structured query language (SQL) are supposed to have a bijective re-
lationship by having the same expressive power. That is, each operation in SQL can be mapped to one
RA equivalent and vice versa. RA has an explicit relational division symbol (÷) whereas SQL does not
have a corresponding explicit division keyword. Division is implemented using a combination of four
core operations, namely cross product, difference, selection, and projection. The work described by
Eric Draken, Shang Gao, and Reda Alhajj is intended to provide SQL expression equivalent to explicit
relational algebra division (with static divisor). The goal is to implement a SQL query rewriter in Java
which takes as input a divide grammar and rewrites it to an efficient query using current SQL keywords.
The developed approach could be adapted as front-end or as a wrapper to existing SQL query system.
Recently, there has been a lot of interest in the application of graphs in different domains. Graphs
have been widely used for data modeling in different application domains such as: chemical compounds,
protein networks, social networks, and Semantic Web. Given a query graph, the task of retrieving related
graphs as a result of the query from a large graph database is a key issue in any graph-based application.
This has raised a crucial need for efficient graph indexing and querying techniques. Sherif Sakr and
Ghazi Al-Naymat provide an overview of different techniques for indexing and querying graph databases.
They also give an overview of several proposals of graph query language. Finally, they provide a set of
guidelines for future research directions.
Multimedia objects–such as images, audio, and video–do not present the total ordering relationship,
so the relational operators are not suitable to compare them. Therefore, similarity queries are the most
useful, and often the only types of queries adequate to search multimedia objects stored in a database.
Unfortunately, the ubiquitous query language SQL–the most widely employed language in Database
Management Systems (DBMS)–does not provide effective support for similarity queries. Maria Camila
xvi
Nardini Barioni et al. present an already validated strategy that adds similarity queries to SQL, supporting
a powerful set of similarity operators. They also describe techniques to store and retrieve multimedia
objects in an efficient way and show existing DBMS alternatives to executing similarity queries over
multimedia data.
Li Yan
Northeastern University, China
Zongmin Ma
Northeastern University, China
xvii
Acknowledgment
The editors wish to thank all of the authors for their insights and excellent contributions to this book and
would like to acknowledge the help of all involved in the collation and review process of the book,
without whose support, the project could not have been satisfactorily completed. Most of the authors of
chapters included in this book also served as referees for chapters written by other authors. Thanks go
to all those who provided constructive and comprehensive reviews.
A further special note of thanks goes to all the staff at IGI Global, whose contributions throughout
the whole process from inception of the initial idea to final publication have been invaluable. Special
thanks also go to the publishing team at IGI Global. This book would not have been possible without
the ongoing professional support from IGI Global.
The idea of editing this volume stems from the initial research work that the editors did in past sev-
eral years. The research work of the editors was supported by the National Natural Science Foundation
of China (60873010 and 61073139), the Fundamental Research Funds for the Central Universities
(N090504005, N100604017 and N090604012), and the Program for New Century Excellent Talents in
University (NCET- 05-0288).
Li Yan
Northeastern University, China
Zongmin Ma
Northeastern University, China
June 2010
Section 1
1
Chapter 1
Automatic Categorization of
Web Database Query Results
Xiangfu Meng
Liaoning Technical University, China
Li Yan
Northeastern University, China
Z. M. Ma
Northeastern University, China
ABSTRACT
Web database queries are often exploratory. The users often find that their queries return too many
answers and many of them may be irrelevant. Based on different kinds of user preferences, this chapter
proposes a novel categorization approach which consists of two steps. The first step analyzes query his-
tory of all users in the system offline and generates a set of clusters over the tuples, where each cluster
represents one type of user preference. When a user issues a query, the second step presents to the user
a category tree over the clusters generated in the first step such that the user can easily select the subset
of query results matching his needs. The problem of constructing a category tree is a cost optimization
problem and heuristic algorithms were developed to compute the min-cost categorization. The efficiency
and effectiveness of our approach are demonstrated by experimental results.
INTRODUCTION
As internet becomes ubiquitous, many people are searching their favorite cars, houses, stocks, etc. over
the Web databases. However, Web database queries are often exploratory. The users often find that
their queries return too many answers, which are commonly referred to as “information overload”. For
DOI: 10.4018/978-1-60960-475-2.ch001
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Automatic Categorization of Web Database Query Results
example, when a user submits a query to MSN House&Home Web site to search for a house located in
Seattle with a price between $200,000 and $300,000, 1,256 tuples are returned. Information overload
makes it hard for the user to separate the interesting items from the uninteresting ones, and thereby lead
to a huge wastage of user’s time and effort. In such a situation, the user would pose a broad query in the
beginning to avoid exclusion of potentially interesting results, and then iteratively refine their queries
until a few answers matching their preferences are returned. However, this iterative procedure is time-
consuming and many users will give up before they reach the final stage.
In order to resolve the problem of “information overload”, two types of solutions have been proposed.
The first type categorizes the query results into a category tree (Chakrabarti, Chaudhuri & Hwang, 2004;
Chen & Li, 2007), and second type ranks the results (Agrawal, Chaudhuri, Das & Gionis, 2003; Agrawal,
Rantzau &Terzi, 2006; Bruno, Gravano & Marian, 2002; Chaudhuri, Das, Hristidis & Weikum, 2004;
Das, Hristidis, Kapoor & Sudarshan, 2006). The success of both approaches depends on the utilization
of user preferences. But these approaches always assume that all users have the same user preferences,
but in real life different users often have different preferences. Let us look at the following example.
Example 1. Consider a real estate searching Web site. Figure 1 and Figure 2 respectively show a fraction
of category trees generated by using the methods of Greedy (Chakrabarti, Chaudhuri & Hwang, 2004)
and C4.5-Categorization (Chen & Li, 2007) over 214 houses returned by a query with the condition
“Price between 250000 and 350000 ∧ City = Seattle”. Each of tree nodes specifies the range or equality
conditions on an attribute, and the number in the parentless is the number of tuples satisfying all condi-
tions from the root to the current node. Users can use this tree to select the houses they are interested in.
Consider three users U1, U2, and U3. Assume that U1 prefers houses with large square, U2 prefers
houses with water views, and U3 prefers both water views and Burien living area. The Greedy method
assumed that all users have the same preferences. As a result, attributes “Livingarea” and “Schooldistrict”
are placed at the first two levels of the tree because more users are concerned with “Livingarea” and
“Schooldistrict” than other attributes. However, there may be some users (such as U2 and U3) who want
to first visit the large square and water view houses. Then they have to visit many nodes if they go along
with the tree built in Figure 1. Considering the diversity of user preferences and the cost of both visiting
intermediate nodes and leaf nodes, the C4.5-Categorization method took advantage of C4.5 algorithm
2
Automatic Categorization of Web Database Query Results
to create the navigational tree. But the created category tree (Figure 2) has two drawbacks: (i) the tuples
under the intermediate nodes cannot be explored by the users, i.e., users can only access the tuples under
the leaf nodes but cannot examine the tuples in the intermediate nodes; (ii) the cost of visiting the tuples
of intermediate node is not considered if the user choose to explore the tuples of intermediate node.
User preferences are often difficult to obtain because users do not want to spend extra efforts to specify
their preferences, thus there are two major challenges to address the diversity issue of user preferences:
(i) how to summarize different kinds of user preferences from the behavior of all users already in the
system, and (ii) how to categorize or rank the query results according to the specific user preferences.
Query history has been widely applied to infer the preferences of all users in the system (Agrawal,
Chaudhuri, Das & Gionis, 2003; Chaudhuri, Das, Hristidis & Weikum, 2004; Chakrabarti, Chaudhuri
& Hwang, 2004; Das, Hristidis, Kapoor & Sudarshan, 2006).
In this chapter, we present techniques to automatically categorize the results of user queries on Web
databases in order to reduce information overload. We propose a two-step approach to address both
challenges for the categorization case. The first step analyzes query history of all users already in the
system offline and then generates a set of clusters over the data. Each cluster corresponds to one type of
user preferences and is associated with a probability that users may be interested in the cluster. Assume
that an individual user’s preference can be represented as a subset of these clusters. When a specific user
submits a query, the second step first compute the similarity between the query and the representative
queries in the query clusters, and then the data clusters the user may be interested in can be inferred by
the query. Next, the set of data clusters generated in the first step is intersected with the query answers
and then a labeled hierarchical category structure is generated automatically based on the contents of the
tuples in the answer set. Consequently, a category tree is automatically constructed over these intersected
clusters on the fly. This tree is finally presented to the user.
This chapter presents a domain-independent approach to addressing the information overload problem.
The contributions are summarized as follows:
3
Automatic Categorization of Web Database Query Results
• We propose a clustering approach to cluster queries and summarize preferences of all users in the
system using the query history. This approach uses query pruning, pre-computation and query
clustering to deal with large query histories and large data sets.
• We propose a cost-based algorithm to construct a category tree over these clusters pre-formulated
in the offline processing phrase. Unlike the existing categorization and decision tree construction
approaches, our approach shows tuples for intermediate nodes and considers the cost for users to
visit both intermediate nodes and leaves.
The rest of this chapter is organized as follows. Section 2 reviews some related work. Section 3
formally defines some notions. Section 4 describes the queries and tuples clustering method. Section 5
proposes the algorithm for the category tree construction. Section 6 shows the experimental results. The
chapter is concluded in Section 7.
RELATED WORK
Two kinds of automatic categorization approaches have been proposed by Chakrabarti et. al. (Chakrab-
arti, Chaudhuri & Hwang, 2004) and Chen et. al. (Chen & Li, 2007), respectively. Chakrabarti et.al.
proposed a greedy algorithm to construct a category tree. This algorithm uses query history of all users
in the system to infer an overall user preference as the probabilities that users are interested in each at-
tribute. Taking advantage of C4.5 decision tree constructing algorithm, Chen (Chen & Li, 2007) proposed
a two-step solution which first clusters user query history and then constructs the navigated tree for
resolving the user’s personalized query. We make use of some of these ideas, but enhance the category
tree with the feature of showing tuples in the intermediate nodes and focus on how the clusters of query
history and the cost of visiting both intermediate nodes and leaves have impact on the categorization.
For providing query personalization, several approaches have been proposed to define a user profile
for each user and use the profile to decide his preferences (Koutrika &Ioannidis, 2004; Kieβling, 2002).
As Chen (Chen & Li, 2007) pointed out that, however, in real life, user profiles may not be available
because users do not want to or cannot specify their preferences (if they can, they can form the appropri-
ate query and there is no need for either ranking or categorizing). The profile may be derived from the
query history of a certain user, but this method does not work if the user is new to the system, which is
exactly true when the user needs help.
There has been a rich body of work on categorizing text documents (Dhillon, Mallela, & Kumar,
2002; Joachims, 1998; Koller,& Sahami, 1997) and Web search results (Liu, Yu & Meng, 2002; Zeng,
He, Chen, Ma & Ma, 2004). But categorizing relational data presents unique challenges and opportuni-
ties. First, relational data contains numerical values while text categorization methods treat documents
as bags of words. This chapter tries to minimize the overhead for users to navigate the generated tree
(it will be defined in Section 3), which is not considered in the existing text categorization methods.
Also there has been a rich body of work on information visualization techniques (Card, MacKinlay
& Shneiderman, 1999). Two popular techniques are dynamic query slider (Ahlberg & Shneiderman,
1994) and brushing histogram (Tweedie, Spence, Williams & Bhogal, 1994). The former allows users
to visualize dynamic query results by using sliders to represent range search conditions, and the latter
employs interactive histograms to represent each attribute and helps users exploring correlations between
attributes. Note that they do not take query history into account. Furthermore, information visualization
4
Automatic Categorization of Web Database Query Results
techniques require users to specify what information to visualize (e.g., by setting the slider or selecting
histogram buckets). Since our approach generates the information to visualize, i.e., the category tree,
our approach is a complementary to visualization techniques. If the leaf of the tree still contains many
tuples, for example, a query slider or brushing histogram can be used to further narrow down the scope.
Concerning the ranked retrieval from databases, user relevance feedback (Rui, Huang & Merhotra,
1997; Wu, Faloutsos, Sycara & Payne, 2000) is employed to learn the similarity between a result tuple
and the query, which is used to rank the query results in relational multimedia databases. The SQL query
language is extended to allow the user to specify the ranking function according to their preference for
the attributes (Kieβling, 2002; Roussos, Stavrakas & Pavlaki, 2005). Also, the importance scores of
result tuples (Agrawal, Chaudhuri, Das & Gionis, 2003; Chaudhuri, Das, Hristidis & Weikum, 2004;
Geerts, Mannila & Terzim, 2004) are extracted automatically by analyzing the past workloads, which
can reveal what users are looking for and what they consider as important. According to the scores, the
tuples can be ranked. Ranking is a complementary to categorization. We can use ranking in addition to
our techniques (e.g., we rank tuples stored in the intermediate nodes and leaves). However, most exist-
ing work does not consider the diversity issue of user preferences. In contrast, we focus on addressing
the diversity issue of user preferences for the categorization approach.
Also there has been a lot of work on information retrieval (Card, MacKinlay & Shneiderman, 1999;
Shen, Tan & Zhai, 2005; Finkelstein &Gabrilovich, 2001; Joachims, 2006; Sugiyama & Hatano, 2004)
using query history or other implicit feedbacks. However, these work focuses on searching text docu-
ments, while this chapter focuses on searching relational data. In addition, these studies typically rank
query results, while this chapter categorizes the results. Of course, ones could use the existing hierar-
chical clustering techniques (Mitchell, 1997) to create the category tree. But the generated trees are not
easy for users to navigate. For example, how do we describe the tuples contained in a node? We can
use a representative tuple, but such a tuple may contain many attributes. It is difficult for users to read.
On the contrary, the category tree used in this chapter is easy to understand because each node just uses
one attribute.
BASICS OF CATEGORIZATION
This section introduces the query history firstly, and then defines the category tree and the category cost.
The categorical space and exploration model are finally described.
Query History
Consider a database relation D with n tuples D = {t1,..., tn} with schema R {A1,...,Am}. Let Dom(Ai)
represent the active domain of attribute Ai.
Let H be a query history {(Q1, U1, F1),..., (Qk, Uk, Fk)} in chronological order, where Qi is a query,
Ui is a session ID (a session starts when a user connects to the database and ends when the user discon-
nects), and Fi is the importance weight of the query, which is evaluated by the frequency of the query in
H. Assume that the queries in the same session are asked by the same user, which will be used later to
prune queries. The query history can be collected using the query log of commercial database systems.
5
Automatic Categorization of Web Database Query Results
We assume that all queries only contain point or range conditions, and the query is of the form: Q =
∧iâ‹‹m(Ai θ ai), where ai â‹‹ Dom(Ai), θ â‹‹ {>, <, =, ≥, ≤, between, in}. Note that if θ is the operator between,
Ai θ ai has the format of “Ai between ai1 and ai2”or “ai1 ≤ Ai ≤ ai2”, where ai1, ai2 â‹‹ Dom(Ai).
D can be partitioned into a set of disjoint preference-based clusters C = {C1,..., Cq}, where each
cluster Cj corresponds to one type of user preferences. Each Cj is associated with a probability Pj that
users are interested in Cj. This set of clusters over D is inferred from the query history. We assume that
the dataset D is fixed. But, in practice D may get modified from time to time. For the purpose of this
chapter, we will assume that the clusters are generated periodically (e.g., once a month) as the set of
queries evolve and database is updated.
Category Tree
Category Tree
Definition 1 (Category tree). A category tree T (V, E, L) consists of a node set V, an edge set E, and a
label set L. Each node v â‹‹ V has a label lab(v) â‹‹ L which specifies the condition on an attribute such that
the following should be satisfied: (i) such conditions are point or range conditions, and the bounds in
the range conditions are called partition points; (ii) v contains a set of tuples N(v) that satisfy all condi-
tions on its ancestors including itself, in other words, N(v) is the subset of tuples in D that satisfies the
conjunction of catalog labels of all nodes on the path from the root to v; (iii) conditions associated with
subcategory of an intermediate node v are on the same attribute (called partition attribute), and define
a partition of the tuples in v.
The label of a category (or a node), therefore, solely and unambiguously describes to the user which
tuples, among those in the tuple set of the parent of v, appear under v. Hence, user can determine whether
v contains any item that is relevant to her or not by looking just at the label and hence decide whether
to explore or ignore v. The lab(v) has the following structure:
If the categorizing attribute A is a categorical attribute: lab(v) is of the form ‘A â‹‹ S’ where S â−‡
Dom(A) (Dom(A) denotes the domain of values of attribute A in D). A tuple t satisfied the predicate lab
(v) if t.A â‹‹ S, otherwise it is false (t.A denotes the value of tuple t on attribute A).
If the categorizing attribute A is a numeric attribute: lab(v) is of the form ‘a1≤A<a2’ where a1,
a2â‹‹ Dom(A). A tuple t satisfies the predicate lab(v) is true if a1≤t.A<a2, otherwise it is false.
Exploration Model
Given a category tree T over the query results, the user starts the exploration by exploring the root node.
Suppose that she has decided to explore the node v, if v is an intermediate node, she non-deterministically
(i.e., not known in advance) chooses one of the two options:
Option ‘ShowTuples’: Browse through the tuples in N(v). Note that the user needs to examine all
tuples in N(v) to make sure that she finds every tuple relevant to her.
Option ‘ShowCat’: Examine the labels of all the n subcategories of v, exploring the ones relevant
to her and ignoring the rest. More specifically, she examines the label of each subcategory vi of v star-
ing form the first subcategory and no-deterministically chooses to either explore it or ignore it. If she
chooses to ignore vi, she simply proceeds and examines the next label (of vi+1). If she chooses to explore
6
Automatic Categorization of Web Database Query Results
vi, she does so recursively based on the same exploration model, i.e., by choosing either ‘ShowTuples’
or ‘ShowCat’ if it is an intermediate node or by choosing ‘ShowTuples’ if it is a leaf node. After she
finishes the exploration of vi, she goes ahead and examines the label of the next subcategory of v (of
vi+1). When the user reaches the end of the subcategory list, she is done. Note that we assume that the
user examines the subcategories in the order it appears under v; it can be from top to bottom or from left
to right depending on how the tree is rendered by the user interface.
Category Cost
We assume that a user visits T in a top-bottom fashion, and stops at a node (intermediate node or leaf
node) that contains the tuples that she is interested in.
Let v be a node (intermediate node or leaf node) of T with N(v) tuples and Cj be a cluster in C. Cj ∩
v ≠ ÏŁ denotes that v contains tuples in Cj. Anc(v) denotes the set of ancestors of v including v itself, but
excluding the root. Sib(v) denotes the set of nodes at the same level as the node v including itself. Let
K1 and K2 represent the weights of visiting a tuples in the node and visiting an intermediate tree node,
respectively. Let Pj be the probability that users will be interested in cluster Cj, and let Pst be the prob-
ability that user goes for option ‘ShowTuples’ for an intermediate node v given that she explores v. The
category cost is defined as follows.
Cost (T ,C ) =
|Sib (vi )|
(1)
∑ ∑ Pj (K 1 | N (v ) | +K 2 ∑ (| Sib(vi ) | + ∑ Pst (N (v j ))))
j
v ∈Node (T ) C j ∩v ≠f vi ∈Anc (v ) j =1
The category cost of a leaf node v consists of three terms: the cost of visiting tuples in leaf node v,
the cost of visiting intermediate nodes, and the cost of visiting tuples in intermediate nodes if the user
chooses to explore it. Users need to examine the labels of all sibling nodes to select a node on the path
from the root to v, thus users have to visit ∑ v ∈Anc (v ) | Sib(vi ) | intermediate tree nodes. Users may also
i
like to examine the tuples of some sibling nodes on the path from the root to v, thus users have to visit
|Sib (vi )|
∑ ∑ Pst (N (v j )) tuples of intermediate tree nodes. When users reach the node v which they
j
vi ∈Anc (v ) j =1
would like to explore it, they have to look at N(v) tuples in v. Pst is the probability that the user exploring
v using ‘ShowTuples’, Pst = N(Av)/N, where N(Av) denotes the number of queries in the query history
that contain selection condition on attribute A of node v and N is the total number of queries in the
query history. Definition 2 computes the expected cost over all clusters and nodes.
For resolve the problem of query results categorization, we propose a solution which consists of two
steps, the offline data clustering step and the online category tree construction step. In this Section, we
first describe data clustering and then present the category tree construction approach.
7
Automatic Categorization of Web Database Query Results
Data Clustering
We generate preference-based clusters as follows. We first define a binary relationship R over tuples such
that (ri, rj) â‹‹ R if and only if two tuples ri and rj appear in the results of the exactly same set of queries
in H. If (ri, rj) â‹‹ R, according to the query history, ri and rj are not distinguishable because each user that
requests ri also requests rj and vice versa. Clearly, R is reflexive, symmetric, and transitive. Thus R is an
equivalence relation and it partitions D into equivalence classes {C1,..., Cq}, where tuples equivalent to
each other are put into the same class. Those tuples not selected by any query will also form a cluster
associated with zero probability (since no users are interested in them). Thus, we can define the data
clustering problem as follows.
Problem 1. Given database D, query history H, find a set of disjoint clusters C = {C1,…, Cq} such that for
any tuples ri and rjâ‹‹ Cl, 1≤ l ≤q, (ri, rj) â‹‹R, and for any tuples ri and rj not in the same cluster, (ri, rj)⋋̸R.
Since the query history H may contain many queries, thus we need to cluster the queries in H, and
then to cluster the tuples depending on the clusters of query history. We will propose the algorithm for
query history and data clustering in the next section.
Problem 2. Given D, C, Q, find a tree T(V, E, L) such that (i) it contains all tuples in the results of Q,
and (ii) there does not exist another tree T’ satisfying (i) and with Cost (T’, C) < Cost (T, C).
The above problem can be proved to be NP-hard in a way similar to proving that the problem of
finding an optimal decision tree with a certain average length is NP-hard. Section 5 will present an ap-
proximate solution. The category tree construction algorithm is shown in Algorithm 1.
This section describes the algorithm to cluster tuples using query history. We propose the preprocessing
steps to refine the query history, which include prune unimportant queries and cluster the queries. Based
on different kinds of user preferences, we propose the method of generating tuples clusters.
8
Automatic Categorization of Web Database Query Results
Query Prune
The query pruning algorithm is based on the following heuristics: (i) queries with empty answers are
not useful, (ii) in the same session, a user often starts with a query with general conditions and return
many answers, and then continuously refines the previous query until the query returns a few interesting
answers. Therefore, only the last query in such a refinement sequence is important. The queries prune
algorithm is shown in Algorithm 2.
We identify the relationship “â−ƒ”between queries by using the following method. Let Qi’s condition
on attribute Ai (i = 1,…,m) is ai1 ≤ Ai ≤ ai2, and Qj’s condition on attribute Ai is a’i1 ≤ Ai ≤ a’i2,. Qi â−ƒ Qj if
for every condition in Qi, Qj either does not contain any condition on Ai or has a condition a’i1 ≤ Ai ≤
a’i2, such that a’i1 ≤ ai1 ≤Ai ≤ ai2 ≤ a’i2,. For simplicity, we use H to denote the pruned query history H’
in the following sections.
Query Clustering
Since there are too many queries in the query history, we should cluster the similar queries into the same
cluster and find the representative queries.
In order to quantify the similarity between the query Q1 and Q2, we adopt a typical definition of similar-
ity, the cosine similarity. For this to be defined we first need to form the vector representations of query
Q1 and Q2. Consider the set Δ of all distinct <attribute, attribute-value> pairs appearing in the D, that is,
Δ = {<Ai, ai> | â‹•i â‹‹{1,…, d} and â‹•a â‹‹ Dom(Ai)}. Since Dom(Ai) is the active domain of attribute Ai the
cardinality of this set is finite. Let it be N = |Δ| and let OD be an arbitrary but fixed order on the pairs ap-
pearing in Δ. We refer to the i-th element of Δ based on the ordering OD by Δ[i]. A vector representation
of query Q1 = ∧jâ‹‹m(Aj θ aj) is a binary vector VQ1 of size N. The i-th element of the vector corresponds to
pair Δ[i]. If Δ[i] is contained in the conjunctions of Q1 then VQ1[i] = 1. Otherwise it is 0. Analogously,
the vector representation of a query Q2 is a binary vector VQ2 of size N. The i-th element of the vector
corresponds to pair Δ[i]. If Δ[i] is contained in the conjuncts of Q2, then VQ2[i] = 1; otherwise it is 0.
9
Automatic Categorization of Web Database Query Results
Now, we can define the similarity between Q1 and Q2 using their vector representations VQ1 and VQ2
as follows:
VQ 1 ⋅VQ 2
Sim(Q1,Q2 ) = cos(VQ 1,VQ 2 ) = (2)
|VQ 1 ||VQ 2 |
In order to quantify how well a query Q1 is represented by another query Q2, we need to define a
distance measure between two queries. Based on the similarity mentioned above, the distance between
Q1 and Q2 can be defined as
Based on the definitions above, the queries clustering problem can be defined. Let H be the set of m
queries in query history: H = {Q1,…, Qm}. The we need to find a set of k queries Hk = {Q1 ,…, Q k } (k
< m) such that:
We call the queries in set Hk representative queries and associate with each representative query Q i
a set of queries QC j = {Qi | Q j = arg min j ' d (Qi ,Q j ' )} .
The problem of queries clustering is the same as the k-median problem. The k-median problem is well
known to be NP-hard. An instance of the metric k-median problem consists of a metric space χ = (X, c),
where X is a set of points and c is a distance function (also called the cost) that specifies the distance cxy
≥ 0 between any pair of nodes x, y â‹‹ X. The distance function is reflexive, symmetric, and satisfies the
triangle inequality. Given a set of points F â−ƒ X, the cost of F is defined by cost(F ) = ∑ x ∈X c , where
xF
cxF = min f ∈F cxF for x â‹‹ X. The objective is to find a k-element set F â−ƒ X that minimizes cost(F)
(Chrobak, Keynon & Young, 2005). Obviously, the queries clustering problem can be treated as the k-
median problem and it is also NP-hard. Thus, we have to think of approximation algorithms for solving
it.
10
Automatic Categorization of Web Database Query Results
Input: A pruned query history H with m queries: H = {Q1,…, Qm}, a set of all
Stars: U = {â�¨Qi, QCiâ�© | Qi â‹‹ H, QCi â−ƒ H}, k
Output: A set of k query clusters Hk = {�Q 1 ,QC1�...�Q k , QCk�}
1. Let B = {} be a buffer that can hold m �Qi, QCi�
2. While H ≠ â‹– and k > 0 Do
3. B ← â‹–
4. For each Qi â‹‹ H Do
5. Pick si = â�¨Qi, QCiâ�© with minimum rsi from Ui = {â�¨Qi, QCiâ�© | QCi â−ƒ H, | QCi |
= [2, |H| - k + 1]}
6. B ←B +{si}
7. End For
8. Pick s = �Qi, QCi� with minimum rs from B
9. H←H –QCi - {Qi}, Hk ←Hk + s, k← k – 1
10. End While
11. Return Hk
Algorithmic Solution
For clustering queries, we propose a novel approach, which can discover the near-globally optimal so-
lution and has the low time complexity of the algorithm as well. The approach is described as follows.
Observing the solution of the queries clustering, we can find that every representative query connects
with some other queries of H and these connections are like star structures. Here, we call a connection
as a Star. Then we can re-define the queries clustering problem as follows: Let U be the set of all Stars,
i.e., U = {â�¨Qi, QCiâ�© | Qi â‹‹ H, QCi â−ƒ H}. The cost of each Star s = â�¨Qi, QCiâ�© â‹‹ U can be denoted as:
cs = ∑Q ∈QC d (Qi ,Q j ). Let rs = cs/|QCi| be the performance-price ratio. Our objective is to find a set of
j i
Star S, such that S â−ƒ U, which minimizes the cost and enables that there are k representative queries in
S and any original query Qj â‹‹ H appears at least once at Star s â‹‹ S.
For solving this problem, we propose an approach which consists of two parts: a pre-processing part
and a processing part. In the processing part, we build a sequential permutation ki = {Qi1, Qi2,..., Qim} over
H for each query Qi â‹‹ H, where {Qi1, Qi2,..., Qim}â‹‹ H and the queries in ki are arranged non-decreasing
according to their cost corresponding to Qi, that is, d(Qi, Qi1) ≤ d(Qi, Qi2) ≤... ≤ d(Qi, Qim). Such permuta-
tions can help us only consider the first l queries in ki other than all queries in ki when we build the Star
for Qi. Note that, the number l should be choosed appropriately. It can be seen that the complexity of
pre-processing part is O(|H|2log|H|), where |H| denotes the number of queries of H.
The task of processing part is to cluster queries by using the Greedy-Refine algorithm (Algorithm 3)
based on the Stars formed in pre-processing part. The input is a set of all Stars formed in preprocessing
part. For each Qi â‹‹ H, the algorithm picks up the Star si with the minimal rs in Ui (the set of all Stars in
U corresponding to Qi, Ui â−ƒ U) and put it in the set B. From the set B, the algorithm chooses the Star s
with the minimal rs and adds it to the objective set Hk. And then, the algorithm removes Qi and QCi from
H. The algorithm stops when the set Hk has k elements. The output is a set of k pairs of the form �Q i ,
11
Automatic Categorization of Web Database Query Results
QCi�, where Q i is a representative query (i.e., it is the center of clustering QCi), and Q i corresponds to
the query cluster QCi. The time complexity in the processing part is O(|H|k), and thus the algorithm is
polynomial solvable (Meng & Ma, 2008).
After queries pruning and clustering, we get a set of query clusters QC1,…, QCk. For each tuple ti, we
generate a set Si consisting of query clusters such that one of the queries in that cluster returns ti. That is,
Si = {QCp | â‹… Qiâ‹‹QCp such that ti is returned by Qj}. We then group tuples according to their Si, and each
group forms a cluster. Each cluster is assigned a class label. The probability of users being interested in
cluster Ci is computed as the sum of probabilities that a user asks a query in Si. This equals the sum of
frequencies of queries in Si divided by the sum of frequencies of all queries in the pruned query history H.
Example 2. Suppose that there are four queries Q1, Q2, Q3, and Q4 and 15 tuples r1, r2, …, r15. Q1 returns
first 10 tuples r1, r2, …, r10, Q2 returns the first 9 tuples r1, r2, …, r9, and r14, Q3 returns r11, r12 and r14,
and Q4 returns r15. Obviously, the first 9 tuples r1, r2, …, r9 are equivalent to each other since they are
returned by both Q1 and Q2. The data can be divided into five clusters {r1, r2, …, r9} (returned by Q1,
Q2), {r10} (returned by Q1 only), {r11, r12, r14} (returned by Q3), {r15} (returned by Q4), and {r13} (not
returned by any query).
In example 2, after clustering we get two clusters {Q1, Q2}, {Q3} and {Q4}. Four clusters C1, C2, C3,
and C4 will be generated. The cluster C1 corresponds to Q1 and Q2 and contains the first 10 tuples, with
probability P1 = 2/4= 0.5. The cluster C2 corresponds to Q3 and contains r11, r12, r14, with probability P2
= 1/4 = 0.25. The cluster C3 corresponds to Q4 and contains r15, with probability P3 = 1/4 = 0.25. The
cluster C4 contains r13, with probability 0, because r13 is not returned by any query. The data clusters
generating algorithm is shown in Algorithm 4.
12
Automatic Categorization of Web Database Query Results
This section proposes the category tree construction algorithm. Section 5.1 gives an overview of our
algorithm. Section 5.2 presents a novel partitioning criterion that considers the cost of visiting both
intermediate and leaves nodes.
Algorithm Overview
A category tree is very similar to a decision tree. There are many well-known decision construction al-
gorithms such as ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), and CART (Breiman, Friedman & Stone,
1984). However, the existing decision tree construction algorithm aims at minimizing the impurity of
data (Quinlan, 1993) (represented by information gain, etc.). Our goal is to minimize the category cost,
which includes both the cost of visiting intermediate tree nodes (and the cost of visiting tuples in the
intermediate nodes if the user explores it) and the cost of visiting tuples stored in leaf nodes.
For building the category tree, we make use some ideas of solution presented by Chen (Chen & Li,
2007) and propose the improved algorithms for solving it. The problems of our algorithm have to resolve
including (i) eliminating a subset of relatively unattractive attributes without considering any of their
partitions, and (ii) for every attribute selected above, obtaining a good partition efficiently instead of
enumerating all the possible partitions. Finally, we construct the category tree by choosing the attribute
and its partition that has the least cost.
Since the presence of a selection condition on an attribute in query history reflects the user’s interest in
that attribute, attributes that occur infrequently in the query history can be omitted while constructing
the category tree. Let N(A) be the number of queries in the query history that contain selection condition
on attribute A and N be the total number in the query history. We eliminate the uninteresting attributes
using the following solution: if an attribute A occurs in less than a fraction x of the queries in the query
history, i.e., N(A)/N < x, we eliminate A. The threshold x will need to be specified by the system domain
expert. For example, for the real estate searching application, if we use x = 0.4, only 7 attributes, namely
Price, SqFt, Livinarea, View, Neighborhood, Schooldistrict, and Bedrooms, are retained from among 25
attributes in the MSN House&Home dataset. The algorithm for eliminating the uninterested attributes
is shown in Algorithm 5.
13
Automatic Categorization of Web Database Query Results
For a categorical attribute, a new subcategory will be created with one branch for each value of the at-
tribute, and the information gain (we will discuss how to compute the information gain in Section 5.2)
will be computed over that subcategory. If a categorical attribute have too many values and thus generate
too many branches, we can add intermediate levels to that attribute. The categorical attribute can only
generate one possible partition, and it will be removed from AR if it is selected as the partition attribute.
For a numeric attribute Ai, we use binary partition, i.e., Ai ≤ v or Ai > v. For a numerical value attribute, the
algorithm will generate one subcategory for every possible partition point, and compute the information
gain for that partition point. The best partition point will be selected and the gain of the best partition
14
Automatic Categorization of Web Database Query Results
is the gain of the attribute. If the gain-ratio of the attribute with the maximal gain-ratio exceeds a pre-
defined threshold λ, the tree will be expanded by adding the selected subcategory to the current root.
Algorithm Solution
Based on the solutions mentioned above, we can now describe how we construct a category tree. Since
the problem of finding a tree with minimal category cost is NP-hard, we propose an approximate algo-
rithm (see Algorithm 6).
After building the category tree, the user can go along with the branches of tree to find the interesting
answers. As mentioned above, the user can explore category tree using two models, i.e., showing tuples
(option ‘ShowTuples’) and showing category (option ‘ShowCat’). When the user chosen the option
‘ShowTuples’ on a node (an intermediate node or a leaf node), the system will provide the items satisfy-
ing all conditions from the root to the current node. The category tree accessing algorithm is shown in
Algorithm 7.
We first describe how to compute the information gain which acts as the partition criteria, and then we
give the cost estimation of visiting the intermediate nodes and the leaves.
Partition Criteria
Existing decision tree construction algorithms such as C4.5 compute an information gain to measure
how good an attribute classifies data. Given a decision tree T with N tuples and n classes, where each
class Ci in T has Ni tuples. The entropy can be defined as follows,
n
Ni N
E (T ) = -∑ log i (6)
i =1 N N
15
Automatic Categorization of Web Database Query Results
In real applications, there may be several distinct values in the domain of an attribute A. For each
attribute value v of A, let NTi be the number of tuples with the attribute value v of A in class Ci, and thus
the conditional entropy can be defined as
n N
EA(v ) = ∑ Ti
× E (Ti ) (7)
i =1 N
For example, consider a fraction results (showed in Table 1) returned by MSN house&home Web
database for a query with the condition “Price between 250000 and 350000 and City = Seattle”. We then
use it to describe how to obtain a best partition attribute by using the formulas defined above.
Here, we assume the decision attributes are View, Schooldistrict, Livingarea, and SqFt. We first
compute the entropy of tree T,
5 5 6 6 4 4
E (T ) = E (C 1, C 2 , C 3 ) = − log + log + log = 0.471293.
15 15 15 15 15 15
And then, we compute the entropy of each decision attributes. For attribute “View”, it contains four
distinct values which are ‘Water’, ‘Mountain’, ‘GreenBelt’, and ‘Street’, the entropy of each value are
16
Automatic Categorization of Web Database Query Results
5 5 1 1
E View (Water) = − log − log = 0.195676,
6 6 6 6
0 0 2 2 1 1
E View (Mountain) log − log − log = 0.276434,
3 3 3 3 3 3
0 0 2 2 1 1
E View (GreenBelt) = − log − log − log = 0.276434,,
3 3 3 3 3 3
0 0 2 2 1 1
E View (Street) log − log − log = 0.276434.
3 3 3 3 3 3
Next,
Analogously,
Cost Estimation
17
Automatic Categorization of Web Database Query Results
N (t ) ∑
Cl ∩t ≠f
Pl − ∑ N (t j )( ∑ Pi )
j =1,2 C i ∩t j
(9)
The decision tree construction algorithms do not consider the cost of visiting leaf tuples. For example,
consider a partition that generates two nodes that contain tuples with labels (C1, C2, C1) and (C2), and
a partition that generates two nodes that contain tuples with labels (C2, C1, C2) and (C1). According to
the discussion in Section 5.4.1, these two partitions have the same information gain. However, if P1 =
0.5 and P2 = 0, then the category cost for the first partition is smaller because the cost is 1.5 for the first
partition and is 2 for the second partition.
Ni k
Ni N
∑ (log N − log N i ) = −∑ log i (10)
1≤i ≤k N i =1 N N
This is exactly the entropy E(t). Note that most existing decision tree algorithms choose the partition
that maximizes information gain. Information gain is the reduction of entropy due to a partition and is
represented in the following formula,
N1 N
IGain(t, t1, t2 ) = E (t ) − E (t1 ) − 2 E (t2 ) (11)
N N
Thus a partition with a high information gain will generate a tree with a low entropy. And, this tree
will have short root-to-leaf paths as well. Since the cost of visiting intermediate nodes equals the prod-
uct of path lengths and fan-out in Definition 2, if we assume the average fan-out is about the same for
all trees, then the cost of visiting intermediate nodes is proportional to the length of root-to-leaf paths.
Therefore, the cost reduction of visiting intermediate nodes can be used information gain to estimate.
18
Automatic Categorization of Web Database Query Results
Combining Costs
The remaining problem is how to combine the three types of costs. Here we take a normalization ap-
proach, which uses the following formula to estimate the gain of partitioning t into t1 and t2,
IGain(t, t1, t2 ) / E (t )
(12)
((∑ j =1,2 N (t j )(∑C ∩t ≠f Pi )) / (N (t )∑C ∩t ≠f Pl )) ∗ Pst N (t )
i j l
The denominator is the product of the cost of visiting leaf tuples after partition normalized by the
cost before partition multiplying the cost of visiting the tuples in t. A partition always reduces the cost of
visiting tuples (the proof is straightforward). Thus the denominator ranges from (0, 1]. The nominator is
the information gain normalized by the entropy of t. We compute a ratio between these two terms rather
than sum of the nominator and (1-denominator) because in practice the nominator (information gain)
is often quite small. Thus the ratio is more sensitive to the nominator when the denominator is similar.
Complexity Analysis
Let n be the number of tuples in query results, m be the number of attributes, and k be the number of
classes. The gain in Formula 5 can be computed in O(k) time. C4.5 also uses several optimizations such
as computing the gains for all partition points of an attribute in one pass, sorting all tuples on different
attribute values beforehand, and reusing the sort order. The cost of sorting tuples on different attribute
values is O(mnlogn), and the cost of computing gains for all possible partitions at one node is O(mnk)
because there are at most m partition attributes and n possible partition points, and each gain can be
computed in O(k) time. If we assume the generated tree has O(logn) levels, the total time is O(mnklogn).
EXPERIMENTAL EVALUATION
In this section, we describe our experiments, report the experimental results and compare our approach
with several existing approaches.
Experimental Setup
We used Microsoft SQL Server 2005 RDBMS on a P4 3.2-GHz PC with 1 GB of RAM for our experi-
ments.
Dataset: For our evaluation, we setup a real estate database HouseDB (Price, SqFt, Bedrooms,
Bathrooms, Livingarea, Schooldistrict, View, Neighborhood, Boat, Garage, Buildyear…) containing
1,700,000 tuples extracted from MSN House&Home Web site. There are 27 attributes, 10 numerical
and 17 categorical. The total data size is 20 MB.
19
Automatic Categorization of Web Database Query Results
Query history: In our experiments, we requested 40 subjects to behave as different kinds of house
buyers, such as rich people, clerks, workers, women, young couples, etc. and post queries against the
database. We collected 2000 queries for the database and these queries are used as the query history.
Each subject was asked to submit 15 queries for HouseDB, each query had 2~6 conditions and had
4.2 specified attributes on average. We assume each query has equal weight. We did observe that users
started with a general query which returned many answers, and then gradually refined the query until it
returned a small number of answers.
Algorithm: We implemented all algorithms in C# and connected to the RDBMS through ADO.
The clusters are stored by adding a column to the data table to store the class labels of each tuple. The
stopping threshold λ in build tree algorithm is set to 0.002. We have developed an interface that allows
users to classify query results using generated trees.
Comparison: We compare our create tree algorithm (henceforth referred to as Cost-based algorithm)
with the algorithm proposed by Chakrabarti et. al. (Chakrabarti, Chaudhuri & Hwang, 2004) (henceforth
referred to as Greedy algorithm). It differs from our algorithm on two aspects: (i) it does not consider
different user preferences, and (ii) it does not consider the cost of intermediate nodes generated by fu-
ture partitions. We also compare the algorithm proposed by Chen et. al. (Chen & Li, 2007) (henceforth
referred to as C4.5-Categorization algorithm), it first uses the merging queries step to generate data
clusters and corresponding labels, then uses modified C4.5 to create the navigational tree. It differs
from our algorithm on two aspects: (i) it needs to execute queries on the dataset to evaluate the queries
similarity and then to merging the similar queries, and (ii) it can not expand the intermediate nodes to
show tuples and it thus does not consider the cost of visiting tuples of intermediate nodes.
Setup of user study: We conduced an empirical study by asking 5 subjects (with no overlap with
the 40 users submitting the query history) to use this interface. The subjects were randomly selected
colleagues, students, etc. Each subject was given a tutorial about how to use this interface. Next, each
subject was given the results of 5 queries listed in Table 2, which do not appear in the query history.
For each such query, the subject was asked to go along with the trees generated by the three algorithms
mentioned above, and to select 5-10 houses that he would like to buy.
The experiment aims at comparing the cost of three categorization algorithms and showing the efficiency
of our categorization approach. The actual category cost is defined as follows:
20
Automatic Categorization of Web Database Query Results
|Sib(vi )|
ActCost = ∑
∀leaf v visited by a subject
(K1N (v ) + K 2 ∑
vi ∈Anc(v )
(| Sib(vi ) | + ∑
j =1
Pst (N (v j ))))
j
(13)
Unlike the category cost in Definition 2, this cost is the real count of intermediate (including siblings)
and tuples visited by a subject. We assume the weight for visiting intermediate nodes and visiting tuples
are equal, i.e. K1 = K2 = 1. In general the lower the total category cost, the better the categorization method.
Figure 3 shows the total actual cost, averaged over all the subjects, for Cost-based, C4.5-Categorization,
and Greedy algorithm. Figure 4 reports the average number of houses selected by each subject. Figure
5 reports the average category cost of per selected house for these algorithms.
The results show that the category trees generated by Cost-based algorithm have the lowest actual
cost and the lowest average cost per selected house (the number of query clusters k was set to 30). Users
21
Automatic Categorization of Web Database Query Results
have also found more houses worth considering to buy using our algorithm than the other two algorithms,
suggesting our method makes it easier for users to find interesting houses. The tree generated by Greedy
algorithm has the worst results. This expected because the Greedy algorithm ignores different user
preferences, and dose not consider future partitions when generating category trees. The C4.5-Catego-
rization algorithm also has higher cost than our method. The reason is that our algorithm uses a parti-
tioning criterion that considers the cost of visiting the tuples in intermediate nodes, while C4.5-Catego-
rization algorithm does not. Moreover, our algorithm can use a few clusters to representative a large
scale tuples without lose accuracy (it will be tested in the next experiment).
The results show that using our approach, on average a subject only needs to visit no more than 8
tuples or intermediate nodes for queries Q1, Q2, Q3, and Q4 to find the first relevant tuple, and needs to
visit about 18 tuples or intermediate nodes for Q5. The total navigational cost for our algorithm is less
than 45 for the former four queries, and is less than 80 for Q5. At the end of the study, we asked subjects
which categorization algorithm worked the best for them among all the queries they tried. The result of
that survey is reported in Table 3 and shows that a majority of subjects considered our algorithm the best.
This experiment aims at testing the quality of the algorithm for the queries clustering, whose accuracy
has a great impaction on the accuracy of the clusters of the tuples. We first translated each query in the
query history into its corresponding vector representation, and then we adopt the following strategies
to generate synthetic datasets. Every dataset is characterized by 4 parameters: n, m, l, noise. Here the n
22
Automatic Categorization of Web Database Query Results
is the number of vector elements in <attribute, value> pairs set of Δ (note that, each query in the query
history is translated into vector representations), m is the number of input queries, and l is the number
of true underlying clusters. We set ‘0’ and ‘1’ at random on the n elements and we then generate l ran-
dom queries by sampling at random the space of all possible permutations of n elements. These initial
queries form the centers around which we build each one of the clusters. The task of the algorithms is
to rediscover the clustering model used for the data generation. Given a cluster center, each query from
the same cluster is generated by adding to the center a specified amount of noise of a specific type. We
consider two types of noise: swaps and shifts. The swap means that ‘0’ and ‘1’ elements from the initial
order are picked and their positions in the order are exchanged. For the shifts we pick a random element
and we move it to a new position, either earlier (or later) in the order. All elements that are between the
new and the old positions of the element are shifted on position down (or up). The amount of noise is
the number of swaps or shifts we make.
We experiment with datasets generated for the following parameters: n = 300, m = 600, l = {4, 8, 16,
32}, noise = {2, 4, 8,…, 128} for swaps. Figure 6 shows the performance of the algorithms as a function
of the amount of noise. The y axis is the ratio: F(A)/F(INP), for A = {Greedy, Greedy-Refine}, where
the Greedy- Refine algorithm is proposed in this chapter (Algorithm 3), while the Greedy algorithm
is proposed in [2]. We compare them here since they all aim at solving the same problem (clustering
problem). The F(A) is the total cost of the solution provided by algorithm A when the distance showed
in Equation (3) is used as a distance measure between queries. The F(INP) corresponds to the cost of
the clustering structure (Equation 4) used in the data generation process.
From Figure 6 we can see that: Greedy- Refine algorithm performs greatly better than Greedy algo-
rithm. The reason is that: the Greedy- Refine is executed on the queries which were arranged according
23
Automatic Categorization of Web Database Query Results
to their cost in pre-processing phrase and makes twice greedy selection in processing phrase, so that it
can obtain the near-globally optimization solution.
Performance Report
Figure 7 report the tree construction time of our algorithm for the 5 test queries (since the execution time
of Q5 is much longer than the first 4 queries, we do not show its histogram in the figure). Our algorithm
took no more than 2.4 second for the first 4 queries queries that returned several hundred results. It
took about 4 seconds for the 5th query that returned 16,213 tuples. Thus our algorithm can be used in an
interactive environment.
CONCLUSION
This chapter proposed a categorization approach to address diverse user preferences, which can help
users navigate many query results. This approach first summarized preferences of all users in the system
by clustering the query history, and then divided tuples into clusters using the different kinds of user
preferences. When a specific user issues a query, our approach create a category tree over the clusters
appearing in the results of the query to help users navigate these results. Our approach differs from the
several existing approaches in two aspects: (i) our approach does not require a user profile or a meaning-
ful query when deciding the user preferences for a specific user, and (ii) the category tree construction
algorithm proposed in this chapter considers both the cost of visiting intermediate nodes (including the
cost of visiting the tuples in intermediate nodes) and the cost of visiting the tuples in leaf nodes. In the
future, we will investigate how to accommodate the dynamic nature of user preferences and how to
integrate the ranking approach into our approach.
24
Automatic Categorization of Web Database Query Results
REFERENCES
Agrawal, R., Rantzau, R., & Terzi, E. (2006). Context-sensitive ranking. Proceedings of the ACM SIG-
MOD International Conference on Management of Data, (pp. 383-394).
Agrawal, S., Chaudhuri, S., Das, G., & Gionis, A. (2003). Automated ranking of database query results.
ACM Transactions on Database Systems, 28(2), 140–174.
Ahlberg, C., & Shneiderman, B. (1994). Visual information seeking: tight coupling of dynamic query
filters with starfield displays (pp. 313–317). Proceedings on Human Factors in Computing Systems.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. (1984). Classification and regression trees. Boca
Raton, FL: CRC Press.
Bruno, N., Gravano, L., & Marian, A. (2002). Evaluating top-k queries over Web-accessible databases.
Proceedings of the 18th International Conference on Data Engineering, (pp. 369-380).
Card, S., MacKinlay, J., & Shneiderman, B. (1999). Readings in information visualization: using vision
to think. Morgan Kaufmann.
Chakrabarti, K., Chaudhuri, S., & Hwang, S. (2004). Automatic categorization of query results. Proceed-
ings of the ACM SIGMOD International Conference on Management of Data, (pp. 755–766).
Chaudhuri, S., Das, G., Hristidis, V., & Weikum, G. (2004). Probabilistic ranking of database query
results. Proceedings of the 30th International Conference on Very Large Data Base, (pp. 888–899).
Chen, Z. Y., & Li, T. (2007). Addressing diverse user preferences in SQL-Query-Result navigation.
Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 641-652).
Chrobak, M., Keynon, C., & Young, N. (2005). The reverse greedy algorithm for the metric k-median
problem. Information Processing Letters, 97, 68–72. doi:10.1016/j.ipl.2005.09.009
Das, G., Hristidis, V., Kapoor, N., & Sudarshan, S. (2006). Ordering the attributes of query results.
Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 395-406).
Dhillon, I. S., Mallela, S., & Kumar, R. (2002). Enhanced word clustering for hierarchical text classifica-
tion. Proceedings of the 8th ACM SIGKDD International Conference, (pp. 191–200).
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2001). Placing
search in context: The concept revisited. Proceedings of the 9th International World Wide Web Confer-
ence, (pp. 406–414).
Geerts, F., Mannila, H., & Terzim, E. (2004). Relational link-based ranking. Proceedings of the 30th
International Conference on Very Large Data Base, (pp. 552-563).
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant
features. Proceedings of the European Conference on Machine Learning, (pp. 137–142).
Joachims, T. (2002). Optimizing search engines using clickthrough data. Proceedings of the ACM Con-
ference on Knowledge Discovery and Data Mining, (pp. 133–142).
25
Automatic Categorization of Web Database Query Results
Kießling, W. (2002). Foundations of preferences in database systems. Proceedings of the 28th Interna-
tional Conference on Very Large Data Bases, (pp. 311-322).
Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. Proceed-
ings of the 14th International Conference on Machine Learning, (pp. 170–178).
Koutrika, G., & Ioannidis, Y. (2004). Personalization of queries in database systems. Proceedings of the
20th International Conference on Database Engineering, (pp. 597-608).
Liu, F., Yu, C., & Meng, W. (2002). Personalized Web search by mapping user queries to categories.
Proceedings of the ACM International Conference on Information and Knowledge Management, (pp.
558-565).
Meng, X. F., & Ma, Z. M. (2008). A context-sensitive approach for Web database query results ranking.
Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent
Technology, (pp. 836-839).
Mitchell, T. (1997). Machine learning. McGraw Hill.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. doi:10.1007/
BF00116251
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann Pub-
lishers Inc.
Roussos, Y., Stavrakas, Y., & Pavlaki, V. (2005). Towards a context-aware relational model. Proceedings
of the International Workshop on Context Representation and Reasoning, Paris, (pp. 101-106).
Rui, Y., Huang, T. S., & Merhotra, S. (1997). Content-based image retrieval with relevance feedback in
MARS. Proceedings of the IEEE International Conference on Image Processing, (pp. 815-818).
Shen, X., Tan, B., & Zhai, C. (2005). Context-sensitive information retrieval using implicit feedback.
Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, (pp. 43–50).
Sugiyama, K., Hatano, K., & Yoshikawa, M. (2004). Adaptive Web search based on user profile con-
structed without any effort from users. Proceedings of the 13th International World Wide Web Confer-
ence, (pp. 975-990).
Tweedie, L., Spence, R., Williams, D., & Bhogal, R. S. (1994). The attribute explorer. Proceedings of
the International Conference on Human Factors in Computing Systems, (pp. 435–436).
Wu, L., Faloutsos, C., Sycara, K., & Payne, T. (2000). FALCON: Feedback adaptive loop for content-based
retrieval. Proceedings of the 26th International Conference on Very Large Data Bases, (pp. 297-306).
Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., & Ma, J. (2004). Learning to cluster Web search results.
Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, (pp. 210–217).
26
Discovering Diverse Content Through
Random Scribd Documents
Guatemala—Medina Defeated and Overthrown
—Céleo Arias Succeeds Him—His Liberal Policy
—He is Beset by the Conservatives—His Former
Supporters Depose Him—Ponciano Leiva
Becomes President—His Course Displeases
Barrios, Who Sets Medina against Him—He is
Forced to Resign—Marco Aurelio Soto Made
President by Barrios—Attempted Revolt of Ex-
president Medina—His Trial and Execution—
Soto's Administration—He Goes Abroad—His
Quarrel with Barrios, and Resignation—
President Bogran—Filibustering Schemes
CHAPTER XXIII.
POLITICAL AFFAIRS IN NICARAGUA.
1867-1885.
President Fernando Guzman—Insurrection—
Misconduct of Priests—Defeats of the
Insurgents—Foreign Mediation—Generosity of
the Government—President Vicente Quadra—
Inception of the Jesuits—Aims of Parties—
Internal and Foreign Complications—Costa
Rica's Hostility and Tinoco's Invasion—
Presidents Chamorro and Zavala—More Political
Troubles—Jesuits the Promoters—Their
Expulsion—Peace Restored—Progress of the
Country—President Adan Cárdenas—Resistance
to President Barrios' Plan of Forced 470
Reconstruction
CHAPTER XXIV.
INDEPENDENCE OF THE ISTHMUS.
1801-1822.
Administration under Spain—Influence of Events 488
in Europe and Spanish America on the Isthmus
—Hostilities in Nueva Granada—Constitutional
Government—General Hore's Measures to Hold
the Isthmus for Spain—MacGregor's Insurgent
Expedition at Portobello—Reëstablishment of
the Constitution—Captain-general Murgeon's
Rule—The Isthmus is Declared Independent—
Its Incorporation with Colombia—José Fábrega
in Temporary Command—José María Carreño
Appointed Intendente and Comandante General
—Abolition of African Slavery
CHAPTER XXV.
DIVERS PHASES OF SELF-GOVERNMENT.
1819-1863.
Panamá Congress—Provincial Organizations—
Alzuru's Rebellion and Execution—Secession
from Colombia and Reincorporation—
Differences with Foreign Governments—Crime
Rampant—Summary Treatment of Criminals—
Riots and Massacre of Foreign Passengers—
Attempts to Rob Treasure Trains—Neutrality
Treaties—Establishment of Federal System—
Panamá as a State—Revolutionary Era Begins—
A Succession of Governors—Seditious Character
of the Negro Population—Revolution against
Governor Guardia and his Death—Another
Political Organization—Estado Soberano de
Panamá—Liberal Party in Full Control—Stringent 510
Measures
CHAPTER XXVI.
FURTHER WARS AND REVOLUTIONS.
1863-1885.
Presidents Goitia, Santa Coloma, and Calancha— 532
Undue Interference of Federal Officials—
Colunje's Administration—President Olarte's
Energy—Enmity of the Arrabal's Negroes—Short
and Disturbed Rules of Diaz and Ponce—
President Correoso—Negro Element in the
Ascendent—Conservatives Rebel, and are
Discomfited—Armed Peace for a Time—Feverish
Rules of Neira, Miró, Aizpuru, Correoso, and
Casorla—Cervera's Long Tenure—Temporary
Rule of Vives Leon—President Santodomingo
Vila—Obtains Leave of Absence—Is Succeeded
by Pablo Arosemena—Aizpuru's Revolution—
Arosemena Flees and Resigns—Outrages at
Colon—American Forces Protect Panamá—
Collapse of the Revolution—Aizpuru and
Correoso Imprisoned—Chief Causes of
Disturbances on the Isthmus
CHAPTER XXVII.
CENTRAL AMERICAN INSTITUTIONS.
1886.
Extent of the Country—Climate—Mountains and
Volcanoes—Earthquakes—Rivers and Lakes—
Costa Rica's Area, Possessions, and Political
Division and Government—Her Chief Cities—
Nicaragua, her Territory, Towns, and Municipal
Administration—Honduras' Extent, Islands,
Cities, and Local Government—Salvador, her
Position, Area, Towns, and Civil Rule—
Guatemala's Extent and Possessions—Her Cities
and Towns—Internal Administration—Isthmus
of Panamá—Area, Bays, Rivers, and Islands—
Department and District Rule—The Capital and
Other Towns—Population—Character and
Customs—Education—Epidemics and Other 560
Calamities
CHAPTER XXVIII.
THE PEOPLE OF COSTA RICA, NICARAGUA, AND SALVADOR.
1800-1887.
Central American Population—Its Divisions—
General Characteristics and Occupations—Land
Grants—Efforts at Colonization—Failure of
Foreign Schemes—Rejection of American
Negroes—Character of the Costa Rican People
—Dwellings—Dress—Food—Amusements—
Nicaraguan Men and Women—Their Domestic
Life—How They Amuse Themselves—People of 587
Salvador—Their Character and Mode of Living
CHAPTER XXIX.
THE PEOPLE OF HONDURAS AND GUATEMALA.
1800-1887.
Amalgamation in Honduras—Possible War of
Races—Xicaques and Payas—Zambos or
Mosquitos—Pure and Black Caribs—
Distinguishing Traits—Ladinos—Their Mode of
Life—Guatemala and her People—Different
Classes—Their Vocations—Improved Condition
of the Lower Classes—Mestizos—Pure Indians—
Lacandones—White and Upper Class—Manners
and Customs—Prevailing Diseases—Epidemics— 608
Provision for the Indigent
CHAPTER XXX.
INTELLECTUAL ADVANCEMENT.
1800-1887.
Public Education—Early Efforts at Development— 621
Costa Rica's Measures—Small Success—
Education in Nicaragua—Schools and Colleges—
Nicaraguan Writers—Progress in Salvador and
Honduras—Brilliant Results in Guatemala—
Polytechnic School—Schools of Science, Arts,
and Trades—Institute for the Deaf, Dumb, and
Blind—University—Public Writers—Absence of
Public Libraries—Church History in Central
America and Panamá—Creation of Dioceses of
Salvador and Costa Rica—Immorality of Priests
—Their Struggles for Supremacy—Efforts to
Break their Power—Banishments of Prelates—
Expulsion of Jesuits—Suppression of Monastic
Orders—Separation of Church and State—
Religious Freedom
CHAPTER XXXI.
JUDICIAL AND MILITARY.
1887.
Judicial System of Guatemala—Jury Trials in the
Several States—Courts of Honduras—Absence
of Codes in the Republic—Dilatory Justice—
Impunity of Crime in Honduras and Nicaragua—
Salvador's Judiciary—Dilatory Procedure—
Codification of Laws in Nicaragua—Costa Rican
Administration—Improved Codes—Panamá
Courts—Good Codes—Punishments for Crime in
the Six States—Jails and Penitentiaries—Military
Service—Available Force of Each State—How
Organized—Naval—Expenditures—Military 638
Schools—Improvements
CHAPTER XXXII.
INDUSTRIAL PROGRESS.
1800-1887.
Early Agriculture—Protection of the Industry— 650
Great Progress Attained—Communal Lands—
Agricultural Wealth—Decay of Cochineal—
Development of Other Staples—Indigo, Coffee,
Sugar, Cacao, and Tobacco—Food and Other
Products—Precious Woods and Medicinal Plants
—Live-stock—Value of Annual Production in
Each State—Natural Products of Panamá—
Neglect of Agriculture—Mineral Wealth—Yield of
Precious Metals—Mining in Honduras, Salvador,
and Nicaragua—Deposits of Guatemala and
Costa Rica—Mints—Former Yield of Panamá—
Mining Neglected on the Isthmus—Incipiency of
Manufactures—Products for Domestic Use
CHAPTER XXXIII.
COMMERCE AND FINANCE.
1801-1887.
Early State of Trade—Continued Stagnation after
Independence—Steam on the Coasts—Its
Beneficial Effects—Variety of Staples—Ports of
Entry and Tariffs—Imports and Exports—Fairs—
Accessory Transit Company—Internal
Navigation—Highways—Money—Banking—
Postal Service—Panamá Railway Traffic—Local
Trade of the Isthmus—Pearl Fishery—Colonial
Revenue in Finances of the Federation—Sources
of Revenue of Each State—Their Receipts and 663
Expenditures—Foreign and Internal Debts
CHAPTER XXXIV.
INTEROCEANIC COMMUNICATION.
1801-1887.
Ancient Ideas on the North-west Passage—From 688
Peru to La Plata—Cape Horn Discovered—Arctic
Regions—McClure's Successful Voyage—
Crozier's Discovery—Franklin's Attempts—
Finding by Nordenskiöld of the North-east
Passage—Projects to Unite the Atlantic and
Pacific Oceans across the Isthmuses—Plans
about Tehuantepec—Explorations for a Ship-
canal Route in Nicaragua, Panamá, and Darien
—The Nicaragua Accessory Transit Company—
Construction of the Panamá Railway, and its
Great Benefits—Further Efforts for a Canal—
Organization of a French Company—A Ship-
canal under Construction across the Isthmus of
Panamá—Difficulties and Expectations—Central
American Railroads and Telegraphs—Submarine
Cables
HISTORY
OF
CENTRAL AMERICA.
CHAPTER I.
LAST DAYS OF SPANISH RULE.
1801-1818.
EXPEDITION TO
OAJACA.
After the fall of Oajaca during the Mexican war
of independence, the patriot chief Morelos
regarded the rear of his military operations as secure. Sympathizing
messages had reached him from men of weight in Guatemala, which
lulled him into the belief that attack need not be apprehended from
this quarter. To Ignacio Rayon he wrote: "Good news from
Guatemala; they have asked for the plan of government, and I'll
send them the requisite information." It was all a mistake. His cause
had friends in Central America, and enemies likewise. Among the
most prominent of the latter were Captain-general Bustamante and
Archbishop Casaus. The ecclesiastic, with a number of Spanish
merchants from Oajaca who had sought refuge in Guatemala,
prompted the general, then anxious to avenge the execution of his
predecessor, to fit out an expedition, invade Oajaca, and harass the
insurgents even at the gates of the city.
About 700 men, mostly raw recruits, were accordingly put in the
field, early in 1813, under the command of Lieutenant-colonel
Dambrini, a man of little ability and unsavory record, and crossed
the line into Tehuantepec. Dambrini could not abandon his money-
making propensities; and having been led to believe he would
encounter but little or no resistance, took along a large quantity of
merchandise for trading. On the 25th of February a small insurgent
force was captured in Niltepec, and Dambrini had its commander,
together with a Dominican priest and twenty-eight others, shot the
next day. This was the usual treatment of prisoners by both
belligerents. But on April 20th the Guatemalans were flanked and
routed at Tonalá by the enemy under Matamoros. Dambrini fled, and
his men dispersed, leaving in the victors' possession their arms,
ammunition, and Dambrini's trading goods. The fugitives were
pursued some distance into Guatemalan territory.[I-25]
FANATICISM.
We have seen how the first steps toward
independence failed. Nor could any other result have been expected
from the degraded condition, socially and intellectually, of the
masses. The people were controlled by fanaticism, in abject
submission to king and clergy. Absurd doctrines and miracles were
implicitly believed in; and every effort made to draw the ignorant
people out of that slough was in their judgment treason and
sacrilege, a violation of the laws of God, an attempt to rob the king
of his rights; certain to bring on a disruption of social ties, and the
wrath of heaven. The lower orders had been taught that freedom
signified the reign of immorality and crime, while fealty to the
sovereign was held a high virtue. Hence the daily exhibitions of
humble faithfulness, the kneeling before the images of the monarch
and before their bishops, and the more substantial proof of money
gifts to both church and crown.[I-38]
ebookgate.com