0% found this document useful (0 votes)
48 views

Browsing and Querying On XML Data Sources

The document summarizes Urmila Kelkar's M.Tech dissertation on developing a system for browsing and querying XML data sources. The system provides a directory tree interface for nested XML browsing along with IDREF links. It allows custom document views through element dropping and matching. Keyword search returns browsable results trees. The system aims to scale to large XML documents using incremental techniques.

Uploaded by

Avneesh Kumar
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Browsing and Querying On XML Data Sources

The document summarizes Urmila Kelkar's M.Tech dissertation on developing a system for browsing and querying XML data sources. The system provides a directory tree interface for nested XML browsing along with IDREF links. It allows custom document views through element dropping and matching. Keyword search returns browsable results trees. The system aims to scale to large XML documents using incremental techniques.

Uploaded by

Avneesh Kumar
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

M.Tech.

Dissertation

Browsing and Querying on XML Data Sources

submitted in partial fulfillment of the requirements for the degree of


Master of Technology

By

Urmila Kelkar
Roll No : 00305402

Under the guidance of

Prof. S. Sudarshan

Department of Computer Science and Engineering


Indian Institute of Technology, Bombay
Mumbai
January 17, 2002
Dissertation Approval Sheet

This is to certify that the dissertation entitled Browsing and Querying in XML Data Sources by Urmila
Kelkar is approved for the award of the degree of Master of Technology.

Prof. S. Sudarshan
(Guide)

Internal Examiner

External Examiner

Chairman

Date :
Acknowledgement

I would like to thank my guide, Dr. S. Sudarshan for his untiring support and encouragement throughout
my M.Tech project. I would also like to acknowledge ones, who, from behind the scenes have contributed
their ideas and energies. Special thanks to all my colleagues from Informatics lab.

Urmila Kelkar
January 17, 2002

ii
Contents

1 Introduction 1

2 Related Work 2
2.1 Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Blended Browsing and Querying by BBQ . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 XML based information mediation with MIX . . . . . . . . . . . . . . . . . . . . . 2
2.1.3 BANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 BANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.2 DataSpot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Scalable Browsing of XML documents 5


3.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.1 Incremental browsing of XML documents . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.2 Mapping from an XML document to the Foldertree . . . . . . . . . . . . . . . . . . 7
3.1.3 IDREF to ID links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.4 Sending serialized objects over HttpConnection . . . . . . . . . . . . . . . . . . . . 9
3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Interaction between the Servlet and the Applet . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Browser setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Scalability of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Extensions to the Foldertree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Styling enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.2 Interactive Foldertree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Select Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.1 Working of Select Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Integrating Keyword Search with Browsing 17


4.1 Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Browsing search results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Browsing Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Conclusions and Future work 21

iii
Abstract

The Web is extensively used by every information seeker. Search engines such as Google retrieve informa-
tion from HTML documents. They allow users to get desired information by just typing a few words and
following hyperlinks. The goal of our project is to design and implement a system providing a powerful
way of extracting information from XML documents, using browsing and keyword search.

Our system provides a directory tree like interface to browse through nested XML data, coupled with the use
of IDREF to ID links and the use of stylesheets. To facilitate customized views of the same XML document,
we provide with menus to drop elements, to find matching elements, to drop subtree and so on. Additionally,
we provide keyword search where users can just type a few keywords to get desired information from the
XML data source.
Chapter 1

Introduction

XML is an evolving technology, which is becoming important because of its standardized data representa-
tion format. XML documents focus on semantics of data. It does not provide information about displaying
the data contained in the document. XML portrays a semistructured data model which is likely to be used
to publish heterogeneous data.

Consider the example of electronic patient records(EPR) used by public health services. Many European
countries aim to introduce EPR as a standard way of maintaining patient records. XML, which provides
users with a flexible way to markup nested data, appears well suited for maintaining EPRs. Administrative
committees in number of hospitals are now incorporating XML within their prescribed standards for main-
taining patient records. This emphasizes the need for a system that facilitates information retrieval from the
underlying XML documents.

We aim to develop a system that facilitates browsing of XML documents and provides keyword search
as well. Further, in browsing, we focus on two access patterns which will be most commonly used : docu-
ment traversal and pattern matching queries. The nesting structure of an XML document is used to provide
navigation through XML document in a tree format, called a foldertree. Users can choose a particular style
for displaying XML documents. Our system also provides a menu to drop the selected subtree, to expand
the selected subtree, to drop a particular element and to highlight matching elements. Taking into consider-
ation potentially large XML documents, it is important to make the system scalable. Our system facilitates
scalable browsing using incremental approach. We also provide an interface called as ’Select Interface’
which helps users retrieving information from XML documents using pattern matching queries. There are a
few systems like Blended Browsing and Querying(BBQ) [MLP99], XML based information mediation with
MIX(MIX) [BGL 99], that support querying and browsing of XML documents. BBQ uses incremental
on-demand approach.

Our system additionally provides keyword search, which is not supported by BBQ and MIX. The key-
word search module constructs a graph from the XML documents available in the data source. We use least
common ancestor technique to find out answer results. The answer result tree can be browsed like any other
XML document.

Chapter 2 gives an overview of related work in the area of browsing XML data sources. The browsing
interface offered by our system is described in Chapter 3. Chapter 4 briefly describes the keyword search
module and browsing of search results. The detailed approach of keyword search is described by Megha
Meshram in her dissertation [Mes02]. Chapter 5 outlines conclusions and describes future work.

1
Chapter 2

Related Work

The Web has introduced a new paradigm of browsing and keyword search to retrieve information from
HTML documents. Our goal is to provide a similar system to browse through XML documents. Tech-
niques used for information retrieval from HTML documents can not be directly used for XML documents.
This chapter gives an overview of the systems supporting browsing, keyword search and querying of XML
documents.

2.1 Browsing
Search engines such as Google help a naive user to browse through information using hyperlinks. There are
a few systems that provide support for browsing through XML documents. The following sections describe
these systems.

2.1.1 Blended Browsing and Querying by BBQ


Blended Browsing and Querying (BBQ) [MLP99] provides a Document Type Definition(DTD) based graph-
ical user interface, which facilitates XML query construction and browsing of results by non-expert users.
It is used as a front-end to the virtual source exported by MIX mediator system. Virtual source may be an
actual XML source or XML view created by mediator.

BBQ assigns each document a separate window with its data and schema displayed side-by-side. Both
data and schema can be navigated using directory like structures. BBQ querying is schema driven. It uses
XMAS (XML matching and structured language) query underneath that supports joins, filtering and con-
straints on leaf nodes. BBQ assumes that users can not come up with the focussed query right at the first
step and facilitates iterative query refinement. It supports queries on multiple DTDs. Users can specify the
structure of the query by dragging-dropping elements from the source DTDs or introducing new elements
or grouping elements according to the value of other elements. Further when execution of the query starts,
users can browse through partial results. BBQ uses Document Object Model Application Programming
Interface(DOM API) and uses incremental approach for browsing potentially very large documents and
subsequent query results.

2.1.2 XML based information mediation with MIX


Mediation of information from heterogeneous data sources becomes crucial as data from different sources
like HTML documents, XML documents, relational databases, legacy data are getting published over the
Web. MIX [BGL 99] employs XML as a semistructured data model to provide a uniform and flexible

2
representation of arbitrary source data. Since XML may also become stumbling block while formulating
meaningful queries against semistructured databases, it uses XML DTD as a structural description of data
exchanged by components of mediator.

MIX focuses on valid XML documents which confirm to the DTD. XML queries are denoted in high level,
declarative query language known as XMAS which is evolved from the ideas from XMLQL and other XML
query languages. It allows pattern matching as well group by queries. To facilitate querying of heteroge-
neous sources, XML wrappers are provided which export data in a uniform format to the mediator. MIX
uses BBQ as its graphical interface. XMAS queries generated by BBQ are sent to the mediator for execution
and the results are displayed using BBQ.

2.1.3 BANKS
BANKS - an acronym for Browsing ANd Keyword Search [BHN 01] facilitates browsing and keyword
search in relational database. Earlier we had worked on a module of BANKS system. To retrieve informa-
tion from the underlying relational databases, users need to know SQL (structured query language). BANKS
uses a new paradigm of keyword search introduced by the Web to retrieve the desired information.

BANKS uses tabular format to display information retrieved from database. Foreign key - Primary key
relationship (FK-PK) is used to provide a link from one table to another in the form of hyperlinks. BANKS
also provides menu, using java script for sorting records from the table on a particular column, or grouping
records based on a column and so on. It also provides ’Select Interface’ to get records matching the specified
values. BANKS also provides templates such as crosstab, nested, foldertree, bar-chart, pie-chart to display
information from database in graphical manner.

Browsing and keyword search in XML can also be thought of as an extension to BANKS. XML is con-
sidered as semi-structured data, while BANKS uses relational database( structured) as back-end. Hence, the
strategies and the approaches used for browsing and keyword search in XML are different but the need for
browsing and keyword search lying underneath is the same.

XML documents can be displayed using Folder-tree structure like the one used in BANKS. Menus such
as drop elements, select matching elements, drop subtree can be used to interact with foldertree. We can
also provide a group-by selection interface wherein users can select group-by element and result element
to get graphical view of data like bar-chart used in BANKS. For example, rather than viewing the whole
collection of CDs at once, users will prefer having year-wise or artist-wise collection of CDs where we can
specify CD as the result element and the year or the artist as a group by element. We can also facilitate
querying of XML documents, using some query language at the back-end.

2.2 Keyword Search


Keyword search on documents is very well used for finding out desired information on Internet. Keyword
search paradigm is equally important for XML documents. We describe the prior work done in keyword
search in databases, in the following sections.

2.2.1 BANKS
BANKS supports keyword search to retrieve the information from the underlying relational database where
users are not required to know the schema details. Like keyword search on internet returns documents

3
containing given words, BANKS returns tuples containing the given word. BANKS exploits foreign key-
primary key relationship between tables of database to form a graph called meta-data graph. Further, it also
uses tuple level graph to identify particular tuple. Now given a set of words, two words are considered to be
close to each other, if in a table, they are in the same row and same column, or in different columns of a row,
or in rows of different tables linked by foreign keys. The details of the keyword search algorithm are given
in [HBN 01]. Search results are ranked and sorted out before presenting.

To facilitate keyword search in XML documents, we can use the built-in hierarchical structure of XML
document to form a graph. We can make an entry into the text index for all words from all documents
except for stop words. Text index can store DOM Node reference, which will help us to traverse the graph
for finding out shortest paths between keyword nodes. Simplest way to find out a tree containing all words
is to use the least common ancestor technique. The result will contain the root as the common intersection
node while leaf nodes will refer to the search terms. The search result can be ranked based on the depth of
the tree or say the longest edge of the tree. Result tree can be browsed using the foldertree like any other
XML document.

2.2.2 DataSpot
DataSpot [DEGP98] introduces a new approach to database query and information retrieval by providing end
users with the capability of exploring databases using the free-form queries and navigation. This capability
is based on a novel, schema-less representation of data, called a Hyperbase. The DataSpot representation
and search technology is the foundation of the DataSpot system. DataSpot has since been named a Mercado
and is available as a commercial product.

A DataSpot Hyperbase is a graph structure comprised of nodes, edges and node labels. Nodes are re-
lated by directed edges of two types. A simple edge is used to connect the parent node to a child node. The
set of children of a node are ordered. An identification edge is used to indicate that reference node uniquely
identifies the subject node. A DataSpot Query is an associative search over a Hyperbase. The input to a
query is a set of nodes called a query sources. The result of a query is a list of answers where each answer
is a connected a Hyperbase containing the query sources. The answers to the query are ranked. Users can
view answer records in detail, navigate to the related records or may submit the continuation queries from
the current record or from set of records.

4
Chapter 3

Scalable Browsing of XML documents

The primary motivation for developing a system for browsing is to ease the end-users task. Naive users
should be able to get the desired information from XML documents by just few clicks rather than writing
complex queries. We have developed a system which provides such an interface for extracting the desired
information from XML documents.

The central idea is to provide a Scalable Browsing system to navigate through XML documents. Since
an XML document consists of markup tags and not the formatting tags, we need some mechanism to con-
vert XML-encoded information into the true data model and to make it presentable. We are using a foldertree
structure to display XML documents. Folder tree is a simple hierarchical structure like the directory tree
structure used in Windows. Consider a document order.xml as shown in Figure 3.1. Our system provides
users with a foldertree view of XML document, as shown in Figure 3.2

3.1 Design Issues


This section describes various design issues in our system like, incremental browsing approach, mapping of
an XML document to foldertree, IDREF to ID links and serialized object used for communication between
client end and server end of the system.

3.1.1 Incremental browsing of XML documents


The approach used for displaying XML documents, brings nodes on demand (i.e. as requested by users) and
displays the tree incrementally. To indicate users that a particular node has a few more child nodes yet to be
retrieved, a dummy node called as ”More” is displayed.

This incremental approach makes the design scalable since a user is not required to spend time waiting
for child nodes to arrive. The following sections explain our approach in detail.

Approach I
At the initial stages of work, we implemented the following non-incremental model. The model consists of
servlet running at the server end and applet working at the client end, communicating with each other. The
steps to be followed are as follows :
¯ Servlet parses the whole XML document and gets an in-memory DOM tree.

¯ Applet sends request over HttpConnection for a particular XML document.

5
<OrderData endDate="2/11/2001" startDate="1/11/2001">
<Customer customerID="c1" name="PCS" ..... postalcode="476799">
</Customer>
<Customer customerID="c2" name="HP" ..... postalcode="476799">
</Customer>
<Invoice orderDate="20/11/2001" .... customerIDREF="c1">
<LineItem quantity="10" price="10" partIDREF="p1">
</LineItem>
<LineItem quantity="50" price="50" partIDREF="p2">
</LineItem>
<LineItem quantity="20" price="30" partIDREF="p3">
</LineItem>
<LineItem quantity="30" price="20" partIDREF="p3">
</LineItem>
</Invoice>
<Part partID="p1" ... description="Deskjet printer">
</Part>
<Part partID="p2" ... description="Desktop Computer">
</Part>
<Part partID="p3" ... description="Laptop Computer">
</Part>
</OrderData>

Figure 3.1: Original XML document

¯ Servlet traverses in-memory DOM tree in BFS(Breadth First Search) order, creates newObject corre-
sponding to each DOM Node and sends all objects one by one, over HttpConnection. Refer Section
3.1.4 to get the details of newObject

¯ Applet receives newObject corresponding to every DOM Node and goes on attaching newObjects to
their respective parents to form a tree.

¯ Applet displays the tree using Java Swing.


This approach did not work well for huge XML documents because sending the whole DOM tree at once
was not feasible. Later, we came up with an incremental model as described below.

Approach II
The servlet running in background, sends DOM Nodes on demand, as requested by users at the client end.
It makes the servlet stateless. The applet saves the state of the request and sends the next request as per user
navigation. The following steps are taken :
¯ The servlet parses the whole XML document and gets an in-memory DOM tree.

¯ Initially the applet requests for the root node, identified by id=1, along with few child nodes, number
specified in the Configuration file.

¯ In general, the applet sends a request with the node id of a parent node and number of chlld nodes,
identified by ’from’ and ’to’ parameters in request. It also includes docName parameter identifying the

6
Figure 3.2: Foldertree view of order.xml document displaying IDREF to ID links

XML document. The object called newObject, described in Section 3.1.4 is used for communication
between the servlet and the applet.

¯ In response to the request from applet, the servlet sends root node if id is equal to 1, or else only child
nodes numbered from ’from’ to ’to’ of the requested XML document.

¯ The servlet sends a special object with myID=0 as a demarcating object, to indicate that it is the end
of the response.

¯ The applet updates the tree by appending the received child nodes to the respective parent nodes and
displays the tree.

¯ On the applet side, once the tree is displayed, applet waits for a user request. User can request for
child nodes of a particular node by just clicking on that node and the request is sent to the servlet
asking for child nodes of that node.

¯ A dummy node named as ”More...” is used to indicate that the parent node has some more child nodes
yet to be retrieved. Users can click on ”More...” node to request for those child nodes or he can also
click on parent node itself to ask for child nodes.

3.1.2 Mapping from an XML document to the Foldertree


This section describes the steps taken to map an XML document to the foldertree, while displaying it incre-
mentally.

7
Foldertree is chosen to display the document since it is suitable for displaying hierarchical, nested struc-
tures. The jaxp [JAX] DOM parser is used to parse XML documents. It gives us an in-memory tree, corre-
sponding to a document, where every node is a DOM Node. We only consider nodes of the types DOCU-
MENT NODE, DOCUMENT TYPE NODE, ELEMENT NODE, ATTRIBUTE NODE and TEXT NODE.
Document node is the root of the document, while Document type node is used to identify the DTD asso-
ciated with the document. Document node corresponds to the root of the foldertree. Element node and
Attribute node correspond to the foldertree node (FTN) and leaf node in a foldertree. Element node, Text
node and Attribute node are transformed into foldertree structure as follows :

ELEMENT NODE : The name of the Element node (i.e. the markup tag) in XML document is assigned
to the corresponding foldertree node(FTN) in the foldertree. Refer Figure 3.1 and Figure 3.2 demon-
strating the mapping from XML document to the foldertree.

TEXT NODE : Text node in XML document corresponds to the Leaf node of a foldertree. Since Text node
in XML document carries the actual data, and an Element node in XML document can only have a
single Text node as its child, while mapping XML document to the foldertree, we append the value
of Leaf node (corresponding to the Text node in XML document) to its parent foldertree node and
remove Leaf node in the foldertree as it is redundant.

ATTRIBUTE NODE : ’Attribute Name=Attribute Value’ pair for every Attribute node in XML docu-
ment, is appended to the FTN corresponding to the Element node associated with an attribute. For
example, as shown in Figure 3.1 and Figure 3.2, Element node OrderData has Attribute node ’start-
Date=1/11/2001’. We append ’startDate=1/11/2001’ to the FTN corresponding to OrderData in a
foldertree.

Attribute node can have different types such as ”CDATA”, ”ID”, ”IDREF”, ”ENTITY” etc. Currently our
system supports only attributes of type ”ID” and ”IDREF”. IDREF to ID links are creates by checking the
attribute type. Suppose an Element node in XML document has an attribute with a type IDREF, then we
identify the corresponding Element node having attribute of type ID(i.e. ID node) with matching values of
ID and IDREF. In a foldertree, FTN corresponding to the Element node having ID, is attached as a child of
FTN corresponding to the Element node having IDREF. In XML document, ID node and IDREF node can
lie distantly. Our system facilitates a way to browse from IDREF node to ID node.

IDREF to ID links are thus created as part of the initialization of a system and in-memory DOM tree corre-
sponding to XML document is updated to include IDREF to ID links, described in the following section.

3.1.3 IDREF to ID links


XML document contains elements. The element can have attributes. Attributes of type ID identify element
uniquely in a document. IDREF to ID relationship is considered analogous to foreign key-primary key re-
lationalship. Attribute of type IDREF indicate that the element refers to another element having an attribute
of type ID, wherein both attribute values are the same. Here, we are assuming that value of the ID attribute
is unique over the entire document.

DOM parser doesn’t provide API to identify attributes of a particular type say ID or IDREF while, SAX
parser API provide support for such identification. Since using SAX parser will lead to an overhead as one
more SAX parser scan is required in addition to DOM parser scan, we are using a DTD parser [DTD]. DTD
parser helps us to identify IDREF elements and ID elements. We support identification of IDREF to ID links
only if the document has DTD associated with it. An in-memory DOM tree is updated to include IDREF to

8
ID links as follows :

If the XML document has DTD associated with it,

¯ Parse the DTD

¯ For every Element check if any attribute is of type ID or it is of type IDREF,

– For an attribute of type IDREF, enter ElementName-AttributeName pair into IDREF hashtable.
– For an attribute of type ID, enter ElementName-AttributeName pair into ID hashtable.

¯ When finished with DTD, parse the XML document to construct an in-memory DOM tree,

– If the ElementName-AttributeName pair in the document matches with some entry in the ID
hashtable, enter id value of ID Node in ID array.
– If it matches with some entry in IDREF hashtable, enter idref value of IDREF Node in IDREF
array.

¯ After the DOM tree is completely constructed, for every entry in IDREF array do the following :

– Get the matching ID Node with matching value.


– Clone ID Node to get the duplicate-ID Node since DOM does not allow to have two nodes with
same information. If we do not clone the ID node, it gets removed from its original place in
XML document and gets appended to the child list of IDREF node. Because we want to keep
the ID node as it is in the original XML document, and additionally we want to append it to the
child list of IDREF node, we clone it.
– Append duplicate-ID Node to the childlist of IDREF Node.

Figure 3.2 shows IDREF to ID links in order.xml document where Invoice refers to Customer and LineItem
refers to Part. We append Customer to the child list of Invoice and Part to the child list of LineItem for
matching IDREF and ID values in the document.

3.1.4 Sending serialized objects over HttpConnection


The system consists of client end or browser end and server end. XML documents are stored in a data source
at server end, while foldertree is displayed at the client end. This section describes the object used to send
XML document from the server end to the client end.

Users at the client end can send request for a particular XML document to be browsed or they may re-
quest for child nodes of FTN, numbered from say ten to twenty. This request is sent over a HttpConnection
to the servlet running at the back-end. In response to this request, the servlet sends requested nodes to the
client end. The client end redisplays the tree by attaching received nodes to their corresponding parents.
DOM parser gives us in-memory DOM tree for the document. But since DOM Node is not serializable,we
can not send it over HttpConnection. We have constructed our own object which stores DOM Node infor-
mation along with a few tags so that it is easier to attach those nodes to their respective parent nodes at the
client end. The class newObject describes the serializable object used for sending node information from
the server end to the client end.

9
Class newObject {
String folderValue;
int isLeaf;
int myID;
int parentID;
int numChildren;
int numChildrenAtClient;
}

The newObject carries following parameters required for reconstruction of the tree.

¯ folderValue - is a variable of type String representing value which is displayed in foldertree.

¯ isLeaf - is a variable of type integer, that indicates whether the Node is a leaf node or non-leaf node
(1-leaf , 0-non-leaf). The nodes with isLeaf value equal to 0, have to be stored temporarily on client
side, so that whenever we get child nodes, we can append them to the parent node. We can get rid of
nodes with with isLeaf value 1 since those are leaf nodes.

¯ myID - is a variable of type integer, used to assign a unique ID to the Node.

¯ parentID - is also an integer, used to store ID of the parent of the Node. It helps in appending child
nodes to the correct parent Node.

¯ numChildren - is an integer used to store number of child nodes of a Node.

¯ numChildrenAtClient - is an integer indicating number of child nodes received by the applet at the
client end.

3.2 Implementation Details


We are working on a sample data source containing saved XML documents. The system is developed using
Java [JDK]. It uses Java Servlet [JS] at the back end, Swing Applet [JDK] at the client end and standard
interfaces like DTD parser [DTD], DOM [JAX] and XQL [XQL] as shown in the Figure 3.3. Since XML is
a document format and not data format, we need to preprocess it. XML parser is used to retrieve actual data
from XML documents by preprocessing them. XML parsers currently available are : jaxp parser(DOM-
Document Object Model, SAX-Simple API for XML) [JAX], Xerces(Apache’s parser in Java), libxml in C.
We are using jaxp parser. XML documents can have a DTD (Document Type Definition) associated with
them. A few XML documents in a data source, have DTD describing the schema of XML document. For
our system, DTD is not needed but if a DTD is available, it helps while browsing.

3.2.1 Interaction between the Servlet and the Applet


The servlet is set up on URL corresponding to the entry ”servletRoot” in the configuration file. We just
need to put all XML documents that we would like to browse in a directory called as ”XMLDocs” inside
public html directory since applets can read files stored only in public areas. Otherwise applets need to be
signed which is a bit complex procedure. The working of the servlet and the applet is such that the applet is
a master asking for a particular XML document, while the servlet runs in the background serving requests of
the applet. The servlet uses DOM API to parse the XML document and constructs an in-memory DOM tree
out of it. The servlet sends the DOM Node object over the HttpConnection using an outputStream. While
the applet gets the DOM Node object from the HttpConnection using an inputStream. The problem here is

10
Browsing Interface
Select Interface
(Foldertree Applet)

Network

DTD Parser DOM XQL

Servlet

XML Data source

Figure 3.3: Overview of the system

DOM Node, is not serializable. It is not possible to send the DOM Node as a stream over the HttpConnec-
tion. So we are using an object which is designed to store the DOM Node information in serializable format.
The obejct is described in Section 3.1.4. The same object is used at the back end as well as at the client end.

All XML documents in a data source are parsed to get corresponding in-memory DOM trees. Our sys-
tem maintains reference to the root node of in-memory DOM tree to retrieve requested nodes quickly. At
the initialization of the system, we update in-memory DOM tree by attaching text nodes, attaching attributes
and attaching ID nodes as a child of IDREF nodes, as described in Sections 3.1.2 and 3.1.3.

3.2.2 Browser setup


We are using the Swing Applet [JDK] to display XML document in a foldertree format. Netscape 4.7 doesn’t
support Java Swing. One option is to setup JRE(Java Runtime Environment) with path set for plugins. This
is a bit complex way. One simple option is to place swingall.jar file in /usr/lib/netscape/java/classes path
of Unix environment, which makes the browser swing enabled. Netscape 4.7 and earlier versions does not
provide support for stylesheets associated with XML documents. Our system, embeds style information
(font and colour) in the foldertree and displays it using swing. Hence a better option is to use Netscape 6.1
or higher versions.

3.2.3 Data structures


The system is composed of servlet at the back-end and the applet at the client end. It is assumed that
the client end, has the same version of JVM (Java Virtual Machine) installed as on the Servlet side. To
understand the system in detail, we need to understand the flow control and data structures used at both
ends.

Data structures at Servlet end


/* initialized at the startup of system */
static int numOfDocs;
static String docNames[];

11
/* hashtable of document references */
static Hashtable hashNode_to_id;
static Hashtable hashid_to_Node;

/* per request */
static String docName;
static int reqid;
static int from;
static int to;

At the initialization, the system reads the available documents from the Configuration file. numOfDocs
indicates number of documents in the data source. array    stores DOM reference to the root
node of the in-memory tree for each document.    
 and 
   are Hashtables of
Hashtables.    
 is a hashtable with key as document number (index in docNames array) and
value as   
 hashtable for that document.   
 hashtable stores the DOM Node as a key and
unique value assigned to that node a a value. 
    is a hashtable with key as document number
and value as
    hashtable. doInit() function of the servlet does this initialization work.

The servlet stores four variables - docName, reqid, from and to, indicating that the applet is asking for
XML document named docName. Redid, from and to indicate that the applet requires child nodes num-
bered from ’from’ to ’to’ of a node with id value equal to reqid. The servlet accepts these parameters in
doGet() and accordingly sends response over HttpConnection.

Data structures at Applet end


/* hashtables */
Hashtable hashid_to_Path;
Hashtable hashPath_to_id;
Hashtable hashid_to_DMTN;
Hashtable hashid_to_Object;

/* parameters sent in request to servlet */


String docName;
int reqid;
int fromChild;
int toChild;

To browse through the XML document, the servlet creates a new applet. Every applet stores four hashtables
mentioned above for constructing the tree. 
 
is a hashtable with key as myID of newObject
and value as parentPath appended by folderValue of newObject. On the applet side, we need to identify
the id of the node, when a user clicks on a particular node. To uniquely identify any node, we use the
whole path to that node as a key value. Hence the hashtable providing the map from Path to id is used.

    is used to get the actual tree node (called as DefaultMutableTreeNode) to which the
received child nodes are appended. Further, 
 
is used to retrieve the newObject corre-
sponding to id value since tags like isLeaf, myID, parentID are stored in newObject.

The four parameters, docName, reqid, fromChild and toChild constitute the request sent from the applet
to the servlet as described above.

12
Figure 3.4: Foldertree for an XML document (shakespear’s play - The tragedy of Richard III). The Figure
also displays the menu provided by the system to interact with the foldertree

3.3 Scalability of the system


Taking into consideration potentially huge XML documents, we take care of the scalability at the front end.
The XML documents are stored in a data source on server side. When a user asks for a particular document,
only the partial document is brought to the client-side and displayed. Further remaining portion of the doc-
ument is brought on demand.

On the back-end side, we get all XML documents parsed in memory in the initialization phase and all
the parsed documents are in memory all the time. This may become memory overhead for some huge docu-
ments in a data source. As of now we are not handling this issue. Back-end scalability can be improved by
bringing the partial DOM tree in memory and throwing it out if not needed. Further, external indices can be
used to fetch required nodes to form a partial view of tree and sending it to the swing applet.

3.4 Extensions to the Foldertree


This section describes a few featuers of the foldertree like use of style information and menus that facilitate
getting different views of the same document.

3.4.1 Styling enhancements


Construction of a foldertree is the basic step. To make the foldertree more presentable, we are using
stylesheet information. The applet can’t read files. To allow applet to read file, we need to sign the ap-

13
plet for reading a particular file on local machine or we need to provide the whole file, as input to the applet.
Thus, the approach becomes somewhat complex.

Our system uses style information to display foldertree node and Leaf nodes of a foldertree in a style that
is currently set as default. Users can change the default style using select Style option given in menu. In
our system, on the servlet end, style chosen by the user is saved in the Configuration file. Whenever the
applet at the client end asks for particular document, the servlet sends the current style settings to the applet
over the stream. The approach can be extended further to display the foldertree according to the stylesheets
associated with XML documents.

3.4.2 Interactive Foldertree


The system provides users with a mouse-over menu to play around with the foldertree to get customized
views of the same XML document. Figure 3.4 shows how the system provides navigation in foldertree
format. It also displays the menu provided to facilitate interaction with the foldertree.

¯ Find matching - The menu helps users to highlight elements having the value same as that of the
selected element.

¯ Expand subtree - Currenly we provide only one level expansion and child nodes of the selected
element are displayed. This feature can be extended to expand a particular node up to the few levels
as asked by the user.

¯ Drop subtree - Drop Subtree drops the whole subtree below the selected element.

¯ Drop element - Drop element drops the elements with the name same as selected element.

Figure 3.5: Query: Get articlesTuple from sigmod.xml containing ’Donald’

14
Figure 3.6: Query Result: Get articlesTuple from sigmod.xml containing ’Donald’

3.5 Select Interface


Our system facilitates the Select interface to select a particular element from XML document. If the user
knows what exactly he wants then he can specify pattern matching queries using Select interface to get the
desired result. We plan to extend it to provide even complex group-by, nested queries.

We are currently using XQL engine provided by GMD-IPSI [XQL]. They provide XQL APIs to run basic
queries on XML documents.

3.5.1 Working of Select Interface


XPATH expressions are needed to specify query in XQL. Every XML document doesn’t have DTD asso-
ciated with it. Hence, we create the hierarchy of elements from XML document which helps us to query a
particular element from the document. At the initialization step, this nesting of elements along with their
XPATH expressions is stored in a file format. Queries are in the form of ”contains” clause. Users can type
in values in Select interface and get the elements containing the specified values as a result of the query. For
example, while browsing sigmod.xml, which contains a collection of articles from sigmod, users might want
to get the articles written by author ’Donald’. Here, users can specify ’Donald’ in the field ’articlesTuple’ to
get detailed list of articles written by author ’Donald’. The result of the query is displayed in tabular format
using HTML. Indentation is used to portray the nesting of elements.

At the initialization of the system, we create .xpath file for every XML document in the data source. Our
system provides XPATH module for that purpose. These files are used to form a query when users specify
values in Select Interface.

Figure 3.5 shows the Query interface for specifying values. sigmod.xml is a set of sigmod articles, each

15
Number of
CD s

1995 1996 1997 .... 2000 Year

Figure 3.7: Bar graph displaying year-wise distribution of CDs for a XML document containing collection
of CDs

having fields like volume number, title, authors etc as shown in the Figure 3.5 . The query looks like arti-
clesTuple contains ”Donald”. The result is shown in Figure 3.6. The document contains two entries with
author value equal to ”Donald”. Both of these tuples are presented to the user as a result of the query.

3.5.2 Extensions
The select interface currently provided can be enhanced as described below :

¯ The select interface provided is not yet integrated with the browsing system. We can use the same
foldertree, as used for browsing an XML document, to display query results with query-result nodes
in expanded state and the remaining portion of the document in collapsed state.

¯ Select interface supports pattern matching queries on element nodes from the document. We can
extend it to include attribute nodes too.

¯ We have not taken into consideration scalability. For potentially large documents, query execution
takes quite a long time since XQL engine again traverses through the whole document to find out
matching elements. The incremental approach can be used to run query in the background and dis-
playing partial results if some query language provides that feature. We need to replace the current
querying engine with the better one.

¯ Currently we have just focussed on implementing ”contains” clause. We can provide group-by queries
where in users can select group-by element and result element. It will be similar to the concept of
group-by templates provided in BANKS. For example, for a document portraying collection of CDs,
users would like to see the list of CDs grouped by year. Here users can input group-by element and
result element to the system, the year as a group-by element and the CD as a result element. Users
can get the list of CDs grouped by year as a result. Further, this information can be displayed in a
graphical manner as shown in Figure 3.7 Rectangular bar representing the year can have hyperlink to
actual records giving the details of the CDs in that particular year.

16
Chapter 4

Integrating Keyword Search with Browsing

The primary motivation behind keyword search is to facilitate an interface to help naive user extracting
information from XML data source just by typing few keywords.

4.1 Keyword Search


The keyword Search routine constructs the graph from the XML documents where nodes in the graph
correspond to the nodes from the documents. Parent-child edges and IDREF-ID edges form edges of the
graph. The keyword search algorithm runs on the preconstructed graph to get answer results. The algorithm
can be described as follows :

¯ Construct the graph from the XML data source.

¯ Create an in-memory text index containing all words from all documents, except for stop words. Stop
words are words such as ’a’, ’an’, ’the’ which occur very frequently.

¯ Take search terms as input from the user.

¯ Traverse the graph starting from search terms.

¯ Follow the backward edges to find an intersection node common to all search terms.

¯ The intersection node is the root node of the answer result with the leaf nodes representing search
terms.

¯ If there are more than one answer results, then sort them according to the relevance. The details of
how relevance score is calculated for a result tree are given in [Mes02]

¯ Return the list of answer results to the user. The answer result is not only the name of XML document
where the word lies but the relevant portion of XML document where the word lies.

Construction of graph, construction of in-memory textindex, the search technique used, the ranking of
answer results according to the relevance, are described in detail by Megha Meshram in her dissertation
[Mes02].

17
Figure 4.1: Figure shows answer result tree for query : ’dunkel rabbit model’.

4.2 Browsing search results


The keyword search routine returns the list of answer results in decreasing order of relevance. Since the
result has hierarchical format, we have chosen foldertree to display the answer result tree. Answer result is
also displayed using incremental, on demand approach.

The keyword search module accepts keywords from the user. The step by step execution of the algorithm
given above leads to the generation of answer results. Answer results are ranked and are sorted before dis-
playing to the user. Users can view the answer result in a foldertree format as shown in Figure 4.1.

We use searchObject for sending search result node over HttpConnection. The searchObject carries param-
eters required for reconstruction of answer result at the client end. The class searchObject used is described
below :

searchObject {
String folderValue;
boolean isLeaf;
int no_of_children;
int got_children;
boolean hasKeyword;
String docName;
}

Every searchObject carries with it the following parameters :

18
¯ folderValue - is a String, representing the actual value of FTN.

¯ isLeaf - is an Integer, which indicates whether node is a leaf node or a nonleaf node. Value 1 indicates
a nonleaf node while value 0 indicates a leaf node.

¯ no of children - is an Integer, which indicates the number of children of the search node.

¯ got children - is an Integer, which indicates the number of child nodes, currently brought to the client
end. It helps while displaying search results.

¯ hasKeyword - is a String which indicates whether the node contains atleast one search term in it.
Answer result may contain node which does not contain search term. This is used to display nodes
containing search terms, with different colour.

¯ docName - is a String storing the name of the document the search term belongs to. This information
is used to browse to the respective XML document from a particular search node in answer result.

Search result is provided with mouse-over menu where users can get the name of the XML document the
search term belongs to. Further, the users can browse through the document by clicking on the menu.

Figure 4.2: Figure shows answer result tree for query : ’hepatitis antibodies australia’.

Consider an example where the user wants to know whether there is any medical citation written by author
’dunkel’ containing ’rabbit model’ in its title. Here, the search query will look like ’dunkel rabbit model’.
Figure 4.1 shows the answer result for query : ’dunkel rabbit model’. Consider another example of search
query : ’hepatitis antibodies australia’ which finds out medical citation in Australia, on ’hepatitis antibodies’.
Refer Figure 4.2 which shows the answer result for this query.

19
4.3 Browsing Extensions
Current approach used for browsing search results is a naive approach. A few features can be added to the
system to display search results in an intersting manner.

¯ Current approach provides an hyperlink from a search node to the root of the XML document to which
the search node belongs. Users would like to go to the corresponding node in XML document rather
than going to root of the document. This can be done by finding out the node in XML document that
corresponds to the search node and displaying the whole subtree below it.

¯ Currently, browsing interface provides the only option of getting the child nodes of a particular node.
We can extend this to provide an option, for getting the parent node of a particular node. This feature
may lead to interesting browsing patterns while browsing keyword search results.

20
Chapter 5

Conclusions and Future work

We have developed a complete system, which facilitates browsing and keyword search in XML documents.
The system works with any XML data source at the back-end. The system will help naive users, getting
information from XML data sources containing sigmod papers, or say collection of articles. XML will help
information providers portray different views of the same information as per user’s interest. Our system
provides foldertree for navigating through XML documents, starting from root, up to the leaf nodes. To get
the different view of the same XML document, it provides menus. Styling feature provided by our system
helps users to view XML document in customized styles. The system also provides a select interface to
extract desired information using pattern matching queries. The system can be enhanced in many ways.
Some of the short term extensions include :

¯ Creating an external index - Currently we are using in-memory DOM which is kept in memory all
the time. This may become an overhead for potentially huge XML documents. A better option would
be to use persistent DOM [XQL] where we can create disk based index, which will help in fetching
a node given unique id, without the cost of parsing XML document or building an in-memory DOM
tree.

¯ Facilitating complete support for querying of XML documents - Currently we provide with an inter-
face which supports only simple pattern matching queries using XQL. It can be extended to support
even complex group by and nested queries.

¯ Using HTML to display XML documents using hyperlinks - Instead of using the foldertree, we can
generate stylesheets on-the-fly. XML document can be displayed using style sheets. Hyperlinks may
be used for inter-document references.

¯ Exploiting features of XML document - Currently our system supports only element nodes along
with attribute nodes and text nodes. Most of the times, extracting information from these nodes is
sufficient. Further, system can be extended to include use of entities, entity references, processing
instructions and CDATA sections.

¯ Displaying keyword search results - Every search node in answer result tree contains a parameter
called docName which is used to browse through actual XML document in which the search term be-
longs. Further, this feature can be extended to display only partial view of XML document containing
nodes relevant to the search term, instead of displaying the whole XML document.

¯ Extending Keyword Search - Including metadata search will be quite interesting for XML since XML
document identifies data using tags. For example, users can type in ’CATALOG.AUTHOR:Adams’
to search for a particular author named ’Adams’ in catalog.xml. More precisely, we can restrict the

21
search domain only to those paths specified by the user. This feature will help users to get more
accurate results since XML identifies semantics of data contained in the document.

Longer term future work would include the following :

¯ Combined system for structured and unstructured data - In E-commerce or marketplaces systems
where scalability and performance plays critical roles, it is desirable to have a system providing sup-
port for structured as well as unstructured querying. Using RDBMS leads to degraded performance,
while unstructured paradigm doesn’t provide support for some structured components. For example,
for a query : Get documents containing ”XML”, in ”English” language, of type ”pdf” where ”XML”
is a structured component while other two are structured components.

OrderData

Customer C1 Part P1 Part P3 Invoice

Customer C2 Part P2

LineItem L1 LineItem L2

Figure 5.1: XML document portrayed as a graph

¯ Navigation through XML using a graph model - Our system uses a tree model to represent XML
document. XML document can be mapped to the graph model rather than a tree model, which will
make browsing more interesting.

The graph of XML document can be constructed with nodes and edges such that element node from
XML document represents the nodes of the graph, while parent-child edges and IDREF-ID edges
represent edges of the graph. To portray graph structure, we can do the following :

– For IDREF to ID references in XML document, rather than replicating the ID nodes as a child
of IDREF nodes, we can have only one instance of ID node, to which every IDREF node can
refer. Refer Figure 5.1 which portrays XML document as a graph, where dotted lines represent
IDREF to ID edges, while solid lines represent parent-child edges. The original XML document
is displayed in Figure 3.1.
– We can merge identical text values or entire subelements from XML document.

The graph structure will help compressing the document by removing replication in tree format. Most
important is that it will be of immense use for keyword search. The current keyword search module
explicitly searches for ID node, once it receives IDREF node. While this approach will provide an
in-build graph with all IDREF-ID links hidden in it. We can use package called Grappa [LR], which
is a graph package written in Java provided by AT&T Research Labs.

22
¯ Implications of graph model on browsing as compared to the Foldertree - Foldertree facilitates in-
order browsing through XML document, starting from root of the document, up to the leaf level. We
can map XML document to a connected graph as mentioned above. To browse through a particular
document, we can provide users with customized views, according to the starting node specified. For
example, if users specify OrderData as a root node, they can view the whole document as shown in
the Figure 5.1. If Invoice is specified as a starting node, users can just view the partial document that
is reachable from the Invoice, directly or indirectly.

23
Bibliography

[BGL 99] Chaitanya Baru, Amarnath Gupta, Bertram Ludascher, Richard Marciano, Yannis Papakon-
stantinou, and PAvel Velikhov. XML-Based Information Mediation with MIX. In ACMSIG-
MOD 1999, exibition program, University of California, San Diego, La Jolla, CA 92093, 1999.

[BHN 01] Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Key-
word searching and browsing in databases using banks. In Proc of ICDE, 2001.

[DEGP98] Shaul Dar, Gadi Entin, Shai Geva, and Eran Palmon. DataSpot : Database Exploration Using
Plain Language. In Proc. of the 24th VLDB Conference, Data Technologies Ltd., 1998.

[DTD] Java DTD Parser. Online at http://www.wutka.com/dtdparserdownload.html.

[HBN 01] Arvind Hulgeri, Gaurav Bhalotia, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Key-
word searching and browsing in databases using banks. In IEEE Data Engineering Bulletin,
September 2001.

[JAX] JAXP API for XML parsing 1.1.1. Availale online at http://java.sun.com/xml/
jaxp/dist/1.1/docs/api/overview-summary.html.

[JDK] Java API 1.2.2. Available online at http://java.sun.com/products/jdk/1.2/


docs/api/index.html.

[JS] Java Servlet API. Available online at http://java.sun.com/products/servlet/


2.2/javadoc/index.html.

[LR] AT&T Labs-Research. Grappa - A Java Graph Package. Available online at http://www.
research.att.com/sw/tools/graphviz/packages/grappa.html.

[Mes02] Megha Meshram. Keyword Searching in XML Documents. Master’s thesis, Computer Science
and Engineering Department, IIT Bombay., 2002.

[MLP99] Kevin D. Munroe, Bertram Ludascher, and Yannis Papakonstantinou. Blended Browsing and
Querying of XML in a Lazy Mediator System. In VDB 2000, University of California, San
Diego, La Jolla, CA 92093, 1999.

[XQL] GMD-IPSI XQL Engine. Available online at http://xml.darmstadt.gmd.de/xql/


xql-examples.html.

24

You might also like