Browsing and Querying On XML Data Sources
Browsing and Querying On XML Data Sources
Dissertation
By
Urmila Kelkar
Roll No : 00305402
Prof. S. Sudarshan
This is to certify that the dissertation entitled Browsing and Querying in XML Data Sources by Urmila
Kelkar is approved for the award of the degree of Master of Technology.
Prof. S. Sudarshan
(Guide)
Internal Examiner
External Examiner
Chairman
Date :
Acknowledgement
I would like to thank my guide, Dr. S. Sudarshan for his untiring support and encouragement throughout
my M.Tech project. I would also like to acknowledge ones, who, from behind the scenes have contributed
their ideas and energies. Special thanks to all my colleagues from Informatics lab.
Urmila Kelkar
January 17, 2002
ii
Contents
1 Introduction 1
2 Related Work 2
2.1 Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Blended Browsing and Querying by BBQ . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 XML based information mediation with MIX . . . . . . . . . . . . . . . . . . . . . 2
2.1.3 BANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 BANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.2 DataSpot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
iii
Abstract
The Web is extensively used by every information seeker. Search engines such as Google retrieve informa-
tion from HTML documents. They allow users to get desired information by just typing a few words and
following hyperlinks. The goal of our project is to design and implement a system providing a powerful
way of extracting information from XML documents, using browsing and keyword search.
Our system provides a directory tree like interface to browse through nested XML data, coupled with the use
of IDREF to ID links and the use of stylesheets. To facilitate customized views of the same XML document,
we provide with menus to drop elements, to find matching elements, to drop subtree and so on. Additionally,
we provide keyword search where users can just type a few keywords to get desired information from the
XML data source.
Chapter 1
Introduction
XML is an evolving technology, which is becoming important because of its standardized data representa-
tion format. XML documents focus on semantics of data. It does not provide information about displaying
the data contained in the document. XML portrays a semistructured data model which is likely to be used
to publish heterogeneous data.
Consider the example of electronic patient records(EPR) used by public health services. Many European
countries aim to introduce EPR as a standard way of maintaining patient records. XML, which provides
users with a flexible way to markup nested data, appears well suited for maintaining EPRs. Administrative
committees in number of hospitals are now incorporating XML within their prescribed standards for main-
taining patient records. This emphasizes the need for a system that facilitates information retrieval from the
underlying XML documents.
We aim to develop a system that facilitates browsing of XML documents and provides keyword search
as well. Further, in browsing, we focus on two access patterns which will be most commonly used : docu-
ment traversal and pattern matching queries. The nesting structure of an XML document is used to provide
navigation through XML document in a tree format, called a foldertree. Users can choose a particular style
for displaying XML documents. Our system also provides a menu to drop the selected subtree, to expand
the selected subtree, to drop a particular element and to highlight matching elements. Taking into consider-
ation potentially large XML documents, it is important to make the system scalable. Our system facilitates
scalable browsing using incremental approach. We also provide an interface called as ’Select Interface’
which helps users retrieving information from XML documents using pattern matching queries. There are a
few systems like Blended Browsing and Querying(BBQ) [MLP99], XML based information mediation with
MIX(MIX) [BGL 99], that support querying and browsing of XML documents. BBQ uses incremental
on-demand approach.
Our system additionally provides keyword search, which is not supported by BBQ and MIX. The key-
word search module constructs a graph from the XML documents available in the data source. We use least
common ancestor technique to find out answer results. The answer result tree can be browsed like any other
XML document.
Chapter 2 gives an overview of related work in the area of browsing XML data sources. The browsing
interface offered by our system is described in Chapter 3. Chapter 4 briefly describes the keyword search
module and browsing of search results. The detailed approach of keyword search is described by Megha
Meshram in her dissertation [Mes02]. Chapter 5 outlines conclusions and describes future work.
1
Chapter 2
Related Work
The Web has introduced a new paradigm of browsing and keyword search to retrieve information from
HTML documents. Our goal is to provide a similar system to browse through XML documents. Tech-
niques used for information retrieval from HTML documents can not be directly used for XML documents.
This chapter gives an overview of the systems supporting browsing, keyword search and querying of XML
documents.
2.1 Browsing
Search engines such as Google help a naive user to browse through information using hyperlinks. There are
a few systems that provide support for browsing through XML documents. The following sections describe
these systems.
BBQ assigns each document a separate window with its data and schema displayed side-by-side. Both
data and schema can be navigated using directory like structures. BBQ querying is schema driven. It uses
XMAS (XML matching and structured language) query underneath that supports joins, filtering and con-
straints on leaf nodes. BBQ assumes that users can not come up with the focussed query right at the first
step and facilitates iterative query refinement. It supports queries on multiple DTDs. Users can specify the
structure of the query by dragging-dropping elements from the source DTDs or introducing new elements
or grouping elements according to the value of other elements. Further when execution of the query starts,
users can browse through partial results. BBQ uses Document Object Model Application Programming
Interface(DOM API) and uses incremental approach for browsing potentially very large documents and
subsequent query results.
2
representation of arbitrary source data. Since XML may also become stumbling block while formulating
meaningful queries against semistructured databases, it uses XML DTD as a structural description of data
exchanged by components of mediator.
MIX focuses on valid XML documents which confirm to the DTD. XML queries are denoted in high level,
declarative query language known as XMAS which is evolved from the ideas from XMLQL and other XML
query languages. It allows pattern matching as well group by queries. To facilitate querying of heteroge-
neous sources, XML wrappers are provided which export data in a uniform format to the mediator. MIX
uses BBQ as its graphical interface. XMAS queries generated by BBQ are sent to the mediator for execution
and the results are displayed using BBQ.
2.1.3 BANKS
BANKS - an acronym for Browsing ANd Keyword Search [BHN 01] facilitates browsing and keyword
search in relational database. Earlier we had worked on a module of BANKS system. To retrieve informa-
tion from the underlying relational databases, users need to know SQL (structured query language). BANKS
uses a new paradigm of keyword search introduced by the Web to retrieve the desired information.
BANKS uses tabular format to display information retrieved from database. Foreign key - Primary key
relationship (FK-PK) is used to provide a link from one table to another in the form of hyperlinks. BANKS
also provides menu, using java script for sorting records from the table on a particular column, or grouping
records based on a column and so on. It also provides ’Select Interface’ to get records matching the specified
values. BANKS also provides templates such as crosstab, nested, foldertree, bar-chart, pie-chart to display
information from database in graphical manner.
Browsing and keyword search in XML can also be thought of as an extension to BANKS. XML is con-
sidered as semi-structured data, while BANKS uses relational database( structured) as back-end. Hence, the
strategies and the approaches used for browsing and keyword search in XML are different but the need for
browsing and keyword search lying underneath is the same.
XML documents can be displayed using Folder-tree structure like the one used in BANKS. Menus such
as drop elements, select matching elements, drop subtree can be used to interact with foldertree. We can
also provide a group-by selection interface wherein users can select group-by element and result element
to get graphical view of data like bar-chart used in BANKS. For example, rather than viewing the whole
collection of CDs at once, users will prefer having year-wise or artist-wise collection of CDs where we can
specify CD as the result element and the year or the artist as a group by element. We can also facilitate
querying of XML documents, using some query language at the back-end.
2.2.1 BANKS
BANKS supports keyword search to retrieve the information from the underlying relational database where
users are not required to know the schema details. Like keyword search on internet returns documents
3
containing given words, BANKS returns tuples containing the given word. BANKS exploits foreign key-
primary key relationship between tables of database to form a graph called meta-data graph. Further, it also
uses tuple level graph to identify particular tuple. Now given a set of words, two words are considered to be
close to each other, if in a table, they are in the same row and same column, or in different columns of a row,
or in rows of different tables linked by foreign keys. The details of the keyword search algorithm are given
in [HBN 01]. Search results are ranked and sorted out before presenting.
To facilitate keyword search in XML documents, we can use the built-in hierarchical structure of XML
document to form a graph. We can make an entry into the text index for all words from all documents
except for stop words. Text index can store DOM Node reference, which will help us to traverse the graph
for finding out shortest paths between keyword nodes. Simplest way to find out a tree containing all words
is to use the least common ancestor technique. The result will contain the root as the common intersection
node while leaf nodes will refer to the search terms. The search result can be ranked based on the depth of
the tree or say the longest edge of the tree. Result tree can be browsed using the foldertree like any other
XML document.
2.2.2 DataSpot
DataSpot [DEGP98] introduces a new approach to database query and information retrieval by providing end
users with the capability of exploring databases using the free-form queries and navigation. This capability
is based on a novel, schema-less representation of data, called a Hyperbase. The DataSpot representation
and search technology is the foundation of the DataSpot system. DataSpot has since been named a Mercado
and is available as a commercial product.
A DataSpot Hyperbase is a graph structure comprised of nodes, edges and node labels. Nodes are re-
lated by directed edges of two types. A simple edge is used to connect the parent node to a child node. The
set of children of a node are ordered. An identification edge is used to indicate that reference node uniquely
identifies the subject node. A DataSpot Query is an associative search over a Hyperbase. The input to a
query is a set of nodes called a query sources. The result of a query is a list of answers where each answer
is a connected a Hyperbase containing the query sources. The answers to the query are ranked. Users can
view answer records in detail, navigate to the related records or may submit the continuation queries from
the current record or from set of records.
4
Chapter 3
The primary motivation for developing a system for browsing is to ease the end-users task. Naive users
should be able to get the desired information from XML documents by just few clicks rather than writing
complex queries. We have developed a system which provides such an interface for extracting the desired
information from XML documents.
The central idea is to provide a Scalable Browsing system to navigate through XML documents. Since
an XML document consists of markup tags and not the formatting tags, we need some mechanism to con-
vert XML-encoded information into the true data model and to make it presentable. We are using a foldertree
structure to display XML documents. Folder tree is a simple hierarchical structure like the directory tree
structure used in Windows. Consider a document order.xml as shown in Figure 3.1. Our system provides
users with a foldertree view of XML document, as shown in Figure 3.2
This incremental approach makes the design scalable since a user is not required to spend time waiting
for child nodes to arrive. The following sections explain our approach in detail.
Approach I
At the initial stages of work, we implemented the following non-incremental model. The model consists of
servlet running at the server end and applet working at the client end, communicating with each other. The
steps to be followed are as follows :
¯ Servlet parses the whole XML document and gets an in-memory DOM tree.
5
<OrderData endDate="2/11/2001" startDate="1/11/2001">
<Customer customerID="c1" name="PCS" ..... postalcode="476799">
</Customer>
<Customer customerID="c2" name="HP" ..... postalcode="476799">
</Customer>
<Invoice orderDate="20/11/2001" .... customerIDREF="c1">
<LineItem quantity="10" price="10" partIDREF="p1">
</LineItem>
<LineItem quantity="50" price="50" partIDREF="p2">
</LineItem>
<LineItem quantity="20" price="30" partIDREF="p3">
</LineItem>
<LineItem quantity="30" price="20" partIDREF="p3">
</LineItem>
</Invoice>
<Part partID="p1" ... description="Deskjet printer">
</Part>
<Part partID="p2" ... description="Desktop Computer">
</Part>
<Part partID="p3" ... description="Laptop Computer">
</Part>
</OrderData>
¯ Servlet traverses in-memory DOM tree in BFS(Breadth First Search) order, creates newObject corre-
sponding to each DOM Node and sends all objects one by one, over HttpConnection. Refer Section
3.1.4 to get the details of newObject
¯ Applet receives newObject corresponding to every DOM Node and goes on attaching newObjects to
their respective parents to form a tree.
Approach II
The servlet running in background, sends DOM Nodes on demand, as requested by users at the client end.
It makes the servlet stateless. The applet saves the state of the request and sends the next request as per user
navigation. The following steps are taken :
¯ The servlet parses the whole XML document and gets an in-memory DOM tree.
¯ Initially the applet requests for the root node, identified by id=1, along with few child nodes, number
specified in the Configuration file.
¯ In general, the applet sends a request with the node id of a parent node and number of chlld nodes,
identified by ’from’ and ’to’ parameters in request. It also includes docName parameter identifying the
6
Figure 3.2: Foldertree view of order.xml document displaying IDREF to ID links
XML document. The object called newObject, described in Section 3.1.4 is used for communication
between the servlet and the applet.
¯ In response to the request from applet, the servlet sends root node if id is equal to 1, or else only child
nodes numbered from ’from’ to ’to’ of the requested XML document.
¯ The servlet sends a special object with myID=0 as a demarcating object, to indicate that it is the end
of the response.
¯ The applet updates the tree by appending the received child nodes to the respective parent nodes and
displays the tree.
¯ On the applet side, once the tree is displayed, applet waits for a user request. User can request for
child nodes of a particular node by just clicking on that node and the request is sent to the servlet
asking for child nodes of that node.
¯ A dummy node named as ”More...” is used to indicate that the parent node has some more child nodes
yet to be retrieved. Users can click on ”More...” node to request for those child nodes or he can also
click on parent node itself to ask for child nodes.
7
Foldertree is chosen to display the document since it is suitable for displaying hierarchical, nested struc-
tures. The jaxp [JAX] DOM parser is used to parse XML documents. It gives us an in-memory tree, corre-
sponding to a document, where every node is a DOM Node. We only consider nodes of the types DOCU-
MENT NODE, DOCUMENT TYPE NODE, ELEMENT NODE, ATTRIBUTE NODE and TEXT NODE.
Document node is the root of the document, while Document type node is used to identify the DTD asso-
ciated with the document. Document node corresponds to the root of the foldertree. Element node and
Attribute node correspond to the foldertree node (FTN) and leaf node in a foldertree. Element node, Text
node and Attribute node are transformed into foldertree structure as follows :
ELEMENT NODE : The name of the Element node (i.e. the markup tag) in XML document is assigned
to the corresponding foldertree node(FTN) in the foldertree. Refer Figure 3.1 and Figure 3.2 demon-
strating the mapping from XML document to the foldertree.
TEXT NODE : Text node in XML document corresponds to the Leaf node of a foldertree. Since Text node
in XML document carries the actual data, and an Element node in XML document can only have a
single Text node as its child, while mapping XML document to the foldertree, we append the value
of Leaf node (corresponding to the Text node in XML document) to its parent foldertree node and
remove Leaf node in the foldertree as it is redundant.
ATTRIBUTE NODE : ’Attribute Name=Attribute Value’ pair for every Attribute node in XML docu-
ment, is appended to the FTN corresponding to the Element node associated with an attribute. For
example, as shown in Figure 3.1 and Figure 3.2, Element node OrderData has Attribute node ’start-
Date=1/11/2001’. We append ’startDate=1/11/2001’ to the FTN corresponding to OrderData in a
foldertree.
Attribute node can have different types such as ”CDATA”, ”ID”, ”IDREF”, ”ENTITY” etc. Currently our
system supports only attributes of type ”ID” and ”IDREF”. IDREF to ID links are creates by checking the
attribute type. Suppose an Element node in XML document has an attribute with a type IDREF, then we
identify the corresponding Element node having attribute of type ID(i.e. ID node) with matching values of
ID and IDREF. In a foldertree, FTN corresponding to the Element node having ID, is attached as a child of
FTN corresponding to the Element node having IDREF. In XML document, ID node and IDREF node can
lie distantly. Our system facilitates a way to browse from IDREF node to ID node.
IDREF to ID links are thus created as part of the initialization of a system and in-memory DOM tree corre-
sponding to XML document is updated to include IDREF to ID links, described in the following section.
DOM parser doesn’t provide API to identify attributes of a particular type say ID or IDREF while, SAX
parser API provide support for such identification. Since using SAX parser will lead to an overhead as one
more SAX parser scan is required in addition to DOM parser scan, we are using a DTD parser [DTD]. DTD
parser helps us to identify IDREF elements and ID elements. We support identification of IDREF to ID links
only if the document has DTD associated with it. An in-memory DOM tree is updated to include IDREF to
8
ID links as follows :
– For an attribute of type IDREF, enter ElementName-AttributeName pair into IDREF hashtable.
– For an attribute of type ID, enter ElementName-AttributeName pair into ID hashtable.
¯ When finished with DTD, parse the XML document to construct an in-memory DOM tree,
– If the ElementName-AttributeName pair in the document matches with some entry in the ID
hashtable, enter id value of ID Node in ID array.
– If it matches with some entry in IDREF hashtable, enter idref value of IDREF Node in IDREF
array.
¯ After the DOM tree is completely constructed, for every entry in IDREF array do the following :
Figure 3.2 shows IDREF to ID links in order.xml document where Invoice refers to Customer and LineItem
refers to Part. We append Customer to the child list of Invoice and Part to the child list of LineItem for
matching IDREF and ID values in the document.
Users at the client end can send request for a particular XML document to be browsed or they may re-
quest for child nodes of FTN, numbered from say ten to twenty. This request is sent over a HttpConnection
to the servlet running at the back-end. In response to this request, the servlet sends requested nodes to the
client end. The client end redisplays the tree by attaching received nodes to their corresponding parents.
DOM parser gives us in-memory DOM tree for the document. But since DOM Node is not serializable,we
can not send it over HttpConnection. We have constructed our own object which stores DOM Node infor-
mation along with a few tags so that it is easier to attach those nodes to their respective parent nodes at the
client end. The class newObject describes the serializable object used for sending node information from
the server end to the client end.
9
Class newObject {
String folderValue;
int isLeaf;
int myID;
int parentID;
int numChildren;
int numChildrenAtClient;
}
The newObject carries following parameters required for reconstruction of the tree.
¯ isLeaf - is a variable of type integer, that indicates whether the Node is a leaf node or non-leaf node
(1-leaf , 0-non-leaf). The nodes with isLeaf value equal to 0, have to be stored temporarily on client
side, so that whenever we get child nodes, we can append them to the parent node. We can get rid of
nodes with with isLeaf value 1 since those are leaf nodes.
¯ parentID - is also an integer, used to store ID of the parent of the Node. It helps in appending child
nodes to the correct parent Node.
¯ numChildrenAtClient - is an integer indicating number of child nodes received by the applet at the
client end.
10
Browsing Interface
Select Interface
(Foldertree Applet)
Network
Servlet
DOM Node, is not serializable. It is not possible to send the DOM Node as a stream over the HttpConnec-
tion. So we are using an object which is designed to store the DOM Node information in serializable format.
The obejct is described in Section 3.1.4. The same object is used at the back end as well as at the client end.
All XML documents in a data source are parsed to get corresponding in-memory DOM trees. Our sys-
tem maintains reference to the root node of in-memory DOM tree to retrieve requested nodes quickly. At
the initialization of the system, we update in-memory DOM tree by attaching text nodes, attaching attributes
and attaching ID nodes as a child of IDREF nodes, as described in Sections 3.1.2 and 3.1.3.
11
/* hashtable of document references */
static Hashtable hashNode_to_id;
static Hashtable hashid_to_Node;
/* per request */
static String docName;
static int reqid;
static int from;
static int to;
At the initialization, the system reads the available documents from the Configuration file. numOfDocs
indicates number of documents in the data source. array stores DOM reference to the root
node of the in-memory tree for each document.
and
are Hashtables of
Hashtables.
is a hashtable with key as document number (index in docNames array) and
value as
hashtable for that document.
hashtable stores the DOM Node as a key and
unique value assigned to that node a a value.
is a hashtable with key as document number
and value as
hashtable. doInit() function of the servlet does this initialization work.
The servlet stores four variables - docName, reqid, from and to, indicating that the applet is asking for
XML document named docName. Redid, from and to indicate that the applet requires child nodes num-
bered from ’from’ to ’to’ of a node with id value equal to reqid. The servlet accepts these parameters in
doGet() and accordingly sends response over HttpConnection.
To browse through the XML document, the servlet creates a new applet. Every applet stores four hashtables
mentioned above for constructing the tree.
is a hashtable with key as myID of newObject
and value as parentPath appended by folderValue of newObject. On the applet side, we need to identify
the id of the node, when a user clicks on a particular node. To uniquely identify any node, we use the
whole path to that node as a key value. Hence the hashtable providing the map from Path to id is used.
is used to get the actual tree node (called as DefaultMutableTreeNode) to which the
received child nodes are appended. Further,
is used to retrieve the newObject corre-
sponding to id value since tags like isLeaf, myID, parentID are stored in newObject.
The four parameters, docName, reqid, fromChild and toChild constitute the request sent from the applet
to the servlet as described above.
12
Figure 3.4: Foldertree for an XML document (shakespear’s play - The tragedy of Richard III). The Figure
also displays the menu provided by the system to interact with the foldertree
On the back-end side, we get all XML documents parsed in memory in the initialization phase and all
the parsed documents are in memory all the time. This may become memory overhead for some huge docu-
ments in a data source. As of now we are not handling this issue. Back-end scalability can be improved by
bringing the partial DOM tree in memory and throwing it out if not needed. Further, external indices can be
used to fetch required nodes to form a partial view of tree and sending it to the swing applet.
13
plet for reading a particular file on local machine or we need to provide the whole file, as input to the applet.
Thus, the approach becomes somewhat complex.
Our system uses style information to display foldertree node and Leaf nodes of a foldertree in a style that
is currently set as default. Users can change the default style using select Style option given in menu. In
our system, on the servlet end, style chosen by the user is saved in the Configuration file. Whenever the
applet at the client end asks for particular document, the servlet sends the current style settings to the applet
over the stream. The approach can be extended further to display the foldertree according to the stylesheets
associated with XML documents.
¯ Find matching - The menu helps users to highlight elements having the value same as that of the
selected element.
¯ Expand subtree - Currenly we provide only one level expansion and child nodes of the selected
element are displayed. This feature can be extended to expand a particular node up to the few levels
as asked by the user.
¯ Drop subtree - Drop Subtree drops the whole subtree below the selected element.
¯ Drop element - Drop element drops the elements with the name same as selected element.
14
Figure 3.6: Query Result: Get articlesTuple from sigmod.xml containing ’Donald’
We are currently using XQL engine provided by GMD-IPSI [XQL]. They provide XQL APIs to run basic
queries on XML documents.
At the initialization of the system, we create .xpath file for every XML document in the data source. Our
system provides XPATH module for that purpose. These files are used to form a query when users specify
values in Select Interface.
Figure 3.5 shows the Query interface for specifying values. sigmod.xml is a set of sigmod articles, each
15
Number of
CD s
Figure 3.7: Bar graph displaying year-wise distribution of CDs for a XML document containing collection
of CDs
having fields like volume number, title, authors etc as shown in the Figure 3.5 . The query looks like arti-
clesTuple contains ”Donald”. The result is shown in Figure 3.6. The document contains two entries with
author value equal to ”Donald”. Both of these tuples are presented to the user as a result of the query.
3.5.2 Extensions
The select interface currently provided can be enhanced as described below :
¯ The select interface provided is not yet integrated with the browsing system. We can use the same
foldertree, as used for browsing an XML document, to display query results with query-result nodes
in expanded state and the remaining portion of the document in collapsed state.
¯ Select interface supports pattern matching queries on element nodes from the document. We can
extend it to include attribute nodes too.
¯ We have not taken into consideration scalability. For potentially large documents, query execution
takes quite a long time since XQL engine again traverses through the whole document to find out
matching elements. The incremental approach can be used to run query in the background and dis-
playing partial results if some query language provides that feature. We need to replace the current
querying engine with the better one.
¯ Currently we have just focussed on implementing ”contains” clause. We can provide group-by queries
where in users can select group-by element and result element. It will be similar to the concept of
group-by templates provided in BANKS. For example, for a document portraying collection of CDs,
users would like to see the list of CDs grouped by year. Here users can input group-by element and
result element to the system, the year as a group-by element and the CD as a result element. Users
can get the list of CDs grouped by year as a result. Further, this information can be displayed in a
graphical manner as shown in Figure 3.7 Rectangular bar representing the year can have hyperlink to
actual records giving the details of the CDs in that particular year.
16
Chapter 4
The primary motivation behind keyword search is to facilitate an interface to help naive user extracting
information from XML data source just by typing few keywords.
¯ Create an in-memory text index containing all words from all documents, except for stop words. Stop
words are words such as ’a’, ’an’, ’the’ which occur very frequently.
¯ Follow the backward edges to find an intersection node common to all search terms.
¯ The intersection node is the root node of the answer result with the leaf nodes representing search
terms.
¯ If there are more than one answer results, then sort them according to the relevance. The details of
how relevance score is calculated for a result tree are given in [Mes02]
¯ Return the list of answer results to the user. The answer result is not only the name of XML document
where the word lies but the relevant portion of XML document where the word lies.
Construction of graph, construction of in-memory textindex, the search technique used, the ranking of
answer results according to the relevance, are described in detail by Megha Meshram in her dissertation
[Mes02].
17
Figure 4.1: Figure shows answer result tree for query : ’dunkel rabbit model’.
The keyword search module accepts keywords from the user. The step by step execution of the algorithm
given above leads to the generation of answer results. Answer results are ranked and are sorted before dis-
playing to the user. Users can view the answer result in a foldertree format as shown in Figure 4.1.
We use searchObject for sending search result node over HttpConnection. The searchObject carries param-
eters required for reconstruction of answer result at the client end. The class searchObject used is described
below :
searchObject {
String folderValue;
boolean isLeaf;
int no_of_children;
int got_children;
boolean hasKeyword;
String docName;
}
18
¯ folderValue - is a String, representing the actual value of FTN.
¯ isLeaf - is an Integer, which indicates whether node is a leaf node or a nonleaf node. Value 1 indicates
a nonleaf node while value 0 indicates a leaf node.
¯ no of children - is an Integer, which indicates the number of children of the search node.
¯ got children - is an Integer, which indicates the number of child nodes, currently brought to the client
end. It helps while displaying search results.
¯ hasKeyword - is a String which indicates whether the node contains atleast one search term in it.
Answer result may contain node which does not contain search term. This is used to display nodes
containing search terms, with different colour.
¯ docName - is a String storing the name of the document the search term belongs to. This information
is used to browse to the respective XML document from a particular search node in answer result.
Search result is provided with mouse-over menu where users can get the name of the XML document the
search term belongs to. Further, the users can browse through the document by clicking on the menu.
Figure 4.2: Figure shows answer result tree for query : ’hepatitis antibodies australia’.
Consider an example where the user wants to know whether there is any medical citation written by author
’dunkel’ containing ’rabbit model’ in its title. Here, the search query will look like ’dunkel rabbit model’.
Figure 4.1 shows the answer result for query : ’dunkel rabbit model’. Consider another example of search
query : ’hepatitis antibodies australia’ which finds out medical citation in Australia, on ’hepatitis antibodies’.
Refer Figure 4.2 which shows the answer result for this query.
19
4.3 Browsing Extensions
Current approach used for browsing search results is a naive approach. A few features can be added to the
system to display search results in an intersting manner.
¯ Current approach provides an hyperlink from a search node to the root of the XML document to which
the search node belongs. Users would like to go to the corresponding node in XML document rather
than going to root of the document. This can be done by finding out the node in XML document that
corresponds to the search node and displaying the whole subtree below it.
¯ Currently, browsing interface provides the only option of getting the child nodes of a particular node.
We can extend this to provide an option, for getting the parent node of a particular node. This feature
may lead to interesting browsing patterns while browsing keyword search results.
20
Chapter 5
We have developed a complete system, which facilitates browsing and keyword search in XML documents.
The system works with any XML data source at the back-end. The system will help naive users, getting
information from XML data sources containing sigmod papers, or say collection of articles. XML will help
information providers portray different views of the same information as per user’s interest. Our system
provides foldertree for navigating through XML documents, starting from root, up to the leaf nodes. To get
the different view of the same XML document, it provides menus. Styling feature provided by our system
helps users to view XML document in customized styles. The system also provides a select interface to
extract desired information using pattern matching queries. The system can be enhanced in many ways.
Some of the short term extensions include :
¯ Creating an external index - Currently we are using in-memory DOM which is kept in memory all
the time. This may become an overhead for potentially huge XML documents. A better option would
be to use persistent DOM [XQL] where we can create disk based index, which will help in fetching
a node given unique id, without the cost of parsing XML document or building an in-memory DOM
tree.
¯ Facilitating complete support for querying of XML documents - Currently we provide with an inter-
face which supports only simple pattern matching queries using XQL. It can be extended to support
even complex group by and nested queries.
¯ Using HTML to display XML documents using hyperlinks - Instead of using the foldertree, we can
generate stylesheets on-the-fly. XML document can be displayed using style sheets. Hyperlinks may
be used for inter-document references.
¯ Exploiting features of XML document - Currently our system supports only element nodes along
with attribute nodes and text nodes. Most of the times, extracting information from these nodes is
sufficient. Further, system can be extended to include use of entities, entity references, processing
instructions and CDATA sections.
¯ Displaying keyword search results - Every search node in answer result tree contains a parameter
called docName which is used to browse through actual XML document in which the search term be-
longs. Further, this feature can be extended to display only partial view of XML document containing
nodes relevant to the search term, instead of displaying the whole XML document.
¯ Extending Keyword Search - Including metadata search will be quite interesting for XML since XML
document identifies data using tags. For example, users can type in ’CATALOG.AUTHOR:Adams’
to search for a particular author named ’Adams’ in catalog.xml. More precisely, we can restrict the
21
search domain only to those paths specified by the user. This feature will help users to get more
accurate results since XML identifies semantics of data contained in the document.
¯ Combined system for structured and unstructured data - In E-commerce or marketplaces systems
where scalability and performance plays critical roles, it is desirable to have a system providing sup-
port for structured as well as unstructured querying. Using RDBMS leads to degraded performance,
while unstructured paradigm doesn’t provide support for some structured components. For example,
for a query : Get documents containing ”XML”, in ”English” language, of type ”pdf” where ”XML”
is a structured component while other two are structured components.
OrderData
Customer C2 Part P2
LineItem L1 LineItem L2
¯ Navigation through XML using a graph model - Our system uses a tree model to represent XML
document. XML document can be mapped to the graph model rather than a tree model, which will
make browsing more interesting.
The graph of XML document can be constructed with nodes and edges such that element node from
XML document represents the nodes of the graph, while parent-child edges and IDREF-ID edges
represent edges of the graph. To portray graph structure, we can do the following :
– For IDREF to ID references in XML document, rather than replicating the ID nodes as a child
of IDREF nodes, we can have only one instance of ID node, to which every IDREF node can
refer. Refer Figure 5.1 which portrays XML document as a graph, where dotted lines represent
IDREF to ID edges, while solid lines represent parent-child edges. The original XML document
is displayed in Figure 3.1.
– We can merge identical text values or entire subelements from XML document.
The graph structure will help compressing the document by removing replication in tree format. Most
important is that it will be of immense use for keyword search. The current keyword search module
explicitly searches for ID node, once it receives IDREF node. While this approach will provide an
in-build graph with all IDREF-ID links hidden in it. We can use package called Grappa [LR], which
is a graph package written in Java provided by AT&T Research Labs.
22
¯ Implications of graph model on browsing as compared to the Foldertree - Foldertree facilitates in-
order browsing through XML document, starting from root of the document, up to the leaf level. We
can map XML document to a connected graph as mentioned above. To browse through a particular
document, we can provide users with customized views, according to the starting node specified. For
example, if users specify OrderData as a root node, they can view the whole document as shown in
the Figure 5.1. If Invoice is specified as a starting node, users can just view the partial document that
is reachable from the Invoice, directly or indirectly.
23
Bibliography
[BGL 99] Chaitanya Baru, Amarnath Gupta, Bertram Ludascher, Richard Marciano, Yannis Papakon-
stantinou, and PAvel Velikhov. XML-Based Information Mediation with MIX. In ACMSIG-
MOD 1999, exibition program, University of California, San Diego, La Jolla, CA 92093, 1999.
[BHN 01] Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Key-
word searching and browsing in databases using banks. In Proc of ICDE, 2001.
[DEGP98] Shaul Dar, Gadi Entin, Shai Geva, and Eran Palmon. DataSpot : Database Exploration Using
Plain Language. In Proc. of the 24th VLDB Conference, Data Technologies Ltd., 1998.
[HBN 01] Arvind Hulgeri, Gaurav Bhalotia, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Key-
word searching and browsing in databases using banks. In IEEE Data Engineering Bulletin,
September 2001.
[JAX] JAXP API for XML parsing 1.1.1. Availale online at http://java.sun.com/xml/
jaxp/dist/1.1/docs/api/overview-summary.html.
[LR] AT&T Labs-Research. Grappa - A Java Graph Package. Available online at http://www.
research.att.com/sw/tools/graphviz/packages/grappa.html.
[Mes02] Megha Meshram. Keyword Searching in XML Documents. Master’s thesis, Computer Science
and Engineering Department, IIT Bombay., 2002.
[MLP99] Kevin D. Munroe, Bertram Ludascher, and Yannis Papakonstantinou. Blended Browsing and
Querying of XML in a Lazy Mediator System. In VDB 2000, University of California, San
Diego, La Jolla, CA 92093, 1999.
24