Base Paper PDF
Base Paper PDF
Classification
Umara Noor Department of
Computer Science, International
Islamic University (DCS, IIUI),
Islamabad, Pakistan
[email protected]
Zahid Rashid
School of Electrical Engineering and
Computer Science (SEECS, NUST),
Islamabad, Pakistan
[email protected]
Azhar Rauf Department of
Computer Science,University of
Peshawar (UOP), Peshawar,
Pakistan
[email protected]
ABSTRACT
Today, deep web comprises of a large part of web contents.
Because of this large volume of data, the technologies related to
deep web have gained larger attention in recent years. Deep web
mostly comprises of online domain specific databases, which are
accessed by using web query interfaces. These highly relevant
domain specific databases are more suitable for satisfying the
information needs of the users. In order to make the extraction of
relevant information easier, there is a need to classify the deep
web databases into subject-specific self-descriptive categories. In
this paper we present a novel training-less classification approach
TODWEB based on common sense world knowledge (in the form
of ontology or any external lexical resource) for the automatic
deep web source classification; which will help in building highly
scalable, domain focused and efficient semantic information
retrieval systems (i.e. metasearch engine and search engine
directories). One of the important aspects of this approach is the
classification method which is completely training less and uses
Wikipedia category network and domain-independent ontologies
to analyze the semantics in the meta-information of the deep web
sources. The large number of fine grained Wikipedia categories
are employed to analyze semantic relatedness among concepts and
finally the URL of deep web search source is mapped to the
category hierarchy offered by Wikipedia. The experiments
conducted on a collection of search sources shows that this
approach results in a highly accurate and fine grained
classification as compared to existing approaches, nearly identical
to the results achieved by manual classification.
Categories and Subject Descriptors
H.2.4 [Database Management]: Systems Distributed
Databases; H.2.5 [Database Management]: Heterogeneous
Databases; H.3.3 [Information Systems]: Information Search and
Retrieval information filtering
General Terms
Algorithms, Performance, Design, Experimentation, Verification.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and
that copies bear this notice and the full citation on the first page. To
copy otherwise, or republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
iiWAS11, 57, 2011, Ho Chi Minh City, Veitnam.
Copyright 2011 ACM 978-1-4503-0784-
0/11/22$10.00.
Keywords
Deep Web, Ontology, Training-less Classification, Semantic
Information Retrieval, Semantic Search, Wikipedia, UMBEL
1. INTRODUCTION
The information hidden behind the deep web is much more than
the surface web. The statistics shows an exponential growth
in terms of mass and number of newly emergent deep web sites.
In a survey [1], the estimated number of deep web sites was
43,000-
96,000 with an estimated mass of 7,500 terabytes. The number
shows that the volume of deep web is about 400 times more than
the data contained in the surface web. Recent studies [2] show an
increase of 25 million in the number of deep web databases.
It is observed in different studies, that the quality of deep
web contents is much higher as compared to data in the surface
web. The quality of contents in the deep web is at least 1,000 to
2,000 times better than that of surface web [3].
Furthermore, the relevancy factor is also relatively high in
connection to the information needs of the users as more than
half of the deep web contents reside in topic/domain specific
databases. These databases focus on data in any particular
confined domain e.g. information about health, sports, movies,
actors, songs, singers, books, travel, hotels, property and
vehicles etc. The content of these hidden web databases is
naturally clustered, which greatly improves users satisfaction
over searched content.
In order to bring these hidden information sources to the surface
of the web, the contents in these sources need to be crawled and
indexed by the search engines. Several techniques have been
proposed to crawl and index these hidden information sources [4-
7]. Besides other shortcomings, these techniques are not able to
fully address the problem of information relevancy and coverage
of hidden web contents. To cope with these issues, research
community proposes the automatic classification of these hidden
web sources into domain categories as the first step [8-17].
All these classification techniques are training oriented,
employing statistical measures and require pre-classified training
data set for classification.
In this paper we propose a novel training less approach that
automatically classifies deep websites into hierarchical categories
of Wikipedia. These categories are merely nodes for organizing
the articles in the Wikipedia. The quality of categorized
documents is continuously improved by the huge community of
Wikipedia authors. The articles in Wikipedia is not strictly
categorized i.e. one article may belongs to more than one
categories. According to [18] there are almost 400,000 categories
in the English Wikipedia, with an average of 19 articles and two
190
subcategories each. These huge numbers of categories have been
used for identifying documents topic and for measuring the
semantic relatedness among text snippets [19], [20].
The most attractive features of our technique are the utilization of
knowledge which can be offered by any domain independent
ontology, and extraction of semantics buried among the links of
Wikipedia categories. We choose UMBEL as a domain
independent ontology, which is open source light weight ontology
and its structure is described in SKOS and OWL-Full. It is linked
with many external ontologies and vocabularies which provides
the facility of reuse of properties. It contains more than
20,000 subject concepts and was built with the aim of
establishing context among data on the web. This ontology has
already been used before for ontological reasoning and as a
resource of open linked data in [21], [22].
We implemented our technique as a web application and the
results were evaluated. It was observed that our technique is
comparatively better than existing techniques in the following
perspectives. 1) it is training less which removes the drawback of
requiring the representative training data set 2) it uses domain
independent ontology to semantically classify the deep web
database instead of building a category ontology for each and
every category domain as in [10] 3) it uses self descriptive
Wikipedia category hierarchy keeping in view diverse nature of
deep web instead of non-descriptive limited category domains 4)
It can classify both structure and un-structured deep websites.
The rest of the paper is organized as follows. Section 2 discusses
the limitations in existing works. Section 3 explains the proposed
classification technique. Section 4 formally formulates the
problem. Section 5 presents experimental results that validate the
computational efficiency of our methodology. Section 6 discusses
the related work. Finally, Section 7 concludes the study and
discusses the future work.
2. LIMITATIONS IN EXISTING
WORKS
Several deep web classification methods have been proposed [8-
17] in the literature, a detailed survey of which can be found in
[23]. From the aforementioned survey it can be concluded that a
good deep web classification method is the one that classifies both
structured and unstructured deep web sources encountering both
simple and advance query interfaces and performs fine grained
subject specific classification. Existing classification methods do
not completely satisfy the above heuristic. In this section we
discuss special issues/limitations with existing approaches that
motivated us to do this research work. In the existing works,
limitations are observed from two perspectives i.e. in the
classification approach and in the content representative
extraction methodology.
2.1 Content Representative Extraction
To classify a deep web database in a domain we need
to understand about the kind of content contained in its
archives. Thus content representative is that piece of information
that tells about the domain of the database. In relation with
classification of deep web sources, several techniques have been
proposed in the literature to find the domain of online data
sources. A strong representative which briefly encounters all
the majors of a data source and understandable to a
computing resource helps in efficient classification. Thus the
strength of content representative
depends on three factors: 1) its ease of extraction considering the
large volume of deep web 2) its content coverage and 3) its
understandability to automated classification machines.
Currently two approaches are adopted to extract content
representative from the deep web sources: through query probing
and through visible form features. The first approach peeps inside
the deep web database and the second approach tries to find it on
the surface of the deep web database source from its attributes.
2.1.1 Query Probing
Query probing is a very common method adopted by most of the
classification techniques. In query probing approach, queries are
designed through special techniques and are probed against the
deep web databases. The results retrieved based on the probed
queries determine the content representative to that web database.
The result of probing is the count i.e. no. of results
matched with the probed query. Query probing has several
limitations:
Automatic query designing requires extensive effort to cover
the breadth of a deep web database which cannot
be accomplished by merely training classifiers with query
form interfaces of the training databases.
The count obtained through query probing do not serve the
right representative for classification process. As the
probability of noisy results is very high.
Forming queries from the titles of the category domain
focuses on the topic of the domain object (e.g. art, science,
shopping) instead of fields to be queried (e.g. book no.,
author, item, price).
Query probing approach is effective regarding simple,
keyword based query interfaces as it cant be easily adopted
for multi-attribute structured query interfaces [15].
Query probing approach covers only unstructured data
sources so their coverage is limited.
2.1.2 Visible Form Features
Obtaining content representative through visible form features
involve dealing with meta-information present on the surface of
the deep web form and site. The meta-information obtained is in
textual form, so the deep web classification problem is directed in
the discipline of text/document classification. The meta-
information considered is query schema attributes textual content
in the neighborhood of forms and in the HTML code of a deep
web site. The existing approaches encounter the following
limitations:
Query schemas represent only general description of the deep
web content e.g. if it is a car database the attributes of
the query schema will be make, model, price, new, color
etc. This information is incomplete to describe the
domain of interest unless and until guided by some
external common sense knowledge.
Query schema matching has several intrinsic drawbacks.
When query schema of the deep web forms are used as
content representative for classifying deep web databases, a
similarity match is performed between the input query
schema and query schema in the training set. Such similarity
matching is also called as semantic mapping. Semantic
mapping between query schemas is not an easy job as
discussed in the context of data integration in [24].
191
We cannot apply the document/ text classification heuristics
to the problem of deep web classification when schema
matching approach is used to extract content representative.
2.2 Training based Classification
All the techniques proposed so far for deep web classification are
training based which require a training set of pre-classified deep
web sources that is used for training the classifier. The
trained classifier is then used to perform the task of classifying
previously unseen deep web sources to their appropriate
category domains. In case of document classification, in most of
the cases a serious observation made is that an appropriate set of
well categorized (typically by humans) document training set is
not available [25]. Even if one is available, the set may be too
small, or a significant portion of documents in the training set
may not have been classified properly. This creates a
serious limitation for the usefulness of the traditional
document classification methods. Same is true for classifying
deep web sources through their visible form features. As by
looking at the structure of the deep web sources we see that it
is good to use training based techniques for the structured deep
web content but it is surely not an effective approach for the
unstructured content. Also fine grained subject specific
classification is not possible through training based
approach of classification.
3. PROPOSED CLASSIFICATION
APPROACH
Keeping in view the above limitations, our research work
concentrates around two key problems: 1) the extraction of right
content representative from a deep web source and 2) the
classification method that do not require a pre-classified training
set. For the first problem we propose to employ the Meta
information present in the HTML code of the deep web site
as content representative as it comprehensively defines the aim of
a deep web search source, also it is a very inexpensive method
for extracting content representative and it best depict the deep
web classification problem in document/ text classification
domain.
For the resolution of second problem, we propose a method for
training less deep web classification in which ontology is
employed not only for ontological reasoning but also for making
the approach training less. According to the proposed approach, a
BOW model of deep web sources content representative is build.
Based on the BOW model, domain of the deep web source is
identified using Wikipedia category hierarchy. Features of BOW
model are extended through domain independent ontology
UMBEL. The extended features are also mapped to Wikipedia to
retrieve its category hierarchy. Further we use a measure of
semantic relatedness based on the link structure of Wikipedia
categories to find how much semantically similar is a BOC
category to the ones derived by BOW model.
3.1 BOW Model of Deep Web Source
To classify documents into their respective domains, the
documents must be represented in a particular form. Traditionally
bag of words (BOW) approach is employed. Bag of words (BOW)
is a traditional method of document representation used in Natural
language processing, information retrieval and document
classification. In this approach the textual content of the document
is represented as an unordered pool of words, disregarding
syntactic and semantic rules. For example, "a big cat" is
considered same as "big a cat" in BOW model.
In the context of training based classification all the words in the
training documents are treated by BOW as elements in a
classification vector. As the classifier needs document feature
vector, therefore simple language words of the document become
the features of that document. These words are either Boolean
(yes/ no) or are weighted (based on some measure usually Term
Frequency-Inverse Document Frequency). A new document when
posed to the classifier is first judged to be a member of a
class based on how similar it is to the training documents.
BOW approach can also be employed for training less
classification methods. Meta-information i.e. Description and
Keywords found in HTML code of deep web site serve as the
content representative in our approach. Building BOW model
include all the data pruning steps. We define our BOW
model over two heuristics:
Single term BOW model
Comma separated terms BOW model.
Keywords consist of comma separated terms which have
significant meaning regarding domain classification. Thus we use
comma separated BOW model because it correctly maps the
category domain over Wikipedia.
3.2 Domain Identification Using Wikipedia
Category Hierarchy
Wikipedia, a service of web 2.0 currently comprising of 14
million multi-lingual articles is a large repository of worldly
knowledge which is developed and updated by a large community
of users around the globe. Wikipedia was started in 2001
since then it is one of 10 most popular websites around the
world. Its accuracy nearly equals encyclopedia Britannica.
Wikipedia contains an extensive network of human-
understandable categories that can be leveraged as class labels in
the classification process. Articles in Wikipedia are organized in a
hierarchy of categories. Wikipedia categories enable similar
articles to be grouped together. A category is usually associated
with a category page in the category name space. A category page
contains text that can be edited, like any other page, but when the
page is displayed, the last part of what is displayed is
an automatically generated list of all pages in that category, in
the form of links. Every page should belong to at least one
category. A page may often be in several categories. The
categories to which a page belongs can be seen at the bottom of
the page. Also each category can be explicitly accessed to
determine its subcategories, pages it contains etc. All the elements
of BOW model are mapped to Wikipedia article structure, for
each element the corresponding category network is extracted
as shown in Table 1.
Table 1. Wikipedia Categories obtained for terms health
and medical research
BOW element Wikipedia Categories
Single term: health Health | Personal life
Comma separated term:
medical research
Medical research | Health
research | Health sciences
3.3 Limitations of BOW Approach
The traditional BOW approach has several limitations as pointed
out by [26].
192
The BOW approach encounters only those words that appear
in the training corpus. It totally ignores any new words that
appear in the testing documents. Due to lack of this ability,
important words in the document may get overlooked and
thus results in inefficient document classification.
Infrequently appearing words are mostly ignored even if they
are essential in the classification process.
BOW approach inherently lack semantics, it cannot find a
general or specific concept for a word. Thus a document
having instance words cannot be classified into its domain
class concepts.
Connection between words in the form of synonyms and
polysems is not addressed by BOW.
Word sense disambiguation cannot be performed using
BOW.
3.4 UMBEL Ontology as Background
Knowledge Source
In order to have accuracy in the classification process some sort of
intelligence should be introduced to the overall task. To achieve
this goal the traditional BOW approach was enhanced by adding
concepts to it. The new approach was termed as Bag-of-Concepts
(BOC) approach in which any lexical external knowledge source
is employed to obtain concepts for the words.
Several external knowledge sources are available. Yahoo, Dmoz,
MESH, Wikipedia are to name a few. In our technique we employ
a domain independent ontology UMBEL (Upper Mapping and
Binding Exchange Layer) to extract semantics i.e. concepts for the
meta-information of the deep web site and the structure of
Wikipedia categories to perform the classification task. We
represent the meta-information by bag of concepts instead of bag
of words. We use umbel ontology to identify the concepts
appearing within the meta-information.
UMBEL ontology is an extracted subset of OpenCyc that provides
the Cyc data in the form of an RDF ontology based on
SKOS, OWL and RDFS. The major purpose of Umbel
ontology is to relate web content and data to a standard set of
subject concepts and provide a fixed set of reference points in a
global knowledge space. These subject concepts have defined
relationships among them, and can be employed as binding or
attachment points for any web content or data.
3.5 Calculating Semantic Relatedness
among BOW and BOC
The BOC model derives conceptually related categories to the
BOW model. Out of which not all of them may be relevant. As we
can see from the figure 1 that subject concepts derived for a BOW
element contains several irrelevant concepts for domain
classification of deep web source. To filter these irrelevant
concepts we analyze the link graph of Wikipedia categories for
BOW and BOC and define a measure of finding semantic
relatedness among categories retrieved from BOW and BOC.
A single Wikipedia category page contains a number of
information i.e. its subcategories, all pages in that category and its
parent category. The semantic relatedness among Wikipedia
categories can be observed by taking in account the links between
them. A subject concepts category is said to be relevant to
its corresponding BOW element category if it matches with
the
attributes of category page of BOW element or any link exists
between the attributes of category pages of BOW element
category and BOC subject concept category.
We observe due to this approach several conceptually related
categories are obtained which results in efficient classification.
Figure 1. Umbel Subject Concept for BOW element health
4. PROBLEM FORMULATION
In this section we formally formulate our research problem. First
the related conception of our proposed technique is described.
Then we propose the pseudo-code for our technique.
4.1 Related Conception
Definition 1 (List of Keywords):
K
i
denotes a set of comma separated keywords obtained from the
<meta> tag of HTML code of deep web site.
Definition 2 (List of Words):
W
i
denotes a set of single words obtained from <meta> keyword
and <meta> description tag of HTML code of deep web site.
Definition 3 (Umbel Subject Concept):
S
i
denotes a set of subject concepts retrieved for each entry in
the set W
i
Definition 4 (Wikipedia Categories)
C
k
denotes a set of categories obtained for comma separated
keywords. C
w
denotes a set of categories obtained for each
single word. C
s
denotes a set of categories obtained for
each single subject concept entry in set S
i
for an entry of W
i
.
W
i
S
i=1p
S
i
C
s=1l
Definition 5 (Semantic relatedness among C
wi
and C
si
)
C
wi
{M
a
, S
a
, P
c
, L
c
} is a set of attributes where M
a
denotes
main article of category, S
a
denotes subcategories of C
wi
, P
c
denotes pages in the C
wi
, L
c
denotes parent category hierarchy of
C
wi
.
Similarly C
si
{M
a
, S
a
, P
c
, L
c
} is a set of attributes where
M
a
denotes main article of category, S
a
denotes subcategories of
C
si
,
193
P
c
denotes pages in the C
si
, L
c
denotes parent category
hierarchy of C
si
.
C
sr
is semantic relatedness among C
wi
and C
si
which is
calculated as:
C
sr
= C
si
<String match> C
wi
{M
a
, S
a
, P
c
, L
c
} AND
C
si
{M
a
, S
a
, P
c
, L
c
} <String match> C
wi
{M
a
, S
a
, P
c
, L
c
}
4.1.1 Tokenizer (Input)
The Tokenizer takes a string as input and performs the following
three functionalities:
1) Removes the stops words from the input string i.e. removal
of stop words is, are, were, of, the etc.
2) Find the tokens of the inputs i.e. Health Information
System return three token Health, Information and System.
3) As it is observed in most of the search sources that all
keywords provided in the keywords <meta> tag are comma
Algorithm
Input: <meta> content extracted from HTML code of a deep web
site.
Output: A deep web site mapped to Wikipedia category hierarchy
1. W:= Tokenizer (Input);
2. K:= Tokenizer (Input);
3. S:= IdentifyConcepts (Wi=1.m);
4. Cw:= WikiCategory (Wi=1.m);
5. Ck:= WikiCategory (Ki=1.n);
6. Cs:= WikiCategory (Si=1.p);
7. Csr:= Csi <String match> Cwi {Ma, Sa, Pc, Lc} AND
Csi {Ma, Sa, Pc, Lc} <String match> Cwi {Ma, Sa, Pc, Lc}
8. List_of_Wiki_Categories:= Cw+Ck+Csr;
9. MapURL (URL, List_of_Wiki_Categories);
separated. The Tokenizer module also returns the comma
separated keywords .e.g. Health Information, Drugs and
Supplements, Dictionaries it returns three keywords Health
Information, Drugs and Supplements and Dictionaries.
4.1.2 Identify Concepts (W
i=1.m
)
IdentifyConcepts () identifies the subject concepts as discussed in
Figure 1. Subject concepts are a special kind of concept: namely,
ones that are concrete, subject-related and non-abstract. Note in
other systems or ontologies, similar constructs may alternatively
be called topics, subjects, concepts or perhaps interests [27].
4.1.3 WikiCategory ()
This function maps the Corresponding words, keywords and
subject concepts to wikipedia category hierarchy as in Table 1.
4.1.4 MapURL ()
The URL of the search source is mapped to the identified list of
categories and then moved to a knowledge base. A
knowledge base contains list of URLs of deep web sites mapped
to wikipedia categories.
5. EXPERIMENT EVALUATION
In this section, we present the results from the experiments using
our training-less ontology-based deep web classification method.
The presented experiments were designed as comparisons of the
proposed method with selected traditional deep web classification
methods. The observations were based on the granularity of the
classification process, defined as: macro level classification and
micro level classification. Through experiments we prove that our
technique works well for both granularity levels. In the
experimental evaluation described below, our goals are: to verify
whether our training less technique of feature extension based on
ontology is effective against traditional training based BOW
approach in both structured and unstructured deep web
classification scenario. We also show how much Wikipedia
category structure and subject concepts of UMBEL ontology are
helpful in fine grained classification. We also examined the net
effect of our proposed technique against other classification
techniques in the literature. Finally, we provide a detailed analysis
of the results and comment on the performance of the proposed
method.
5.1 Dataset
We tested the algorithm described above over a set of 100 deep
web sources depicted in Table 2. We gathered these sites from the
currently largest deep web directory CompletePlanet [28]. As all
the sites in the collection are manually classified into categories,
thus all the instances in the dataset serve as the gold standard to
verify our technique. Our dataset falls in four distinct domains and
comprises of both structured and unstructured sources: Health;
Music; Movies; Books as depicted in table 2. The
description column describes the subject specificity of each
domain.
Table 2. Dataset Domains and subject specific domains used in
experiment
Domain Description Sites
Health
Patient information, health information,
health statistics, physician practice web,
healthy eating/ lifestyle, weight loss,
medical research center, national library of
medicine, pharmaceutical company,
internal medicine, health philanthropy,
environmental health sciences, health
institutes/ hospital/ councils, Government
health departments
35
Music
Music departments at institutions, folkloric
music, music search engines/ directories,
music business/ career, classical music
downloads, TV commercial music,
Bulgarian music, music research, music
reviews, music teachers.
25
Movies
Movie review rating, celebrity profiles,
online movies guides, movie trailers
databases, entertainment news/ photos.
20
Books
Book stores, immigration books, book
reviews, book authors profiles, author
interviews, books on India, book bargains,
book by type
20
5.2 Category Coverage for Dataset
Domains by Wikipedia and UMBEL
Wikipedia offers an extensive network of human understandable
categories that can be leveraged as class labels in the classification
194
Domain
Wikipedia Root
Categories
No. of
Subcategories
Dataset
Coverage
Health
Health, Personal
life
62
99%
Music
Greek loanwords,
Performing arts,
Entertainment,
Music
113
98%
Movies
Film, Art media,
Media formats
85
97%
Books
Books,
Documents, Paper
products
58
100%
Domain
Umbel Subject
Concepts
Dataset Coverage
Health 35 90%
Music 172 100%
Movies 72 98%
Books 26 75%
Domain
Precision
(%)
Recall
(%)
F-measure
(%)
Health 96.5 100 98.2
Music 91.3 81.9 86.3
Movies 99.7 85.7 92.2
Books 96.9 80.8 88.1
process. Wikipedia category network also ensures granularity in
the classification process by classifying domains in subcategories
of the root categories. Table 3 lists the root categories
in Wikipedia for the four domains, the number of its
subcategories and percentage of dataset domain coverage.
Table 3. Wikipedia category coverage for dataset domains
sources which belong to some other category but falsely classified
in a domain. FN shows the number of deep web sources falsely
excluded from the domain.
To evaluate the performance of our classifier at the micro level,
we calculate the precision, recall and f-measure for all the
self descriptive categories obtained for a single deep web source.
Then the terms in above ratio have a different definition: TP
shows the number of correct categories for a deep web source. FP
shows the number of those categories falsely obtained for a
particular deep web source. FN shows the number of those
categories that are falsely excluded during the classification
process. The net effect of micro level classification can be
obtained by taking the mean of all precisions, recalls and F-
measures depicted as (4), (5), (6).
Precision =
i=0..n
Precision (i) / n (4)
Recall =
i=0..n
Recall (i) / n (5)
Like Wikipedia, Umbel also provides a good coverage for
the subject specificity of dataset domains as depicted in table 4.
Table 4. Umbel category coverage for dataset domains
F-measure =
i=0..n
F-measure (i) / n (6)
We evaluate our approach by taking into consideration the
following impact factors:
Wikipedia category employment
Feature extension using umbel ontology
Calculating semantic relatedness among Wikipedia categories
retrieved for BOW and BOC models.
We evaluated the effectiveness of our approach over four distinct
domains. For each domain we show its precision, recall and
F- measure in Table 6. On the macro level, we see a broad view
of the overall classification process. A deep web source belongs
to a particular domain if the subject-specific self descriptive
categories retrieved for it reflect the theme of that domain.
Table 6. Macro level effectiveness of the classification method
5.3 Performance Evaluation
To evaluate the performance of the classification technique, we
used the confusion matrix shown in table 5. The matrix
models the association between real classification and
expected classification by deriving three important performance
measures i.e. precision, recall and F-measure depicted in
equation (1), (2) and (3).
Table 5. Confusion Matrix for the evaluation metric
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual
Negative
False Positive (FP)
True Negative (TN)
Precision = TP / (TP + FP) (1)
Recall = TP / (TP + FN) (2)
F-measure = (2*Precision*Recall) / (Precision + Recall) (3)
Precision measures the exactness of a classifier while Recall
measures the completeness or sensitivity of the classifier. F-
measure is the harmonic mean between precision and recall.
A high F-measure means that both recall and precision have
high values. A perfect classification would result in an F-measure
with a value equal to 1. Here TP shows the number of deep
web sources classified correctly. FP shows the number of deep
web
On the micro level, precision, recall and F-measure is determined
for each individual deep web source. This depicts how much
relevant the self-descriptive categories retrieved, are to the deep
web source. We show the performance of our classification
process for both BOW and BOC approach in Table 7. We see that
BOC approach significantly improves the classification process.
Table 7. Micro level effectiveness of the classification method
Domain
Precision (%)
Recall (%)
F-measure
(%)
BOW BOC BOW BOC BOW BOC
Health 93.8 97.8 100 100 96.8 98.9
Music 94.3 92.5 62.5 82.5 75.2 87.2
Movies 79.5 89.7 82.5 87.1 81 88.4
Books 77.8 86.7 96.9 100 86.3 92.9
While comparing our approach with other traditional training
based classification approaches we observe that our proposed
195
method outperforms the other techniques. Our technique covers
both structured and unstructured deep web sources, considering
both simple and advance query interfaces while performing
classification over descriptive category domains. As in the other
two compared techniques, the one proposed by [10] classifies
only structured deep web sources and the one proposed by
[8] classifies only simple query interfaces. Figure 2 reflects
the performance over the Music, Movies and Books domains.
The literal A depicts our technique proposed in this paper. B
shows the technique proposed by [8]. C shows the technique
proposed by [10]. Figure 3 depicts the trend of f-measure for
deep web classification techniques. From which we conclude
that our technique performs better as compared to other
techniques.
Table 8 shows categories identified by our prototype TODWEB
for a deep web source providing updated information about
health statistics. We observe that our prototype derives several
conceptually related categories as compared to other classification
methods, thus results in efficient classification.
6. RELATED WORK
In the field of document/ text classification the classification
process is made training less with the employment of ontology of
categories [19], [20], [25], [29], [30]. Wikipedia which is one of
the most visited encyclopedia on the web offer an extensible
hierarchy of categories for its editors to classify their articles. The
authors in [30] used Wikipedia articles, categories and the
hyperlink graphs among the articles to discover concepts common
to the document. In [25], wikipedias content was first converted
into ontology in RDF format. From the ontology named entities in
the input document are recognized to build a thematic graph
of entities in the document. A dominant thematic graph is
extracted from thematic graph based on the largest number of
nodes and the highest total of entity weights to determine the
topic of the document. In [19], a simple method of
documents topic identification was introduced in which
relevance between a document and a category in wikipedia
was computed by the relevance between the document and
article titles that belonged to the category. The work was
improved by [20] by taking into account the articles content
and hyperlink structure of Wikipedia.
Table 8. List of categories identified for a deep web source
about health statistics
Deep Web Source:
http://medlineplus.nlm.nih.gov/medlineplus/healthstatistics.htm
l
Wikipedia Categories: Health, Healthcare, Policy, Health
economics, Food and drink, Insurance, Health insurance,
Therapy, Health policy, Medicine, Health sciences, Personal
life, Statistics, Mathematical sciences, Data Information
Research methods, Scientific method, Evaluation methods,
Mathematical and quantitative methods (economics),
Biostatistics, Medical specialties, Fields of application of
statistics, Medical statistics, Medicine, Statistics.
Figure 2. Comparative analysis over Music, Movies and Books
domains
Figure 3. F-measure trend for deep web classification
techniques
7. CONCLUSIONS
Online databases lie deep in the ocean of WWW and their
structure restricts crawlers from indexing their contents. These
databases contain highly relevant contents to satisfy users
information needs. In order to generate knowledge for making
accurate and timely decisions we need to integrate data from these
heterogeneous deep web sources. In our research work, we
proposed a training less ontology based method of deep web
classification into subject specific self-descriptive domains. We
employed the structure of Wikipedia to identify the domain of
deep web source using BOW approach then we enhanced our
BOW model with concepts extracted from a domain independent
ontology. The phenomena improved the classification process.
The experimental results show that our technique outperforms as
compared to other techniques. In our future work we will
experiment with semantic deep web database selection methods
based on semantic understanding of users query based on
the prototype of our deep web classification method.
8. REFERENCES
[1] The Deep Web: Surfacing hidden value. Accessible at
http://brightplanet.com, 2000.
[2] Madhavan, J., Cohen, S., Dong, X. L., Halevy, A. Y.,
Jeffery S. R., Ko, D. and Yu, C. Web scale data integration:
You can afford to pay as you go. In the proceedings
of
196
Conference on Innovative Data Systems Research (CIDR),
2007, 342350.
[3] Chang, K. C. C., He, B., Li, C., Patel, M. and Zhang,
Z.
Structured databases on the web: Observations and
Implications. In the proceedings of International
Conference on Management of Data (ACM SIGMOD),
2004, 6170.
[4] Madhavan, J., Afanasiev, L., Antova L. and Halevy, A. Y.
Harnessing the Deep Web: Present and Future. In the
proceedings of Conference on Innovative Data Systems
Research (CIDR), 2009.
[5] Barbosa, L. and Freire, J. Searching for Hidden-Web
Databases. In the proceedings of International Workshop on
the. Web and Databases (WebDB), 2005, 16.
[6] Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen
A. and Halevy, A. Y. Google's Deep Web crawl. In
the proceedings of International Conference on Very
Large Database Endowment (PVLDB), 2008, 1241-1252.
[7] Raghavan, S. and Garcia-Molina, H. Crawling the Hidden
Web. In the proceedings of International conference on
Very Large Databases (VLDB), 2001, 129138.
[8] Xian, X., Zhao, P., Fang, W., Xin, J. and Cui, Z. Automatic
Classification of Deep Web Databases with Simple Query
Interfaces. In the proceedings of International Conference
on Industrial Machatronics and Automation (ICIMA),
2009, 8588.
[9] Ipeirotis, P. G., Gravano, L. and Sahami, M. Automatic
Classification of Text Databases through Query Probing. In
ACM SIGMOD workshop on the Web and Databases
(WebDB), 2000, 245255.
[10] Xu, H., Hau, X., Wang, S. and Hu, Y. A method of Deep
Web Classification. In the proceedings of International
Conference on Machine Learning and Cybernetics
(ICMLC), 2007, 40094014.
[11] Nie, T., Shen, D., Yu, G. and Kou, Y. Subject-
Oriented Classification Based on Scale Probing in the Deep
Web. In the proceedings of International Conference on
Web-age Information Management (WAIM), 2008, 224
229.
[12] Lin, P., Du, Y., Tan, X. and Lv, C. Research on Automatic
Classification for Deep Web Query Interfaces. In
International Symposium on Information Processing (ISIP),
2008, 313317.
[13] Le, H. and Conrad, S. Classifying Structured Web Sources
Using Support Vector Machine and Aggressive Feature
Selection. In Lecture Notes in Business Information
Processing, Vol. 45, 2010, 270282.
[14] Zhao, P., Huang, L., Fang, W. and Cui, Z. Organizing
Structured Deep Web by Clustering Query Interfaces Link
Graph. In Lecture Notes in Computer Science, Vol. 5139,
2008, 683690.
[15] Barbosa, L., Freire, J. and Silva, A. Organizing hidden-web
databases by clustering visible web documents. In the
proceedings of International Conference on Data
Engineering (ICDE), 2007, 326335.
[16] He, B. Tao, T. and K. Chang, K. C. C. Organizing
structured web sources by query schemas: a clustering
approach. In the proceedings of Conference on Information
and Knowledge Management (CIKM), 2004, 2231.
[17] Su, W., Wang, J. and Lochovsky, F. Automatic Hierarchical
Classification of Structured Deep Web Databases. In the
proceedings of International Conference on Web
Information System Engineering (WISE), 2006, 210221.
[18] Medelyan, O., Milne, D., Legg, C. and Witten, I. Mining
meaning from Wikipedia. In International Journal of
Human-Computer Interactions (IJHCI), Vol. 67(9), 2009,
716754.
[19] Schonhofen, P. Identifying document topics using the
Wikipedia category network. In the proceedings of
International Conference on Web Intelligence
(IEEE/WIC/ACM), 2006, 456462.
[20] Huynh, D., Cao, T., Pham, P. and Hoang, T. Using
Hyperlink Texts to Improve Quality of Identifying
Document Topics Based on Wikipedia. In the proceedings
of International Conference on Knowledge and Systems
Engineering (ICKSE), 2009, 249254.
[21] Feilmayr, C., Barta, R., Grn, C., Prll, B. and Werthner,
H. Covering the Semantic Space of Tourism: an Approach
based on Modularized Ontologies. In workshop on Context,
Information and Ontologies (CIAO, ESWC), 2009.
[22] Nummiaho, A. and Vainikainen, S. Utilizing Linked Open
Data Sources for Automatic Generation of Semantic
Metadata and Semantic Research. In Communications in
Computer and Information Science (CCIS), 2010, 7883.
[23] Halevy, A. Y. Why your data dont mix. In the journal of
ACM Queue, Vol. 3(8), 2005, 5058.
[24] Noor, U., Rashid, Z. and Rauf, A. A survey of automatic
deep web classification techniques. In International Journal
of Computer Applications (IJCA), Vol. 19(6), 2011, 4350.
[25] Janik, M. and Kochut, K. Training less ontology based text
categorization. In workshop on Exploiting Semantic
Annotations in Information Retrieval (ESAIR), 2008, 317.
[26] Gabrilovich, E. and Markovitch, S. Feature Generation for
Text Categorization Using World Knowledge. In the
proceedings of International Joint Conference on Artificial
Intelligence (IJCAI), 2005, 10481053.
[27] UMBEL: http://www.umbel.org/.
[28] CompletePlanet. http://www.completeplanet.com.
[29] Tiun, S., Abdullah, R. and Kong, T. E. Automatic Topic
Identification Using Ontology Hierarchy. In the
proceedings of International Conference on Computational
Linguistics and Intelligent Text Processing (CICLing),
2001, 444453.
[30] Syed, Z., Finin, T. and Joshi, A. Wikipedia as an Ontology
for Describing Documents. In the proceedings of
International Conference on Weblogs and Social Media
(AAAI), 2008, 136144.
197