Deep Web Content Mining: Shohreh Ajoudanian, and Mohammad Davarpanah Jazi
Deep Web Content Mining: Shohreh Ajoudanian, and Mohammad Davarpanah Jazi
=
18. if J=1 then
19. return c
i
and c
i+1
are grouping attributes
20. if J=0 then
21. return c
i
and c
i+1
are synonym attributes
22. End
World Academy of Science, Engineering and Technology 49 2009
503
QI
5
: www.bookplace.com= {title search, author search,
keyword search, ISBN search, publisher search}
As a second input we send extracted attributes of query
interfaces that must be matched. These query interfaces are
www.amazon.com and www.bookery.com. Attributes that
extracted from them are as follows.
T1: www.amazon.com = {author, title, subject, ISBN,
publisher, reader age, language}
T2: www.bookery.com = {Last name, First name, title, other
keywords, category}
As stated in section IV. A, For decreasing the cost of
correlation mining algorithm, before applying this algorithm
to extracted information, attributes that are completely similar
are recognized and notified as correlated attributes. This
attributes will be removed from the list of attributes that we
send them to the algorithm. In this example Title attribute is
used in two query forms and thus we dont reside it in arrays
columns. The resulted array is as Table II. Other attributes in
these two query interfaces reside in arrays columns.
After that, each column is compared with other columns
and according to their results, the status of attributes is
declared. For example about first name and last name
attributes Jaccard measure returns one as a result and it shows
that they are grouping attributes and commonly come with
each other in a query interface. For example about author and
first name attributes it returns 0 that means they are synonym
and rarely come with each other in a query interface.
This approach returns good semantic matching between
attributes in online databases but it has some false information
too. For instance in the above example in the book domain in
addition to true result author = {First name, Last name}, we
also encounter with the false result subject = {First name,
Last name}. We call this situation conflict. For resolving this
problem in the previous works, distribution relation has been
used.
For example because both author and subject attributes are
matched with {first name, last name}. Thus subject and author
must be equivalent. But algorithm doesnt show this. Thus one
of these matching is wrong and must be removed. We do this
by means of function that select ones with upper Jaccard
measure result, among matching attributes.
TABLE II
A SAMPLE ARRAY WITH INPUT INFORMATION
author Subject ISBN publisher Reader age Language Last name First name Other keywords category
QI1 1 1 1 1 1 1 0 0 0 0
QI2 1 0 1 0 0 0 0 0 0 1
QI3 1 1 1 1 0 0 0 0 1 0
QI4 0 0 1 0 0 0 1 1 1 1
QI5 1 0 1 1 0 0 0 0 1 0
V. RELATED WORK
In the face of that correlation mining technique that we use
in this paper can find complex matching between attributes
faster and more accurate than previous works that uses
grammatical methods but for improving the result, we can
improve the algorithm. The heart of this algorithm is its
measure. With the use of suitable measure, the result
algorithm tends to be better.
In this paper we use Jaccard measure in the algorithm. This
measure is capable to find correlation between attributes
accurately. In [1] a new measure is introduced that can find
correlated attributes in a good manner but it needs a threshold
to find matching attributes. Defining such threshold accurately
usually is very difficult. Jaccard measure doesnt need to
define a threshold. In the future work with a more accurate
measure we can have a better algorithm.
For resolving conflict between 1:1 attributes matching, we
can use test samples [7]. For example in two query interfaces
that algorithm recognizes that subject = author, we can verify
that what they return if user ask similar query. If the result is
equal then they are matched attributes, otherwise they are not
equal and one of them must be removed.
VI. CONCLUDE
In this paper a system that extracts and matches information
in the deep web is presented. This system does the task of
matching attributes in online databases with a new correlation
mining approach. This system does its work in two essential
steps automatically. In the first step it extracts information
from query interfaces and in the second step it matches them.
In general it does deep web content mining and complex
matching between attributes that are extracted from query
interfaces. It uses web content mining approach to extract
information from web based databases. After that for
clustering web pages in subject domains, it uses clustering
technique with a heuristic function for extracting attributes.
Finally with the use of correlation mining algorithm we match
the extracted attributes in a special domain. We use Jaccard
measure in this algorithm to find grouping and synonym
attributes in a faster and more accurate manner.
REFERENCES
[1] Bin He, Kevin chen-chuan chang; Automatic complex schema
matching across web query interfaces: A correlation mining
approach; ACM Transactions on Databases Systems; Vol. 31; No.1;
Pages 1-45; March 2006.
[2] Michael K. Bergman; The Deep Web: Surfacing Hidden Value;
www.BrightPlanet.com; Pages 1-5; 2001.
World Academy of Science, Engineering and Technology 49 2009
504
[3] Kevin chen-chuan chang; Toward Large Scale Integration: Building a
Metaquerier over databases on the web; VLDB Journal; 2005.
[4] Zhen Zhang; Light-weight Domainbased Form Assistant: Querying
web databases on the fly ; 31st VLDB Conference; Trondheim
Norway; 2005.
[5] M. A. Hearst and J. O. Pederson; Reexamining the cluster hypothesis:
Scatter/gather on retrieval results; In Proceedings of SIGIR; Pages 76-
84; 1996.
[6] O. Zamir and O. Etzioni; Web document clustering: a feasibility
demonstration; In Proceedings of SIGIR; 1998.
[7] Sh. Ajoudanian, M. Davarpanah Jazi, and M. Saraee; Discovering
Knowledge from Deep Web Databases using Correlation Mining
Approach; IDMC Conference; Iran; 2007.
[8] Bin He, Kevin chen-chuan chang; Statistical schema matching across
web query interfaces; In SIGMOD Conferences; 2003.
[9] E. Rahm, P. A. Bernstein;A survey of approaches to automatic schema
matching; VLDB Journal; no 10; Pages 334-350; 2001.
[10] Agrawal R., Imielinski T., Swami A. N.; Mining association rules
between sets of items in large databases; In SIGMOD Conference;
1993.
[11] Y-K Lee, W-Y Kim, Y. D. Cai; Efficient mining of correlated
patterns; In SIGMOD Conference; 2003.
[12] S. Brin, R. Motwani, C. Silverstein; Beyond market baskets:
generalizing association rules to correlations; In SIGMOD
Conference; 1997.
World Academy of Science, Engineering and Technology 49 2009
505