Mining Data Records Based On Ontology Evolution For Deep Web
Mining Data Records Based On Ontology Evolution For Deep Web
Wen Z ang
College of Information Engineering,
Zhongzhou University, Zhengzhou 450044, China
[email protected]
I. INTRODUCTION
The web contains two types: deep web and surface web.
The deep Web [1] is qualitatively different from the surface
Web. Deep Web sources store their content in searchable
databases that only produce results dynamically in response
to a direct request; standard search engines never find it. (b) Query results
BrightPlanet has quantified the size and relevancy of the
Figure1.The sample of Query interface and Query results
deep Web in a study based on data collected between March
13 and 30, 2000. Public information on the deep Web is From Figure 1, we could figure out that (a) represents
currently 400 to 550 times larger than the commonly defined query interfaces. After inputs keyword “java program”, the
World Wide Web. As data sources grow rapidly, it’s server will get query result (b). The main task in this paper is
necessary to integrate those data sources. In the process of to extract each data item in (b), and to integrate results
integrating, the first important task is to mine the useful returned by various data sources into one table.
information.
c
978-1-4244-6349-7/10/$26.00 2010 IEEE V7-628
Authorized licensed use limited to: SRM University. Downloaded on July 12,2010 at 09:14:23 UTC from IEEE Xplore. Restrictions apply.
Manual approach, wrapper induction, and automatic interfaces and query result pages has also been taken into
extraction. account.
Manual approach, through which by observing a web
page and its source code, the programmer can find some IV. DATA RECORDS IDENTIFICATION
schemas from the web page, and write a program to identify, To identify the data regions, system exploits the
as well as to extract the data items. This approach is not following observations. First, the mining area contains many
suitable for a large number of pages. information about the ontology tat will be used, such as
Wrapper induction, [2] a set of extraction rules are learnt labels, instances. In addition, from visible perspective, there
from a set of manually labeled pages or data records. These are some rules existed. For example, the distance among
rules used to extract data from similar pages. each record is much bigger, while the distance inside record
Automatic extraction, [3]find data items have different is smaller.
roles in web pages, it is resolved at various levels: Semantic The label of a concept and its instances are often located
blocks, sections and data items, and several approaches are in close proximity and spatially aligned on the data page.
proposed to identify the mapping between data items having This placement regularity can be exploited to associate the
same role. RoadRunner [4] works by comparing the HTML label of a concept with its instances.
structure of two given sample pages belonging to a same
A. Maximum Entropy Model
page class , generating as a result schema for the data
contained in the pages. From this Schema, some same pages The maximum entropy model has been successfully
can be extracted, but most of pages are heterogeneous, this applied to many problems, such as part-of-speech tagging
method is more time consuming. [5], name entity recognition [6] and so on. Intuitively, the
However, there are still places to be improved. (1) principle of maximum entropy [7] is simple: We model all
Incapable of processing either zero or few query results. that is known and assume nothing about that which is
(2)No label assignment. This paper proposes ontology unknown. In other words, given a set of constraints, we
evolution to solve above problems. choose a model that is consistent with all the constraints, but
otherwise is as uniform as possible.
III. ONTOLOGY DESCRIPTION The goal of a maximum entropy model is to construct a
The logic structure of an ontology could be considered as stochastic model that accurately represents the behavior of a
a two-tuple O=(C, A), while each ontology describes a random process [8]. Such a model is a method of estimating
domain. C represents a concept, described in the form of the conditional probability that, given a context x, the
rectangle, such as book, airplane and so on. Oppositely, A process will output y. We denote a single observation by y, a
represents the attribute of C, while a concept could have random variable that takes on values in alphabet Y. To study
multiple attributes. the random process, we observe its behavior for a while,
Generally, an attribute model could be described as a collecting some training samples (x1, y1), (x2, y2)… (xn, yn).
seven-tuple A= {N, V, FA, DT, R, cons, count}. As a particular pair (x, y) may occur in some or all of the
N: the name of the attributes, and it was obtained from samples, the training samples are summarized in terms of
sample query interfaces and query results examples, like their empirical probability
~ ~ ~
p ( x , y ) ≡ c ( x , y ) / ∑ c ( x, y ) ,
Tile, Author, or ISDN.
V: the instance of A, represents the value of A and (1)
x, y
appears in the query result pages for training.
~
FA: The associate attribute of A, which means the
synonyms of A, was expressed as FA. For example, if in which c ( x, y ) is the number of times (x, y) appears in
attribute “Author” is the Synonyms of attribute “editor”, the collection.
“editor” will become the associate attribute when “Author” is Given a set of features, which determines the statistics
the attribute of A . that we feel are important in modeling the process, we would
DT: which can be acquired by analyzing the V value of like to build our model in accordance with these statistics.
A, stands for the data type of A. Usually, the most common Generally a feature fi can be defined as
data types are int, real, price, datatime, and string. ⎧1 if the feature is exp ressed in case( x, y) (2)
f ( x, y) =
i ⎨
R: shows the relationships between concepts with ⎩0 otherwise
individual attribute, such as the relationship of succession, The expected value of feature fi with respect to the
and it between parts with whole. ~
Cons: represents the restriction for A, such as the value empirical distribution p ( x, y ) is
of attribute “age” should never be less than zero. ~ ~
Count: indicates the frequency of A’s appearance in p ( f i ) = ∑ p ( x, y ) f i ( x, y ) , (3)
training text and typically is used for determining the x, y
importance of that attribute. While the expected value of feature fi with respect to the
The construction of ontology in this paper adopts the
technology in [10]. In the meanwhile, in order to produce a model p ( y | x) is
relatively perfect ontology, the information about query
[Volume 7] 2010 2nd International Conference on Computer Engineering and Technology V7-629
Authorized licensed use limited to: SRM University. Downloaded on July 12,2010 at 09:14:23 UTC from IEEE Xplore. Restrictions apply.
~ figure 2 will be used to calculate the maximum relation of
p ( f i ) = ∑ p ( x ) p ( y | x ) f i ( x, y ) (4) child tree. If the value of relation is greater than Tc, the
x, y system will be convinced that there is only one data record
The context information includes the following in our contained in the area, and that sub tree is the exact data
experiments. Raw data string is the name, alias, or the value record. Other wise, the system will continue searching other
of an attribute in the ontology? The possible attribute of the sub trees followed figure 2.
previous two and subsequent two raw data strings, the
possible data type candidate of the raw data string and so on,
can be encoded as a feature. DataRecordsIdentification(T, O)
B. Building the HTML Tag Tree 1. while(T leaf node)
By using DOM(Document object model)technology, we 2. Ti a child of T that has the largest correlation with O
can describe the response page as a tag tree. System corrects 3. if (Corr(Ti,O)>Corr(T,O))
these documents before creating tree. After a series of 4. T Ti
transactions, the converted tree will meet the standards of 5. else break
W3C strictly. The tree has three parts: HTML tag, Attribute, 6. endwhile
and Text. 7. while(true)
8. Tl the immediate left sibling of T
C. Instance-Ontology Correlation 9. if(Corr(Tl+T,O)>Corr(T,O))
We assume that an instance is a data record if its 10. T Tl +T
correlation with the ontology O is larger than a threshold Tc. 11. else break
To allow the threshold Tc to be adaptive to the domain, it is 12. endwhile
empirically set to be half of the smallest correlation with the 13. while(true)
ontology O for all data records in the training Web sites. 14. Tr the immediate right sibling of T
The instance-ontology correlation calculation is mainly 15. if(Corr(Tr+T,O)>Corr(T,O))
used for evaluating instances set {d1, d2… dn}, to figure out
16. T Tr +T
which instance is a query result record. For each di in D, its
17. else break
weight, Si, has been defined as:
18. endwhile
⎧ pnj if di is N j or FAj of Aj 19. return T
si = ⎨ (5)
⎩ p j otherwise.
That is, N j is the name of attribute A j , FA j is the Figure 2. Data Records identification
V7-630 2010 2nd International Conference on Computer Engineering and Technology [Volume 7]
Authorized licensed use limited to: SRM University. Downloaded on July 12,2010 at 09:14:23 UTC from IEEE Xplore. Restrictions apply.
VI. EMPIRICAL EVALUATIONS VII. CONCLUSION AND FUTURE WORK
UIUC web integration repository[9] was chosen for the This paper mainly proposed an ontology evolution based
query interfaces of deep web; This dataset collects original method to mine data areas. This method could solve the
query interfaces of 447 deep Web sources from 8 problem while there is only one data record contained in
representative domains, in the Travel group: Airfares, Hotels, website, and also identify the meaning of data without labels.
and Car Rentals; in the Entertainment group: Books, Movies, With the evolution of ontology, the extraction of data records
and Music Records; in the Living group: Jobs and is being more accurate.
Automobiles. We pick up three domains among them: In further works, there are still bunch of things we could
Airfares, Books and Automobiles. focus on. Because of the isomerism of web pages, the data
Precision and Recall is the standard for examining data extracted from multiple data resources is different in the
extraction. Precision is the ratio of the amount of data chunks form of format and order. For further data mining, we planed
extracted correctly by this approach to the number of all data to integrate these data in a uniform format.
chunks obtained from query; on the other side, Recall is the
ratio of the amount of data chunks extracted correctly by this
approach to the amount of data chunks supposed to be REFERENCES
extracted correctly if manual method was chosen. [1] Michael K. Bergman, “The Deep Web: Surfacing Hidden Value,”
The Journal of Electronic Publishing 7 (1), August 2001.
[2] Yanhong Zhai and Bing Liu, “Extracting Web Data Using Instances-
TABLE1:DATA EXTRACTION Based Learning,” WISE Conference, 2005.
[3] Hu D,Meng X, “Automatic Data Extraction from Data-Rich Web
Domains Precision Recall Pages,” The 10th Data System for Advanced Applications
(DASFAA), Beijing, 2005.
Automobile 96.7% 95.2% [4] Crescenzi,V., Mecca,G., and Merialdo,P, “RoadRunner: Towards
Book 95.4% 92.3% automatic data extraction from large Web sites,” In proceedings of
Airfare 100% 98.7% the 26th international conference on Very Large Data Bases , Italy,
pp.109-118,2001.
Average 97.4% 95.4%
[5] RATNAPARKHI, A, “A maximum entropy model for part-of-speech
tagging,” In Proceedings of the 1st Empirical Methods in Natural
Experiment results were as shown as table1. Through Language Processing Conference, pp.133–141, 1996.
observation those three domains confirm of getting better [6] BORTHWICK, A, “A maximum entropy approach to named entity
results, especially for airfare domain whose precision was as recognition,” Ph.D. thesis, Computer Science Department, New York
high as 100% because of the low number of attributes and University. 1999.
simple relationships between attributes. Data extraction step [7] Adam L. Berger, Vincent J. Della Pietra, Stephen A. Della Pietra, “A
Maximum Entropy Approach to Natural Language Processing,”
attempts to locate proper attributes for extracting from the Computational Linguistics, Volume 22 , Issue 1, pp. 39 - 71 , 1996.
union of attribute values of new result pages. From values in
[8] BERGER, A. L., DELLA-PIETRA, S. A., DELLA-PIETRA, V. J, “A
above table, it’s clear that the quality of extraction in maximum entropy approach to natural language processing,”
automobile, book, and airfare domain is satisfied as the Comput. Linguist. 22, 1, pp.39–71,1996.
precision and recall could be above 92%. With the increasing [9] N. Kushmerick, “Wrapper induction: EĜciency and expressiveness,”
training texts, base ontology will gradually evolve to a more Artificial Intelligence, 118, 2000.
sufficient ontology, and the extraction for result pages will [10] Kerui Chen , Wanli Zuo, Fan Zhang, Fengling He,Tao Peng,
also get higher precision and recall. “Automatic Generation of Domain-specific Ontology from Deep
Web,” Journal of Information and Computational Science,
unpublished.
[Volume 7] 2010 2nd International Conference on Computer Engineering and Technology V7-631
Authorized licensed use limited to: SRM University. Downloaded on July 12,2010 at 09:14:23 UTC from IEEE Xplore. Restrictions apply.