Web Search Engine Based On DNS: Wang Liang, Guo Yi-Ping, Fang Ming
Web Search Engine Based On DNS: Wang Liang, Guo Yi-Ping, Fang Ming
1
Corresponding author
Wang Liang, E-mail: [email protected], Phone: +86-27-87542739. fax: +86-27-87554415
and again. The coverage and update interval can't be ensured in this system. Obviously centralized architecture is
not appropriate to manage the distributed WWW. Unsuitable architecture is the key for most bottleneck problems
in current search engine. So a better solution must choose a completely different architecture.
Adopting the experience of DNS may be a good selection. The distributed hierarchical architecture of DNS may
also provide an efficient architecture to build the new web search engine. Furthermore, Since DNS can index the
name of each site, could they index all the web pages in different sites? So idea of "search engine + DNS"
appeared.
2.2. Basic Architecture
As the basic idea of new search engine based on DNS, its architecture is as same as DNS, which is shown in
fig1.
There are three layers in this system. The third layer is the third level Domain, always corresponding to some
organizations like a university. The second layer is the second level Domain, the sub net of a country. The top
layer corresponds to each country. Its three layers strictly correspond to different levels of DNS.
root
First layer
Uk(England) Ru(Russia) Cn(China) Fr(France) …
Second layer
Gov.cn Edu.cn(CERNET) Com.cn …
Third layer
Hust.edu.cn Pku.edu.cn Tsinghua.edu.cn …
Fig.1. Architecture of web pages search engine based on DNS
In this architecture, we can simply download all the web pages in the bottom layer and send them to the servers
in the higher layer. Because all the pages extracting work is done by the different servers in bottom layer, which
always corresponds to local network, the update interval of whole system could be reduced to a reasonable level.
So the recency problem could be solved. But in this method, the databases in the top layer will be fairly large. We
may have to adopt distributed computing or other complex technologies to design such system. All in all, building
a database system that can mirror the whole Internet is almost an impossible mission. We have to find a practical
method to realize the basic idea of this system.
3. Technology related to search engine
In this section, we will give a high overview of some technologies related to search engine. First, we will give a
brief introduction for basic search engine technology. Then as the key technology of search engine, two kinds of
ranking algorithms are presented. Web search engine is just one kind of information retrieval system. Three
different information retrieval systems are introduced at last. All these technologies will be employed in the design
of new system.
3.1. Basic search engine technologies
Most practical and commercially operated Internet search engines are based on a centralized architecture that
relies on a set of key components, namely Crawler, Indexer and Searcher, whose definitions are introduced as
follows[5].
z Crawler. A crawler is a module aggregating data from the World Wide Web in order to make them searchable.
Several heuristics and algorithms exists for crawling, most of them are based upon following links.
z Indexer. A module that takes a collection of documents or data and builds a searchable index from them.
Common practices are inverted files, vector spaces, suffix structures and hybrids of these.
z Searcher. The searcher is working on the output files from the indexer. The searcher accepts user queries,
runs them over the index, and returns computed search results to issuer.
3.2. Two kinds of ranking technologies
Now there are two main page ranking technologies.
z Text information retrieval technology [6]. Web search technology is derived from the text information
retrieval technology, which employs the so called TF*IDF search and ranking algorithm which assigns
relevance scores to documents (the term documents here refers to text files in general, which also applies to
WWW pages).The big difference between them is that the former can make use of rich tags in the web
documents, which gives some hint about the documents. According to a specific strategy, different tags are
given different weights so that the search quality can be improved. This is a basic technology and used by
many early search engines such as Excite, Infoseek, etc.
z Hyperlink analyzing technology [7]. Its basic idea is that the ranking score of a page is determined by the
number of links in other sites that point to this page. There are two representative algorithms for this
technology. Based on the simulation of the behaviors of uses when they surf on the web, Larry page brings
the model called PageRank [8], which presents an algorithm to calculate every web page's weight. In IBM
research lab's "clever" system, a group demonstrates an HITS model [9], which automatically locates two
types of pages: authorities and hubs. The first type provides the best sources of information on a given topic,
and the second type points to many valuable authorities. This technology gives higher weight to the pages
many other pages link to. The newly published pages on web will be ignored.
If we design a search engine in a small scope of Internet like in a university, text information retrieval
technology will be enough. Hyperlink analyzing technology is more effective in large scale like in a country
3.3. Three kinds of information retrieval system
According to the architecture, there have been three kinds of information retrieval systems, which are
introduced as follows.
z Centralized search system. This system has its own data collecting mechanism, and all the data are stored and
indexed in a conventional database system. Although many web search engines download web pages and
provide service by thousands of servers, they all belong to this kind according to their basic architecture.
z Metadata harvest search system. Normally Metadata is much smaller than data itself. So when we can't store
all the data in a database system or need to integrate different kinds of information resource like video, PDF,
web pages in a system, we can harvest the metadata from different sub databases and build a union metadata
database. This database can provide the search service just like the conventional database system. Some
library database systems based on OAI [10] just adopt this method.
z Distributed search system. When the data source is large enough that even the metadata can't be efficiently
managed in a database system, we can choose distributed system.Distributed information retrieval system has
no its own actual record database. It just indexes the interface of sub database system. When receiving a
query from a user, main system will instantly obtain the records from sub databases through their search
interfaces. The limitation of this system is that the number of sub databases can't be many, otherwise the
search speed can't be ensured. A famous system is InfoBus system in Stanford digital library project [11].
There are two main factors to determine the architecture of an information retrieval system, the size and
diversity of data source. Normally, with the increase of size and diversity of data source, we can select centralized
system, metadata harvest system and distributed search system respectively.
4. Realization of web search engine based on DNS
According to the different characters of three layers, we adopt the different architectures and ranking algorithms
to build the new system. Three layer's search systems are introduced as follows.
4.1 Third layer: centralized search system
The node of this layer always corresponds to an organization. Normally, a centralized search system can
efficiently manage all its web pages. The node in this layer is just a conventional web search engine. Only
difference is that its search scope is limited in a third level Domain like a university. This search system comprise
of three parts, crawler, indexer and searcher, which are introduced respectively.
4.1.1. Crawler
Crawler of this layer will download all the pages in a third level Domain. For example, "hust.edu.cn" is the
domain name of our university, so the server under the domain of "hust.edu.cn" like the department of Computer
Science's server "cs.hust.edu.cn" can be easily found in the DNS server of our university. By this means, the
crawler can download all the web sites' pages in this Domain.
The work of crawler is arranged by single web site. When a spider visits a web server, it will download all the
pages in this site. It will stop working when it encounters a URL linking to other site. This kind of URL is called
"stop URL" .The content in this "stop URL" is also treated as valuable matter of this site and is downloaded too.
This theory has some differences with conventional search engine, whose crawler will freely surf in WWW and
have no stop URL. We have to design complex algorithms to direct crawler. But in new system, its crawler just
need download all the pages in each site and not consider the intricate relation between different sites.
4.1.2. Indexer
Normally, the key issue in indexer is the appropriate selection of metadata. In text information retrieval
technology, every word of document is indexed with its ranking score. We also use this method to index the web
pages. By this means, the rank score of keyword is determined by its position and frequency information. The
other information such as tags, encoding and abstract will also be used to describe a web page. Moreover, we can
also use W3C's Ontology [12] model or other advanced technologies to index the web pages.
4.1.3. Searcher
Providing user interface and processing the search results are main functions of searcher. How to rank the
pages is the key issue. In this layer, we use the text Information retrieval technology to rank the results. This is
because its web pages are limited in a small area like a university, but the Hyperlink Analysis is more useful in
large scale. This paper is mainly concerned with how to build the search system. Detailed ranking algorithm will
be determined in its standard protocols.
4.2. Second layer: metadata harvest search system
This layer will provide the search service that covers all the web sites under a second level Domain. Metadata
harvest system is adopted in this layer. A third layer node like "hust.edu.cn" corresponds to our universities. Most
universities have no more than 100 thousand pages, so a centralized search system can work efficiently. But a
second layer node "edn.cn" includes the web pages of all the universities in China. Centralized search system may
not ensure the recency and coverage of its database. So we use metadata harvest system in this layer. The search
engine of this layer has only two parts, indexer and searcher.
4.2.1. Indexer
Its data is downloaded from the databases in the corresponding third layer's nodes. For example, the search
engine corresponding to the domain "edn.cn" will obtain its data from the search engine's databases in thousands
of universities in China, but not directly extract millions of web pages itself. By this means, the update interval
could be much shorter than conventional methods. Only storing the metadata will also make its database fairly
small. This web metadata harvest protocol refers to the OAI [10] protocol, which is a mature metadata harvest
system and has been adopted in many library systems.
4.2.2. Searcher
A notable problem that should be concerned with is the overlap of web pages when harvesting the metadata. In
the third layer, when the crawler extracts pages from a site, it also gets some pages which don't belong to this site
(Stop URL). So when harvesting the metadata, some pages may appear many times. As the download principle in
the third layer, the overlap number (no including this page) is just the number of pages of other sites which direct
to this page. According to the theory of Hyperlink Analyzing, the ranking score of this page could be calculated by
this number. New system employs a simple method to realize the basic idea of Hyperlink Analyzing. Obviously
Hyperlink Analyzing technology will be more efficient to rank the search results in this layer. Detailed content
like how two layers cooperate to transfer the metadata and the finial ranking algorithm will be defined in its
protocols.
4.3. Top layer: distributed search system
Because the number of second layer's node (number of second level's Domain name) is no more than ten, and
storing the metadata of billions of web pages in single system is still a big challenge, search engine in this layer
could be a distributed information retrieval system. It has merely one part, searcher, no crawler, and no indexer.
There are three issues when designing a distributed search system [13].1 the underlying transport protocol for
sending the message (e.g., TCP/IP). 2 a search protocol that specifies the syntax and semantics of message
transmitted between clients and servers.3 a mechanism that can combine the results from different servers. These
problems are detailed as follows.
1 Communication protocol. In this system, SOAP is adopted as the basic protocol for communication.
2 Search protocol. Its search protocol is based on Webservice. Webservice use the SOAP as its fundamental
protocol and provides an efficient distributed architecture. We refer to the SDLIP [14] and Google's search
Webservice to define the format of query and results. All the search engines in the second layer should provide the
standard search Webservice according to a standard protocol. Search engine in top layer just needs to index all the
Webservice in lower layer.
3 Combining the results. In this system, the key problem for the combination of results is page ranking. In the
second layer, the ranking score of a page is calculated by its overlap number. In this layer, we also use this theory
to rank the pages. The ranking score of same page from different databases will be added up and a finial ranking
list of pages will be produced.
The work theory of this search engine is just like that of Meta search engine [15], which has no its own pages
database, but an index of the search interface of other engines. Some experience of Meta search engine research
like how to combine the search results could be adopted in this search system. Because the sub search engines of
this system are strictly arranged and comply with the same protocol, its performance will be much better than any
current Meta search engine. The search engine in this layer will provide the search service in the scope of a
country, which will cover the most of search request on Internet. If you want to search in several countries, you
can use their search Webservice in top layer to build a special search engine.
5. Application of web search engine based on DNS
The next question is how to use this system. Because all the nodes of this system are integrated search engines,
how to help the user find the appropriate search engine is the key for its application. As an application software
system, we normally use class tree of OO (Object Oriented) model to describe this system. We chose a namespace
DRIS (Domain resource integration system) for this system, which means integrating the information resources in
different Domains. The class tree of search engine based on DNS is shown in Fig 2.
DRIS namespace
Acknowledgments The research described here was conducted as part of HUST (Huazhong University of Science
and Technology) Digital Library project, supported by CALIS (China Academic Library & Information System)
and national "211" project. IETF also give many suggestions and helps for this research.
REFERENCE
[1] Steve Lawrence, C. Lee Giles, Searching the World Wide Web, Science. 280(1998) 98-100
[2] C.M. Bowman, P.B. Danzig, D.R. Hardy, The Harvest information discovery and access system, Computer
Networks and ISDN Systems.28(1995) 119-125.
[3] N. Sato, M. Uehara, Y. Sakai, H. Mori, A distributed search engine for fresh information retrieval, in:
Proceedings of Database and Expert Systems Applications, 12th International Workshop, 2001
[4] Mark A.C.J. Overmeer, My personal search engine, Computer Networks. 31(1999) 2271–2279
[5] Risvik, Knut Magne, Michelsen Rolf, Search engines and Web dynamics, Computer Networks. 39(2002)
289-302
[6] Budi Yuwono, Savio L. Y. Lam, Jerry H. Ying, Dik L. Lee , A World Wide Web Resource Discovery System,
In: Proceedings of the 4th International WWW Conference (WWW95) December 11-14, 1995, Boston,
Massachusetts, USA .http://www.w3.org/Conferences/WWW4/Papers/66/
[7] Henzinger M.R, Hyperlink analysis for the Web, IEEE Internet Computing. 5(2001) 45 -50
[8] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Computer networks and ISDN
system. 30(1998)107-117
[9] Soumen Chakrabati, Dom Byron E. Mining the Web's Link Structure, Computer. 32(1999)60~67
[10] OAI (The Open Archives Initiative Protocol for Metadata Harvesting) protocol
http://www.openarchives.org/OAI/openarchivesprotocol.html
[11] Infobus, http://www-diglib.stanford.edu/diglib/pub/userinfo.html
[12] Web ontology model,http://www.w3.org/2001/sw/WebOnt/
[13] Liang sun, Implementation of large-scale distributed information retrieval system, in: Proceedings of
Info-tech and Info-net, 2001
[14] SDLIP, http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/
[15] Huang Lieming, Hemmje Matthiasa, Neuhold Erich J, ADMIRE: an adaptive data model for Meta search
engines, Computer Networks. 33(2000)431-447
[16] G.T. Wang, F. Xie, F. Tsunoda, H. Maezawa, A.K. Onoma, Web Search with Personalization and Knowledge,
in: Proceedings of Multimedia Software Engineering, Fourth International Symposium, 2002,
[17] W. Zhang , B. Xu, H. Yang , Development of A Self-Adaptive Web Search Engine, in: Proceedings of Web
Site Evolution, 3rd International Workshop,2001
[18] Sato. N, Udagawa. M, Uehara. M, Sakai.Y, Searching Restricted Documents in a Cooperative Search Engine.
In: Proceedings of 24th International Conference Distributed Computing Systems Workshops, 23-24 Mar. 2004
[19] Semantic Web,http://www.w3.org/2001/sw/
[20] IIRI BOF for DRIS, http://www.ietf.org/ietf/04mar/iiri.txt