Architectural Design and Evaluation of An Efficient Web-Crawling System
Architectural Design and Evaluation of An Efficient Web-Crawling System
1. Introduction
During the short history of the World Wide Web
(Web), Internet resources grow day by day and the
number of home pages increases rapidly. How to quickly
and accurately find what you need in the Web? Search
engine is a useful tool and it becomes more and more
important. The number of indexed pages plays a vital role
in a search engine. By indexing a larger number of pages
visited, search engines are able to satisfy users requests in
a better light. As the Web is changing every day, to index
more pages, we should collect more pages in a limited
time frame. Thus collecting pages efficiently is essential
for a quality search engine.
It is natural to think of distributed system and
parallel processing when talking about efficiently
executing tasks with large data set. Previously,
WebGather 1.0 [1], which answers more than 30,000
queries every day, adopted a centralized processing
method to collect Web pages (a main process manages
many crawlers to work in parallel), and one million page
indices are maintained after the pages are crawled and
analyzed. With the capability of crawling 100,000 pages a
day, WebGather 1.0 takes about ten days to refresh the
2. Related work
2.1. Harvest: a typical distributed architecture
Harvest [4] is a typical system that makes use of
distributed methods to collect and index Web pages.
Harvest is made of several subsystems. The Gatherer
subsystem collects indexing information (such as
keywords, author names, and titles, etc.) from the
resources available at Provider sites (such as FTP and
HTTP servers). The Broker subsystem retrieves indexing
information from one or more Gatherers, eliminates
duplicate information, incrementally indexes the collected
information, and provides a Web query interface to it. The
Replicator subsystem efficiently replicates Brokers around
the Internet. Users can efficiently retrieve located
information through the Cache subsystem. The Harvest
Server Registry is a distinguished Broker that holds
information about each Harvest Gatherer, Broker, Cache,
and Replicator in the Internet. Harvest provides a
3.
4.
Load balance.
Low amount of communication between main
controllers.
High scalability, that is, the more the main
controllers, the higher the performance.
Dynamic reconfigurability, meaning to add or
remove main controllers while running.
2.
2.
Circular
communication:
there
is
an
interconnection between two adjacent main
controllers, forming a ring graph. Cross URLs
transmission can be clockwise or anti-clockwise.
Mesh
communication:
there
is
an
interconnection between any two main
controllers, forming a fully connected graph.
Cross URLs transmission can be directly
completed from one to another.
Gather2
MainCtrl2
Gather3
MainCtrl3
MainCtrl1
Gather1
WSR
MainCtrlN
1.
2.
..
GatherN
2.
3.
3.
5000
5001
10000
45001
50000
n1
n2
n10
4545
4546
5000
5001
9545
9546
10000
45001
49545
49546
50000
n1
n1 shift
n2
n2 shift
n11
10
ref1
ref2
2
3
4
6
6
9
8
12
10
15
12
18
14
21
16
24
18
27
20
30
E ( X ) = xk pk
k = 1,2,... (1)
n
2
4
8
16
Reference
0.000110
0.001454
0.000501
0.000309
8.18E-05
0.000326
0.00059
0.000564
0.000375
0.000315
0.000124
7.04E-05
6.11E-05
4.98E-05
5.32E-05
1.06E-05
1.57E-05
1.43E-05
1.11E-05
1.34E-05
0.01
0.01
0.01
0.01
0.01
6
6.18E-05
0.000465
4.18E-05
1.42E-05
0.01
7
2.14E-07
0.000702
4.25E-05
1.48E-05
0.01
8
1.25E-05
0.000672
7.44E-05
1.51E-05
0.01
9
2.74E-05
0.000662
5.91E-05
1.58E-05
0.01
10
8.24E-06
0.000568
5.79E-05
1.82E-05
0.01
We can see from Table 3 that, when there are two, four,
eight, or sixteen main controllers, the variances are all less
than the corresponding reference ones. That is to say each
main controller is responsible for about the same size of
Web page set. Thus, the expected goal of load balance of
the distributed system is achieved.
k =1
x k' =
xk
j = 1,2,... (3)
x
j =1
x
j =1
column of Table 1.
Table 2. Regulated reference data
t
10
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
n
ref1
ref2
Number
of main
controllers
in the
centralize
d system
1
1
1
1
Line
types in
the
Figure 3
Distributed
Ratio
56199
52712
51055
24763
96130
177131
304854
290344
1.710529
3.360354
5.97109
11.72491
Square
Diamond
Plus
Star
3.5
10
12
14
16
maincontrol number
5. Conclusion
The parallel and distributed architecture described in
this paper provides a method for efficiently crawling
massive amount of Web pages. The simulation results
demonstrate that the system realizes our design goal. At
present, we are applying the architecture and method to
implement WebGather 2.0. In the real system (visit
http://e.pku.edu.cn for a look), we temporarily run two
main controllers, which shows the expected outcome,
collecting about 300,000 pages a day. At the same time,
we realize that the success of the distributed crawling
system brings many more new issues of research and
development, such as parallel indexing and retrieving, etc.
In addition, we believe that the architecture proposed in
the paper can be used to build the information system
infrastructure in digital library context.
6. Acknowledgement
8
16
2.5
web page number
10
2,4,8,16 mainctrls
x 10
2
4
1.5
2
2,4,8
0.5
16
0
2,4,8,16 mainctrls
12
acceleration
5
time
10
7. References
[1] J. Liu, M. Lei, J. Wang, and B. Chen. Digging for gold on
the Web: Experience with the WebGather. In Proceedings
of the 4th International Conference on High Performance
Computing in the Asia-Pacific Region, Beijing, P.R.China,
May 14-17, 2000. IEEE Computer Society Press. PP: 751755.
[2] Google Search Engine. http://www.google.com
[3] http://searchenginewatch.com/reports/sizes.html