Fuzzy Based Approach To URL Assignment in Dynamic Web Crawler
Fuzzy Based Approach To URL Assignment in Dynamic Web Crawler
Abstract— WWW is a huge collection of unorganized achieve a tradeoff among the objectives to build an optimized
documents. Web Crawler is the process used by the search crawler.
engines to build the database from this unorganized web. The
crawler which interacts with millions of web pages has to be Besides these challenges, the advantages of parallel crawler
made efficient in order to make a search engine powerful. This over single process crawler are [4]:
necessitates parallelization of web crawlers to enhance the
Scalability: With millions of pages been added to the
download rate due to the fast increasing size of the web. The
paper review different parallel web crawling techniques in the
web daily, it’s almost impossible to crawl the web by a
literature. The paper proposes an approach for URL assignment single process crawler.
in dynamic parallel web crawler using fuzzy logic which Network Load Dispersion: With parallel crawlers, we
addresses two important aspects of a crawler: first one to create can disperse the load to multiple regions rather than
crawling framework with load balancing among parallel overloading one local network.
crawlers. The second aspect is to make crawling process fast by
using parallel crawlers with efficient network access. Network Load Reduction: Allowing parallel agents to
crawl specific local data (of the same country or region
Keywords— Static Parallel Crawler, Dynamic Parallel of that of crawler), the pages will have to go through
Crawler, Fuzzy Logic. local network, thereby reducing the network load.
Further, to reduce the overlapping of the downloaded pages
I. INTRODUCTION by parallel crawlers, the parallel agents need to coordinate. On
A crawler is a program that downloads and stores Web that basis, the parallel crawler can be implemented in three
pages, often for a Web search engine. A crawler plays a vital ways [4]:
role in data mining algorithms in many fields of research e.g. Static Parallel Crawler: In this, the web is partitioned
mining of twitter data for opinion mining or finding the success by some logic and each crawler knows its own partition
ratio in project funding sites like Kickstart [1, 2]. Generally, a to crawl. So, there is no need of the central coordinator.
web crawler starts its work from a single seed URL keeping it
into a queue Q0, where it keeps all the URLs to be extracted. Dynamic Parallel Crawler: In this, there is a central
From there, it extracts a URL based on some ordering and coordinator which assigns the URL to different parallel
downloads that page, extracts any URL in the downloaded agents based on some logic i.e. the web is partitioned
page and put them in the same queue. It repeats this function by the central coordinator at the run time.
until it is stopped.
Independent Parallel Crawler: Here, there is no
The major difficulty for a single process web crawler is that coordination among the parallel agents. Each parallel
crawling the web may take months and in the mean time a agent continues crawling from its own seed URL. So,
number of web pages may have changed and thus, not useful the overlap can be significant in this case unless the
for the end users. So, to minimize the download time, search domain of crawl agents is limited and entirely different
engines execute multiple crawlers simultaneously known as for each crawl agent.
parallel web crawlers.
A. Static Parallel Crawler
An appropriate architecture of a parallel crawler demands
the overlapping of the downloaded web pages by different As discussed, there is no need of the central coordinator for
parallel agents to be negligible. Further, the coverage rate of the static parallel crawler. Instead, we need a good partitioning
the web should not be compromised within each parallel agents method to partition the web before the crawling. A number of
range. Next, the quality of web crawled should not be less than partitioning scheme have been proposed, as follow [5]:
a single centralized crawler [4]. To achieve all these, URL Hash Based: In this scheme, the page is sent to a
communication overhead should be taken into account to parallel agent based on the hash value of the URL. So,
in between a crawl, a parallel agent may not be able to agent behaves as a separate single process crawler receiving
crawl URL of the same site due to different hash value the seed URL for its domain from the central coordinator. It
leading to interpartition link. then downloads the page from the web and extracts the URL
links from the downloaded page and sends them to the central
Site Hash Based: In this scheme, the hash value is coordinator for assignment, in case it is outside the domain of
calculated only on the site name of the URL. So, the the crawl agent. The domain of each parallel agent is
URLs of the same site will be crawled by the same implementation specific. Further, the dynamic parallel crawler
parallel agents, resulting in less interpartition links and has a number of advantages which are explained as follow:
further, less communication bandwidth.
Crawling Decision: Static parallel crawler suffers from
Hierarchical: Here, partitioning is done on the basis of poor crawling decision i.e. which web page to crawl
issues like country, language or the type of URL next. This is because no crawling agent has complete
extension. view of the web crawled. But in dynamic parallel
One concern in the literature of Static Parallel Crawler is the crawler, the central coordinator has the global image of
mode of job division among the parallel agents. There are the web and the decision about the URL selection and
different modes of job division like firewall, crossover, and assignment is taken by the central coordinator, not the
exchange [5]. Under the first mode, the parallel agents crawl crawl agents [6].
pages in its partition only, neglecting the interpartition links. Scalability : In case of dynamic parallel crawlers only
Under the second mode, the parallel agents primarily crawls N connections are needed with the central coordinator
only same partition links and if there are no more links left to for URL assignments, in case the number of crawl
crawl in the same partition, then it moves to interpartition links agents are N. If a new crawl agent is added to the
to crawl them. Under the third mode, parallel agents system, only one socket connection will be required
communicate through message exchanges whenever they between the crawl agent and the central coordinator.
encounter an interpartition link to increase the coverage and
decrease overlap. Minimizing Web Server Load: One important aspect of
a web crawler is that it should not overload a server
The drawbacks of the static parallel crawler are as follow:
with its requests. It has been observed that a web page
Scalability: In order to reduce the overlap and increase contains a number of links to the pages of same web
the coverage, there should be N! connections for URL server. The crawl agent sends the URL links to the
transfer to appropriate parallel agent, in case the central coordinator which sends the most important
number of parallel crawlers are N. unvisited link to the crawl agents. It is not possible that
all the pages of same server would be always
Quality of web pages crawled: Here, each parallel important. This leads to the decreased load on a single
agent is unaware of the web crawled by other agents. web server.
So, they don’t have the global image of the web
crawled and thus, the decision of URL selection is Dynamic web crawler design poses a number of challenges
entirely based on the subset of web crawled, which is too which need to be addressed here:
nothing but the web crawled by the parallel crawl Which distribution algorithm should be used for URL
agent. assignment?
B. Dynamic Parallel Crawler
How to distribute jobs to different crawlers based on
As discussed, there is a central coordinator to manage the their health i.e. the crawler selection for the URL to
assignment of URLs to different crawler agents in case of optimize load balancing?
dynamic parallel crawler. The architecture of dynamic web
crawler is as follow: How to manage the already crawled pages to avoid
replication of pages in database?
The main objective of this paper is the URL assignment
strategies which is one of the important functionalities of the
dynamic parallel crawler.
The paper is organized as follow: Related work in the
dynamic URL assignment is explained in section II. Section III
describes proposed fuzzy technique for URL assignment.
Section IV describes the fuzzy phase of the technique
including the benefits of the proposed architecture. Section V
concludes the paper.