Web_Crawler_A_Review
Web_Crawler_A_Review
ABSTRACT
Asia 1016.8
Information Retrieval deals with searching and retrieving Europe 500.7
information within the documents and it also searches the
online databases and internet. Web crawler is defined as a North… 273.1
program or software which traverses the Web and downloads Latin… 235.8
web documents in a methodical, automated manner. Based on
the type of knowledge, web crawler is usually divided in three Africa 139.9
types of crawling techniques: General Purpose Crawling,
Focused crawling and Distributed Crawling. In this paper, the
Middle East 77
applicability of Web Crawler in the field of web search and a Oceania/Au…23.9
review on Web Crawler to different problem domains in web
search is discussed. 0 200 400 600 800 1000
Keywords Millions of Users
Now days it has become an important part of human life to b) Meta search engines
use Internet to gain access the information from WWW. The c) Search engines
current population of the world is about 7.049 billion out of
which 2.40 billion people (34.3%) use Internet [3] (see Figure
1). From .36 billion in 2000, the amount of Internet users has
increased to 2.40 billion in 2012 i.e., an increase of 566.4% 2. WEB CRAWLER
from 2000 to 2012. In Asia out of 3.92 billion people, 1.076 A web crawler is a program/software or programmed script
billion (i.e.27.5%) use Internet, whereas in India out of 1.2 that browses the World Wide Web in a systematic, automated
billion, .137 billion (11.4%) use Internet. Same growth rate is manner. The structure of the WWW is a graphical structure,
expected in future too and it is not far away when one will i.e., the links presented in a web page may be used to open
start thinking that life is incomplete without Internet. Figure 1: other web pages. Internet is a directed graph where webpage
illustrates Internet Users in the World by Geographic Regions. as a node and hyperlink as an edge, thus the search operation
may be summarized as a process of traversing directed graph.
By following the linked structure of the Web, web crawler
may traverse several new web pages starting from a webpage.
A web crawler move from page to page by the using of
graphical structure of the web pages. Such programs are also
known as robots, spiders, and worms. Web crawlers are
designed to retrieve Web pages and insert them to local
repository. Crawlers are basically used to create a replica of
all the visited pages that are later processed by a search engine
that will index the downloaded pages that help in quick again assigned to web crawlers for further downloading. This
searches. Search engines job is to storing information about process is repeated till no more URLs are missing for
several webs pages, which they retrieve from WWW. These downloading. Millions of pages are downloaded per day by a
pages are retrieved by a Web crawler that is an automated crawler to complete the target. Figure 2 illustrates the
Web browser that follows each link it sees. crawling processes.
2.1 The History of Web Crawler Initialize
The first Internet “search engine”, a tool called “Archie” —
shortened from “Archives”, was developed in 1990 and
downloaded the directory listings from specified public
anonymous FTP (File Transfer Protocol) sites into local files,
around once a month [5], [6]. In 1991, “Gopher” was created, Get a URL
that indexed plain text documents. “Jughead” and “Veronica”
programs are helpful to explore the said Gopher indexes [7],
[8], [9], [10]. With the introduction of the World Wide Web in
1991 [11], [12] numerous of these Gopher sites changed to WWW
web sites that were properly linked by HTML links. In the Web Download Page
year 1993, the “World WideWebWanderer” was formed the reposito
first crawler [13]. Although this crawler was initially used to ry
measure the size of the Web, it was later used to retrieve
URLs that were then stored in a database called Wandex, the
first web search engine [14]. Another early search engine, Extract URLs
“Aliweb” (Archie-Like Indexing for the Web) [15] allowed
users to submit the URL of a manually constructed index of
their site.
Figure 2: Flow of a crawling process
The index contained a list of URLs and a list of user wrote
keywords and descriptions. The network overhead of crawlers The working of a web crawler may be discussed as follows:
initially caused much controversy, but this issue was resolved
in 1994 with the introduction of the Robots Exclusion Selecting a starting seed URL or URLs
Standard [16] which allowed web site administrators to block Adding it to the frontier
crawlers from retrieving part or all of their sites. Also, in the
year 1994, “WebCrawler” was launched [17] the first “full Now picking the URL from the frontier
text” crawler and search engine. The “WebCrawler” permitted Fetching the web-page corresponding to that URL
the users to explore the web content of documents rather than
the keywords and descriptors written by the web Parsing that web-page to find new URL links
administrators, reducing the possibility of confusing results
Adding all the newly found URLs into the frontier
and allowing better search capabilities. Around this time,
commercial search engines began to appear with [18], [19], Go to step 2 and reiterate till the frontier is empty
[20], [21], [22], [23], [24] and [25] being launched from 1994
to 1997 [26]. Also introduced in 1994 was Yahoo! , a Thus a web crawler will recursively keep on inserting newer
directory of web sites that was manually maintained, though URLs to the database repository of the search engine. So we
later incorporating a search engine. During these early years can see that the major function of a web crawler is to insert
Yahoo! and Altavista maintained the largest market share new links into the frontier and to choose a fresh URL from the
[26]. In 1998 Google was launched, quickly capturing the frontier for further processing after every recursive step.
market [26]. Unlike many of the search engines at the time,
Google had a simple uncluttered interface, unbiased search 3. CRAWLING TECHNIQUES
results that were reasonably relevant, and a lower number of There are a few crawling techniques used by Web Crawlers,
spam results [27]. These last two qualities were due to mainly used are:
Google’s use of the PageRank [28] algorithm and the use of
anchor term weighting [29]. A. General Purpose Crawling
A general purpose Web Crawler collects as many pages as it
While early crawlers dealt with relatively small amounts of can from a particular set of URL’s and their links. In this, the
data, modern crawlers, such as the one used by Google, need crawler is able to fetch a large number of pages from different
to handle a substantially larger volume of data due to the locations. General purpose crawling can slow down the speed
dramatic enhance in the amount of the Web. and network bandwidth because it is fetching all the pages.
2.2 Working of Web Crawler B. Focused Crawling
The working of Web crawler is beginning with initial set of A focused crawler is designed to collect documents only on a
URLs known as seed URLs. They download web pages for specific topic which can reduce the amount of network traffic
the seed URLs and extract new links present in the and downloads. The purpose of the focused crawler is to
downloaded pages. The retrieved web pages are stored and selectively look for pages that are appropriate to a pre-defined
well indexed on the storage area so that by the help of these set of matters. It crawl only the relevant regions of the web
indexes they can later be retrieved as and when required. The and leads to significant savings in hardware and network
extracted URLs from the downloaded page are confirmed to resources.
know whether their related documents have already been
downloaded or not. If they are not downloaded, the URLs are
C. Distributed Crawling study of changes examined via a proxy, Douglis et al. [45]
In distributed crawling, multiple processes are used to crawl identified an association between re visitation rates and
and download pages from the Web. change. Hence, the study was limited to web content visited
by a restricted population, and web pages were not
4. PARALLEL CRAWLER aggressively crawled for changes among different visits.
Now search engines do not depend on a single but on multiple Researchers have also peeped at how search results modify
web crawlers that run in parallel to complete the target. While over time [53], [54]. The main focus in this study was on
functioning in parallel, crawlers still face many challenging recognizing the dynamics of the consequences change and
difficulties such as overlapping, quality and network. Given search engines has for searchers who want to return to
below Figure illustrates the flow of multiple crawling previously visited pages. Junghoo Cho and Hector Garcia-
processes. Molina [30] proposed the design of an effective parallel
crawler. The size of the Web grows at very fast speed, it
becomes essential to parallelize a crawling process, to
URL
complete downloading pages in a reasonable amount of time.
dist
ribu Author first proposes multiple architectures for a parallel
tor Crawling Process Crawling Process Crawling Process crawler and then identifies basic issues related to parallel
crawling. Based on this understanding, author then propose
metrics to evaluate a parallel web crawler, and compare the
proposed architectures using millions of pages collected from
the Web. Rajashree Shettar, Dr. Shobha G [31] presented a
new model and architecture of the Web Crawler using
multiple HTTP connections to WWW. The multiple HTTP
connection is applied using multiple threads and
asynchronous downloader part so that the overall
downloading process is optimum. The user gives the initial
URL from the GUI provided. It begins with a URL to visit. As
the crawler visits the URL, it identifies all the hyperlinks
available in the web page and appends them to the list of
URLs to visit, known as the crawl frontier. URLs from the
WWW WWW WWW frontier is iteratively visited and it ends when it reaches more
than five levels from every home pages of the websites visited
and it is accomplished that it is not required to go deeper than
five levels from the home page to capture most of the pages
visited by the people while trying to retrieve information from
Figure 3: Flow of multiple crawling processes.
the internet. Eytan Adar et. al [32] described algorithms,
analyze, and models for characterizing the evolution of Web
5. LITERATURE SURVEY content. Proposed analysis gives insight into how Web
Possibly the largest level study of Web page change was content changes on a finer grain than previous study, both in
performed by Fetterly et al. [46]. They crawled 151 million terms of the time intervals studied and the detail of change
pages once a week for 11 weeks, and compared the analyzed. A. K. Sharma et. al. [33] Parallelization of crawling
modification across pages. Like Ntoulas et. al. [50], they system is necessary for downloading documents in a
found a relatively small amount of change, with 65% of all reasonable amount of time. The work has done reported here
page pairs remaining exactly the same. The study furthermore to focuses on providing parallelization at three levels: the
found that past change was a good judge of future change, this document, the mapper, and the crawl worker level. The
page length was correlated with change, and that the top level bottleneck at the document level has been removed. The
domain of a page was correlated with change. Describing the efficacy of DF (Document Fingerprint) algorithm and the
amount of change on the Web has been of significant interest efficiency of volatile information has been tested and verified.
to researchers [44], [46], [47], [48], [49], [50], [52]. Cho and This paper specifies the major components of the crawler and
Garcia-Molina [44] crawled around 720,000 pages once a day their algorithmic detail. Ashutosh Dixit et. al. [34] developed
for a period of four months and seemed at how the pages a mathematical model for crawler revisit frequency. This
changed. Ntoulas et. al. [50] studied page change through model ensures that frequency of revisit will increase with the
weekly downloaded of 154 websites collected over a year. change frequency of page up to the middle threshold value
They found that a large number of pages did not modify after that up to the upper threshold value remains same i.e.,
according to a bags of words measure of similarity. Even for unaffected by the change frequency of page but after the
pages that did change, the changes were small. Frequency of upper threshold value it starts reducing automatically and
change was not a big judge of the degree of change, but the settles itself to lower threshold. Shruti Sharma et. al. [35]
degree of change was a good judge of the future degree of present architecture for a parallel crawler which includes
change. multiple crawling processes; called C-procs. Each C-proc
performs the vital tasks that a single process crawler performs.
More recently, Olston and Panday [51] crawled 10,000 It downloads pages from the WWW, stores the pages locally,
random samples of URLs and 10,000 pages sampled from the extracts URLs from them and follows their links. The C-
OpenDirectory every second days for several months. Their proc’s executing these tasks may be spread either on the same
analysis measured both change frequency and information local network or at geographically remote locations. Alex Goh
longevity is the average lifetime of a shingle, and found only a Kwang Leng et. al. [36] Developed algorithm which uses the
moderate correlation between the two. They introduce new standard Breadth-First Search strategy to design and develop a
crawl policies that are aware to information longevity. In a Web Crawler called PyBot. Initially it takes a URL and from
that URL, it gets all the hyperlinks. From the hyperlinks, it model dynamically predicts collection cycle of the web
crawls again until a point that no new hyperlinks are found. It contents by calculating web collection cycle score.
downloads all the Web Pages while it is crawling. PyBot will
output a Web structure in Excel CSV format on the website it
crawls. Both downloaded pages and Web structure in Excel
6. CONCLUSION
CSV format are stored in storage and are used for the ranking. The Internet and Intranets have brings a lots of information.
The ranking systems take the Web structure in Excel CSV People usually have the option to search engines to find
format and apply the PageRank algorithm and produces necessary information. Web Crawler is thus vital information
ranking order of the pages by displaying the page list with retrieval which traverses the Web and downloads web
most popular pages at the top. Song Zheng [37] Proposed a documents that suit the user's need. Web crawlers are
new focused crawler analysis model based on the genetic and designed to retrieve Web pages and insert them to local
ant algorithms method. The combination of the Genetic repository. Crawlers are basically used to create a replica of
Algorithm and Ant Algorithm is called the Genetic all the visited pages which are later processed by a search
Algorithm-Ant Algorithm whose basic idea is to take engine that will index the downloaded pages that help in quick
advantages of the two algorithms to overcome their searches. The major objective of the review paper is to throw
shortcomings. The improved algorithm can gets higher recall some light on the web crawling previous work. This article
rate. Lili Yan et. al. [38] Proposed Genetic Pagerank also discussed the various researches related to web crawler.
Algorithms. A genetic algorithm (GA) is a search and
optimization technique which is used in computing to find
optimum solutions. Genetic algorithms are categorized as
global search heuristics. Andoena Balla et. al. [39] presents a
method for detecting web crawlers in real time. Author use
decision trees to categorize requests in real time, as beginning
from a crawler or human, while their session is ongoing. For
this purpose author used machine learning techniques to
recognize the most vital features that distinguish humans from
crawlers. The technique was tested in real time with the help
of an emulator, using only a small number of requests. Results
show the effectiveness and applicability of planned approach.
Bahador Saket and Farnaz Behrang [40] presented a technique
to determine correctly the quality of links that have not been
retrieved so far but a link is accessible to them. For this reason
author apply an algorithm like an AntNet routing algorithm.
To avoid local search difficulty, author recommended a
method which is based on genetic algorithms (GA). In this
technique, address of some pages is given to crawler and their
associated pages are retrieved and the first generation is
created. In selection task, the degree of relationship among the
pages and the specific topic is studied and each page is given
a special score. Pages whose scores exceed a definite amount
is selected and saved and other pages are discarded. In cross-
over task, the links of current generation pages are extracted.
Each link is given a unique score depending on the pages in
which link is placed. After that a previously determined
number of links will be selected randomly and the related
pages will retrieve and new generation is created. Anbukodi.S
and Muthu Manickam.K [41] proposed approach which
employs mobile agents to crawl the pages. Mobile agent is
created, sent, finally received and evaluated in its owner's
home context. These mobile crawlers are transferred to the
site of the source where the data reside to filter out any
unnecessary data locally before transported it back to the
search engine. These mobile crawlers can decrease the
network load by reducing the quantity of data transmitted over
the network. Using this approach filter those web pages that
are not modified using mobile crawlers but retrieves only
those web pages from the remote servers that are actually
modified and perform the filtering of non-modified pages
without downloading the pages. Their migrating crawlers shift
to the web servers, and carry out the downloading of web
documents, processing, and extraction of keywords. After
compressing, transfer the results back to the central search
engine. K. S. Kim et. al. [42] proposed a dynamic web-data
crawling techniques, which contain sensitive inspection of
web site changes, and dynamic retrieving of pages from target
sites. Authors develop an optimal collection cycle model
according the update characteristics of the web contents. The
7. REFERENCES