Effective Focused Crawling Based On Content and Link Structure Analysis

This document presents a technique for effective focused crawling to improve the quality of web navigation on a specific topic. The proposed approach uses both content and link structure analysis. It constructs a topic-specific weight table from seed URLs and calculates the relevance of downloaded pages to the topic. URLs extracted from relevant pages are prioritized based on metadata and added to the queue, while irrelevant pages are stored separately to help traverse related pages. The goal is to gather more relevant web pages through an informed crawling process.

Uploaded by

Ghiffari Agsarya

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Effective Focused Crawling Based On Content and Link Structure Analysis

Uploaded by

Ghiffari Agsarya

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 2, No. 1, June 2009

Effective Focused Crawling Based on Content and

Link Structure Analysis
Anshika Pal, Deepak Singh Tomar, S.C. Shrivastava
Department of Computer Science & Engineering
Maulana Azad National Institute of Technology
Bhopal, India
Emails: [email protected], [email protected], [email protected]

Abstract — A focused crawler traverses the web selecting out dependent ontology, which in turn strengthens the metric used
relevant pages to a predefined topic and neglecting those out of for prioritizing the URL queue. The Link-Structure-Based
concern. While surfing the internet it is difficult to deal with method is analyzing the reference-information among the pages
irrelevant pages and to predict which links lead to quality pages. to evaluate the page value. This kind of famous algorithms like
In this paper, a technique of effective focused crawling is the Page Rank algorithm [5] and the HITS algorithm [6]. There
implemented to improve the quality of web navigation. To check are some other experiments which measure the similarity of
the similarity of web pages w.r.t. topic keywords, a similarity page contents with a specific subject using special metrics and
function is used and the priorities of extracted out links are also reorder the downloaded URLs for the next crawl [7].
calculated based on meta data and resultant pages generated
from focused crawler. The proposed work also uses a method for A major problem faced by the above focused crawlers is
traversing the irrelevant pages that met during crawling to that it is frequently difficult to learn that some sets of off-topic
improve the coverage of a specific topic. documents lead reliably to highly relevant documents. This
deficiency causes problems in traversing the hierarchical page
Keywords-focused crawler, metadata, weight table, World-Wide layouts that commonly occur on the web.
Web, Search Engine, links ranking.
To solve this problem, Rennie and McCallum [19] used
reinforcement learning to train a crawler on specified example
web sites containing target documents. However, this approach
I. INTRODUCTION puts burden on the user to specify representative web sites.
With the exponential growth of information on the World Paper [20] uses tunneling to overcome some off-topic web
Wide Web, there is a great demand for developing efficient and page. The main purpose of those algorithms is to gather as
effective methods to organize and retrieve the information many relevant web pages as possible.
available. A crawler is the program that retrieves Web pages
for a search engine, which is widely used today. Because of III. PROPOSED ARCHITECTURE
limited computing resources and limited time, focused crawler Fig. 1 shows the architecture of focused crawling system
has been developed. Focused crawler carefully decides which [8][9].
URLs to scan and in what order to pursue based on previously
downloaded pages information. An early search engine which Seed URLs
Internet
deployed the focused crawling strategy based on the intuition
that relevant pages often contain relevant links. It searches
deeper when relevant pages are found, and stops searching at URL Web Page Irrelevant
pages not as relevant to the topic. Unfortunately, this traditional Queue downloader Table
method of focused crawling show an important drawback when
the pages about a topic are not directly connected. In this paper,
an approach for overcoming the limitations of dealing with the
Parser &
irrelevant pages is proposed. Extractor

II. RELATED WORKS

Focused crawling was first introduced by chackrabarti in Relevant Relevance Topic Specific
Page DB Calculator Weight Table
1999[1]. The fish-search algorithm for collecting topic-specific
pages is initially proposed by P.DeBra et al. [2]. Based on the
improvement of fish-search algorithm, M.Hersovici et al. Relevant Irrelevant
proposed the shark-search algorithm [3]. An association metric Topic
was introduced by S.Ganesh et al. in [4]. This metric estimated Filter
the semantic content of the URL based on the domain
Figure 1. The architecture of focused crawler
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 2, No. 1, June 2009
URL queue contains a list of unvisited URLs maintained is more important to express the topic of a page than common
by the crawler and is initialized with seed URLs. Web page text. For this reason, weights are adjusted as follows [8]:
downloader fetches URLs from URL queue and downloads
corresponding pages from the internet. The parser and extractor f kp = 2 title text (2)
extracts information such as the terms and the hyperlink URLs 1 common text
from a downloaded page. Relevance calculator calculates
relevance of a page w.r.t. topic, and assigns score to URLs f kp is the weight of keyword k in different locations of
extracted from the page. Topic filter analyzes whether the page p. So we can get the overall weight (wkp) of keyword k in
content of parsed pages is related to topic or not. If the page is page p by adding the weights of k in different locations. We
relevant, the URLs extracted from it will be added to the URL have shown an example in fig. 2.
queue, otherwise added to the Irrelevant table.
Page (p):
A. Topic Specific Weight Table Construction <html><title>java documentation</title>
<body>java technologies</body></html>
Weight table defines the crawling target. The topic name is
sent as a query to the Google Web search engine and the first K The weight of term “java” in page p
results are retrieved [12]. The retrieved pages are parsed, stop = 2 ( weight of java in title) + 1 (weight of java in body) = 3
words such as “the” and “is” are eliminated [10], words are
stemmed using the porter stemming algorithm [11] and the Figure 2. Weighting method
term frequency (tf) and document frequency (df) of each word
is calculated. The term weight is computed as w = tf * df. Order Now we use a cosine similarity measure to calculate the
the word by their weight and extract a certain number of words relevance of the page on a particular topic.
with high weight as the topic keywords. After that weights are
normalized as ∑ k ε ( t ∩ p) wkt wkp (3)
W = Wi (1) Relevance (t, p) =

Wmax ∑ k ε t (wkt)2 ∑ k ε t (wkp)2

Where Wi is the weight of keyword i and Wmax is weight Here, t is the topic specific weight table, p is the web page
of keyword with highest weight. Table I shows the sample under investigation, wkt and wkp is the weight of keyword k in
weight table for topic “E-Business”. the weight table and in the web page respectively. The range of
Relevance (t, p) lies between 0 and 1, and the relevancy
TABLE I. WEIGHT TABLE increases as this value increases [15]. If the relevance score of a
page is greater than relevancy limit specified by the user, then
Keywords Weight this page is added to database as a topic specific page.
Business 1.0
2) Links Ranking
Management 0.58 The Links Ranking, assigns scores to unvisited Links
Solution 0.45 extracted from the downloaded page using the information of
Corporation 0.34 pages that have been crawled and the metadata of hyperlink.
Metadata is composed of anchor text and HREF information
Customer 0.27
[15].
LinkScore(j) = URLScore(j) + AnchorScore(j) +
Now the crawler tries to expand this initial set of keywords,
LinksFromRelevantPageDB(j) + [ Relevance(p1) +
by adding relevant terms that it has intelligently detected
Relevance(p2) +…+ Relevance(pn) ] (4)
during the crawling process [14]. Among the pages
downloaded by focused crawler, those which have been LinkScore (j) is the score of link j, URLScore (j) is the
assigned a relevance score greater than or equal to 0.9 by relevance between the HREF information of j and the topic
equation (3), are so likely to be relevant to the search topic, keywords, and AnchorScore (j) is the relevance between the
because the range of relevance score lies between 0 and 1, and anchor text of j and the topic keywords, we use equation (3) to
the relevancy increases as this value increases, so web pages compute the relevance score. LinksFromRelevantPageDB(j) is
whose relevance score is over 0.9 is obviously highly relevant the number of links from relevant crawled pages to j [8], Pi is
pages. Keyword with highest frequency from each of these the ith parent page of URL j [4]. Parent page is a page from
pages are extracted and added to the table with weight equal to which a link was extracted.
the relevance score of the corresponding page.
C. Dealing with Irrelevant Pages
B. Relevance Calculation Though Focused crawling is quite efficient, it have some
1) Page Relevance drawbacks. From the fig. 3 we can see that at level 2 there are
The weight of words in page corresponding to the keyword some irrelevant pages (P, Q) which are discarding relevant
in the table is calculated. The same words in different locations pages (c, e) at level 3 and (f, g) at level 4 from the crawling
of a page take different information. For example, the title text path [13].
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 2, No. 1, June 2009
The main component of the architecture in fig. 1 which help
Relevant page to execute the above algorithm is irrelevant table that allows
irrelevant pages to be included in the search path.
Irrelevant page In Table II we have shown an example of the above
algorithm, for page P in fig. 3.
P Q
TABLE II. IRRELEVANT TABLE
a b c d e Link LinkScore Level
f g h c 5 0
e 4 0
b 3 0 1
a 2 0 1
Figure 3. An example structure of web pages d 1 0 1
f - 1
As a solution of this problem we design an algorithm that g - 1
allows the crawler to follow several bad pages in order to get a h - 1
good page. The working principle of this algorithm is to go on
crawling upto a given maxLevel from the irrelevant page.
Pseudo code of Algorithm Fig. 3 shows that page P is irrelevant, so accord to the
process of line 1-8 in algorithm, URLs that it contains (a, b, c,
1. if (page is Irrelevant) d, e) are all extracted and inserted into the table with level
2. { value 0 and its calculated link score, which are assumed as 1, 2,
3. Initialize level = 0; 3, 4 and 5 for this example, then sort the table (first five entries
4. url_list = extract_urls(page); in table II). Now page c and e are downloaded, and its extracted
// extract all the urls from the page urls (f, g and h) are added to the table with level value 1 and its
5. for each u in url_list { corresponding link score (process of line 9-24 and last three
6. compute LinkScore(u) using equation (4); entries in table II).
7. IrrelevantTable.insert(u, LinkScore(u), level); The process of line 25-28 are based on the assumption that
// insert u into irrelevant table with linkscore and level if a page is irrelevant and its any child say v is also irrelevant
value } then all the children of this page whose linkscore are less than
8. reorder IrrelevantTable accord to LinkScore(u); to v and level value are same as level of v, are less important,
9. while ( IrrelevantTable.size > 0) so for these urls it would be unnecessary to continue crawling
10. { process upto a given maxLevel, we can directly increase level
11. get the url with highest score and call it Umax ; value without downloading these pages, it means that we are
12. if ( Umax.level < = maxLevel) reducing the crawling depth of these less meaningful paths.
13. {
Now in table II we see that level of page b, a and d are
14. page = downloadPage(Umax);
updated from 0 to 1, because when b is downloaded, it seems
// download the URL Umax
to be irrelevant and accord to line 27 level of b, a and d are
15. calculate relevance of the page using equation (3); increased.
16. if ( page is relevant) {
17. RelevantPageDB = page; Clearly, this algorithm can improve the effectiveness of
// put page into relevant page database focused crawling by expanding its reach, and its efficiency by
18. if ( Umax.level < maxLevel) { reducing the crawling path of less relevant urls so that
19. level ++ ; unnecessary downloading of too many off-topic pages avoided
20. url_list = extract_urls(page); and better coverage may achieved.
21. for each u in url_list {
22. compute LinkScore(u) using equation (4); IV. EXPERIMENTAL RESULTS
23. IrrelevantTable.insert(u, LinkScore(u), level); } The experiments are conducted in Java environment.
24 reorder IrrelevantTable accord to LinkScore(u); } Breadth-First Search (BFS) crawler is also implemented for
25. } else { performance comparison. Target topics were E-Business,
26. for each u in IrrelevantTable { Nanotechnology, Politics, and Sports. For each topic crawler
27 if (LinkScore(u) < = LinkScore(Umax) started with 10 seed sample URLs, and crawled about one
&& u.level = = Umax.level) thousand web pages. Google is used to get the seed URLs. The
28. u.level ++; } parameter used in this experiment is: Weight Table size = 50,
29. } maxLevel = 2, and relevancy limit is equal to the half of the
30. } }} average LinkScore of seed URLs, this value lies between 0.3
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 2, No. 1, June 2009
to 0.5 in our experiment. Precision metric is used to evaluate
the crawler performance.

precision_rate = relevance_ pages (5)

pages_downloaded
The precision ratios varied among the different topics and
seed sets, possibly because of the linkage density of pages
under a particular topic or the quality of the seed sets.
Table III shows the final precision rates of four topics after
crawling one thousand pages.

TABLE III. THE FINAL RATE OF TOPICS

Target Topic Focused Crawler BFS Crawler

Figure 6. Precision of two crawlers for the topic Politics
E-Business 0.83 0.15
Nanotechnology 0.56 0.25
Politics 0.85 0.25
Sports 0.77 0.36

We illustrate a crawled result on a two-dimensional graph.

Figure 7. Precision of two crawlers for the topic Sports

Figs. 3, 4, 5, and 6 show the performances of the two

crawlers for the topics E-Business, Nanotechnology, Politics,
and Sports respectively. These graphs clearly depict that in
early stages of the crawling the precision rate of our focused
crawler is not so high compared to BFS, but enhancement in
the performance appeared after crawling first few hundred
pages. It is meaningful since in primary steps of crawling
Figure 4. Precision of two crawlers for the topic E-Business process there is not so much pages in URL queue and so there
is some noise. But as the number of downloaded pages
increases, the chart would be smoother and precision rate
increases.

V. CONCLUSION & FUTURE WORK

This paper, presented a method for focused crawling that
allows the crawler to go through several irrelevant pages to get
to the next relevant one when the current page is irrelevant.
From the experimental results, we can conclude that our
approach has better performance than the BFS crawler.
Although the initial results are encouraging, there is still a
lot of work to do for improving the crawling efficiency. A
major open issue for future work is to do more extensive test
with large volume of web pages. Future work also includes
code optimization and url queue optimization, because crawler
Figure 5. Precision of two crawlers for the topic Nanotechnology
efficiency is not only depends to retrieve maximum number of
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 2, No. 1, June 2009
relevant pages but also to finish the operation as soon as Deepak Singh Tomar Completed M.Tech & B.E. in Computer Science &
possible. Engg. and pursuing PhD in Computer Science & Engg and working as
Sr. Faculty in Computer Science & Engg. Department. Total 14 Years
Teaching Experience ( PG & UG ).
ACKNOWLEDGMENT
The research presented in this paper would not have been Dr. S. C. Shrivastava Professor & Head of the Computer Science & Engg.
Department.
possible without our college, at MANIT, Bhopal. We wish to
express our gratitude to all the people who helped turn the
World-Wide Web into the useful and popular distributed
hypertext it is. We also wish to thank the anonymous reviewers
for their valuable suggestions.

REFERENCES
[1] S. Chakrabarti, M. van den Berg, B. Dom, “Focused crawling: a new
approach to topic-specific Web resource discovery,” in 8th International
WWWConference, May 1999.
[2] P.M.E. De Bra, R.D.J. Post, “Information Retrieval in the World Wide
Web: Making Client-based searching feasible”, Computer Networks and
ISDN Systems, 27(2) 1994 183-192.
[3] M. Hersovici, A. Heydon, M. Mitzenmacher, D.pelleg, “The Shark-
search Algorithm-An application: Tailored Web Site Mapping. Proc of
World Wide Conference”, Brisbane. Australia, 1998, 317-326.
[4] S. Ganesh, M. Jayaraj, V. Kalyan, S. Murthy and G. Aghila. “Ontology-
based Web Crawler”, IEEE Computer Society, Las Vegas – Nevada –
USA, pp. 337-341, 2004.
[5] S. Bri, L. Page, “The anatomy of large-scale hypertext Web search
engine”, Proc of World-Wide Web Conference, Brisbane, Australia,
1998, 107-117.
[6] Jon M. Kleinberg, “Authoritative Sources in a Hyperlinked
Environment”, Journal of the ACM, 1999, 46(5), 604-632.
[7] .J. Cho, H. Garcia-Molina, and L. Page, “Efficient crawling through
URL ordering,” in Proceedings of the Seventh World-Wide Web
Conference, 1998.
[8] Qu Cheng, Wang Beizhan, Wei Pianpian “Efficient focused crawling
strategy using combination of link structure and content similarity”
IEEE 2008.
[9] P. Tao, H. Fengling, Z. Wanli “A new Framework for Focused Web
Crawling” Wuhan University Journal of Natural Science (WUJNS),
2006.
[10] http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
[11] [M. F. Porter. “An algorithm for suffix stripping”. Readings in
information retrieval, Morgan Kaufmann Publishers Inc, pp. 313-316,
San Francisco – CA - USA , 1997.
[12] A.Ghozia, H.Sorour, A.Aboshosha “Improved focused crawling using
Bayesian object based approach” 25th national radio science
conference(NRSC 2008).
[13] D. Mukhopadhyay, A. Biswas, S. Sinha “A new approach to design
domain specific ontology based web crawler” IEEE 2007.
[14] M. shokouhi, P. Chubak, Z. Raeesy “Enhancing Focused Crawling with
Genetic Algorithms” IEEE 2005.
[15] F. Yuan, C. Yin and Y. Zhang “An application of Improved PageRank
in focused Crawler” IEEE 2007.
[16] J. Rennie and A. McCallum, "Using reinforcement learning to spider the
web efficiently," in Proc. International Conference on Machine Learning
(lCML),1999.
[17] D. Bergmark, C. Lagoze, and A. Sbityakov, "Focused Crawls,
Tunneling, and Digital Libraries", in Proc. 6th European Conf. Research
and Advanced Technology for Digital Libraries, pp. 91-106, 2002.