Effective Focused Crawling Based On Content and Link Structure Analysis
Effective Focused Crawling Based On Content and Link Structure Analysis
Abstract — A focused crawler traverses the web selecting out dependent ontology, which in turn strengthens the metric used
relevant pages to a predefined topic and neglecting those out of for prioritizing the URL queue. The Link-Structure-Based
concern. While surfing the internet it is difficult to deal with method is analyzing the reference-information among the pages
irrelevant pages and to predict which links lead to quality pages. to evaluate the page value. This kind of famous algorithms like
In this paper, a technique of effective focused crawling is the Page Rank algorithm [5] and the HITS algorithm [6]. There
implemented to improve the quality of web navigation. To check are some other experiments which measure the similarity of
the similarity of web pages w.r.t. topic keywords, a similarity page contents with a specific subject using special metrics and
function is used and the priorities of extracted out links are also reorder the downloaded URLs for the next crawl [7].
calculated based on meta data and resultant pages generated
from focused crawler. The proposed work also uses a method for A major problem faced by the above focused crawlers is
traversing the irrelevant pages that met during crawling to that it is frequently difficult to learn that some sets of off-topic
improve the coverage of a specific topic. documents lead reliably to highly relevant documents. This
deficiency causes problems in traversing the hierarchical page
Keywords-focused crawler, metadata, weight table, World-Wide layouts that commonly occur on the web.
Web, Search Engine, links ranking.
To solve this problem, Rennie and McCallum [19] used
reinforcement learning to train a crawler on specified example
web sites containing target documents. However, this approach
I. INTRODUCTION puts burden on the user to specify representative web sites.
With the exponential growth of information on the World Paper [20] uses tunneling to overcome some off-topic web
Wide Web, there is a great demand for developing efficient and page. The main purpose of those algorithms is to gather as
effective methods to organize and retrieve the information many relevant web pages as possible.
available. A crawler is the program that retrieves Web pages
for a search engine, which is widely used today. Because of III. PROPOSED ARCHITECTURE
limited computing resources and limited time, focused crawler Fig. 1 shows the architecture of focused crawling system
has been developed. Focused crawler carefully decides which [8][9].
URLs to scan and in what order to pursue based on previously
downloaded pages information. An early search engine which Seed URLs
Internet
deployed the focused crawling strategy based on the intuition
that relevant pages often contain relevant links. It searches
deeper when relevant pages are found, and stops searching at URL Web Page Irrelevant
pages not as relevant to the topic. Unfortunately, this traditional Queue downloader Table
method of focused crawling show an important drawback when
the pages about a topic are not directly connected. In this paper,
an approach for overcoming the limitations of dealing with the
Parser &
irrelevant pages is proposed. Extractor
REFERENCES
[1] S. Chakrabarti, M. van den Berg, B. Dom, “Focused crawling: a new
approach to topic-specific Web resource discovery,” in 8th International
WWWConference, May 1999.
[2] P.M.E. De Bra, R.D.J. Post, “Information Retrieval in the World Wide
Web: Making Client-based searching feasible”, Computer Networks and
ISDN Systems, 27(2) 1994 183-192.
[3] M. Hersovici, A. Heydon, M. Mitzenmacher, D.pelleg, “The Shark-
search Algorithm-An application: Tailored Web Site Mapping. Proc of
World Wide Conference”, Brisbane. Australia, 1998, 317-326.
[4] S. Ganesh, M. Jayaraj, V. Kalyan, S. Murthy and G. Aghila. “Ontology-
based Web Crawler”, IEEE Computer Society, Las Vegas – Nevada –
USA, pp. 337-341, 2004.
[5] S. Bri, L. Page, “The anatomy of large-scale hypertext Web search
engine”, Proc of World-Wide Web Conference, Brisbane, Australia,
1998, 107-117.
[6] Jon M. Kleinberg, “Authoritative Sources in a Hyperlinked
Environment”, Journal of the ACM, 1999, 46(5), 604-632.
[7] .J. Cho, H. Garcia-Molina, and L. Page, “Efficient crawling through
URL ordering,” in Proceedings of the Seventh World-Wide Web
Conference, 1998.
[8] Qu Cheng, Wang Beizhan, Wei Pianpian “Efficient focused crawling
strategy using combination of link structure and content similarity”
IEEE 2008.
[9] P. Tao, H. Fengling, Z. Wanli “A new Framework for Focused Web
Crawling” Wuhan University Journal of Natural Science (WUJNS),
2006.
[10] http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
[11] [M. F. Porter. “An algorithm for suffix stripping”. Readings in
information retrieval, Morgan Kaufmann Publishers Inc, pp. 313-316,
San Francisco – CA - USA , 1997.
[12] A.Ghozia, H.Sorour, A.Aboshosha “Improved focused crawling using
Bayesian object based approach” 25th national radio science
conference(NRSC 2008).
[13] D. Mukhopadhyay, A. Biswas, S. Sinha “A new approach to design
domain specific ontology based web crawler” IEEE 2007.
[14] M. shokouhi, P. Chubak, Z. Raeesy “Enhancing Focused Crawling with
Genetic Algorithms” IEEE 2005.
[15] F. Yuan, C. Yin and Y. Zhang “An application of Improved PageRank
in focused Crawler” IEEE 2007.
[16] J. Rennie and A. McCallum, "Using reinforcement learning to spider the
web efficiently," in Proc. International Conference on Machine Learning
(lCML),1999.
[17] D. Bergmark, C. Lagoze, and A. Sbityakov, "Focused Crawls,
Tunneling, and Digital Libraries", in Proc. 6th European Conf. Research
and Advanced Technology for Digital Libraries, pp. 91-106, 2002.
AUTHORS PROFILE
Anshika Pal Completed B.E. in Computer Science & Engg. and now
pursuing M.Tech degree in Computer Science and Engineering.