SlideShare a Scribd company logo
Seminar Report on
Mehta Ishani
130040701003
Search Engine and Web Crawler
2
Abstract
The World Wide Web is a rapidly growing and changing information
source. Due to the dynamic nature of the Web, it becomes harder to find
relevant and recent information.
Search engines are the primary gateways of information access on the Web.
Today search engines are becoming necessity of most of the people in day to
day life for navigation on internet or for finding anything. Search engine
answer millions of queries every day. Whatever comes in our mind, we just
enter the keyword or combination of keywords to trigger the search and get
relevant result in seconds without knowing the technology behind it. I
searched for ―search engine‖ and it returned 68,900 results. In addition with
this, the engine returned some sponsored results across the side of the page,
as well as some spelling suggestion. All in 0.36 seconds. And for popular
queries the engine is even faster. For example, searches for World Cup or
dance shows (both recent events) tookless than .2 seconds each.
To engineer a search engine is a challenging task. Web crawler is an
indispensable part of search engine. A web crawler is a program that, given
one or more seed URLs, downloads the web pages associated with these
URLs, extracts any hyperlinks contained in them, and recursively continues
to download the web pages identified by these hyperlinks. Web crawlers are
an important component of web search engines, where they are used to
collect the corpus of web pages indexed by the search engine. Moreover,
they are used in many other applications that process large numbers of web
pages, such as web data mining, comparison shopping engines, and so on.
Search Engine and Web Crawler
3
Introductionto Search Engine
Search engine is a tool that allows people to find information over World
Wide Web. Search engine is a website that you can use to look up web
pages, like yellow pages for the Internet. A web search engine is a software
system that is designed to search for information on the World Wide Web.
Assume you are reading a book and want to find references to a specific
word in the book. What do you do? You turn the pages to the end and look
in the index! You will then locate the word in the index, find the page
numbers mentioned there and flip to the corresponding pages.
Search Engines also work in a similar way.
Figure 1 telephone directory
Search engines are constantly building and updating their index to the World
Wide Web. They do this by using ―spiders‖ that ―crawl‖ the web and fetch
web pages. Then the words used in these web pages are added to the index
along with where the words came from. [1]
Search Engine and Web Crawler
4
How stuff works?
A search engine operates in the following order:
1. Web crawling
2. Indexing
3. Searching
Web search engines work by storing information about many web pages.
These pages are retrieved by a Web crawler (sometimes also known as a
spider) — an automated Web crawler which follows every link on the site.
The search engine then analyzes the contents of each page to determine how
it should be indexed (for example, words can be extracted from the titles,
page content, headings, or special fields called meta tags).
Figure 2 working flow of search engine
When a user enters a query into a search engine (typically by using
keywords), the engine examines its index and provides a listing of best-
matching web pages according to its criteria, usually with a short summary
Search Engine and Web Crawler
5
containing the document's title and sometimes parts of the text. The index is
built from the information stored with the data.
From 2007 the Google.com search engine has allowed one to search by date
by clicking 'Show search tools' in the leftmost column of the initial search
results page, and then selecting the desired date range.
Most search engines support the use of the boolean operators AND, OR and
NOT to further specify the search query. Boolean operators are for literal
searches that allow the user to refine and extend the terms of the search.
As well, natural language queries allow the user to type a question in the
same form one would ask it to a human. A site like this would be ask.com.
The usefulness of a search engine depends on the relevance of the result set
it gives back. While there may be millions of web pages that include a
particular word or phrase, some pages may be more relevant, popular, or
authoritative than others. Most search engines employ methods to rank the
results to provide the "best" results first.
Search engines that do not accept money for their search results make
money by running search related ads alongside the regular search engine
results. The search engines make money every time someone clicks on one
of these ads. [2]
Search Engine and Web Crawler
6
Major Search Engines - A Comparison
Today there are many search engines available to web searchers. What
makes one search engine different from another? Following are some
important measure.[3]
 The contents of that database are a crucial factor determining whether or
not you will succeed in finding the information we need. Because when
we are doing searching, we are not actually searching the Web directly.
Rather, we are searching the cache of the web or database that contains
information about all the Web sites visited by that search engine’s spider
or crawler.
 Size is also one important measure. How many Web pages has the spider
visited, scanned, and stored in the database? Some of the larger Search
Engines have databases that are covering over three billion Web pages,
while the databases of smaller Search Engines cover half a billion or less
 Another important measure is how up to date the database is. As we
know that the Web is continuously changing and growing. New Websites
appear, old sites vanish, and existing sites modify their content. So the
information stored in the database will become out of date unless Search
engine’s spider keep up with these changes.
 In addition with these, the ranking algorithm used by the Search Engine
determines whether the most relevant search results appear or not
towards the top of results list.
Figure 3. Google logo.
Google has been in the search game a long time, it has the highest share
market of Search Engine (about 81%) [3].
1) Web Crawler-based service provides both comprehensive coverage of the
Web along with great relevancy.
2) Google is much better than the other engines at determining whether a
link is an artificial link or true editorial link
Search Engine and Web Crawler
7
3) Google gives much importance to Sites which add fresh content on a
regular basis. This is why Google likes blogs, especially popular ones.
4) Google prefer informational pages to commercial sites.
5) A page on a site or sub domain of a site with significant age or link can
rank much better than it should, even with no external citations.
6) It has aggressive duplicate content filters that filter out many pages with
similar content.
7) Crawl depth determined not only by link quantity, but also link quality.
Excessive low quality links may make your site less likely to be crawled
deep or even included in the index.
8) In addition we can search for twelve different file formats, cached pages,
images, news and Usenet group postings.
Figure 4 Yahoo logo.
Yahoo has been in the search game for many years [3].
1) It shares the second largest share market of the search engine
(about12%).
2) When it comes to counting back lings, Yahoo is the most accurate search
engine
3) Yahoo is better than MSN but near as good as Google at determining
whether a link is artificial or natural.
4) Crawl rate of the Yahoo's spiders is at least 3 times faster than Google‟s
Spiders.
5) Yahoo! tends to prefer commercial pages to informational pages as
comparing with Google.
6) At Yahoo search engine "exact matching" is given more importance than
"concept matching" which makes them slightly more susceptible to
spamming.
7) Yahoo! gives more importance to meta keywords and description tags.
Search Engine and Web Crawler
8
Figure. 5. MSN logo.
1) MSN has the share of 3% of the total search engine market [3].
2) MSN Search uses its own Web database and also has separate News,
Images, and Local databases.
3) Its strengths include: this large unique database, its query building
Search Builder" and Boolean searching, cached copies of Web pages
including date cached, and automatic local search options.
4) The spider crawls only the beginning of the pages (as opposed to the
other two search engine which crawl the entire content) and also the
number of pages found in its index or database is extremely low.
5) It is bad at determining if a link is natural or artificial in nature.
6) Due to sucking at link analysis they place too much weight on the page
content.
7) New sites that are generally untrusted in other systems can rank quickly
in MSN Search. But it also makes them more susceptible to spam.
8) Another downside of this search engine is its habit of supplying the
results based on geo-targeting, which makes it extremely hard to
determine if the results we see are the same ones everybody sees.
Figure. 6. ASK Jeeves logo.
1) The Ask search engine has the lowest share (about 1%) out of the total
search engine market [3].
2) Ask is a topical search site. It gives more importance to sites that are
linked to topical communities
3) Ask is more susceptible to spamming.
4) Ask is smaller and more specialized than other search engines, it is wise
to approach this engine more from a networking or marketing
perspective.
Search Engine and Web Crawler
9
Figure 7 live search logo
1) Lunched in sept 2006
2) Live Search (formerly Windows Live Search) is the name of Microsoft's
web search engine, successor to MSN Search, designed to compete with
the industry leaders Google and Yahoo.
3) It also allows the user to save searches and see them updated
automatically on Live.com.
Figure 8 bing logo
1) Lunched in july 2009 by Microsoft. Use msn search.
2) Things like 'wiki' suggestions, 'visual search', and 'related searches' might
be very useful to you.
Search Engine and Web Crawler
10
Introductionto Web Crawler
A crawler is a program that visits Web sites and reads their pages and other
information in order to create entries for a search engine index. Also known
as a "spider" or a "bot" (short for "robot")
Spider – programs like a browser to download the web page.
Crawler – programs automatically follow the links of web pages.
Robots - It had automated computer program can visit websites. It will be
guided by search engine algorithms It can combine the tasks of crawler &
spider helpful of the indexing the web pages and through the search engines.
[4]
 Why Crawlers?
Figure 9 result of searching term web crawler in Google.
Crawling: gathering pages from the internet, in order to index them
It has 2 main objectives:
• fast gathering
• efficient gathering [5]
Internet has a wide expanse of
Information.
Finding relevant information
requires an efficient
mechanism.
Web Crawlers provide that
scope to the search engine.
Search Engine and Web Crawler
11
 Features
Features a crawlermust provide
Robustness: The Web contains servers that create spider traps, which are
generators of web pages that mislead crawlers into getting stuck fetching
an infinite number of pages in a particular domain. Crawlers must be
designed to be resilient to such traps. Not all such traps are malicious;
some are the inadvertent side-effect of faulty website development.
Politeness: Web servers have both implicit and explicit policies
regulating the rate at which a crawler can visit them. These politeness
policies must be respected.
Features a crawlershould provide
Distributed: The crawler should have the ability to execute in a
distributed fashion across multiple machines.
Scalable: The crawler architecture should permit scaling up the crawl
rate by adding extra machines and bandwidth.
Performance and efficiency: The crawl system should make efficient
use of various system resources including processor, storage and network
bandwidth.
Quality: Given that a significant fraction of all web pages are of poor
utility for serving user query needs, the crawler should be biased towards
fetching ―useful‖ pages first.
Freshness: In many applications, the crawler should operate in
continuous mode: it should obtain fresh copies of previously fetched
pages.
Extensible: Crawlers should be designed to be extensible in many ways
– to copewith new data formats, new fetch protocols, and so on.
This demands that the crawler architecture be modular. [5]
Search Engine and Web Crawler
12
Architectureof Crawler
 Flow of basic sequential crawler
Web crawlers are mainly used to index the links of all the visited
pages for later processing by a search engine. Such search engines
rely on massive collections of web pages that are acquired with the
help of web crawlers, which traverse the web by following hyperlinks
and storing downloaded pages in a large database that is later indexed
for efficient execution of user queries. Despite the numerous
applications for Web crawlers, at the core they are all fundamentally
the same. Following is the process bywhich Web crawlers work [6]:
1) Download the Web page.
2) Parse through the downloaded page and retrieve all the links.
3) For each link retrieved, repeat the process.
Figure 10 shows the flow of a basic sequential crawler. The crawler
maintains a list of unvisited URLs called the frontier.
The list is initialized with seed URLs which may be provided by a
user or another program. Each crawling loop involves picking the next
URL to crawl from the frontier, fetching the page corresponding to the
URL through HTTP, parsing the retrieved page to extract the URLs
and application specific information, and finally adding the unvisited
URLs to the frontier.
Before the URLs are added to the frontier they may be assigned a
score that represents the estimated benefit of visiting the page
corresponding to the URL. The crawling process may be terminated
when a certain number of pages have been crawled. If the crawler is
ready to crawl another page and the frontier is empty, the situation
signals a dead-end for the crawler. The crawler has no new page to
fetch and hence it stops. [6]
Search Engine and Web Crawler
13
Figure 10 Flow of a basic sequential crawler
The multi-threaded crawler model needs to deal with an empty
frontier just like a sequential crawler [6].
Search Engine and Web Crawler
14
Figure 11 A multi-threaded crawler model
 High level architecture
Here, the multi-threaded downloader downloads the web pages from
the WWW, and using some parsers the web pages are decomposed
into URLs, contents, title etc.
The URLs are queued and sent to the downloader using some
scheduling algorithm. The downloaded data are stored in a database
[7].
Search Engine and Web Crawler
15
Figure 12 high level architecture of web crawler
The design of the downloader scheduler algorithm is crucial as too
many objects will exhaust many resources and make the system slow,
too small number of downloader will degrade the system
performance. The scheduler algorithm is as follows: [7]
1) System allocates a pre-defined number of downloader objects
2) User input a new URL to start crawler.
3) If any downloader is busy and there are new URLs to be
processed, then a check is made to see if any downloader object is
free. If true assign new URL to it and set its status as busy; else go
to 6.
4) After the downloader object downloads the contents of web pages
set its status as free.
5) If any downloader object runs longer than an upper time limit,
abort it. Set its status as free.
6) If there are more than predefined number of downloader or if all
the downloader objects are busy then allocate new threads and
distribute the downloader to them
7) Continue allocating the new threads and free threads to the
downloader until the number of downloader becomes less than the
threshold value, provided the number of threads being used be kept
under a limit.
8) Goto 3.
Search Engine and Web Crawler
16
Crawling Strategies
There are mainly four types of crawling strategies as below [8]:
1) Breadth-First Crawling
Figure 13 breath first crawling
This algorithm starts at the root URL and searches the all the
neighbour URL at the same level. If the goal is reached, then it is
reports success and the search terminates. If it is not, search proceeds
down to the next level sweeping the search across the neighbour URL
at that level and so on until the goal is reached. When all the URLs are
searched, but the objective is not met then it is reported as failure.
2) Depth-First Crawling
Figure 14 depth first crawling
It starts at the root URL and traverse deeper through the child URL. If
there are more than one child, then priority is given to the left most
child and traverse deep until no more child is available. It is
Search Engine and Web Crawler
17
backtracked to the next unvisited node and then continues in a similar
manner
3) Repetitive Crawling
once page have been crawled,some systems requrie the process to be
repeated periodically so that indexes are kept updated.which may be
achieved by launching a second crawl in parallel,to overcome this
problem we should constantly update the ―Index List.‖
4) TargetedCrawling
Here main objective is to retrieve the greatest number of pages
relating to a particular subject by using the ―Minimum Bandwidth‖.
most search engines use crawling process heuristics in order to target
certain type of page on specific topic.
 Crawling Policies
The characteristics of web that make crawling difficult:
1) Its Large Volume
2) Its Fast Rate of Change
To remove these dificulties the web crawler is having the following
policies. [5]
A SelectionPolicythat states which page to download.
A Re-Visit Policy that states when to check for changes in pages.
A Politeness Policythat states how to avoid overloading web sites.
A Parallelization Policy that states how to coordinate distributed
Web Crawlers.
Search Engine and Web Crawler
18
Implementation
I have developed Web crawler application java works on Windows
operating system. It makes use net bins or any java compactable IDE to run.
For database connectivity it uses my sql - wamp server interface. The
currently proposed web crawler uses breadth first search crawling to search
the links. The proposedweb crawler is deployed on a client machine.
Once the start the IDE and run the program, an automated browsing process
is initiated. The HTML page contents of rediffmail.com homepage are given
to the parser. The parser puts it in a suitable format as described above and
the list of URLs in the HTML page are listed and stored in the frontier. The
URLs are picked up from the frontier and each URL is assigned to a
downloader. The status of downloader whether busy or free can be known.
After the page is downloaded it is added to the database and then the
particular downloader is set as free (i.e. released). The implementation
details are given in table 1.
Figure 15 main program of web crawler application
Search Engine and Web Crawler
19
Figure 16 output in IDE
Table 1: Functionality of the web crawler application on client machine.
Feature Support
Search for a search string Yes
Help manual No
Integration with other applications Yes
Specifying casesensitivity for a search string No
Specifying start URL Yes
Supportfor Breadth First crawling Yes
Check for Validity of URL specified Yes
Search Engine and Web Crawler
20
Figure 17 webpage content on database
Search Engine and Web Crawler
21
Conclusion
Web Crawler forms the back-bone of applications that facilitate Web
information Retrieval. In this report I have presented the architecture and
implementation details of my crawling system which can be deployed on the
client machine to browse the web concurrently and autonomously. It
combines the simplicity of asynchronous downloader and the advantage of
using multiple threads. It reduces the consumption of resources as it is not
implemented on the mainframe servers as other crawlers also reducing
server management. The proposed architecture uses the available resources
efficiently to make up the task done by high cost mainframe servers.
A major open issue for future work is a detailed study of how the system
could become even more distributed, retaining though quality of the content
of the crawled pages. Due to dynamic nature of the Web, the average
freshness or quality of the page downloaded need to be checked, the crawler
can be enhanced to check this and also detect links written in JAVA scripts
or VB scripts and also provision to support file formats like XML, RTF,
PDF, Microsoft word and Microsoft PPT can be done.
References
[1]―basic search handout‖ url: WWW.digitallearn.org
[2]―web search engine‖ url: www.wikipedia.org
[3]Krishan Kant Lavania, Sapna Jain, Madhur Kumar Gupta, and Nicy Sharma,
―Google: A Case Study (Web Searching and Crawling)‖, International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
[4]―web crawler‖ url: www.wikipedia.org
[5]―web crawling and indexes‖ Online edition, April 1, 2009 Cambridge
University Press.
[6]―Crawling the web‖ G. Pant, P. Srinivasan, F. Menczer
[7]Rajashree Shettar, Dr. Shobha G, ―Web Crawler On Client Machine‖,
IMECS 2008, Vol II ,19-21 March, 2008, Hong Kong
[8]Rashmi Janbandhu, Prashant Dahiwale, M.M.Raghuwanshi, ―Analysis of
Web Crawling Algorithms‖ International Journal on Recent and Innovation
Trends in Computing and Communication ISSN: 2321-8169 Volume: 2
Issue: 3
Ad

More Related Content

What's hot (20)

Web Crawler
Web CrawlerWeb Crawler
Web Crawler
iamthevictory
 
Search engine
Search engineSearch engine
Search engine
Alisha Korpal
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and Indexing
Himani Tyagi
 
Search Engine
Search EngineSearch Engine
Search Engine
Ankush Srivastava
 
Introduction to SEO
Introduction to SEOIntroduction to SEO
Introduction to SEO
Rand Fishkin
 
google search engine
google search enginegoogle search engine
google search engine
way2go
 
Search Engine ppt
Search Engine pptSearch Engine ppt
Search Engine ppt
OECLIB Odisha Electronics Control Library
 
Web 3.0 (Presentation)
Web 3.0 (Presentation)Web 3.0 (Presentation)
Web 3.0 (Presentation)
Allan Cho
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
A. LE
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
Robert Dempsey
 
Google Search Engine
Google Search EngineGoogle Search Engine
Google Search Engine
guestf460ed0
 
How search engines work
How search engines workHow search engines work
How search engines work
Chinna Botla
 
Semantic web
Semantic webSemantic web
Semantic web
RehithaP
 
Search Engine
Search EngineSearch Engine
Search Engine
Ram Dutt Shukla
 
Google Search Engine
Google Search Engine Google Search Engine
Google Search Engine
Aniket_1415
 
How search engine work ppt
How search engine work pptHow search engine work ppt
How search engine work ppt
Shubham Chinchkar
 
What is Keyword Research & How to Do it ?
What is Keyword Research & How to Do it ? What is Keyword Research & How to Do it ?
What is Keyword Research & How to Do it ?
Jam Hassan
 
The semantic web
The semantic web The semantic web
The semantic web
ap
 
Web crawler
Web crawlerWeb crawler
Web crawler
poonamkenkre
 
WEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web DevelopmentWEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web Development
Randy Connolly
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and Indexing
Himani Tyagi
 
Introduction to SEO
Introduction to SEOIntroduction to SEO
Introduction to SEO
Rand Fishkin
 
google search engine
google search enginegoogle search engine
google search engine
way2go
 
Web 3.0 (Presentation)
Web 3.0 (Presentation)Web 3.0 (Presentation)
Web 3.0 (Presentation)
Allan Cho
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
A. LE
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
Robert Dempsey
 
Google Search Engine
Google Search EngineGoogle Search Engine
Google Search Engine
guestf460ed0
 
How search engines work
How search engines workHow search engines work
How search engines work
Chinna Botla
 
Semantic web
Semantic webSemantic web
Semantic web
RehithaP
 
Google Search Engine
Google Search Engine Google Search Engine
Google Search Engine
Aniket_1415
 
What is Keyword Research & How to Do it ?
What is Keyword Research & How to Do it ? What is Keyword Research & How to Do it ?
What is Keyword Research & How to Do it ?
Jam Hassan
 
The semantic web
The semantic web The semantic web
The semantic web
ap
 
WEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web DevelopmentWEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web Development
Randy Connolly
 

Similar to Search engine and web crawler (20)

G017254554
G017254554G017254554
G017254554
IOSR Journals
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
iosrjce
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut Aslantaş
Aykut Aslantaş
 
SEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEOSEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEO
FlutterbyBarb
 
Seo report
Seo reportSeo report
Seo report
Manojkumar Kambale
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than Google
Dr Trivedi
 
Seo guide
Seo guideSeo guide
Seo guide
Sudhanshu Pandey
 
Search engine
Search engineSearch engine
Search engine
Wasif Khan
 
Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web Crawler
IJMER
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEO
Thanh Nguyen
 
Search engine
Search engineSearch engine
Search engine
Chinmay Patel
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.
Anish Thomas
 
How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete Approach
Prakhar Gethe
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
imgaurav16
 
Search engine
Search engineSearch engine
Search engine
Alisha Korpal
 
How search engine works
How search engine worksHow search engine works
How search engine works
Ashraf Ali
 
A SURVEY ON SEARCH ENGINES
A SURVEY ON SEARCH ENGINESA SURVEY ON SEARCH ENGINES
A SURVEY ON SEARCH ENGINES
Journal For Research
 
A Survey On Search Engines
A Survey On Search EnginesA Survey On Search Engines
A Survey On Search Engines
Andrew Parish
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdf
Creationlabz
 
Search Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEOSearch Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEO
Neeraj Reddy
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
iosrjce
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut Aslantaş
Aykut Aslantaş
 
SEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEOSEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEO
FlutterbyBarb
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than Google
Dr Trivedi
 
Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web Crawler
IJMER
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEO
Thanh Nguyen
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.
Anish Thomas
 
How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete Approach
Prakhar Gethe
 
How search engine works
How search engine worksHow search engine works
How search engine works
Ashraf Ali
 
A Survey On Search Engines
A Survey On Search EnginesA Survey On Search Engines
A Survey On Search Engines
Andrew Parish
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdf
Creationlabz
 
Search Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEOSearch Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEO
Neeraj Reddy
 
Ad

More from ishmecse13 (14)

Web services concepts, protocols and development
Web services concepts, protocols and developmentWeb services concepts, protocols and development
Web services concepts, protocols and development
ishmecse13
 
Web services concepts, protocols and development
Web services concepts, protocols and developmentWeb services concepts, protocols and development
Web services concepts, protocols and development
ishmecse13
 
Web services
Web servicesWeb services
Web services
ishmecse13
 
Wap wml
Wap wmlWap wml
Wap wml
ishmecse13
 
Wap architecture and wml script
Wap architecture and wml scriptWap architecture and wml script
Wap architecture and wml script
ishmecse13
 
Solving travelling salesman problem using firefly algorithm
Solving travelling salesman problem using firefly algorithmSolving travelling salesman problem using firefly algorithm
Solving travelling salesman problem using firefly algorithm
ishmecse13
 
Object oriented concepts with java
Object oriented concepts with javaObject oriented concepts with java
Object oriented concepts with java
ishmecse13
 
Kerberos using public key cryptography
Kerberos using public key cryptographyKerberos using public key cryptography
Kerberos using public key cryptography
ishmecse13
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
ishmecse13
 
File models and file accessing models
File models and file accessing modelsFile models and file accessing models
File models and file accessing models
ishmecse13
 
Case study on cyber crime
Case study on cyber crimeCase study on cyber crime
Case study on cyber crime
ishmecse13
 
Branch and bound technique
Branch and bound techniqueBranch and bound technique
Branch and bound technique
ishmecse13
 
Branch and bound technique
Branch and bound techniqueBranch and bound technique
Branch and bound technique
ishmecse13
 
Cyber crime and cyber laws
Cyber crime and cyber lawsCyber crime and cyber laws
Cyber crime and cyber laws
ishmecse13
 
Web services concepts, protocols and development
Web services concepts, protocols and developmentWeb services concepts, protocols and development
Web services concepts, protocols and development
ishmecse13
 
Web services concepts, protocols and development
Web services concepts, protocols and developmentWeb services concepts, protocols and development
Web services concepts, protocols and development
ishmecse13
 
Wap architecture and wml script
Wap architecture and wml scriptWap architecture and wml script
Wap architecture and wml script
ishmecse13
 
Solving travelling salesman problem using firefly algorithm
Solving travelling salesman problem using firefly algorithmSolving travelling salesman problem using firefly algorithm
Solving travelling salesman problem using firefly algorithm
ishmecse13
 
Object oriented concepts with java
Object oriented concepts with javaObject oriented concepts with java
Object oriented concepts with java
ishmecse13
 
Kerberos using public key cryptography
Kerberos using public key cryptographyKerberos using public key cryptography
Kerberos using public key cryptography
ishmecse13
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
ishmecse13
 
File models and file accessing models
File models and file accessing modelsFile models and file accessing models
File models and file accessing models
ishmecse13
 
Case study on cyber crime
Case study on cyber crimeCase study on cyber crime
Case study on cyber crime
ishmecse13
 
Branch and bound technique
Branch and bound techniqueBranch and bound technique
Branch and bound technique
ishmecse13
 
Branch and bound technique
Branch and bound techniqueBranch and bound technique
Branch and bound technique
ishmecse13
 
Cyber crime and cyber laws
Cyber crime and cyber lawsCyber crime and cyber laws
Cyber crime and cyber laws
ishmecse13
 
Ad

Recently uploaded (19)

Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
DataProvider1
 
Best web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you businessBest web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you business
steve198109
 
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC
 
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 SupportReliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
steve198109
 
Determining Glass is mechanical textile
Determining  Glass is mechanical textileDetermining  Glass is mechanical textile
Determining Glass is mechanical textile
Azizul Hakim
 
White and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptxWhite and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptx
canumatown
 
project_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptxproject_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptx
redzuriel13
 
DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)
APNIC
 
highend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptxhighend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptx
elhadjcheikhdiop
 
Computers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers NetworksComputers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers Networks
Tito208863
 
OSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description fOSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description f
cbr49917
 
Perguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolhaPerguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolha
socaslev
 
Understanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep WebUnderstanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep Web
nabilajabin35
 
IT Services Workflow From Request to Resolution
IT Services Workflow From Request to ResolutionIT Services Workflow From Request to Resolution
IT Services Workflow From Request to Resolution
mzmziiskd
 
5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx
andani26
 
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry SweetserAPNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC
 
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHostingTop Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
steve198109
 
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation TemplateSmart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
yojeari421237
 
(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security
aluacharya169
 
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
DataProvider1
 
Best web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you businessBest web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you business
steve198109
 
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC
 
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 SupportReliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
steve198109
 
Determining Glass is mechanical textile
Determining  Glass is mechanical textileDetermining  Glass is mechanical textile
Determining Glass is mechanical textile
Azizul Hakim
 
White and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptxWhite and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptx
canumatown
 
project_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptxproject_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptx
redzuriel13
 
DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)
APNIC
 
highend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptxhighend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptx
elhadjcheikhdiop
 
Computers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers NetworksComputers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers Networks
Tito208863
 
OSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description fOSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description f
cbr49917
 
Perguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolhaPerguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolha
socaslev
 
Understanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep WebUnderstanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep Web
nabilajabin35
 
IT Services Workflow From Request to Resolution
IT Services Workflow From Request to ResolutionIT Services Workflow From Request to Resolution
IT Services Workflow From Request to Resolution
mzmziiskd
 
5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx
andani26
 
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry SweetserAPNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC
 
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHostingTop Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
steve198109
 
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation TemplateSmart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
yojeari421237
 
(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security
aluacharya169
 

Search engine and web crawler

  • 1. Seminar Report on Mehta Ishani 130040701003
  • 2. Search Engine and Web Crawler 2 Abstract The World Wide Web is a rapidly growing and changing information source. Due to the dynamic nature of the Web, it becomes harder to find relevant and recent information. Search engines are the primary gateways of information access on the Web. Today search engines are becoming necessity of most of the people in day to day life for navigation on internet or for finding anything. Search engine answer millions of queries every day. Whatever comes in our mind, we just enter the keyword or combination of keywords to trigger the search and get relevant result in seconds without knowing the technology behind it. I searched for ―search engine‖ and it returned 68,900 results. In addition with this, the engine returned some sponsored results across the side of the page, as well as some spelling suggestion. All in 0.36 seconds. And for popular queries the engine is even faster. For example, searches for World Cup or dance shows (both recent events) tookless than .2 seconds each. To engineer a search engine is a challenging task. Web crawler is an indispensable part of search engine. A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on.
  • 3. Search Engine and Web Crawler 3 Introductionto Search Engine Search engine is a tool that allows people to find information over World Wide Web. Search engine is a website that you can use to look up web pages, like yellow pages for the Internet. A web search engine is a software system that is designed to search for information on the World Wide Web. Assume you are reading a book and want to find references to a specific word in the book. What do you do? You turn the pages to the end and look in the index! You will then locate the word in the index, find the page numbers mentioned there and flip to the corresponding pages. Search Engines also work in a similar way. Figure 1 telephone directory Search engines are constantly building and updating their index to the World Wide Web. They do this by using ―spiders‖ that ―crawl‖ the web and fetch web pages. Then the words used in these web pages are added to the index along with where the words came from. [1]
  • 4. Search Engine and Web Crawler 4 How stuff works? A search engine operates in the following order: 1. Web crawling 2. Indexing 3. Searching Web search engines work by storing information about many web pages. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web crawler which follows every link on the site. The search engine then analyzes the contents of each page to determine how it should be indexed (for example, words can be extracted from the titles, page content, headings, or special fields called meta tags). Figure 2 working flow of search engine When a user enters a query into a search engine (typically by using keywords), the engine examines its index and provides a listing of best- matching web pages according to its criteria, usually with a short summary
  • 5. Search Engine and Web Crawler 5 containing the document's title and sometimes parts of the text. The index is built from the information stored with the data. From 2007 the Google.com search engine has allowed one to search by date by clicking 'Show search tools' in the leftmost column of the initial search results page, and then selecting the desired date range. Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. As well, natural language queries allow the user to type a question in the same form one would ask it to a human. A site like this would be ask.com. The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. Search engines that do not accept money for their search results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads. [2]
  • 6. Search Engine and Web Crawler 6 Major Search Engines - A Comparison Today there are many search engines available to web searchers. What makes one search engine different from another? Following are some important measure.[3]  The contents of that database are a crucial factor determining whether or not you will succeed in finding the information we need. Because when we are doing searching, we are not actually searching the Web directly. Rather, we are searching the cache of the web or database that contains information about all the Web sites visited by that search engine’s spider or crawler.  Size is also one important measure. How many Web pages has the spider visited, scanned, and stored in the database? Some of the larger Search Engines have databases that are covering over three billion Web pages, while the databases of smaller Search Engines cover half a billion or less  Another important measure is how up to date the database is. As we know that the Web is continuously changing and growing. New Websites appear, old sites vanish, and existing sites modify their content. So the information stored in the database will become out of date unless Search engine’s spider keep up with these changes.  In addition with these, the ranking algorithm used by the Search Engine determines whether the most relevant search results appear or not towards the top of results list. Figure 3. Google logo. Google has been in the search game a long time, it has the highest share market of Search Engine (about 81%) [3]. 1) Web Crawler-based service provides both comprehensive coverage of the Web along with great relevancy. 2) Google is much better than the other engines at determining whether a link is an artificial link or true editorial link
  • 7. Search Engine and Web Crawler 7 3) Google gives much importance to Sites which add fresh content on a regular basis. This is why Google likes blogs, especially popular ones. 4) Google prefer informational pages to commercial sites. 5) A page on a site or sub domain of a site with significant age or link can rank much better than it should, even with no external citations. 6) It has aggressive duplicate content filters that filter out many pages with similar content. 7) Crawl depth determined not only by link quantity, but also link quality. Excessive low quality links may make your site less likely to be crawled deep or even included in the index. 8) In addition we can search for twelve different file formats, cached pages, images, news and Usenet group postings. Figure 4 Yahoo logo. Yahoo has been in the search game for many years [3]. 1) It shares the second largest share market of the search engine (about12%). 2) When it comes to counting back lings, Yahoo is the most accurate search engine 3) Yahoo is better than MSN but near as good as Google at determining whether a link is artificial or natural. 4) Crawl rate of the Yahoo's spiders is at least 3 times faster than Google‟s Spiders. 5) Yahoo! tends to prefer commercial pages to informational pages as comparing with Google. 6) At Yahoo search engine "exact matching" is given more importance than "concept matching" which makes them slightly more susceptible to spamming. 7) Yahoo! gives more importance to meta keywords and description tags.
  • 8. Search Engine and Web Crawler 8 Figure. 5. MSN logo. 1) MSN has the share of 3% of the total search engine market [3]. 2) MSN Search uses its own Web database and also has separate News, Images, and Local databases. 3) Its strengths include: this large unique database, its query building Search Builder" and Boolean searching, cached copies of Web pages including date cached, and automatic local search options. 4) The spider crawls only the beginning of the pages (as opposed to the other two search engine which crawl the entire content) and also the number of pages found in its index or database is extremely low. 5) It is bad at determining if a link is natural or artificial in nature. 6) Due to sucking at link analysis they place too much weight on the page content. 7) New sites that are generally untrusted in other systems can rank quickly in MSN Search. But it also makes them more susceptible to spam. 8) Another downside of this search engine is its habit of supplying the results based on geo-targeting, which makes it extremely hard to determine if the results we see are the same ones everybody sees. Figure. 6. ASK Jeeves logo. 1) The Ask search engine has the lowest share (about 1%) out of the total search engine market [3]. 2) Ask is a topical search site. It gives more importance to sites that are linked to topical communities 3) Ask is more susceptible to spamming. 4) Ask is smaller and more specialized than other search engines, it is wise to approach this engine more from a networking or marketing perspective.
  • 9. Search Engine and Web Crawler 9 Figure 7 live search logo 1) Lunched in sept 2006 2) Live Search (formerly Windows Live Search) is the name of Microsoft's web search engine, successor to MSN Search, designed to compete with the industry leaders Google and Yahoo. 3) It also allows the user to save searches and see them updated automatically on Live.com. Figure 8 bing logo 1) Lunched in july 2009 by Microsoft. Use msn search. 2) Things like 'wiki' suggestions, 'visual search', and 'related searches' might be very useful to you.
  • 10. Search Engine and Web Crawler 10 Introductionto Web Crawler A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. Also known as a "spider" or a "bot" (short for "robot") Spider – programs like a browser to download the web page. Crawler – programs automatically follow the links of web pages. Robots - It had automated computer program can visit websites. It will be guided by search engine algorithms It can combine the tasks of crawler & spider helpful of the indexing the web pages and through the search engines. [4]  Why Crawlers? Figure 9 result of searching term web crawler in Google. Crawling: gathering pages from the internet, in order to index them It has 2 main objectives: • fast gathering • efficient gathering [5] Internet has a wide expanse of Information. Finding relevant information requires an efficient mechanism. Web Crawlers provide that scope to the search engine.
  • 11. Search Engine and Web Crawler 11  Features Features a crawlermust provide Robustness: The Web contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development. Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected. Features a crawlershould provide Distributed: The crawler should have the ability to execute in a distributed fashion across multiple machines. Scalable: The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth. Performance and efficiency: The crawl system should make efficient use of various system resources including processor, storage and network bandwidth. Quality: Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching ―useful‖ pages first. Freshness: In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. Extensible: Crawlers should be designed to be extensible in many ways – to copewith new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular. [5]
  • 12. Search Engine and Web Crawler 12 Architectureof Crawler  Flow of basic sequential crawler Web crawlers are mainly used to index the links of all the visited pages for later processing by a search engine. Such search engines rely on massive collections of web pages that are acquired with the help of web crawlers, which traverse the web by following hyperlinks and storing downloaded pages in a large database that is later indexed for efficient execution of user queries. Despite the numerous applications for Web crawlers, at the core they are all fundamentally the same. Following is the process bywhich Web crawlers work [6]: 1) Download the Web page. 2) Parse through the downloaded page and retrieve all the links. 3) For each link retrieved, repeat the process. Figure 10 shows the flow of a basic sequential crawler. The crawler maintains a list of unvisited URLs called the frontier. The list is initialized with seed URLs which may be provided by a user or another program. Each crawling loop involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP, parsing the retrieved page to extract the URLs and application specific information, and finally adding the unvisited URLs to the frontier. Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL. The crawling process may be terminated when a certain number of pages have been crawled. If the crawler is ready to crawl another page and the frontier is empty, the situation signals a dead-end for the crawler. The crawler has no new page to fetch and hence it stops. [6]
  • 13. Search Engine and Web Crawler 13 Figure 10 Flow of a basic sequential crawler The multi-threaded crawler model needs to deal with an empty frontier just like a sequential crawler [6].
  • 14. Search Engine and Web Crawler 14 Figure 11 A multi-threaded crawler model  High level architecture Here, the multi-threaded downloader downloads the web pages from the WWW, and using some parsers the web pages are decomposed into URLs, contents, title etc. The URLs are queued and sent to the downloader using some scheduling algorithm. The downloaded data are stored in a database [7].
  • 15. Search Engine and Web Crawler 15 Figure 12 high level architecture of web crawler The design of the downloader scheduler algorithm is crucial as too many objects will exhaust many resources and make the system slow, too small number of downloader will degrade the system performance. The scheduler algorithm is as follows: [7] 1) System allocates a pre-defined number of downloader objects 2) User input a new URL to start crawler. 3) If any downloader is busy and there are new URLs to be processed, then a check is made to see if any downloader object is free. If true assign new URL to it and set its status as busy; else go to 6. 4) After the downloader object downloads the contents of web pages set its status as free. 5) If any downloader object runs longer than an upper time limit, abort it. Set its status as free. 6) If there are more than predefined number of downloader or if all the downloader objects are busy then allocate new threads and distribute the downloader to them 7) Continue allocating the new threads and free threads to the downloader until the number of downloader becomes less than the threshold value, provided the number of threads being used be kept under a limit. 8) Goto 3.
  • 16. Search Engine and Web Crawler 16 Crawling Strategies There are mainly four types of crawling strategies as below [8]: 1) Breadth-First Crawling Figure 13 breath first crawling This algorithm starts at the root URL and searches the all the neighbour URL at the same level. If the goal is reached, then it is reports success and the search terminates. If it is not, search proceeds down to the next level sweeping the search across the neighbour URL at that level and so on until the goal is reached. When all the URLs are searched, but the objective is not met then it is reported as failure. 2) Depth-First Crawling Figure 14 depth first crawling It starts at the root URL and traverse deeper through the child URL. If there are more than one child, then priority is given to the left most child and traverse deep until no more child is available. It is
  • 17. Search Engine and Web Crawler 17 backtracked to the next unvisited node and then continues in a similar manner 3) Repetitive Crawling once page have been crawled,some systems requrie the process to be repeated periodically so that indexes are kept updated.which may be achieved by launching a second crawl in parallel,to overcome this problem we should constantly update the ―Index List.‖ 4) TargetedCrawling Here main objective is to retrieve the greatest number of pages relating to a particular subject by using the ―Minimum Bandwidth‖. most search engines use crawling process heuristics in order to target certain type of page on specific topic.  Crawling Policies The characteristics of web that make crawling difficult: 1) Its Large Volume 2) Its Fast Rate of Change To remove these dificulties the web crawler is having the following policies. [5] A SelectionPolicythat states which page to download. A Re-Visit Policy that states when to check for changes in pages. A Politeness Policythat states how to avoid overloading web sites. A Parallelization Policy that states how to coordinate distributed Web Crawlers.
  • 18. Search Engine and Web Crawler 18 Implementation I have developed Web crawler application java works on Windows operating system. It makes use net bins or any java compactable IDE to run. For database connectivity it uses my sql - wamp server interface. The currently proposed web crawler uses breadth first search crawling to search the links. The proposedweb crawler is deployed on a client machine. Once the start the IDE and run the program, an automated browsing process is initiated. The HTML page contents of rediffmail.com homepage are given to the parser. The parser puts it in a suitable format as described above and the list of URLs in the HTML page are listed and stored in the frontier. The URLs are picked up from the frontier and each URL is assigned to a downloader. The status of downloader whether busy or free can be known. After the page is downloaded it is added to the database and then the particular downloader is set as free (i.e. released). The implementation details are given in table 1. Figure 15 main program of web crawler application
  • 19. Search Engine and Web Crawler 19 Figure 16 output in IDE Table 1: Functionality of the web crawler application on client machine. Feature Support Search for a search string Yes Help manual No Integration with other applications Yes Specifying casesensitivity for a search string No Specifying start URL Yes Supportfor Breadth First crawling Yes Check for Validity of URL specified Yes
  • 20. Search Engine and Web Crawler 20 Figure 17 webpage content on database
  • 21. Search Engine and Web Crawler 21 Conclusion Web Crawler forms the back-bone of applications that facilitate Web information Retrieval. In this report I have presented the architecture and implementation details of my crawling system which can be deployed on the client machine to browse the web concurrently and autonomously. It combines the simplicity of asynchronous downloader and the advantage of using multiple threads. It reduces the consumption of resources as it is not implemented on the mainframe servers as other crawlers also reducing server management. The proposed architecture uses the available resources efficiently to make up the task done by high cost mainframe servers. A major open issue for future work is a detailed study of how the system could become even more distributed, retaining though quality of the content of the crawled pages. Due to dynamic nature of the Web, the average freshness or quality of the page downloaded need to be checked, the crawler can be enhanced to check this and also detect links written in JAVA scripts or VB scripts and also provision to support file formats like XML, RTF, PDF, Microsoft word and Microsoft PPT can be done. References [1]―basic search handout‖ url: WWW.digitallearn.org [2]―web search engine‖ url: www.wikipedia.org [3]Krishan Kant Lavania, Sapna Jain, Madhur Kumar Gupta, and Nicy Sharma, ―Google: A Case Study (Web Searching and Crawling)‖, International Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013 [4]―web crawler‖ url: www.wikipedia.org [5]―web crawling and indexes‖ Online edition, April 1, 2009 Cambridge University Press. [6]―Crawling the web‖ G. Pant, P. Srinivasan, F. Menczer [7]Rajashree Shettar, Dr. Shobha G, ―Web Crawler On Client Machine‖, IMECS 2008, Vol II ,19-21 March, 2008, Hong Kong [8]Rashmi Janbandhu, Prashant Dahiwale, M.M.Raghuwanshi, ―Analysis of Web Crawling Algorithms‖ International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 2 Issue: 3