0% found this document useful (0 votes)
2 views

unit8

The document provides an overview of search engines, detailing their characteristics, functionality, and the ranking of web pages. It explains the differences between web search and traditional information retrieval, the components of search engines, and the processes of web crawling, indexing, and searching. Additionally, it discusses the goals of web search, the quality of search results, and the page rank algorithm.

Uploaded by

Srizan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

unit8

The document provides an overview of search engines, detailing their characteristics, functionality, and the ranking of web pages. It explains the differences between web search and traditional information retrieval, the components of search engines, and the processes of web crawling, indexing, and searching. Additionally, it discusses the goals of web search, the quality of search results, and the page rank algorithm.

Uploaded by

Srizan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit 8: Search Engines LH 3

Presented By : Tekendra Nath Yogi


[email protected]
College Of Applied Business And Technology
Contd…
• Outline:
– 8.1 Characteristics of search engine

– 8.2 Search Engine functionality

– 8.3 Ranking of Web pages

7/18/2019 By: Tekendra Nath Yogi 2


Introduction
• Most libraries have a relatively small collection of documents and a
catalogue to search documents.

• The web on the other hand is a very large collection of documents.

• The search engines allow a user to carry out the task of searching the
web for information.

• The Google dominates the search engine market(about 75%).

7/18/2019 By:Tekendra Nath Yogi 3


Differences between web search and information retrieval

• A web search is very different than a normal information


retrieval search of a document because of the following factors:

– Bulk:

• web is much larger than any set of documents used by IR.

– Diversity:

• web pages may contains text, tables, image, video, audio


instead of only text.

– Growth:

• exponential growth of web.

7/18/2019 By:Tekendra Nath Yogi 4


Contd..
– Dynamic:
• web changes significantly with time but text does not.

– Demanding Users:
• users demand immediate results.

– Quality of document:
• Text documents are usually of high quality but web documents may not.

– Hyperlinks:
• very important components of web documents

– Queries:
• web queries are short and ambiguous.

7/18/2019 By:Tekendra Nath Yogi 5


Search engine
• A search engine is defined as program that searches for
documents for specified keyword and returns a list of the
documents where the keywords are found.

• A search engine consists of the following main components:


– Crawler(spider)

– Indexer

– Search engine user interface

7/18/2019 By:Tekendra Nath Yogi 6


Contd…
• A typical search engine architecture is as shown in figure below

7/18/2019 By:Tekendra Nath Yogi 7


Contd..

• How search engine works?

A search engine operates in the following order:

1. Web crawling

2. Indexing

3. Searching

7/18/2019 By:Tekendra Nath Yogi 8


Contd..
• Web Crawling:

– Search engine has a huge databases of web pages . Such databases


are built and updated automatically by the web crawler.

– The web crawler performs web crawling as follows:


• The crawler begins with one or more URLs that constitute a
URL set.
• It picks a URL from this URL set, and then fetches the web
page at that URL.
• The fetched page is then parsed to extract both the text and the
links from the page.
• The extracted links (URLs) are then added to a URL set.
• The extracted text is fed to a text indexer.
7/18/2019 By:Tekendra Nath Yogi 9
Contd..
• Indexing:
– The indexer module of the search engine is responsible for
indexing the extracted text supplied by the web crawler.

– Most commonly used indexing is the inverted indexing

7/18/2019 By:Tekendra Nath Yogi 10


Contd..

7/18/2019 By:Tekendra Nath Yogi 11


Contd..
• Searching:

– When a user enters a query to the search engine, user is not


searching the entire web. Instead user is only searching the
database that has been compiled by the search engine.

– The user’s query is parsed into the words by the query parser.

– Such parsed words are matched with the words in the inverted
list of indexed documents.

– The matched list of documents are returned to the user with


ranking.
7/18/2019 By:Tekendra Nath Yogi 12
Characteristics of search engines
• Features a search engine must provide:

– Robustness:

• search engine must be distributed over large number of


machine to deal search engine failure due to the machine
failure.
– Politeness:
• Web servers have policies regulating the rate at which a search
engine can visit them. These politeness policies must be
respected.

7/18/2019 By:Tekendra Nath Yogi 13


Contd..
• Features a search engine should provide

– Distributed:

• The search should have the ability to execute in a


distributed fashion across multiple machines.

– Scalable:

• The search engine architecture should permit scaling up


the search rate by adding extra machines.

7/18/2019 By:Tekendra Nath Yogi 14


Contd..
• Performance and efficiency:

– The search system should make efficient use of various


system resources including processor, storage and network.

• Quality:

– Given that a significant fraction of all web pages are of poor utility
for serving user query needs, the search engine should be biased
towards fetching “useful” pages first.

7/18/2019 By:Tekendra Nath Yogi 15


Contd..
• Freshness:
– it should obtain fresh copies of previously fetched pages.

• Extensible:

– Crawlers should be designed to be extensible in many ways – to


cope with new data formats, new fetch protocols, and so on. This
demands that the crawler architecture be modular.

7/18/2019 By:Tekendra Nath Yogi 16


Problems with search using search engines
• Specifying query keywords can be challenging:

– Search result get affected by structure of the query phrase.

– Due to the nature of English language search result may get


affected e.g., current.

– Scarcity problem

7/18/2019 By:Tekendra Nath Yogi 17


Contd..
• Difficult for the search engine to be certain about what users
want.

– Some may be seeking destination While others may want only


a small number of highly relevant result.

• Diversity of search engine and web users

– Young to old

– A search engine is therefore attempting to meet the needs of a


diverse group of users.

7/18/2019 By:Tekendra Nath Yogi 18


The goals of web search
• Depending on the nature of search engine queries, the
information needs of user may be divided into three classes:

– Navigational

– Informational

– Transactional

7/18/2019 By:Tekendra Nath Yogi 19


Contd..
– Navigational:

• To reach a website that the user has in mind. The user may
know the site exists but or may have visited the site earlier but
does not know the site URL.

– Informational:

• To find a website that provides useful information about a topic


of interest. The user may not have a particular website in mind.

– Transactional:

• To go to a site to perform some kind of transaction. E.g., buy a


book
7/18/2019 By:Tekendra Nath Yogi 20
Quality of search result
• The quality of search results from a search engine ideally should
satisfy the following requirements:
– Precision:

• precision indicates what percentage of documents retrieved are


relevant?

• So , only relevant documents should be returned.

– Recall:

• means what percentage of relevant documents is retrieved from


total relevant documents in the web

• So, all relevant document should be returned


7/18/2019 By:Tekendra Nath Yogi 21
Contd..
• Ranking:

– A ranking of the documents providing some indication of the


relative relevance of the results should be returned.

• First screen:

– The first page of results should include the most relevant


results.

• Speed:

– Results should be provided quickly.

7/18/2019 By:Tekendra Nath Yogi 22


Search engine functionality
• A search engine is a complex collection of software modules. A
search engines carries out a variety of tasks:
– Collecting information

– Evaluating and categorizing information

– Creating a database and creating indexes

– Computing ranks of the web documents

– Checking queries and executing them

– Presenting results

– Profiling the users

7/18/2019 By:Tekendra Nath Yogi 23


Contd..
• Collecting information:

– A search engine would normally collect web pages or information


about them by web crawling.

• Evaluating and categorizing information:

– search engine evaluates the pages before submission and categorize


the information.

• Creating a database and creating indexes:

– The information collected needs to be stored either in a database


or some kind of file system. Indexes must be created so that the
information may be searched efficiently.
7/18/2019 By:Tekendra Nath Yogi 24
Contd..
• Computing ranks of the web documents: rank the web pages before
returning as a response to the user queries.

• Checking queries and executing them: queries posed by users need


to be checked , for example, for spelling errors. Once checked, a query
is executed by searching the search engine database.

• Presenting results: search engine determine what results to present


and how to display them

• Profiling the user: To improve the search performance the search


engines carry out user profiling that deals with the way users use
search engines.
7/18/2019 By:Tekendra Nath Yogi 25
Page Ranking
• The web consists of a huge number of documents that have been
published without any quality control.

• The page ranking is a method for determining the relative


importance and quality of the page for a given query.

• The most well known ranking algorithm is the page rank


algorithm.

7/18/2019 By:Tekendra Nath Yogi 26


Contd..
• Page rank algorithm:

– Was developed by Larry page at Stanford university.

– A hyperlink to a page counts as a vote of support .


• A page that is linked to by many pages receives a high rank and if

there is no links to a web page there is no support for that page so, get

low rank.

7/18/2019 By:Tekendra Nath Yogi 27


Contd..
– Assigns to every node in the web graph a numerical score

between 0 and 1 to each element of hyperlinked set of

documents.

– The rank value indicates the importance of a particular page.


• A page rank of 0.5 means there is a 50% chance that a person clicking

on a random link will be directed to the document with 0.5 page rank.

7/18/2019 By:Tekendra Nath Yogi 28


Contd..
• Algorithm with illustrative example:
– Assume a small universe of four web pages A, B, C and D.

– The initial approximation of Page Rank would be evenly divided between


the four documents.

– Hence each document would begin with an estimated Page Rank of 0.25.

– If pages B, C and D each only link to A, they would each confer 0.25 page
rank to A.

– i.e. PR(A) = PR(B) + PR(C) + PR(D) = 0.75

7/18/2019 By:Tekendra Nath Yogi 29


Contd..
• Suppose that page B has link to page C as well as to page A, while pages D has
links to all three pages and page C has link to A.

• The value of link votes is divided among all the outbound links on the page.
• Thus B gives vote worth 0.125 to page A and a vote 0.125 to page C.
• Similarly, D’s page rank is 0.083 (approximately)
• i.e. PR(A) = PR(B)/2 + PR(C)/1 + PR(D)/3

• Where L(page) = number of outbound links


• Bm = set of all pages link to page m

7/18/2019 By:Tekendra Nath Yogi 30


Home work
• What is search engine? Explain the various components of search engine
architecture.

• What is the role of crawler and indexer?

• Explain the different search engine functionality.

• What are the primary goals of web search?

• Describe the page rank algorithm. Using an example, show how it works.

• How is web search different than text retrieval?

7/18/2019 By:Tekendra Nath Yogi 31


Thank You !

7/18/2019 By: Tekendra Nath Yogi 32

You might also like