0% found this document useful (0 votes)

118 views35 pages

Search Engine

This document provides an overview of web search engines. It discusses information retrieval, types of search engines, the largest search engines in 1998, search engine architectures, user interfaces, web directories, ranking algorithms like PageRank and anchor text, web crawlers, indices, metasearch engines, and the future of web search. The document also answers common questions about the size and growth of the web.

Uploaded by

『HW』 DOBBY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views35 pages

Search Engine

Uploaded by

『HW』 DOBBY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

The Overview of Web Search

Engines

Presented by Sunny Lam

Outline
Introduction
Information Retrieval
Searching Problems
Types of Search Engines
The Largest Search Engines
Architectures
User Interfaces
Web Directories
Ranking
Web Crawlers
Indices
Metasearchers
Add-on Tools
Future Work
Conclusion
Questions about the Web
Q: How many computers are in the world?
A: Over 40 million.

Q: How many of them are Web servers?

A: Over 3 million.

Q: How many Web pages in the world?

A: Over 350 million.

Q: What is the most popular formats of Web documents?

A: HTML, GIF, JPG, ASCII files, Postscript and ASP.

Q: What is the average size of Web document?

A: Mean: 5 Kb; Median: 2 Kb.

Q: How many queries does a search engine answer every day?

A: Tens of millions.
Characteristics of the Web
Huge (1.75 terabytes of text)
Allow people to share information globally and freely
Hides the detail of communication protocols, machine
locations, and operating systems
Data are unstructured
Exponential growth
Increasingly commercial over time (1.5 % .com in
1993 to 60% .com in 1997)
Difficulties of Building a Search
Engine
Build by Companies and hide the technical detail
Distributed data
High percentage of volatile data
Large volume
Unstructured and redundant data
Quality of data
Heterogeneous data
Dynamic data
How to specify a query from the user
How to interpret the answer provided by the system
Information Retrieval
Search Engine is in the field of IR
Searching authors, titles and subjects in library card catalogs or
computers
Document classification and categorization, user interfaces, data
visualization, filtering
Should easily retrieve interested information
IR can be inaccurate as long as the error is insignificant
Data is usually natural language text, which is not always well
structured and could be semantically ambiguous
Goal: To retrieve all the documents which are relevant to a
query while retrieving as few non-relevant documents as
possible
User Problems
Do not exactly understand how to provide a
sequence of words for the search
Not aware of the input requirement of the search
engine.
Problems understanding Boolean logic, so the users
cannot use advanced search
Novice users do not know how to start using a search
engine
Do not care about advertisements ? No funding
Around 85% of users only look at the first page of
the result, so relevant answers might be skipped
Searching Guidelines
Specify the words clearly (+, -)
Use Advanced Search when necessary
Provide as many particular terms as possible
If looking for a company, institution, or organization, try:
www.name [.com | .edu | .org | .gov | country code]
Some searching engine specialize in some areas
If the user use broad queries, try to use Web directories as
starting points
The user should notice that anyone can publish data on the
Web, so information that they get from search engines might
not be accurate.
Types of Search Engines
Search by Keywords (e.g. AltaVista,
Excite, Google, and Northern Light)
Search by categories (e.g. Yahoo!)
Specialize in other languages (e.g.
Chinese Yahoo! and Yahoo! Japan)
Interview simulation (e.g. Ask Jeeves!)
The Largest Search Engines
(1998) Search engine URL Web pages indexed
AltaVista www.altavista.com 140
AOL Search search.aol.com N/A
Excite www.excite.com 55
Google google.stanford.edu 25
GoTo goto.com N/A
HotBot www.hotbot.com 110
Go www.go.com 30
Lycos www.lycos.com 30
Magellan magellan.excite.com 55
Microsoft search.msn.com N/A
Northern Light www.northernlight.com 67

Open Text www.opentext.com N/A

WebCrawler www.webcrawler.com 2
Search Engine Architectures
AltaVista
Harvest
Google
AltaVista Architecture
Index

Query Engine

Interface

Indexer
User

Crawler

Web
Harvest Architecture

Replication Broker
Manager

User Broker Gatherer

Object Cache Web site

Google Architecture
User Interfaces
Query Interface
A box is entered a sequence of words (AltaVista uses union,
HotBot uses intersection)
Complex query interfaces (e.g. Boolean logic, phrase search,
title search, URL search, date range search, data type search)

Answer Interface
Relevant pages appear on the top of the list
Each entry in the list includes a title of the page, an URL, a brief
summary, a size , a date and a written language
Web Directories
Also called: catalogs, yellow pages, subject
directories
Hierarchical taxonomies that classify human
knowledge
First level of taxonomies range from 12 to 26
Popularities: Yahoo!, eBLAST, LookSmart, Magellan,
and Nacho.
Most allow keyword searches
Category services: AltaVista Categories, AOL Netfind,
Excite Channels, HotBot, Infoseek, Lycos Subjects,
and WebCrawler Select.
The Most Popular Web
Directories in 1998
Web directory URL Number of Web sites Categories

eBLAST www.eblast.com 125 N/A

LookSmart www.looksmart.com 300 24

Lycos Subjects www.lycos.com 50 N/A

Magellan magellan.excite.com 60 N/A

NewHoo www.newhoo.com 100 23

Netscape search. netscape.com N/A N/A

Search.com www.search.com N/A N/A

Snap www.snap.com N/A N/A

Yahoo! www.yahoo.com 750 N/A

Ranking
Not publicly available
Do not allow access to the text, but
only indices
Sometimes too many relevant pages for
a simple query
Hard to compare the quality of ranking
for two search engines
PageRank, Anchor Text
PageRank
Used by WebQuery and Google
The equation:
PR(a) = q (1 - q)? (i = 1 .. N) PR(pi)/C(pi)
Google simulates users using the search engine to
rank documents
Google uses citation graph (518 million links)
Google computes 26 million in a few hours
Many pages point to the result page ? High ranking
Some high-ranking pages point to the result page ?
High ranking
Anchor Text
Most search engines associate the text of a
link with the page that the link is on
Google is the other way around
Advantages: more accurate descriptions of
Web pages and document can be indexed
259 million anchors
Idea was originated by WWWW (World Wide
Web Worm)
Other Features
Keep track of location information for all
hits
Keep track of visual presentation (e.g.
font size of words)
Web Crawlers
Software agents that traverse the Web sending new or updated
pages to a main server where they are indexed
Also called robots, spiders, worms, wanders, walkers, and
knowbots
The 1st crawler, Wanderer was developed in 1993
Not been publicly described
Runs on local machine and send requests to remote Web
servers
Most fragile application
Breath-first and depth-first manner
Avoid crawling same pages
Web pages change dynamically
Invalid links: 2% to 9%
Fastest crawlers are able to traverse up to 10 million pages per day
Google Crawler
Fast distributed crawling system
How does it work?
Peak speed: > 100 pages/sec or 600k per sec for 4
crawlers
Use DNS cache to avoid DNS look up
Each connection possible states:
Looking up DNS
Connecting to host
Sending request
Receiving response
Crawling problems
Internet Archive
Uses multiple machines
A crawler is a single thread
Each crawler assigns to 64 sites
No site is assigned to more than one crawler
Each crawler reads a list of URLs into per-site queues
Each crawler uses asynchronous I/O to fetch pages
from these queues in parallel
Each crawler extracts the links inside the downloaded
page
The crawler assigns links to appropriate site queues
Mercator
Named after the Flemish cartographer
Mercator
Developed by Compaq
Written in Java
Scalable: can scale up to the entire Web (has
fetched tens of millions of Web documents)
Extensible: designed in a modular way, can
add new function by 3rd parties
Indices
Use inverted files
Inverted file is a list of sorted words
Each word points to related pages
A short description associates with each pointer
500 bytes for description and pointer
Store answer in memory
Reduce size of files to 30%
Use binary search for searching for a single keyword
Multiple keyword searching requires multiple binary search
independently, then combine all the result
Phrase search is unknown in public
Phrase search is to search words near each other
Metasearchers
A Web server that takes a given query from
the user and sends it to several sources
Collect the answer from these sources
Return a unified result to the user
Able to sort by host, keyword, data, and
popularity
Can run on client machine as well
Number of sources is adjustable
Metasearchers in 1998
Metasearcher URL Sources used

C4 www.c4.com 14

Dogpile www.dogpile.com 25

Highway61 www.highway61.com 5

InFind www.infind.com 6

Mamma www.mamma.com 7

MetaCrawler www.metacrawler.com 7

MetaMiner www.miner.uol.com.br 13

Local Find local.find.com N/A

Inquirus
Developed by NEC Research Institute
Download and analyze Web pages
Display each page with highlighted
query terms in progressive manner
Discard non-existing pages
Not publicly available
Savvy Search
Available in 1997, but not now
Goal #1: maximize the likelihood of returning
good links
Goal #2: minimize computational and Web
resource consumption
Determines which search engines to contact
and in what order
Ranks search engines based on query terms
and search engines performance
STARTS
Stanford Protocol Proposal for Internet
Retrieval and Search
Supported by 11 companies
Facilitates the task of querying multiple
document sources
1. Choose the best sources to evaluate a query
2. Submit the query at these sources
3. Merge the query results from these sources
STARTS Protocol
The Query-Language Problems
The Rank-Merging Problem
The Source-Metadata Problem
Add-on Tools: Alexa
Free: www.alexa.com
Appear as a toolbar in IE 5x
Provide useful information about the sites
Allow users to browse related sites
Perform searches within the Web site, related site or the whole
Web
Shop online
Provide popularity
Provide speed of access
Provide freshness
Provide overall quality from Alexa users
Future Work
1. Provide better information filtering
2. Pose queries more visually
3. New techniques to traverse the Web due to Web’s growth
4. New techniques to increase efficiency
5. Better ranking algorithms
6. Algorithms that choose which pages to index
7. Techniques to find dynamic pages which are created on demand
8. Techniques to avoid searching for duplicated data
9. Techniques to search multimedia documents on the Web
10. Friendly user interfaces
11. Standard protocol to query search engines
12. Web mining
13. Developments of reliable and secure intranet
Conclusion

Types of Search Engines and How It Works
100% (2)
Types of Search Engines and How It Works
42 pages
Search Engines
83% (6)
Search Engines
23 pages
rusDw3Ze Cleaned
No ratings yet
rusDw3Ze Cleaned
197 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Preparation
No ratings yet
Preparation
10 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
No ratings yet
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
25 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Search Engine Powerpoint
No ratings yet
Search Engine Powerpoint
2 pages
Oc 2 RJPGT 2023
No ratings yet
Oc 2 RJPGT 2023
13 pages
Seminar Formatkhjj
No ratings yet
Seminar Formatkhjj
24 pages
BA4029 SOCIAL MEDIA WEB ANALYTICS unit 5
No ratings yet
BA4029 SOCIAL MEDIA WEB ANALYTICS unit 5
23 pages
WEB BROWSERS+search Engine
No ratings yet
WEB BROWSERS+search Engine
10 pages
Web Search-Engines: Preksha Mangal B-Tech CS-3 Year
No ratings yet
Web Search-Engines: Preksha Mangal B-Tech CS-3 Year
43 pages
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
No ratings yet
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
21 pages
Search Engine Comparisons
No ratings yet
Search Engine Comparisons
23 pages
Search Engine
No ratings yet
Search Engine
17 pages
Smarter Searching 29780
No ratings yet
Smarter Searching 29780
15 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
SPPM 1002 Web Searching
No ratings yet
SPPM 1002 Web Searching
12 pages
Search Engines: The Players and The Field
No ratings yet
Search Engines: The Players and The Field
27 pages
Search Engines: Methods, Advertisements, Website Integration
No ratings yet
Search Engines: Methods, Advertisements, Website Integration
34 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
History of Search Engines
No ratings yet
History of Search Engines
13 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
Unit2 IT
No ratings yet
Unit2 IT
16 pages
Search Engines Ppt
No ratings yet
Search Engines Ppt
24 pages
Websearch
No ratings yet
Websearch
21 pages
Cali) Ngasan - Search Engine
No ratings yet
Cali) Ngasan - Search Engine
98 pages
W4 Lesson 6 Use Search Engines and Directories Effectively - Presentation
No ratings yet
W4 Lesson 6 Use Search Engines and Directories Effectively - Presentation
36 pages
Search Engine: An Effective Tool For Exploring The Internet
No ratings yet
Search Engine: An Effective Tool For Exploring The Internet
5 pages
Module 2
No ratings yet
Module 2
18 pages
Search Engine
No ratings yet
Search Engine
42 pages
Seach Engine
50% (2)
Seach Engine
18 pages
Search Engines: Sara Khalid Suliman
No ratings yet
Search Engines: Sara Khalid Suliman
34 pages
IR Module 3 (1)
No ratings yet
IR Module 3 (1)
45 pages
CS571-Note
No ratings yet
CS571-Note
2 pages
Search Engine
100% (2)
Search Engine
42 pages
Unit I
No ratings yet
Unit I
12 pages
The Anatomy of A Large-Scale Hypertextual
No ratings yet
The Anatomy of A Large-Scale Hypertextual
41 pages
Search Engines: Presented By, Aswathy Gopinadhan 2 Sem Mba
No ratings yet
Search Engines: Presented By, Aswathy Gopinadhan 2 Sem Mba
30 pages
Search Engines Sunday
No ratings yet
Search Engines Sunday
17 pages
Comparisions Among Search Engines
No ratings yet
Comparisions Among Search Engines
10 pages
Term Paper OF Int-301: Web Programming: Topic: Search Engine
No ratings yet
Term Paper OF Int-301: Web Programming: Topic: Search Engine
18 pages
Search Engine
No ratings yet
Search Engine
22 pages
How Do Search Engines Work
No ratings yet
How Do Search Engines Work
3 pages
SEARCH ENGINES and PAGERANK
No ratings yet
SEARCH ENGINES and PAGERANK
29 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Unit 1: Search Engine Optimisation
No ratings yet
Unit 1: Search Engine Optimisation
10 pages
Internet Search Tools Search Engines Meta-Search Engines Metasites Directories
No ratings yet
Internet Search Tools Search Engines Meta-Search Engines Metasites Directories
10 pages
Search Tools Ppt
No ratings yet
Search Tools Ppt
24 pages
By: Abd Rashid Bin HJ Shafie Penyelaras Bestari SMK Gunung Rapat, Ipoh
No ratings yet
By: Abd Rashid Bin HJ Shafie Penyelaras Bestari SMK Gunung Rapat, Ipoh
6 pages
Search Engine: Programs Keywords
No ratings yet
Search Engine: Programs Keywords
10 pages
Seo CH1
No ratings yet
Seo CH1
45 pages
As3 DM
No ratings yet
As3 DM
9 pages
Search Engine - Wikipedia
No ratings yet
Search Engine - Wikipedia
25 pages
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
AJAX Interview Questions, Answers, and Explanations: AJAX Certification Review
From Everand
AJAX Interview Questions, Answers, and Explanations: AJAX Certification Review
Equity Press
No ratings yet
Learning Drupal 8: Create complex websites quickly and easily using the building blocks of Drupal 8, the most powerful version of Drupal yet
From Everand
Learning Drupal 8: Create complex websites quickly and easily using the building blocks of Drupal 8, the most powerful version of Drupal yet
Nick Abbott
4/5 (1)
Vocabulary Test 5A
No ratings yet
Vocabulary Test 5A
2 pages
Internet 11
No ratings yet
Internet 11
6 pages
Free Data Security Policy For Business PurpleSec
No ratings yet
Free Data Security Policy For Business PurpleSec
27 pages
MasterPage Cs
No ratings yet
MasterPage Cs
4 pages
whatsapp.aisensy.com_virtual-number_fbclid=fbclid&_gl=1_1rx1hze__gcl_au_MjY3ODU2ODgwLjE3MzU1NzI1NDM.&_ga=2.203866707.238867236.1735572545-2103860649.1735572542
No ratings yet
whatsapp.aisensy.com_virtual-number_fbclid=fbclid&_gl=1_1rx1hze__gcl_au_MjY3ODU2ODgwLjE3MzU1NzI1NDM.&_ga=2.203866707.238867236.1735572545-2103860649.1735572542
9 pages
Fortigate 1000f Series
No ratings yet
Fortigate 1000f Series
10 pages
Practice DNS
No ratings yet
Practice DNS
8 pages
Daily Work Report of Soma Nashat 5-04-2024
No ratings yet
Daily Work Report of Soma Nashat 5-04-2024
4 pages
Sanet - CD 1111111111111111
No ratings yet
Sanet - CD 1111111111111111
45 pages
Case Study - Bank Alfalah
No ratings yet
Case Study - Bank Alfalah
5 pages
Nse3 FortiClient Quiz Attempt Review - Copie
100% (1)
Nse3 FortiClient Quiz Attempt Review - Copie
4 pages
BP - Where Is Maximo & How To Find Maximo Userform v2
No ratings yet
BP - Where Is Maximo & How To Find Maximo Userform v2
7 pages
Virtual Slides For Education Day XL 1
No ratings yet
Virtual Slides For Education Day XL 1
83 pages
How To Hack Computers
100% (1)
How To Hack Computers
150 pages
Introduction To Computing - Module 8 - Computer Security, Ethics and Privacy
No ratings yet
Introduction To Computing - Module 8 - Computer Security, Ethics and Privacy
57 pages
Artist With Death Wish - No Photoshop Used Wonderment Blog
No ratings yet
Artist With Death Wish - No Photoshop Used Wonderment Blog
605 pages
Az 700
100% (1)
Az 700
323 pages
Common Technology Words To Learn in English VOCABULARY 3
No ratings yet
Common Technology Words To Learn in English VOCABULARY 3
2 pages
Future Trends in HCI_Lecture 8
No ratings yet
Future Trends in HCI_Lecture 8
9 pages
DaFuq!Boom!'s YouTube Stats (Summary Profile) - Social Blade Stats
No ratings yet
DaFuq!Boom!'s YouTube Stats (Summary Profile) - Social Blade Stats
1 page
Whack A Mole
No ratings yet
Whack A Mole
7 pages
EX FINAL 2 - IBM Cybersecurity Analyst Professional Certificate Assessment Exam - Coursera
No ratings yet
EX FINAL 2 - IBM Cybersecurity Analyst Professional Certificate Assessment Exam - Coursera
33 pages
Hokage' Culture in Filipino Facebook
No ratings yet
Hokage' Culture in Filipino Facebook
3 pages
BlackBerry PlayBook Tablet-User Guide
No ratings yet
BlackBerry PlayBook Tablet-User Guide
46 pages
CCNA Security: Chapter Six Securing The Local Area Network
No ratings yet
CCNA Security: Chapter Six Securing The Local Area Network
99 pages
A Practical Approach To Network Sniffing
No ratings yet
A Practical Approach To Network Sniffing
28 pages
Cyber Security Training
No ratings yet
Cyber Security Training
13 pages
面试常用问题和答案
No ratings yet
面试常用问题和答案
3 pages
Dirty Recon 1 PDF
No ratings yet
Dirty Recon 1 PDF
71 pages