0% found this document useful (0 votes)

3 views

Lect 02-Crawling Part a

The document discusses the challenges faced by web search engines, including the vast size of the web, the complexity of user queries, and the evolution of search engine technology from simple metadata usage to advanced algorithms that understand user intent. It outlines the structure of search engines, the process of crawling, and the importance of prioritizing URLs based on various metrics to ensure efficient and effective data retrieval. Additionally, it highlights the dynamic nature of the web and the necessity for continuous updates to maintain relevance in search results.

Uploaded by

bulba-670c256ebf8e0334468fd2b3

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lect 02-Crawling Part a

Uploaded by

bulba-670c256ebf8e0334468fd2b3

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 21

Web search engines

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Two main difficulties
The Web: Extracting “significant data” is difficult !!

 Size: more than 1 trillion pages


Language and encodings: hundreds…
 Distributed authorship: SPAM, format-less,…
 Dynamic: in one year 35% survive, 20% untouched

Matching “user needs” is difficult !!

The User:
 Query composition: short (2.5 terms avg) and imprecise
 Query results: 85% users look at just one result-page
 Several needs: Informational, Navigational, Transactional
Evolution of Search Engines
 1991-.. Wanderer
Zero generation -- use metadata added by users

 First generation -- use only on-page, web-text data

1995-1997 AltaVista,
 Word frequency and language Excite, Lycos, etc

 Second generation -- use off-page, web-graph data

 Link (or connectivity) analysis 1998: Google
 Anchor-text (How people refer to a page)

 Third generation -- answer “the need behind the query”

 Focus on “user need”, rather than on query Google, Yahoo,
 Integrate multiple data-sources MSN, ASK,………
 Click-through data
Fourth and current generation  Information Supply
III° generation
II° generation IV° generation
The structure of a search Engine
Page archive

Crawler

Query
eb
W

Page Indexer
Analizer
Query Ranker
resolver

Which pages
to visit next?
text auxiliary
Structure
The web graph: properties

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
The Web’s Characteristics
 It’s a graph whose size is
 1 trillion of pages is available

50 billion pages crawled (09/15)
 5-40K per page => terabytes & terabytes
 Size grows every day!!
 Actually the web is infinite… Calendars…

 It’s a dynamic graph with

 8% new pages, 25% new links change weekly
 Life time of about 10 days
The Bow Tie
Crawling

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Spidering
 24h, 7days “walking” over
the web graph
 Recall that:
 Direct graph G = (N, E)
 N changes (insert, delete) >> 50
billion nodes
 E changes (insert, delete) > 10
links per node

10*50*109 = 500 billion entries
in posting lists

Many more if we consider also
the word positions in every
document where it occurs.
Crawling Issues
 How to crawl?
 Quality: “Best” pages first
 Efficiency: Avoid duplication (or near duplication)
 Etiquette: Robots.txt, Server load concerns (Minimize load)
 Malicious pages: Spam pages, Spider traps – incl
dynamically generated
 How much to crawl, and thus index?
 Coverage: How big is the Web? How much do we cover?
 Relative Coverage: How much do competitors have?

 How often to crawl?

 Freshness: How much has changed?
 Frequency: Commonly insert time gap btw host requests
Sec. 20.2

Crawling picture

URLs crawled
and parsed
Unseen Web

Seed URLs frontier

pages
Web
Sec. 20.1.1

Updated crawling picture

URLs crawled
and parsed
Unseen Web

Seed
Pages
URL frontier
Crawling thread
A small example
Crawler Manager
PQ AR

Crawler “life cycle” Link

Extractor PR
Downloaders

One Link Extractor per page: One Downloader per page:

while(<Page Repository is not empty>){ while(<Assigned Repository is not empty>){
<take a page p (check if it is new)> <extract url u>
<extract links contained in p within href> <download page(u)>
<extract links contained in javascript> <send page(u) to the Page Repository>
<extract ….. <store page(u) in a proper archive,
<insert these links into the Priority Queue> possibly compressed>
} }

One single Crawler Manager:

while(<Priority Queue is not empty>){
<extract some URL u having the highest priority>
foreach u extracted {
if ( (u  “Already Seen Page” ) ||
( u  “Already Seen Page” && <u’s version on the Web is more recent> )
){
<resolve u wrt DNS>
<send u to the Assigned Repository>
}
}
}
URL frontier visiting
 Given a page P, define how “good” P is.

 Several metrics (via priority assignment):

 BFS, DFS, Random
 Popularity driven (PageRank, full vs partial)
 Topic driven or focused crawling
 Combined

 How to fast check whether the URL is

new ?
Sec. 20.2.3

Mercator
URLs
1. Only one connection is
open at a time to any
host; Prioritizer
2. a waiting time of a few
seconds occurs between
successive requests to K front queues (K priorities)
the same host;
3. high-priority pages are
crawled preferentially. Biased front-queue selector
(Politeness) Back queue
B back queues (hosts)
Single host on each
Back queue selector
Based on min-heap & priority = waiting time
Crawl thread requesting URL
Sec. 20.2.3

Front queues
 Front queues manage prioritization:
 Prioritizer assigns to an URL an integer priority (refresh,
quality, application specific) between 1 and K
 Appends URL to corresponding queue, according to priority
Prioritizer
(assigns to a queue)
1 K

Routing to Back queues

Sec. 20.2.3

Back queues
 Back queues enforce politeness:

Each back queue is kept non-empty

Each back queue contains only URLs from a single host

Back queue request  Select a front queue

randomly, biasing towards higher queues

1 B

URL selector
(pick min, parse and push) min-Heap
Sec. 20.2.3

The min-heap
 It contains one entry per back queue

 The entry is the earliest time te at

which the host corresponding to the
back queue can be “hit again”

 This earliest time is determined from

 Last access to that host
 Any time buffer heuristic we choose
Sec. 20.2.3

The crawl thread

 A crawler seeks a URL to crawl:
 Extracts the root of the heap: So it is an URL at the head of
some back queue q (and then remove it)
 Waits the indicated time turl
 Parses URL and adds its out-links to the Front queues
 If back queue q gets empty, pulls a URL v from some
front queue (more prob for higher queues)
 If there’s already a back queue for v’s host, append v to it
and repeat until q gets not empty;
 Else, make q the back queue for v’s host
 If back queue q is non-empty, pick URL and add it to
the min-heap with priority = waiting time turl
Keep crawl threads busy (B = 3 x K)

Modern JavaScript Applications
From Everand
Modern JavaScript Applications
Narayan Prusty
No ratings yet
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
IR-UNIT 10 (Web Crawling)
No ratings yet
IR-UNIT 10 (Web Crawling)
62 pages
01 Internet Web Concepts
No ratings yet
01 Internet Web Concepts
33 pages
Computer Networks CDN and Dns
No ratings yet
Computer Networks CDN and Dns
50 pages
Week4
No ratings yet
Week4
38 pages
Locality Characteristics of Web Streams Revisited (SPECTS 2005)
No ratings yet
Locality Characteristics of Web Streams Revisited (SPECTS 2005)
28 pages
Ir 73 103
No ratings yet
Ir 73 103
31 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
ECS781P-3-Cloud Applications
No ratings yet
ECS781P-3-Cloud Applications
50 pages
Google SearchEngine
No ratings yet
Google SearchEngine
13 pages
IR Lec13 Web Crawling
No ratings yet
IR Lec13 Web Crawling
27 pages
Apache Traffic Server - More Than Just A Proxy
No ratings yet
Apache Traffic Server - More Than Just A Proxy
50 pages
Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
No ratings yet
Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
14 pages
Information Retrieval Lecture 10 - Web Crawling
No ratings yet
Information Retrieval Lecture 10 - Web Crawling
8 pages
The Design and Implementation of Erachnid: An Extensible, Scalable Web Crawler in Erlang
No ratings yet
The Design and Implementation of Erachnid: An Extensible, Scalable Web Crawler in Erlang
10 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
17 April
No ratings yet
17 April
57 pages
Understanding Web Application Architecture -05 (1)
No ratings yet
Understanding Web Application Architecture -05 (1)
13 pages
Proxy Servers: Jiban Jyoti Rana Reg No:0801307165 Branch: IT Sec: B
No ratings yet
Proxy Servers: Jiban Jyoti Rana Reg No:0801307165 Branch: IT Sec: B
30 pages
(Topic03) BookChapter (Web Technology)
No ratings yet
(Topic03) BookChapter (Web Technology)
10 pages
Browser Architecture
No ratings yet
Browser Architecture
210 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
Web Services
No ratings yet
Web Services
63 pages
Backlinks - Pagerank
No ratings yet
Backlinks - Pagerank
12 pages
C3SA Module 03 V1
No ratings yet
C3SA Module 03 V1
116 pages
Webmininglec
No ratings yet
Webmininglec
75 pages
Web Distribution Systems: Caching and Replication: Chandhok@cse - Wustl.edu
No ratings yet
Web Distribution Systems: Caching and Replication: Chandhok@cse - Wustl.edu
12 pages
In This Issue: September 1999 Volume 2, Number 3
No ratings yet
In This Issue: September 1999 Volume 2, Number 3
40 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
Web Browser Security: Vitaly Shmatikov
No ratings yet
Web Browser Security: Vitaly Shmatikov
60 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
Web Info PDF
No ratings yet
Web Info PDF
4 pages
Web Architectures
No ratings yet
Web Architectures
60 pages
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
No ratings yet
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
6 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
37 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Mining The Web Graph: Technical Seminar Presentation On
No ratings yet
Mining The Web Graph: Technical Seminar Presentation On
15 pages
4020 Week 4
No ratings yet
4020 Week 4
51 pages
intro
No ratings yet
intro
26 pages
HCSCP1203 Web Security Protection
No ratings yet
HCSCP1203 Web Security Protection
77 pages
EDS WebCrawlerArchitecture
No ratings yet
EDS WebCrawlerArchitecture
3 pages
OAK-the Architecture of Apache Jackrabbit 3 PDF
No ratings yet
OAK-the Architecture of Apache Jackrabbit 3 PDF
46 pages
An Extended Model For Effective Migrating Parallel Web Crawling With Domain Specific Crawling
No ratings yet
An Extended Model For Effective Migrating Parallel Web Crawling With Domain Specific Crawling
4 pages
Search Engine
No ratings yet
Search Engine
35 pages
Cse3024 WM Module-2 Smsatapathy
No ratings yet
Cse3024 WM Module-2 Smsatapathy
106 pages
Traffic Control CDN PDF
No ratings yet
Traffic Control CDN PDF
514 pages
Vdocuments - in Crawlware
No ratings yet
Vdocuments - in Crawlware
14 pages
Fundamentals of Web Design v2
No ratings yet
Fundamentals of Web Design v2
26 pages
Fundamentals of Web Design
No ratings yet
Fundamentals of Web Design
25 pages
Web Browser and Developer Tools
No ratings yet
Web Browser and Developer Tools
27 pages
correction of exams related to what we have studied(eva) (3)
No ratings yet
correction of exams related to what we have studied(eva) (3)
21 pages
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
No ratings yet
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
14 pages
C3SA - Module - 3
No ratings yet
C3SA - Module - 3
115 pages
Lecture 14
No ratings yet
Lecture 14
13 pages
Unit-VI-1
No ratings yet
Unit-VI-1
91 pages
Tomcat 6 Developer's Guide
From Everand
Tomcat 6 Developer's Guide
Damodar Chetty
4/5 (1)
Mastering JavaScript Single Page Application Development
From Everand
Mastering JavaScript Single Page Application Development
Philip Klauzinski
No ratings yet
Itilfnd v4 Exam
No ratings yet
Itilfnd v4 Exam
91 pages
Unit 4 - Patterning and Algebra
No ratings yet
Unit 4 - Patterning and Algebra
26 pages
Placements-2020-24
No ratings yet
Placements-2020-24
1 page
Article Review:: 'Dynamic Capability, Knowledge, Learning, and Firm Performance'
No ratings yet
Article Review:: 'Dynamic Capability, Knowledge, Learning, and Firm Performance'
2 pages
Nte 2903
No ratings yet
Nte 2903
3 pages
000-253 Question Answer
No ratings yet
000-253 Question Answer
41 pages
L1 Particle Count
No ratings yet
L1 Particle Count
9 pages
Apps Notes RFID 125kHz
No ratings yet
Apps Notes RFID 125kHz
8 pages
Frontiers in Guided Wave Optics and Op To Electronics
No ratings yet
Frontiers in Guided Wave Optics and Op To Electronics
690 pages
Magoosh GRE - Lessons 4
No ratings yet
Magoosh GRE - Lessons 4
7 pages
EPC392-08 v5.0 SDD Mandate Layout Guidelines
No ratings yet
EPC392-08 v5.0 SDD Mandate Layout Guidelines
14 pages
IES-CONV-Electrical Engineering 1994 PDF
No ratings yet
IES-CONV-Electrical Engineering 1994 PDF
9 pages
Optocouplers Cross Reference PDF
No ratings yet
Optocouplers Cross Reference PDF
3 pages
Operation Manual MyJet Xaar 128 Printheads
80% (5)
Operation Manual MyJet Xaar 128 Printheads
84 pages
Kinetix 350
No ratings yet
Kinetix 350
172 pages
Beneheart d1 Manual de Servicio
No ratings yet
Beneheart d1 Manual de Servicio
70 pages
3. Practical Issues in Neural Network Training
No ratings yet
3. Practical Issues in Neural Network Training
15 pages
ARD RAMPS Kit1.6 - Manual - 2022 04 22
No ratings yet
ARD RAMPS Kit1.6 - Manual - 2022 04 22
9 pages
DJ Equipment Size Chart 08.12.13 3
No ratings yet
DJ Equipment Size Chart 08.12.13 3
48 pages
S To N Empower 3
No ratings yet
S To N Empower 3
18 pages
Scoreme Documemt (1) 4
No ratings yet
Scoreme Documemt (1) 4
23 pages
Java 8 - Garbage Collection PDF
No ratings yet
Java 8 - Garbage Collection PDF
40 pages
ABAP Tutotrial
No ratings yet
ABAP Tutotrial
45 pages
Multicast For Video Streaming: EE290T Spring 2002 Puneet Mehra Pmehra@eecs - Berkeley.edu
No ratings yet
Multicast For Video Streaming: EE290T Spring 2002 Puneet Mehra Pmehra@eecs - Berkeley.edu
20 pages
m3 Done
No ratings yet
m3 Done
13 pages
Computer Networking 8th Edition PDF
No ratings yet
Computer Networking 8th Edition PDF
12 pages
An 9767
No ratings yet
An 9767
13 pages
Highway 66
No ratings yet
Highway 66
173 pages
Managing An Information Security and Privacy Awareness and Training Program by Rebecca Herold PDF
No ratings yet
Managing An Information Security and Privacy Awareness and Training Program by Rebecca Herold PDF
546 pages
Lightning
100% (1)
Lightning
469 pages