Lect 02-Crawling Part a
Lect 02-Crawling Part a
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Two main difficulties
The Web: Extracting “significant data” is difficult !!
Crawler
Query
eb
W
Page Indexer
Analizer
Query Ranker
resolver
Which pages
to visit next?
text auxiliary
Structure
The web graph: properties
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
The Web’s Characteristics
It’s a graph whose size is
1 trillion of pages is available
50 billion pages crawled (09/15)
5-40K per page => terabytes & terabytes
Size grows every day!!
Actually the web is infinite… Calendars…
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Spidering
24h, 7days “walking” over
the web graph
Recall that:
Direct graph G = (N, E)
N changes (insert, delete) >> 50
billion nodes
E changes (insert, delete) > 10
links per node
10*50*109 = 500 billion entries
in posting lists
Many more if we consider also
the word positions in every
document where it occurs.
Crawling Issues
How to crawl?
Quality: “Best” pages first
Efficiency: Avoid duplication (or near duplication)
Etiquette: Robots.txt, Server load concerns (Minimize load)
Malicious pages: Spam pages, Spider traps – incl
dynamically generated
How much to crawl, and thus index?
Coverage: How big is the Web? How much do we cover?
Relative Coverage: How much do competitors have?
Crawling picture
URLs crawled
and parsed
Unseen Web
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
A small example
Crawler Manager
PQ AR
Mercator
URLs
1. Only one connection is
open at a time to any
host; Prioritizer
2. a waiting time of a few
seconds occurs between
successive requests to K front queues (K priorities)
the same host;
3. high-priority pages are
crawled preferentially. Biased front-queue selector
(Politeness) Back queue
B back queues (hosts)
Single host on each
Back queue selector
Based on min-heap & priority = waiting time
Crawl thread requesting URL
Sec. 20.2.3
Front queues
Front queues manage prioritization:
Prioritizer assigns to an URL an integer priority (refresh,
quality, application specific) between 1 and K
Appends URL to corresponding queue, according to priority
Prioritizer
(assigns to a queue)
1 K
Back queues
Back queues enforce politeness:
Each back queue is kept non-empty
Each back queue contains only URLs from a single host
1 B
URL selector
(pick min, parse and push) min-Heap
Sec. 20.2.3
The min-heap
It contains one entry per back queue