0% found this document useful (0 votes)
3 views

Lect 02-Crawling Part a

The document discusses the challenges faced by web search engines, including the vast size of the web, the complexity of user queries, and the evolution of search engine technology from simple metadata usage to advanced algorithms that understand user intent. It outlines the structure of search engines, the process of crawling, and the importance of prioritizing URLs based on various metrics to ensure efficient and effective data retrieval. Additionally, it highlights the dynamic nature of the web and the necessity for continuous updates to maintain relevance in search results.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lect 02-Crawling Part a

The document discusses the challenges faced by web search engines, including the vast size of the web, the complexity of user queries, and the evolution of search engine technology from simple metadata usage to advanced algorithms that understand user intent. It outlines the structure of search engines, the process of crawling, and the importance of prioritizing URLs based on various metrics to ensure efficient and effective data retrieval. Additionally, it highlights the dynamic nature of the web and the necessity for continuous updates to maintain relevance in search results.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Web search engines

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Two main difficulties
The Web: Extracting “significant data” is difficult !!

 Size: more than 1 trillion pages



Language and encodings: hundreds…
 Distributed authorship: SPAM, format-less,…
 Dynamic: in one year 35% survive, 20% untouched

Matching “user needs” is difficult !!


The User:
 Query composition: short (2.5 terms avg) and imprecise
 Query results: 85% users look at just one result-page
 Several needs: Informational, Navigational, Transactional
Evolution of Search Engines
 1991-.. Wanderer
Zero generation -- use metadata added by users

 First generation -- use only on-page, web-text data


1995-1997 AltaVista,
 Word frequency and language Excite, Lycos, etc

 Second generation -- use off-page, web-graph data


 Link (or connectivity) analysis 1998: Google
 Anchor-text (How people refer to a page)

 Third generation -- answer “the need behind the query”


 Focus on “user need”, rather than on query Google, Yahoo,
 Integrate multiple data-sources MSN, ASK,………
 Click-through data
Fourth and current generation  Information Supply
III° generation
II° generation IV° generation
The structure of a search Engine
Page archive

Crawler

Query
eb
W

Page Indexer
Analizer
Query Ranker
resolver

Which pages
to visit next?
text auxiliary
Structure
The web graph: properties

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
The Web’s Characteristics
 It’s a graph whose size is
 1 trillion of pages is available

50 billion pages crawled (09/15)
 5-40K per page => terabytes & terabytes
 Size grows every day!!
 Actually the web is infinite… Calendars…

 It’s a dynamic graph with


 8% new pages, 25% new links change weekly
 Life time of about 10 days
The Bow Tie
Crawling

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Spidering
 24h, 7days “walking” over
the web graph
 Recall that:
 Direct graph G = (N, E)
 N changes (insert, delete) >> 50
billion nodes
 E changes (insert, delete) > 10
links per node

10*50*109 = 500 billion entries
in posting lists

Many more if we consider also
the word positions in every
document where it occurs.
Crawling Issues
 How to crawl?
 Quality: “Best” pages first
 Efficiency: Avoid duplication (or near duplication)
 Etiquette: Robots.txt, Server load concerns (Minimize load)
 Malicious pages: Spam pages, Spider traps – incl
dynamically generated
 How much to crawl, and thus index?
 Coverage: How big is the Web? How much do we cover?
 Relative Coverage: How much do competitors have?

 How often to crawl?


 Freshness: How much has changed?
 Frequency: Commonly insert time gap btw host requests
Sec. 20.2

Crawling picture

URLs crawled
and parsed
Unseen Web

Seed URLs frontier


pages
Web
Sec. 20.1.1

Updated crawling picture

URLs crawled
and parsed
Unseen Web

Seed
Pages
URL frontier
Crawling thread
A small example
Crawler Manager
PQ AR

Crawler “life cycle” Link


Extractor PR
Downloaders

One Link Extractor per page: One Downloader per page:


while(<Page Repository is not empty>){ while(<Assigned Repository is not empty>){
<take a page p (check if it is new)> <extract url u>
<extract links contained in p within href> <download page(u)>
<extract links contained in javascript> <send page(u) to the Page Repository>
<extract ….. <store page(u) in a proper archive,
<insert these links into the Priority Queue> possibly compressed>
} }

One single Crawler Manager:


while(<Priority Queue is not empty>){
<extract some URL u having the highest priority>
foreach u extracted {
if ( (u  “Already Seen Page” ) ||
( u  “Already Seen Page” && <u’s version on the Web is more recent> )
){
<resolve u wrt DNS>
<send u to the Assigned Repository>
}
}
}
URL frontier visiting
 Given a page P, define how “good” P is.

 Several metrics (via priority assignment):


 BFS, DFS, Random
 Popularity driven (PageRank, full vs partial)
 Topic driven or focused crawling
 Combined

 How to fast check whether the URL is


new ?
Sec. 20.2.3

Mercator
URLs
1. Only one connection is
open at a time to any
host; Prioritizer
2. a waiting time of a few
seconds occurs between
successive requests to K front queues (K priorities)
the same host;
3. high-priority pages are
crawled preferentially. Biased front-queue selector
(Politeness) Back queue
B back queues (hosts)
Single host on each
Back queue selector
Based on min-heap & priority = waiting time
Crawl thread requesting URL
Sec. 20.2.3

Front queues
 Front queues manage prioritization:
 Prioritizer assigns to an URL an integer priority (refresh,
quality, application specific) between 1 and K
 Appends URL to corresponding queue, according to priority
Prioritizer
(assigns to a queue)
1 K

Routing to Back queues


Sec. 20.2.3

Back queues
 Back queues enforce politeness:

Each back queue is kept non-empty

Each back queue contains only URLs from a single host

Back queue request  Select a front queue


randomly, biasing towards higher queues

1 B

URL selector
(pick min, parse and push) min-Heap
Sec. 20.2.3

The min-heap
 It contains one entry per back queue

 The entry is the earliest time te at


which the host corresponding to the
back queue can be “hit again”

 This earliest time is determined from


 Last access to that host
 Any time buffer heuristic we choose
Sec. 20.2.3

The crawl thread


 A crawler seeks a URL to crawl:
 Extracts the root of the heap: So it is an URL at the head of
some back queue q (and then remove it)
 Waits the indicated time turl
 Parses URL and adds its out-links to the Front queues
 If back queue q gets empty, pulls a URL v from some
front queue (more prob for higher queues)
 If there’s already a back queue for v’s host, append v to it
and repeat until q gets not empty;
 Else, make q the back queue for v’s host
 If back queue q is non-empty, pick URL and add it to
the min-heap with priority = waiting time turl
Keep crawl threads busy (B = 3 x K)

You might also like