Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction to
Information Retrieval
Ch 19 Web Search Basics
Tasks:
Individual Lab / 3 Tasks
Search Engine related hands-on lab
URL Frontier exercise question
Turn-In
Solution document to ANGEL
2
Introduction to Information Retrieval
Paid
Search Ads
Algorithmic results.
Introduction to Information Retrieval Sec. 19.4.1
CG Appliance Express
Discount Appliances (650) 756-3931
User
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA
Web spider
www.miele.com/ - 20k - Cached - Similar pages
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Search
Indexer
The Web
Indexes Ad indexes
Introduction to Information Retrieval Sec. 19.4.1
User Needs
Need [Brod02, RL04]
Informational – want to learn about something (~40% / 65%)
Low hemoglobin
Navigational – want to go to that page (~25% / 15%)
United Airlines
Transactional – want to do something (web-mediated) (~35% / 20%)
Access a service Seattle weather
Downloads Mars surface images
Canon S410
Shop
Gray areas
Car rental Brasil
Find a good hub
Exploratory search “see what’s there”
Introduction to Information Retrieval
11
Introduction to Information Retrieval
12
Introduction to Information Retrieval Sec. 19.2
http://www.ariadne.ac.uk/issue10/search-engines/ 17
Introduction to Information Retrieval Sec. 19.2.2
Cloaking
Serve fake content to search engine spider
DNS cloaking: Switch IP address. Impersonate
SPAM
Y
Is this a Search
Engine spider?
N Real
Cloaking Doc
Introduction to Information Retrieval
20
Introduction to Information Retrieval Sec. 19.5
(1/2)*Size A = (1/6)*Size B
\ Size A / Size B =
(1/6)/(1/2) = 1/3
Sampling URLs
Ideal strategy: Generate a random URL and check for
containment in each index.
Problem: Random URLs are hard to find!
4 Approaches are discussed in Ch 19
Introduction to Information Retrieval Sec. 19.5
1. Random searches
Choose random searches extracted from a local log
Use only queries with small result sets.
Count normalized URLs in result sets.
Use ratio statistics
Introduction to Information Retrieval Sec. 19.5
1. Random searches
575 & 1050 queries from the NEC RI employee logs
6 Engines in 1998, 11 in 1999
Implementation:
Restricted to queries with < 600 results in total
Counted URLs from each engine after verifying query
match
Computed size ratio & overlap for individual queries
Estimated index size ratio & overlap by averaging over all
queries
Introduction to Information Retrieval Sec. 19.5
2. Random IP addresses
Generate random IP addresses
Find a web server at the given address
If there’s one
Collect all pages from server
From this, choose a page at random
Introduction to Information Retrieval Sec. 19.5
2. Random IP addresses
HTTP requests to random IP addresses
Ignored: empty or authorization required or excluded
[Lawr99] Estimated 2.8 million IP addresses running
crawlable web servers (16 million total) from observing
2500 servers.
OCLC using IP sampling found 8.7 M hosts in 2001
Netcraft [Netc02] accessed 37.2 million hosts in July 2002
[Lawr99] exhaustively crawled 2500 servers and
extrapolated
Estimated size of the web to be 800 million pages
Introduction to Information Retrieval Sec. 19.5
3. Random walks
View the Web as a directed graph
Build a random walk on this graph
Includes various “jump” rules back to visited sites
Does not get stuck in spider traps!
Can follow all links!
Converges to a stationary distribution
Must assume graph is finite and independent of the walk.
Conditions are not satisfied (cookie crumbs, flooding)
Time to convergence not really known
Sample from stationary distribution of walk
Use “strong query” method to check coverage by SE
Introduction to Information Retrieval Sec. 19.5
4. Random queries
Generate random query: how?
Not an English
Lexicon: 400,000+ words from a web crawl dictionary
Conjunctive Queries: w1 and w2
e.g., vocalists AND rsi
Get top-100 result URLs from engine A
Choose a random URL as the candidate to check for
presence in engine B
6-8 low frequency terms as conjunctive query
Introduction to Information Retrieval Sec. 19.6
Duplicate documents
The web is full of duplicated content
Strict duplicate detection = exact match
Not as common
But many, many cases of near duplicates
E.g., Last modified date the only difference
between two copies of a page
Introduction to Information Retrieval
Color Color TV
Enhancement Change size
Elongated
Copied
video
31
Introduction to Information Retrieval Sec. 19.6
Duplicate/Near-Duplicate Detection
Computing Similarity
Features:
Segments of a document (natural or artificial breakpoints)
Shingles (Word N-Grams)
a rose is a rose is a rose my rose is a rose is yours
a_rose_is_a
rose_is_a_rose
is_a_rose_is
a_rose_is_a
Similarity Measure between two docs (= sets of shingles)
Set intersection
Specifically (Size_of_Intersection / Size_of_Union)
Introduction to Information Retrieval Sec. 19.6
Sketch of a document
Create a “sketch vector” (of size ~200) for each
document
Documents that share ≥ t (say 80%) corresponding
vector elements are near duplicates
For doc D, sketchD[ i ] is as follows:
Let f map all shingles in the universe to 0..2m (e.g., f =
fingerprinting)
Let pi be a random permutation on 0..2m
Pick MIN {pi(f(s))} over all shingles s in D
Introduction to Information Retrieval Sec. 19.6
Document 1 Document 2
264 264
264 264
264 264
A B
264 264
Min-Hash Technique:
40
Introduction to Information Retrieval Sec. 19.6
0 1
1 0
1 1 Jaccard(C1,C2) = 2/5 = 0.4
0 0
1 1
0 1
Introduction to Information Retrieval Sec. 19.6
Key Observation
For columns Ci, Cj, four types of rows
Ci Cj
A 1 1
B 1 0
C 0 1
D 0 0
Claim
A
Jaccard(Ci,C j ) =
A+B+C
Introduction to Information Retrieval Sec. 19.6
“Min” Hashing
Randomly permute rows
Hash h(Ci) = index of first row with 1 in column
Ci
Surprising Property
P h(C i ) h(C j ) Jaccard Ci , C j
Why?
Both are |A|/(|A|+|B|+|C|)
Look down columns Ci, Cj until first non-Type-D row
h(Ci) = h(Cj) type A row
Introduction to Information Retrieval Sec. 19.6
Example
Task: find near-duplicate pairs
D1 D2 D3 among D1, D2, and D3 using
the similarity threshold >= 0.5
Index C1 C2 C3
1 R1 1 0 1
2 R2 0 1 1
3 R3 1 0 0
4 R4 1 0 1
5 R5 0 1 0
Introduction to Information Retrieval Sec. 19.6
Example
Signatures
S1 S2 S3
D1 D2 D3
Perm 1 = (12345) 1 2 1
Perm 2 = (54321) 4 5 4
Perm 3 = (34512) 3 5 4
Index C1 C2 C3
1 R1 1 0 1
2 R2 0 1 1
3 R3 1 0 0 Similarities
4 R4 1 0 1 1-2 1-3 2-3
5 R5 0 1 0 Col-Col 0.00 0.50 0.25
Sig-Sig 0.00 0.67 0.67
Introduction to Information Retrieval
More resources
IIR Chapter 19
http://en.wikipedia.org/wiki/Locality-
sensitive_hashing
http://people.csail.mit.edu/indyk/vldb99.ps