lec6

Inverted Index
 The most common data structure used in both database
management and Information Retrieval Systems is the inverted file
structure.
 Inverted file structures are composed of three basic files: the
document file, the inversion lists (sometimes called posting files)
and the dictionary.
 The name “inverted file” comes from its underlying methodology of
storing an inversion of the documents: inversion of the document
from the perspective that, for each word, a list of documents in
which the word is found in is stored (the inversion list for that word).
 Each document in the system is given a unique numerical identifier.
It is that identifier that is stored in the inversion list.
 The way to locate the inversion list for a particular word is via the
Dictionary.

Inverted Index
• The Dictionary is typically a sorted list of all
unique words (processing tokens) in the
system and a pointer to the location of its
inversion list.
• Dictionaries can also store other information
used in query optimization such as the length
of inversion lists.

Inverted Index Construction
• Building Steps
– Collect document
– Text Preprocessing
– Construct an inverted index with dictionary and
postings

•Sequence of (Modified token, Document ID)
pairs.
Indexer Steps
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2

•Sort by terms.
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
hath 1
I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Core indexing step.

•Multiple term entries
in a single document
are merged.
•Frequency information
is added.
Term Doc # Term freq
ambitious 2 1
be 2 1
brutus 1 1
brutus 2 1
capitol 1 1
caesar 1 1
caesar 2 2
did 1 1
enact 1 1
hath 2 1
I 1 2
i' 1 1
it 2 1
julius 1 1
killed 1 2
let 2 1
me 1 1
noble 2 1
so 2 1
the 1 1
the 2 1
told 2 1
you 2 1
was 1 1
was 2 1
with 2 1
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
hath 1
I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2

•The result is split into a Dictionary file and a Postings file.
Term N docs Coll freq
ambitious 1 1
be 1 1
brutus 2 2
capitol 1 1
caesar 2 3
did 1 1
enact 1 1
hath 1 1
I 1 2
i' 1 1
it 1 1
julius 1 1
killed 1 2
let 1 1
me 1 1
noble 1 1
so 1 1
the 2 2
told 1 1
you 1 1
was 2 2
with 1 1
Term Doc # Freq
ambitious 2 1
be 2 1
brutus 1 1
brutus 2 1
capitol 1 1
caesar 1 1
caesar 2 2
did 1 1
enact 1 1
hath 2 1
I 1 2
i' 1 1
it 2 1
julius 1 1
killed 1 2
let 2 1
me 1 1
noble 2 1
so 2 1
the 1 1
the 2 1
told 2 1
you 2 1
was 1 1
was 2 1
with 2 1

Inverted File: An Example
• Documents are parsed to extract
tokens.
• These are saved with the Document ID.
Now is the time
for all good men
to come to the aid
of their country
Doc 1
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Doc 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2

• After all documents
have been parsed
the inverted file is
sorted
alphabetically.
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2

• Multiple term
entries for a
single document
are merged.
• Within-document
term frequency
information is
compiled.
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2

• Finally, the file can be split into
– A Dictionary or Lexicon file
and
– A Postings file

Dictionary/Lexicon PostingsTerm Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Doc # Freq
2 1
1 1
1 1
2 1
1 1
1 1
2 1
2 1
1 1
1 1
2 1
1 1
2 1
2 1
1 1
2 1
2 1
1 1
1 1
2 1
2 1
1 2
2 2
1 1
1 1
2 1
1 2
2 2
Term N docs Tot Freq
a 1 1
aid 1 1
all 1 1
and 1 1
come 1 1
country 2 2
dark 1 1
for 1 1
good 1 1
in 1 1
is 1 1
it 1 1
manor 1 1
men 1 1
midnight 1 1
night 1 1
now 1 1
of 1 1
past 1 1
stormy 1 1
the 2 4
their 1 1
time 2 2
to 1 2
was 1 2
Final Results

Inverted indexes
• Permit fast search for individual terms
• For each term, you get a list consisting of:
– document ID
– frequency of term in doc (optional)
– position of term in doc (optional)
• These lists can be used to solve Boolean queries:
• country -> d1, d2
• manor -> d2
• country AND manor -> d2
• Also used for statistical ranking algorithms

Web Searching: Architecture
Documents stored on many Web servers are indexed in a single central index.
(This is similar to a union catalog.)
• The central index is implemented as a single system on a very large number of
computers
Examples: Google, Yahoo!

17
Web Challenges for IR
• Distributed Data: Documents spread over millions
of different web servers.
• Volatile Data: Many documents change or
disappear rapidly (e.g. dead links).
• Large Volume: Billions of separate documents.
• Unstructured and Redundant Data: No uniform
structure, HTML errors, up to 30% (near) duplicate
documents.
• Quality of Data: No editorial control, false
information, poor quality writing, typos, etc.
• Heterogeneous Data: Multiple media types (images,
video), languages, character sets, etc.

What is a Web Crawler?
Web Crawler
• A program for downloading web pages.
• Given an initial set of seed URLs, it recursively
downloads every page that is linked from pages
in the set.
• A focused web crawler downloads only those
pages whose content satisfies some criterion.
Also known as a web spider

19
What’s wrong with the simple crawler
Scale: we need to distribute.
We can’t index everything: we need to subselect. How?
Duplicates: need to integrate duplicate detection
Spam and spider traps: need to integrate spam detection
Politeness: we need to be “nice” and space out all requests for a site
over a longer period (hours, days)
Freshness: we need to recrawl periodically.
Because of the size of the web, we can do frequent recrawls only for
a small subset.
Again, subselection problem or prioritization
19

20
What a crawler must do
Be robust
 Be immune to spider traps, duplicates, very large pages, very
large websites, dynamic pages etc
20
Be polite
• Don’t hit a site too often
• Only crawl pages you are allowed to crawl: robots.txt

21
Robots.txt
Protocol for giving crawlers (“robots”) limited access to a website, originally from
1994
Examples:
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow: /
Important: cache the robots.txt file of each site we are crawling
21

22
Robot Exclusion
• Web sites and pages can specify that robots
should not crawl/index certain areas.
• Two components:
– Robots Exclusion Protocol: Site wide specification
of excluded directories.
– Robots META Tag: Individual document tag to
exclude indexing or following links.

23
Robots Exclusion Protocol
• Site administrator puts a “robots.txt” file at
the root of the host’s web directory.
– http://www.ebay.com/robots.txt
– http://www.cnn.com/robots.txt
• File is a list of excluded directories for a given
robot (user-agent).
– Exclude all robots from the entire site:
– User-agent: *
Disallow: /

24
Robot Exclusion Protocol Examples
Exclude specific directories:
User-agent: *
Disallow: /tmp/
Disallow: /cgi-bin/
Disallow: /users/paranoid/
Exclude a specific robot:
User-agent: GoogleBot
Disallow: /

25
Spiders (Robots/Bots/Crawlers)
• Start with a comprehensive set of root URL’s
from which to start the search.
• Follow all links on these pages recursively to
find additional pages.
• Index all novel found pages in an inverted
index as they are encountered.
• May allow users to directly submit pages to be
indexed (and crawled from).

26
Search Strategies
Breadth-first Search
In graph theory, breadth-first search (BFS) is a graph search algorithm that begins
at the root node and explores all the neighboring nodes. Then for each of those
nearest nodes, it explores their unexplored neighbor nodes, and so on, until it
finds the goal.

27
Search Strategies (cont)
Depth-first Search
DFS on binary tree. Specialized case of more general graph. The order of the
search is down paths and from left to right. The root is examined first; then the
left child of the root; then the left child of this node, etc. until a leaf is found. At
a leaf, backtrack to the lowest right child and repeat

28
Search Strategy Trade-Off’s
• Breadth-first explores uniformly outward
from the root page but requires memory of all
nodes on the previous level (exponential in
depth). Standard spidering method.
• Depth-first requires memory of only depth
times branching-factor (linear in depth) but
gets “lost” pursuing a single thread.
• Both strategies implementable using a queue
of links (URL’s).

Spidering Algorithm
Initialize queue (Q) with initial set of known URL’s.
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q.
If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
continue loop.
If already visited L, continue loop.
Download page, P, for L.
If cannot download P (e.g. 404 error, robot excluded)
continue loop.
Index P (e.g. add to inverted index or store cached copy).
Parse P to obtain list of new links N.
•Append N to the end of Q.

Restricting Spidering
• Restrict spider to a particular site.
– Remove links to other sites from Q.
• Restrict spider to a particular directory.
– Remove links not in the specified directory.
• Obey page-owner restrictions (robot
exclusion).

lec6

Recommended

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to lec6 (20)

lec6