SlideShare a Scribd company logo
Lec.6
Inverted index
Inverted Index
 The most common data structure used in both database
management and Information Retrieval Systems is the inverted file
structure.
 Inverted file structures are composed of three basic files: the
document file, the inversion lists (sometimes called posting files)
and the dictionary.
 The name “inverted file” comes from its underlying methodology of
storing an inversion of the documents: inversion of the document
from the perspective that, for each word, a list of documents in
which the word is found in is stored (the inversion list for that word).
 Each document in the system is given a unique numerical identifier.
It is that identifier that is stored in the inversion list.
 The way to locate the inversion list for a particular word is via the
Dictionary.
Inverted Index
• The Dictionary is typically a sorted list of all
unique words (processing tokens) in the
system and a pointer to the location of its
inversion list.
• Dictionaries can also store other information
used in query optimization such as the length
of inversion lists.
Inverted Index Construction
• Building Steps
– Collect document
– Text Preprocessing
– Construct an inverted index with dictionary and
postings
•Sequence of (Modified token, Document ID)
pairs.
Indexer Steps
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
•Sort by terms.
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
hath 1
I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Core indexing step.
•Multiple term entries
in a single document
are merged.
•Frequency information
is added.
Term Doc # Term freq
ambitious 2 1
be 2 1
brutus 1 1
brutus 2 1
capitol 1 1
caesar 1 1
caesar 2 2
did 1 1
enact 1 1
hath 2 1
I 1 2
i' 1 1
it 2 1
julius 1 1
killed 1 2
let 2 1
me 1 1
noble 2 1
so 2 1
the 1 1
the 2 1
told 2 1
you 2 1
was 1 1
was 2 1
with 2 1
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
hath 1
I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2
•The result is split into a Dictionary file and a Postings file.
Term N docs Coll freq
ambitious 1 1
be 1 1
brutus 2 2
capitol 1 1
caesar 2 3
did 1 1
enact 1 1
hath 1 1
I 1 2
i' 1 1
it 1 1
julius 1 1
killed 1 2
let 1 1
me 1 1
noble 1 1
so 1 1
the 2 2
told 1 1
you 1 1
was 2 2
with 1 1
Term Doc # Freq
ambitious 2 1
be 2 1
brutus 1 1
brutus 2 1
capitol 1 1
caesar 1 1
caesar 2 2
did 1 1
enact 1 1
hath 2 1
I 1 2
i' 1 1
it 2 1
julius 1 1
killed 1 2
let 2 1
me 1 1
noble 2 1
so 2 1
the 1 1
the 2 1
told 2 1
you 2 1
was 1 1
was 2 1
with 2 1
Inverted File: An Example
• Documents are parsed to extract
tokens.
• These are saved with the Document ID.
Now is the time
for all good men
to come to the aid
of their country
Doc 1
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Doc 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
• After all documents
have been parsed
the inverted file is
sorted
alphabetically.
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
• Multiple term
entries for a
single document
are merged.
• Within-document
term frequency
information is
compiled.
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
• Finally, the file can be split into
– A Dictionary or Lexicon file
and
– A Postings file
Dictionary/Lexicon PostingsTerm Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Doc # Freq
2 1
1 1
1 1
2 1
1 1
1 1
2 1
2 1
1 1
1 1
2 1
1 1
2 1
2 1
1 1
2 1
2 1
1 1
1 1
2 1
2 1
1 2
2 2
1 1
1 1
2 1
1 2
2 2
Term N docs Tot Freq
a 1 1
aid 1 1
all 1 1
and 1 1
come 1 1
country 2 2
dark 1 1
for 1 1
good 1 1
in 1 1
is 1 1
it 1 1
manor 1 1
men 1 1
midnight 1 1
night 1 1
now 1 1
of 1 1
past 1 1
stormy 1 1
the 2 4
their 1 1
time 2 2
to 1 2
was 1 2
Final Results
Inverted indexes
• Permit fast search for individual terms
• For each term, you get a list consisting of:
– document ID
– frequency of term in doc (optional)
– position of term in doc (optional)
• These lists can be used to solve Boolean queries:
• country -> d1, d2
• manor -> d2
• country AND manor -> d2
• Also used for statistical ranking algorithms
Web search
Web Searching: Architecture
Documents stored on many Web servers are indexed in a single central index.
(This is similar to a union catalog.)
• The central index is implemented as a single system on a very large number of
computers
Examples: Google, Yahoo!
17
Web Challenges for IR
• Distributed Data: Documents spread over millions
of different web servers.
• Volatile Data: Many documents change or
disappear rapidly (e.g. dead links).
• Large Volume: Billions of separate documents.
• Unstructured and Redundant Data: No uniform
structure, HTML errors, up to 30% (near) duplicate
documents.
• Quality of Data: No editorial control, false
information, poor quality writing, typos, etc.
• Heterogeneous Data: Multiple media types (images,
video), languages, character sets, etc.
What is a Web Crawler?
Web Crawler
• A program for downloading web pages.
• Given an initial set of seed URLs, it recursively
downloads every page that is linked from pages
in the set.
• A focused web crawler downloads only those
pages whose content satisfies some criterion.
Also known as a web spider
19
What’s wrong with the simple crawler
Scale: we need to distribute.
We can’t index everything: we need to subselect. How?
Duplicates: need to integrate duplicate detection
Spam and spider traps: need to integrate spam detection
Politeness: we need to be “nice” and space out all requests for a site
over a longer period (hours, days)
Freshness: we need to recrawl periodically.
Because of the size of the web, we can do frequent recrawls only for
a small subset.
Again, subselection problem or prioritization
19
20
What a crawler must do
Be robust
 Be immune to spider traps, duplicates, very large pages, very
large websites, dynamic pages etc
20
Be polite
• Don’t hit a site too often
• Only crawl pages you are allowed to crawl: robots.txt
21
Robots.txt
Protocol for giving crawlers (“robots”) limited access to a website, originally from
1994
Examples:
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow: /
Important: cache the robots.txt file of each site we are crawling
21
22
Robot Exclusion
• Web sites and pages can specify that robots
should not crawl/index certain areas.
• Two components:
– Robots Exclusion Protocol: Site wide specification
of excluded directories.
– Robots META Tag: Individual document tag to
exclude indexing or following links.
23
Robots Exclusion Protocol
• Site administrator puts a “robots.txt” file at
the root of the host’s web directory.
– http://www.ebay.com/robots.txt
– http://www.cnn.com/robots.txt
• File is a list of excluded directories for a given
robot (user-agent).
– Exclude all robots from the entire site:
– User-agent: *
Disallow: /
24
Robot Exclusion Protocol Examples
Exclude specific directories:
User-agent: *
Disallow: /tmp/
Disallow: /cgi-bin/
Disallow: /users/paranoid/
Exclude a specific robot:
User-agent: GoogleBot
Disallow: /
25
Spiders (Robots/Bots/Crawlers)
• Start with a comprehensive set of root URL’s
from which to start the search.
• Follow all links on these pages recursively to
find additional pages.
• Index all novel found pages in an inverted
index as they are encountered.
• May allow users to directly submit pages to be
indexed (and crawled from).
26
Search Strategies
Breadth-first Search
In graph theory, breadth-first search (BFS) is a graph search algorithm that begins
at the root node and explores all the neighboring nodes. Then for each of those
nearest nodes, it explores their unexplored neighbor nodes, and so on, until it
finds the goal.
27
Search Strategies (cont)
Depth-first Search
DFS on binary tree. Specialized case of more general graph. The order of the
search is down paths and from left to right. The root is examined first; then the
left child of the root; then the left child of this node, etc. until a leaf is found. At
a leaf, backtrack to the lowest right child and repeat
28
Search Strategy Trade-Off’s
• Breadth-first explores uniformly outward
from the root page but requires memory of all
nodes on the previous level (exponential in
depth). Standard spidering method.
• Depth-first requires memory of only depth
times branching-factor (linear in depth) but
gets “lost” pursuing a single thread.
• Both strategies implementable using a queue
of links (URL’s).
Spidering Algorithm
Initialize queue (Q) with initial set of known URL’s.
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q.
If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
continue loop.
If already visited L, continue loop.
Download page, P, for L.
If cannot download P (e.g. 404 error, robot excluded)
continue loop.
Index P (e.g. add to inverted index or store cached copy).
Parse P to obtain list of new links N.
•Append N to the end of Q.
Restricting Spidering
• Restrict spider to a particular site.
– Remove links to other sites from Q.
• Restrict spider to a particular directory.
– Remove links not in the specified directory.
• Obey page-owner restrictions (robot
exclusion).
Ad

More Related Content

What's hot (20)

11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS
koolkampus
 
Chap7 2 Ecc Intro
Chap7 2 Ecc IntroChap7 2 Ecc Intro
Chap7 2 Ecc Intro
Edora Aziz
 
Run time storage
Run time storageRun time storage
Run time storage
Rasineni Madhan Mohan Naidu
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
PoojaShah174393
 
RSA ALGORITHM
RSA ALGORITHMRSA ALGORITHM
RSA ALGORITHM
Dr. Shashank Shetty
 
Object database standards, languages and design
Object database standards, languages and designObject database standards, languages and design
Object database standards, languages and design
Dabbal Singh Mahara
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
Viet-Trung TRAN
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
DataminingTools Inc
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
MaryamRehman6
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
Temporal database
Temporal databaseTemporal database
Temporal database
District Administration
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
ImXaib
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
Krish_ver2
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
Fabio Fumarola
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
Ashikapokiya12345
 
Confusion and Diffusion.pptx
Confusion and Diffusion.pptxConfusion and Diffusion.pptx
Confusion and Diffusion.pptx
bcanawakadalcollege
 
Chapter 1 Introduction of Cryptography and Network security
Chapter 1 Introduction of Cryptography and Network security Chapter 1 Introduction of Cryptography and Network security
Chapter 1 Introduction of Cryptography and Network security
Dr. Kapil Gupta
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
Information Technology
 
DESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSDESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMS
Gayathri Gaayu
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
moni sindhu
 
11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS
koolkampus
 
Chap7 2 Ecc Intro
Chap7 2 Ecc IntroChap7 2 Ecc Intro
Chap7 2 Ecc Intro
Edora Aziz
 
Object database standards, languages and design
Object database standards, languages and designObject database standards, languages and design
Object database standards, languages and design
Dabbal Singh Mahara
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
DataminingTools Inc
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
MaryamRehman6
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
ImXaib
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
Krish_ver2
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
Fabio Fumarola
 
Chapter 1 Introduction of Cryptography and Network security
Chapter 1 Introduction of Cryptography and Network security Chapter 1 Introduction of Cryptography and Network security
Chapter 1 Introduction of Cryptography and Network security
Dr. Kapil Gupta
 
DESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSDESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMS
Gayathri Gaayu
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
moni sindhu
 

Viewers also liked (8)

Lec 2
Lec 2Lec 2
Lec 2
alaa223
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
alaa223
 
Lec1
Lec1Lec1
Lec1
alaa223
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
alaa223
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
alaa223
 
IR
IRIR
IR
Girish Khanzode
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
alaa223
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 
Ad

Similar to lec6 (20)

Search engines
Search enginesSearch engines
Search engines
Sanjana Dixit
 
Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
Habtamu100
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
IshaXogaha
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
WrushabhShirsat3
 
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbblecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
RAtna29
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
MedinaBedru
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdf
ekobelasting
 
lbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.ppt
lbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.pptlbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.ppt
lbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.ppt
SurabhiChahar
 
Workshop 3
Workshop 3Workshop 3
Workshop 3
peterchanws
 
Development of the CyberCemetery (2011)
Development of the CyberCemetery (2011)Development of the CyberCemetery (2011)
Development of the CyberCemetery (2011)
Dr. Starr Hoffman
 
Search pitb
Search pitbSearch pitb
Search pitb
Nawab Iqbal
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdf
JemalNesre1
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
eswcsummerschool
 
Building Corpora from Social Media
Building Corpora from Social MediaBuilding Corpora from Social Media
Building Corpora from Social Media
Richard Littauer
 
Dublin Core Registry to Support Multilinguality : Te Reo Māori Dublin Core Me...
Dublin Core Registry to Support Multilinguality : Te Reo Māori Dublin Core Me...Dublin Core Registry to Support Multilinguality : Te Reo Māori Dublin Core Me...
Dublin Core Registry to Support Multilinguality : Te Reo Māori Dublin Core Me...
Karen R
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
ssusere3b1a2
 
NSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - EnglishNSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - English
Florent Pillet
 
DR FAT
DR FATDR FAT
DR FAT
John Laycock
 
REMnux tutorial-2: Extraction and decoding of Artifacts
REMnux tutorial-2: Extraction and decoding of ArtifactsREMnux tutorial-2: Extraction and decoding of Artifacts
REMnux tutorial-2: Extraction and decoding of Artifacts
Rhydham Joshi
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1
Hugo Besemer
 
Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
Habtamu100
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
IshaXogaha
 
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbblecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
RAtna29
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdf
ekobelasting
 
lbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.ppt
lbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.pptlbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.ppt
lbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.ppt
SurabhiChahar
 
Development of the CyberCemetery (2011)
Development of the CyberCemetery (2011)Development of the CyberCemetery (2011)
Development of the CyberCemetery (2011)
Dr. Starr Hoffman
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdf
JemalNesre1
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
eswcsummerschool
 
Building Corpora from Social Media
Building Corpora from Social MediaBuilding Corpora from Social Media
Building Corpora from Social Media
Richard Littauer
 
Dublin Core Registry to Support Multilinguality : Te Reo Māori Dublin Core Me...
Dublin Core Registry to Support Multilinguality : Te Reo Māori Dublin Core Me...Dublin Core Registry to Support Multilinguality : Te Reo Māori Dublin Core Me...
Dublin Core Registry to Support Multilinguality : Te Reo Māori Dublin Core Me...
Karen R
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
ssusere3b1a2
 
NSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - EnglishNSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - English
Florent Pillet
 
REMnux tutorial-2: Extraction and decoding of Artifacts
REMnux tutorial-2: Extraction and decoding of ArtifactsREMnux tutorial-2: Extraction and decoding of Artifacts
REMnux tutorial-2: Extraction and decoding of Artifacts
Rhydham Joshi
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1
Hugo Besemer
 
Ad

lec6

  • 2. Inverted Index  The most common data structure used in both database management and Information Retrieval Systems is the inverted file structure.  Inverted file structures are composed of three basic files: the document file, the inversion lists (sometimes called posting files) and the dictionary.  The name “inverted file” comes from its underlying methodology of storing an inversion of the documents: inversion of the document from the perspective that, for each word, a list of documents in which the word is found in is stored (the inversion list for that word).  Each document in the system is given a unique numerical identifier. It is that identifier that is stored in the inversion list.  The way to locate the inversion list for a particular word is via the Dictionary.
  • 3. Inverted Index • The Dictionary is typically a sorted list of all unique words (processing tokens) in the system and a pointer to the location of its inversion list. • Dictionaries can also store other information used in query optimization such as the length of inversion lists.
  • 4. Inverted Index Construction • Building Steps – Collect document – Text Preprocessing – Construct an inverted index with dictionary and postings
  • 5. •Sequence of (Modified token, Document ID) pairs. Indexer Steps I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2
  • 6. •Sort by terms. Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 Core indexing step.
  • 7. •Multiple term entries in a single document are merged. •Frequency information is added. Term Doc # Term freq ambitious 2 1 be 2 1 brutus 1 1 brutus 2 1 capitol 1 1 caesar 1 1 caesar 2 2 did 1 1 enact 1 1 hath 2 1 I 1 2 i' 1 1 it 2 1 julius 1 1 killed 1 2 let 2 1 me 1 1 noble 2 1 so 2 1 the 1 1 the 2 1 told 2 1 you 2 1 was 1 1 was 2 1 with 2 1 Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2
  • 8. •The result is split into a Dictionary file and a Postings file. Term N docs Coll freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i' 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 2 told 1 1 you 1 1 was 2 2 with 1 1 Term Doc # Freq ambitious 2 1 be 2 1 brutus 1 1 brutus 2 1 capitol 1 1 caesar 1 1 caesar 2 2 did 1 1 enact 1 1 hath 2 1 I 1 2 i' 1 1 it 2 1 julius 1 1 killed 1 2 let 2 1 me 1 1 noble 2 1 so 2 1 the 1 1 the 2 1 told 2 1 you 2 1 was 1 1 was 2 1 with 2 1
  • 9. Inverted File: An Example • Documents are parsed to extract tokens. • These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2 Term Doc # now 1 is 1 the 1 time 1 for 1 all 1 good 1 men 1 to 1 come 1 to 1 the 1 aid 1 of 1 their 1 country 1 it 2 was 2 a 2 dark 2 and 2 stormy 2 night 2 in 2 the 2 country 2 manor 2 the 2 time 2 was 2 past 2 midnight 2
  • 10. • After all documents have been parsed the inverted file is sorted alphabetically. Term Doc # a 2 aid 1 all 1 and 2 come 1 country 1 country 2 dark 2 for 1 good 1 in 2 is 1 it 2 manor 2 men 1 midnight 2 night 2 now 1 of 1 past 2 stormy 2 the 1 the 1 the 2 the 2 their 1 time 1 time 2 to 1 to 1 was 2 was 2 Term Doc # now 1 is 1 the 1 time 1 for 1 all 1 good 1 men 1 to 1 come 1 to 1 the 1 aid 1 of 1 their 1 country 1 it 2 was 2 a 2 dark 2 and 2 stormy 2 night 2 in 2 the 2 country 2 manor 2 the 2 time 2 was 2 past 2 midnight 2
  • 11. • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled. Term Doc # Freq a 2 1 aid 1 1 all 1 1 and 2 1 come 1 1 country 1 1 country 2 1 dark 2 1 for 1 1 good 1 1 in 2 1 is 1 1 it 2 1 manor 2 1 men 1 1 midnight 2 1 night 2 1 now 1 1 of 1 1 past 2 1 stormy 2 1 the 1 2 the 2 2 their 1 1 time 1 1 time 2 1 to 1 2 was 2 2 Term Doc # a 2 aid 1 all 1 and 2 come 1 country 1 country 2 dark 2 for 1 good 1 in 2 is 1 it 2 manor 2 men 1 midnight 2 night 2 now 1 of 1 past 2 stormy 2 the 1 the 1 the 2 the 2 their 1 time 1 time 2 to 1 to 1 was 2 was 2
  • 12. • Finally, the file can be split into – A Dictionary or Lexicon file and – A Postings file
  • 13. Dictionary/Lexicon PostingsTerm Doc # Freq a 2 1 aid 1 1 all 1 1 and 2 1 come 1 1 country 1 1 country 2 1 dark 2 1 for 1 1 good 1 1 in 2 1 is 1 1 it 2 1 manor 2 1 men 1 1 midnight 2 1 night 2 1 now 1 1 of 1 1 past 2 1 stormy 2 1 the 1 2 the 2 2 their 1 1 time 1 1 time 2 1 to 1 2 was 2 2 Doc # Freq 2 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 2 2 2 1 1 1 1 2 1 1 2 2 2 Term N docs Tot Freq a 1 1 aid 1 1 all 1 1 and 1 1 come 1 1 country 2 2 dark 1 1 for 1 1 good 1 1 in 1 1 is 1 1 it 1 1 manor 1 1 men 1 1 midnight 1 1 night 1 1 now 1 1 of 1 1 past 1 1 stormy 1 1 the 2 4 their 1 1 time 2 2 to 1 2 was 1 2 Final Results
  • 14. Inverted indexes • Permit fast search for individual terms • For each term, you get a list consisting of: – document ID – frequency of term in doc (optional) – position of term in doc (optional) • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 • Also used for statistical ranking algorithms
  • 16. Web Searching: Architecture Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog.) • The central index is implemented as a single system on a very large number of computers Examples: Google, Yahoo!
  • 17. 17 Web Challenges for IR • Distributed Data: Documents spread over millions of different web servers. • Volatile Data: Many documents change or disappear rapidly (e.g. dead links). • Large Volume: Billions of separate documents. • Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. • Quality of Data: No editorial control, false information, poor quality writing, typos, etc. • Heterogeneous Data: Multiple media types (images, video), languages, character sets, etc.
  • 18. What is a Web Crawler? Web Crawler • A program for downloading web pages. • Given an initial set of seed URLs, it recursively downloads every page that is linked from pages in the set. • A focused web crawler downloads only those pages whose content satisfies some criterion. Also known as a web spider
  • 19. 19 What’s wrong with the simple crawler Scale: we need to distribute. We can’t index everything: we need to subselect. How? Duplicates: need to integrate duplicate detection Spam and spider traps: need to integrate spam detection Politeness: we need to be “nice” and space out all requests for a site over a longer period (hours, days) Freshness: we need to recrawl periodically. Because of the size of the web, we can do frequent recrawls only for a small subset. Again, subselection problem or prioritization 19
  • 20. 20 What a crawler must do Be robust  Be immune to spider traps, duplicates, very large pages, very large websites, dynamic pages etc 20 Be polite • Don’t hit a site too often • Only crawl pages you are allowed to crawl: robots.txt
  • 21. 21 Robots.txt Protocol for giving crawlers (“robots”) limited access to a website, originally from 1994 Examples: User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: / Important: cache the robots.txt file of each site we are crawling 21
  • 22. 22 Robot Exclusion • Web sites and pages can specify that robots should not crawl/index certain areas. • Two components: – Robots Exclusion Protocol: Site wide specification of excluded directories. – Robots META Tag: Individual document tag to exclude indexing or following links.
  • 23. 23 Robots Exclusion Protocol • Site administrator puts a “robots.txt” file at the root of the host’s web directory. – http://www.ebay.com/robots.txt – http://www.cnn.com/robots.txt • File is a list of excluded directories for a given robot (user-agent). – Exclude all robots from the entire site: – User-agent: * Disallow: /
  • 24. 24 Robot Exclusion Protocol Examples Exclude specific directories: User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /users/paranoid/ Exclude a specific robot: User-agent: GoogleBot Disallow: /
  • 25. 25 Spiders (Robots/Bots/Crawlers) • Start with a comprehensive set of root URL’s from which to start the search. • Follow all links on these pages recursively to find additional pages. • Index all novel found pages in an inverted index as they are encountered. • May allow users to directly submit pages to be indexed (and crawled from).
  • 26. 26 Search Strategies Breadth-first Search In graph theory, breadth-first search (BFS) is a graph search algorithm that begins at the root node and explores all the neighboring nodes. Then for each of those nearest nodes, it explores their unexplored neighbor nodes, and so on, until it finds the goal.
  • 27. 27 Search Strategies (cont) Depth-first Search DFS on binary tree. Specialized case of more general graph. The order of the search is down paths and from left to right. The root is examined first; then the left child of the root; then the left child of this node, etc. until a leaf is found. At a leaf, backtrack to the lowest right child and repeat
  • 28. 28 Search Strategy Trade-Off’s • Breadth-first explores uniformly outward from the root page but requires memory of all nodes on the previous level (exponential in depth). Standard spidering method. • Depth-first requires memory of only depth times branching-factor (linear in depth) but gets “lost” pursuing a single thread. • Both strategies implementable using a queue of links (URL’s).
  • 29. Spidering Algorithm Initialize queue (Q) with initial set of known URL’s. Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) continue loop. Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. •Append N to the end of Q.
  • 30. Restricting Spidering • Restrict spider to a particular site. – Remove links to other sites from Q. • Restrict spider to a particular directory. – Remove links not in the specified directory. • Obey page-owner restrictions (robot exclusion).