SlideShare a Scribd company logo
Introduction to Information Retrieval 
June, 2013 Roi Blanco
Acknowledgements 
โ€ข Many of these slides were taken from other presentations 
โ€“ P. Raghavan, C. Manning, H. Schutze IR lectures 
โ€“ Mounia Lalmasโ€™s personal stash 
โ€“ Other random slide decks 
โ€ข Textbooks 
โ€“ Ricardo Baeza-Yates, Berthier Ribeiro Neto 
โ€“ Raghavan, Manning, Schutze 
โ€“ โ€ฆ among other good books 
โ€ข Many online tutorials, many online tools available (full toolkits) 
2
Big Plan 
โ€ข What is Information Retrieval? 
โ€“ Search engine history 
โ€“ Examples of IR systems (you might now have known!) 
โ€ข Is IR hard? 
โ€“ Users and human cognition 
โ€“ What is it like to be a search engine? 
โ€ข Web Search 
โ€“ Architecture 
โ€“ Differences between Web search and IR 
โ€“ Crawling 
3
โ€ข Representation 
โ€“ Document view 
โ€“ Document processing 
โ€“ Indexing 
โ€ข Modeling 
โ€“ Vector space 
โ€“ Probabilistic 
โ€“ Language Models 
โ€“ Extensions 
โ€ข Others 
โ€“ Distributed 
โ€“ Efficiency 
โ€“ Caching 
โ€“ Temporal issues 
โ€“ Relevance feedback 
โ€“ โ€ฆ 
4
5
Information Retrieval 
Information Retrieval (IR) is finding material 
(usually documents) of an unstructured nature 
(usually text) that satisfies an information need 
from within large collections (usually stored on 
computers). 
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schรผtze 
Introduction to Information Retrieval 
6 
6
Information Retrieval (II) 
โ€ข What do we understand by documents? How do 
we decide what is a document and whatnot? 
โ€ข What is an information need? What types of 
information needs can we satisfy automatically? 
โ€ข What is a large collection? Which environments 
are suitable for IR 
7 
7
Basic assumptions of Information Retrieval 
โ€ข Collection: A set of documents 
โ€“ Assume it is a static collection 
โ€ข Goal: Retrieve documents with information that is 
relevant to the userโ€™s information need and helps 
the user complete a task 
8
Key issues 
โ€ข How to describe information resources or information-bearing 
objects in ways that they can be effectively used 
by those who need to use them ? 
โ€“ Organizing/Indexing/Storing 
โ€ข How to find the appropriate information resources or 
information-bearing objects for someoneโ€™s (or your own) 
needs 
โ€“ Retrieving / Accessing / Filtering 
9
Unstructured data 
Unstructured data? 
SELECT * from HOTELS 
where city = Bangalore and 
$$$ < 2 
10 
Cheap hotels in 
Bangalore 
CITY $$$ name 
Bangalore 1.5 Cheapo one 
Barcelona 1 EvenCheapoer 
10
Unstructured (text) vs. structured (database) data in the 
mid-nineties 
11
Unstructured (text) vs. structured (database) data today
13
Search Engine 
Index 
Square 
Pants! 
14
15
Timeline 
1990 1991 1993 1994 1998 
... 
16
... 
1995 
1996 
1997 
1998 
1999 
2000 
17
2009 
2005 
... 
2008 
18
2001 
2003 
2002 
2003 
2003 
2003 
2003 
2010 2010 
2003 
19
20
Your ads here! 
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Usability 
We also fail at using the technology 
Sometimes
36
Applications 
โ€ข Text Search 
โ€ข Ad search 
โ€ข Image/Video search 
โ€ข Email Search 
โ€ข Question Answering systems 
โ€ข Recommender systems 
โ€ข Desktop Search 
โ€ข Expert Finding 
โ€ข .... 
Jobs 
Prizes 
Products 
News 
Source code 
Videogames 
Maps 
Partners 
Mashups 
... 
37
Types of search engines 
โ€ข Q&A engines 
โ€ข Collaborative 
โ€ข Enterprise 
โ€ข Web 
โ€ข Metasearch 
โ€ข Semantic 
โ€ข NLP 
โ€ข ... 
38
Introduction to Information Retrieval
40
IR issues 
โ€ข Find out what the user needs 
โ€ฆ and do it quickly 
โ€ข Challenges: user intention, accessibility, volatility, 
redundancy, lack of structure, low quality, different data 
sources, volume, scale 
โ€ข The main bottleneck is human cognition and not 
computational 
41
IR is mostly about relevance 
โ€ข Relevance is the core concept in IR, but nobody has a good 
definition 
โ€ข Relevance = useful 
โ€ข Relevance = topically related 
โ€ข Relevance = new 
โ€ข Relevance = interesting 
โ€ข Relevance = ??? 
โ€ข However we still want relevant information 
42
โ€ข Information needs must be expressed as a query 
โ€“ But users donโ€™t often know what they want 
โ€ข Problems 
โ€“ Verbalizing information needs 
โ€“ Understanding query syntax 
โ€“ Understanding search engines 
43
Understanding(?) the user 
I am a hungry tourist in 
Barcelona, and I want to 
find a place to eat; 
however I donโ€™t want to 
spend a lot of money 
I want information 
on places with 
cheap food in 
Barcelona 
Info about bars in 
Barcelona 
Bar celona 
Misconception 
Mistranslation 
Misformulation 
44
Why this is hard? 
โ€ข Documents/images/ video/speech/etc are complex. We 
need some representation 
โ€ข Semantics 
โ€“ What do words mean? 
โ€ข Natural language 
โ€“ How do we say things? 
โ€ข L Computers cannot deal with these easily 
45
โ€ฆ and even harder 
โ€ข Context 
โ€ข Opinion 
Funny? Talented? Honest? 
46
Semantics 
Bank Note River Bank Bank 
47 
Blood bank
What is it like to be a search engine? 
โ€ข How can we figure out what youโ€™re trying to do? 
โ€ข Signal can be somehow weak, sometimes! 
[ jaguar ] 
[ iraq ] 
[ latest release Thinkpad drivers touchpad ] [ 
ebay ] 
[ first ] 
[ google ] 
[ brittttteny spirs ] 
48
Search is a multi-step process 
โ€ข Session search 
โ€“ Verbalize your query 
โ€“ Look for a document 
โ€“ Find your information there 
โ€“ Refine 
โ€ข Teleporting 
โ€“ Go directly to the site you like 
โ€“ Formulating the query is too hard, you trust more 
the final site, etc. 
49
โ€ข Someone told me that in the mid-1800โ€™s, people often would carry 
around a special kind of notebook. They would use the notebook to 
write down quotations that they heard, or copy passages from books 
theyโ€™d read. The notebook was an important part of their education, 
and it had a particular name. 
โ€“ What was the name of the notebook? 
50 
Examples from Dan Russel
Naming the un-nameable 
โ€ข Whatโ€™s this thing called? 
51
More tasks โ€ฆ 
โ€ข Going beyond a search engine 
โ€“ Using images / multimedia content 
โ€“ Using maps 
โ€“ Using other sources 
โ€ข Think of how to express things differently (synonyms) 
โ€“ A friend told me that there is an abandoned city in the waters of San Francisco 
Bay. Is that true? If it IS true, what was the name of the supposed city? 
โ€ข Exploring a topic further in depth 
โ€ข Refining a question 
โ€“ Suppose you want to buy a unicycle for your Mom or Dad. How would you find 
it? 
โ€ข Looking for lists of information 
โ€“ Can you find a list of all the groups that inhabited California at the time of the 
missions? 
52
IR tasks 
โ€ข Known-item finding 
โ€“ You want to retrieve some data that you know they exist 
โ€“ What year was Peter Mika born? 
โ€ข Exploratory seeking 
โ€“ You want to find some information through an iterative process 
โ€“ Not a single answer to your query 
โ€ข Exhaustive search 
โ€“ You want to find all the information possible about a particular issue 
โ€“ Issuing several queries to cover the user information need 
โ€ข Re-finding 
โ€“ You want to find an item you have found already 
53
Scale 
โ€ข >300TB of print data produced per year 
โ€“ +Video, speech, domain-specific information (>600PB per year) 
โ€ข IR has to be fast + scalable 
โ€ข Information is dynamic 
โ€“ News, web pages, maps, โ€ฆ 
โ€“ Queries are dynamic (you might even change your information needs while 
searching) 
โ€ข Cope with data and searcher change 
โ€“ This introduces tensions in every component of a search engine 
54
Methodology 
โ€ข Experimentation in IR 
โ€ข Three fundamental types of IR research: 
โ€“ Systems (efficiency) 
โ€“ Methods (effectiveness) 
โ€“ Applications (user utility) 
โ€ข Empirical evaluation plays a critical role across all three types 
of research 
55
Methodology (II) 
โ€ข Information retrieval (IR) is a highly applied scientific 
discipline 
โ€ข Experimentation is a critical component of the scientific 
method 
โ€ข Poor experimental methodologies are not scientifically 
sound and should be avoided 
56
57
58 
Task 
Info 
need 
Verbal 
form 
query 
Search 
engine 
Corpus 
results 
Query 
refinement
User 
Interface 
Query 
interpretation 
Document 
Collection 
Crawling 
Text Processing 
Indexing 
General Voodoo 
Matching 
Ranking 
Metadata 
Index 
Document 
Interpretation 
59
Crawler 
NLP 
pipeline 
Indexer 
Documents Tokens 
Index 
Query 
System 
60
Broker 
DNS 
Cluster 
Cluster 
cache 
server 
partition 
replication 
61
<a href= 
โ€ข Web pages are linked 
โ€“ AKA Web Graph 
โ€ข We can walk trough the 
graph to crawl 
โ€ข We can rank using the 
graph 
62
Web pages are connected 
63
Web Search 
โ€ข Basic search technology shared with IR systems 
โ€“ Representation 
โ€“ Indexing 
โ€“ Ranking 
โ€ข Scale (in terms of data and users) changes the game 
โ€“ Efficiency/architectural design decisions 
โ€ข Link structure 
โ€“ For data acquisition (crawling) 
โ€“ For ranking (PageRank, HITS) 
โ€“ For spam detection 
โ€“ For extending document representations (anchor text) 
โ€ข Adversarial IR 
โ€ข Monetization 
64
User Needs 
โ€ข Need 
โ€“ Informational โ€“ want to learn about something (~40% / 65%) 
โ€“ Navigational โ€“ want to go to that page (~25% / 15%) 
โ€“ Transactional โ€“ want to do something (web-mediated) (~35% / 20%) 
โ€ข Access a service 
โ€ข Downloads 
โ€ข Shop 
โ€“ Gray areas 
โ€ข Find a good hub 
โ€ข Exploratory search โ€œsee whatโ€™s thereโ€ 
Low hemoglobin 
United Airlines 
Seattle weather 
Mars surface images 
Canon S410 
Car rental Brasil 
65
How far do people look for results? 
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 
66
Usersโ€™ empirical evaluation of results 
โ€ข Quality of pages varies widely 
โ€“ Relevance is not enough 
โ€“ Other desirable qualities (non IR!!) 
โ€ข Content: Trustworthy, diverse, non-duplicated, well maintained 
โ€ข Web readability: display correctly & fast 
โ€ข No annoyances: pop-ups, etc. 
โ€ข Precision vs. recall 
โ€“ On the web, recall seldom matters 
โ€ข What matters 
โ€“ Precision at 1? Precision above the fold? 
โ€“ Comprehensiveness โ€“ must be able to deal with obscure queries 
โ€ข Recall matters when the number of matches is very small 
โ€ข User perceptions may be unscientific, but are significant 
over a large aggregate 
67
Usersโ€™ empirical evaluation of engines 
โ€ข Relevance and validity of results 
โ€ข UI โ€“ Simple, no clutter, error tolerant 
โ€ข Trust โ€“ Results are objective 
โ€ข Coverage of topics for ambiguous queries 
โ€ข Pre/Post process tools provided 
โ€“ Mitigate user errors (auto spell check, search assist,โ€ฆ) 
โ€“ Explicit: Search within results, more like this, refine ... 
โ€“ Anticipative: related searches 
โ€ข Deal with idiosyncrasies 
โ€“ Web specific vocabulary 
โ€ข Impact on stemming, spell-check, etc. 
โ€“ Web addresses typed in the search box 
โ€ข โ€œThe first, the last, the best and the worst โ€ฆโ€ 
68
The Web document collection 
โ€ข No design/co-ordination 
โ€ข Distributed content creation, linking, 
democratization of publishing 
โ€ข Content includes truth, lies, obsolete 
information, contradictions โ€ฆ 
โ€ข Unstructured (text, html, โ€ฆ), semi-structured 
(XML, annotated photos), structured 
(Databases)โ€ฆ 
โ€ข Scale much larger than previous text collections 
โ€ฆ but corporate records are catching up 
โ€ข Growth โ€“ slowed down from initial โ€œvolume 
doubling every few monthsโ€ but still expanding 
โ€ข Content can be dynamically generated The Web 
69
Basic crawler operation 
โ€ข Begin with known โ€œseedโ€ URLs 
โ€ข Fetch and parse them 
โ€“Extract URLs they point to 
โ€“Place the extracted URLs on a queue 
โ€ข Fetch each URL on the queue and 
repeat 
70
Crawling picture 
Web 
URLs frontier 
Unseen Web 
URLs crawled 
and parsed 
Seed 
pages 
71
Simple picture โ€“ complications 
โ€ข Web crawling isnโ€™t feasible with one machine 
โ€“ All of the above steps distributed 
โ€ข Malicious pages 
โ€“ Spam pages 
โ€“ Spider traps โ€“ including dynamically generated 
โ€ข Even non-malicious pages pose challenges 
โ€“ Latency/bandwidth to remote servers vary 
โ€“ Webmastersโ€™ stipulations 
โ€ข How โ€œdeepโ€ should you crawl a siteโ€™s URL hierarchy? 
โ€“ Site mirrors and duplicate pages 
โ€ข Politeness โ€“ donโ€™t hit a server too often 
72
What any crawler must do 
โ€ข Be Polite: Respect implicit and explicit 
politeness considerations 
โ€“ Only crawl allowed pages 
โ€“ Respect robots.txt 
โ€ข Be Robust: Be immune to spider traps 
and other malicious behavior from 
web servers 
โ€“Be efficient 
73
What any crawler should do 
โ€ข Be capable of distributed operation: designed to 
run on multiple distributed machines 
โ€ข Be scalable: designed to increase the crawl rate 
by adding more machines 
โ€ข Performance/efficiency: permit full use of 
available processing and network resources 
74
What any crawler should do 
โ€ข Fetch pages of โ€œhigher qualityโ€ first 
โ€ข Continuous operation: Continue fetching 
fresh copies of a previously fetched page 
โ€ข Extensible: Adapt to new data formats, 
protocols 
75
Updated crawling picture 
URLs crawled 
and parsed 
Unseen Web 
Seed 
Pages 
URL frontier 
Crawling thread 
76
77
Document views 
sailing 
greece 
mediterranean 
fish 
sunset 
Author = โ€œB. Smithโ€ 
Crdate = โ€œ14.12.96โ€ 
Ladate = โ€œ11.07.02โ€ 
Sailing in 
Greece 
B. Smith 
content 
view 
head 
title 
author 
chapter 
section 
section 
structure 
view 
data 
view 
layout 
view 
78
What is a document: document views 
โ€ข Content view is concerned with representing the content 
of the document; that is, what is the document about. 
โ€ข Data view is concerned with factual data associated with 
the document (e.g. author names, publishing date) 
โ€ข Layout view is concerned with how documents are 
displayed to the users; this view is related to user interface 
and visualization issues. 
โ€ข Structure view is concerned with the logical structure of 
the document, (e.g. a book being composed of chapters, 
themselves composed of sections, etc.) 
79
Indexing language 
โ€ข An indexing language: 
โ€“ Is the language used to describe the content of 
documents (and queries) 
โ€“ And it usually consists of index terms that are derived 
from the text (automatic indexing), or arrived at 
independently (manual indexing), using a controlled 
or uncontrolled vocabulary 
โ€“ Basic operation: is this query term present in this 
document? 
80
Generating document representations 
โ€ข The building of the indexing language, that is generating 
the document representation, is done in several steps: 
โ€“ Character encoding 
โ€“ Language recognition 
โ€“ Page segmentation (boilerplate detection) 
โ€“ Tokenization (identification of words) 
โ€“ Term normalization 
โ€“ Stopword removal 
โ€“ Stemming 
โ€“ Others (doc. Expansion, etc.) 
81
Generating document representations: overview 
documents 
tokens 
stop-words 
stems 
terms (index terms) 
tokenization 
remove noisy words 
reduce to stems 
+ others: e.g. 
- thesaurus 
- more complex 
processing 
82
Parsing a document 
โ€ข What format is it in? 
โ€“ pdf/word/excel/html? 
โ€ข What language is it in? 
โ€ข What character set is in use? 
โ€“ (ISO-8818, UTF-8, โ€ฆ) 
But these tasks are often done heuristically โ€ฆ 
83
Complications: Format/language 
โ€ข Documents being indexed can include docs from many 
different languages 
โ€“ A single index may contain terms from many languages. 
โ€ข Sometimes a document or its components can contain 
multiple languages/formats 
โ€“ French email with a German pdf attachment. 
โ€“ French email quote clauses from an English-language 
contract 
โ€ข There are commercial and open source libraries that can 
handle a lot of this stuff 
84
Complications: What is a document? 
We return from our query โ€œdocumentsโ€ but there are often 
interesting questions of grain size: 
What is a unit document? 
โ€“ A file? 
โ€“ An email? (Perhaps one of many in a single mbox file) 
โ€ข What about an email with 5 attachments? 
โ€“ A group of files (e.g., PPT or LaTeX split over HTML pages) 
85
Tokenization 
โ€ข Input: โ€œFriends, Romans and Countrymenโ€ 
โ€ข Output: Tokens 
โ€“ Friends 
โ€“ Romans 
โ€“ Countrymen 
โ€ข A token is an instance of a sequence of characters 
โ€ข Each such token is now a candidate for an index entry, after 
further processing 
โ€ข But what are valid tokens to emit? 
86
Tokenization 
โ€ข Issues in tokenization: 
โ€“ Finlandโ€™s capital ๏‚ฎ 
Finland AND s? Finlands? Finlandโ€™s? 
โ€“ Hewlett-Packard ๏‚ฎ Hewlett and Packard as two 
tokens? 
โ€ข state-of-the-art: break up hyphenated sequence. 
โ€ข co-education 
โ€ข lowercase, lower-case, lower case ? 
โ€ข It can be effective to get the user to put in possible hyphens 
โ€“ San Francisco: one token or two? 
โ€ข How do you decide it is one token? 
87
Numbers 
โ€ข 3/20/91 Mar. 12, 1991 20/3/91 
โ€ข 55 B.C. 
โ€ข B-52 
โ€ข My PGP key is 324a3df234cb23e 
โ€ข (800) 234-2333 
โ€ข Often have embedded spaces 
โ€ข Older IR systems may not index numbers 
But often very useful: think about things like looking up error 
codes/stacktraces on the web 
โ€ข Will often index โ€œmeta-dataโ€ separately 
Creation date, format, etc. 
88
Tokenization: language issues 
โ€ข French 
โ€“ L'ensemble ๏‚ฎ one token or two? 
โ€ข L ? Lโ€™ ? Le ? 
โ€ข Want lโ€™ensemble to match with un ensemble 
โ€“ Until at least 2003, it didnโ€™t on Google 
ยป Internationalization! 
โ€ข German noun compounds are not segmented 
โ€“ Lebensversicherungsgesellschaftsangestellter 
โ€“ โ€˜life insurance company employeeโ€™ 
โ€“ German retrieval systems benefit greatly from a compound splitter 
module 
โ€“ Can give a 15% performance boost for German 
89
Tokenization: language issues 
โ€ข Chinese and Japanese have no spaces between words: 
โ€“ ่ŽŽๆ‹‰ๆณขๅจƒ็Žฐๅœจๅฑ…ไฝๅœจ็พŽๅ›ฝไธœๅ—้ƒจ็š„ไฝ›็ฝ—้‡Œ่พพใ€‚ 
โ€“ Not always guaranteed a unique tokenization 
โ€ข Further complicated in Japanese, with multiple alphabets 
intermingled 
โ€“ Dates/amounts in multiple formats 
ใƒ•ใ‚ฉใƒผใƒใƒฅใƒณ500็คพใฏๆƒ…ๅ ฑไธ่ถณใฎใŸใ‚ๆ™‚้–“ใ‚ใŸ$500K(็ด„6,000ไธ‡ๅ††) 
Katakana Hiragana Kanji Romaji 
End-user can express query entirely in hiragana! 
90
Tokenization: language issues 
โ€ข Arabic (or Hebrew) is basically written right to left, but with certain items 
like numbers written left to right 
โ€ข Words are separated, but letter forms within a word form complex 
ligatures 
โ† โ†’ โ† โ†’ โ† start 
โ€˜Algeria achieved its independence in 1962 after 132 years of French 
occupation.โ€™ 
โ€ข With Unicode, the surface presentation is complex, but the stored 
form is straightforward 
91
Stop words 
โ€ข With a stop list, you exclude from the dictionary entirely the commonest 
words. Intuition: 
โ€“ They have little semantic content: the, a, and, to, be 
โ€“ There are a lot of them: ~30% of postings for top 30 words 
โ€ข But the trend is away from doing this: 
โ€“ Good compression techniques means the space for including stop words in a system 
can be small 
โ€“ Good query optimization techniques mean you pay little at query time for including 
stop words. 
โ€“ You need them for: 
โ€ข Phrase queries: โ€œKing of Denmarkโ€ 
โ€ข Various song titles, etc.: โ€œLet it beโ€, โ€œTo be or not to beโ€ 
โ€ข โ€œRelationalโ€ queries: โ€œflights to Londonโ€ 
92
Normalization to terms 
โ€ข Want: matches to occur despite superficial differences in the 
character sequences of the tokens 
โ€ข We may need to โ€œnormalizeโ€ words in indexed text as well as query words 
into the same form 
โ€“ We want to match U.S.A. and USA 
โ€ข Result is terms: a term is a (normalized) word type, which is an entry in 
our IR system dictionary 
โ€ข We most commonly implicitly define equivalence classes of terms by, e.g., 
โ€“ deleting periods to form a term 
โ€ข U.S.A., USA ๏ƒจ USA 
โ€“ deleting hyphens to form a term 
โ€ข anti-discriminatory, antidiscriminatory ๏ƒจ antidiscriminatory 
93
Normalization: other languages 
โ€ข Accents: e.g., French rรฉsumรฉ vs. resume. 
โ€ข Umlauts: e.g., German: Tuebingen vs. Tรผbingen 
โ€“ Should be equivalent 
โ€ข Most important criterion: 
โ€“ How are your users like to write their queries for these words? 
โ€ข Even in languages that standardly have accents, users often may not type 
them 
โ€“ Often best to normalize to a de-accented term 
โ€ข Tuebingen, Tรผbingen, Tubingen ๏ƒจ Tubingen 
94
Case folding 
โ€ข Reduce all letters to lower case 
โ€“ exception: upper case in mid-sentence? 
โ€ข e.g., General Motors 
โ€ข Fed vs. fed 
โ€ข SAIL vs. sail 
โ€“ Often best to lower case everything, since users will use lowercase 
regardless of โ€˜correctโ€™ capitalizationโ€ฆ 
โ€ข Longstanding Google example: [fixed in 2011โ€ฆ] 
โ€“ Query C.A.T. 
โ€“ #1 result is for โ€œcatsโ€ (well, Lolcats) not Caterpillar Inc. 
95
Normalization to terms 
โ€ข An alternative to equivalence classing is to do asymmetric 
expansion 
โ€ข An example of where this may be useful 
โ€“ Enter: window Search: window, windows 
โ€“ Enter: windows Search: Windows, windows, window 
โ€“ Enter: Windows Search: Windows 
โ€ข Potentially more powerful, but less efficient 
96
Thesauri and soundex 
โ€ข Do we handle synonyms and homonyms? 
โ€“ E.g., by hand-constructed equivalence classes 
โ€ข car = automobile color = colour 
โ€“ We can rewrite to form equivalence-class terms 
โ€ข When the document contains automobile, index it under 
car-automobile (and vice-versa) 
โ€“ Or we can expand a query 
โ€ข When the query contains automobile, look under car as 
well 
โ€ข What about spelling mistakes? 
โ€“ One approach is Soundex, which forms equivalence classes of 
words based on phonetic heuristics 
97
Lemmatization 
โ€ข Reduce inflectional/variant forms to base form 
โ€ข E.g., 
โ€“ am, are, is ๏‚ฎ be 
โ€“ car, cars, car's, cars' ๏‚ฎ car 
โ€ข the boy's cars are different colors ๏‚ฎ the boy car be 
different color 
โ€ข Lemmatization implies doing โ€œproperโ€ reduction to 
dictionary headword form 
98
Stemming 
โ€ข Reduce terms to their โ€œrootsโ€ before indexing 
โ€ข โ€œStemmingโ€ suggests crude affix chopping 
โ€“ language dependent 
โ€“ e.g., automate(s), automatic, automation all reduced to automat. 
for example compressed 
and compression are both 
accepted as equivalent to 
compress. 
for exampl compress and 
compress ar both accept 
as equival to compress 
99
โ€“ Affix removal 
โ€ข remove the longest affix: {sailing, sailor} => sail 
โ€ข simple and effective stemming 
โ€ข a widely used such stemmer is Porterโ€™s algorithm 
โ€“ Dictionary-based using a look-up table 
โ€ข look for stem of a word in table: play + ing => play 
โ€ข space is required to store the (large) table, so often not practical 
100
Stemming: some issues 
โ€ข Detect equivalent stems: 
โ€“ {organize, organise}: e as the longest affix leads to {organiz, 
organis}, which should lead to one stem: organis 
โ€“ Heuristics are therefore used to deal with such cases. 
โ€ข Over-stemming: 
โ€“ {organisation, organ} reduced into org, which is incorrect 
โ€“ Again heuristics are used to deal with such cases. 
101
Porterโ€™s algorithm 
โ€ข Commonest algorithm for stemming English 
โ€“ Results suggest itโ€™s at least as good as other stemming options 
โ€ข Conventions + 5 phases of reductions 
โ€“ phases applied sequentially 
โ€“ each phase consists of a set of commands 
โ€“ sample convention: Of the rules in a compound command, select 
the one that applies to the longest suffix. 
102
Typical rules in Porter 
โ€ข sses ๏‚ฎ ss 
โ€ข ies ๏‚ฎ i 
โ€ข ational ๏‚ฎ ate 
โ€ข tional ๏‚ฎ tion 
103
Language-specificity 
โ€ข The above methods embody transformations that are 
โ€“ Language-specific, and often 
โ€“ Application-specific 
โ€ข These are โ€œplug-inโ€ addenda to the indexing process 
โ€ข Both open source and commercial plug-ins are 
available for handling these 
104
Does stemming help? 
โ€ข English: very mixed results. Helps recall for some queries but 
harms precision on others 
โ€“ E.g., operative (dentistry) โ‡’ oper 
โ€ข Definitely useful for Spanish, German, Finnish, โ€ฆ 
โ€“ 30% performance gains for Finnish! 
105
Others: Using a thesaurus 
โ€ข A thesaurus provides a standard vocabulary for indexing 
(and searching) 
โ€ข More precisely, a thesaurus provides a classified 
hierarchy for broadening and narrowing terms 
bank: 1. Finance institute 
2. River edge 
โ€“ if a document is indexed with bank, then index it with 
โ€œfinance instituteโ€ or โ€œriver edgeโ€ 
โ€“ need to disambiguate the sense of bank in the text: e.g. if 
money appears in the document, then chose โ€œfinance 
instituteโ€ 
โ€ข A widely used online thesaurus: WordNet 
106
Information storage 
โ€ข Whole topic on its own 
โ€ข How do we keep fresh copies of the web manageable by a cluster of 
computers and are able to answer millions of queries in milliseconds 
โ€“ Inverted indexes 
โ€“ Compression 
โ€“ Caching 
โ€“ Distributed architectures 
โ€“ โ€ฆ and a lot of tricks 
โ€ข Inverted indexes: cornerstone data structure of IR systems 
โ€“ For each term t, we must store a list of all documents that contain t. 
โ€“ Identify each doc by a docID, a document serial number 
โ€“ Index construction is tricky (canโ€™t hold all the information needed in memory) 
107
108 
docs t1 t2 t3 
D1 1 0 1 
D2 1 0 0 
D3 0 1 1 
D4 1 0 0 
D5 1 1 1 
D6 1 1 0 
D7 0 1 0 
D8 0 1 0 
D9 0 1 1 
D10 0 1 1 
Terms D1 D2 D3 D4 
t1 1 1 0 1 
t2 0 0 1 0 
t3 1 0 1 0
โ€ข Most basic form: 
โ€“ Document frequency 
โ€“ Term frequency 
โ€“ Document identifiers 
109 
term Term id df 
a 1 4 
as 2 3 
(1,2), (2,5), (10,1), (11,1) 
(1,3), (3,4), (20,1)
โ€ข Indexes contain more information 
โ€“ Position in the document 
โ€ข Useful for โ€œphrase queriesโ€ or โ€œproximity queriesโ€ 
โ€“ Fields in which the term appears in the document 
โ€“ Metadata โ€ฆ 
โ€“ All that can be used for ranking 
110 
(1,2, [1, 1], [2,10]), โ€ฆ 
Field 1 (title), position 1
Queries 
โ€ข How do we process a query? 
โ€ข Several kinds of queries 
โ€“ Boolean 
โ€ขChicken AND salt 
โ€ข Gnome OR KDE 
โ€ข Salt AND NOT pepper 
โ€“ Phrase queries 
โ€“ Ranked 
111
List Merging 
โ€ขโ€œExact matchโ€ queries 
โ€“ Chicken AND curry 
โ€“ Locate Chicken in the dictionary 
โ€“ Fetch its postings 
โ€“ Locate curry in the dictionary 
โ€“Fetch its postings 
โ€“Merge both postings 
112
Intersecting two postings lists 
113
List Merging 
Walk through the postings in O(x+y) time 
salt 
pepper 
3 22 23 25 
3 5 22 25 36 
3 22 25 
114
115
Models of information retrieval 
โ€ข A model: 
โ€“ abstracts away from the real world 
โ€“ uses a branch of mathematics 
โ€“ possibly: uses a metaphor for searching 
116
Short history of IR modelling 
โ€ข Boolean model (ยฑ1950) 
โ€ข Document similarity (ยฑ1957) 
โ€ข Vector space model (ยฑ1970) 
โ€ข Probabilistic retrieval (ยฑ1976) 
โ€ข Language models (ยฑ1998) 
โ€ข Linkage-based models (ยฑ1998) 
โ€ข Positional models (ยฑ2004) 
โ€ข Fielded models (ยฑ2005) 
117
The Boolean model (ยฑ1950) 
โ€ข Exact matching: data retrieval (instead of 
information retrieval) 
โ€“ A term specifies a set of documents 
โ€“ Boolean logic to combine terms / document sets 
โ€“ AND, OR and NOT: intersection, union, and 
difference 
118
Statistical similarity between documents (ยฑ1957) 
โ€ข The principle of similarity 
"The more two representations agree in given elements and their 
distribution, the higher would be the probability of their representing 
similar informationโ€ 
(Luhn 1957) 
It is here proposed that the frequency of word [term] occurrence in an 
article [document ] furnishes a useful measurement of word [term] 
significanceโ€ 
119
Zipfโ€™s law 
terms by rank order 
frequency of terms 
f 
r 
120
Zipfโ€™s law 
โ€ข Relative frequencies of terms. 
โ€ข In natural language, there are a few very frequent terms and very many 
very rare terms. 
โ€ข Zipfโ€™s law: The ith most frequent term has frequency proportional to 1/i . 
โ€ข cfi โˆ 1/i = K/i where K is a normalizing constant 
โ€ข cfi is collection frequency: the number of occurrences of the term ti in the 
collection. 
โ€ข Zipfโ€™s law holds for different languages 
121
Zipf consequences 
โ€ข If the most frequent term (the) occurs cf1 times 
โ€“ then the second most frequent term (of) occurs cf1/2 times 
โ€“ the third most frequent term (and) occurs cf1/3 times โ€ฆ 
โ€ข Equivalent: cfi = K/i where K is a normalizing factor, so 
โ€“ log cfi = log K - log i 
โ€“ Linear relationship between log cfi and log i 
โ€ข Another power law relationship 
122
Zipfโ€™s law in action 
123
Luhnโ€™s analysis -Observation 
terms by rank order 
frequency of terms 
f 
resolving power 
r 
upper cut-off lower cut-off 
common terms 
rare terms 
significant terms 
Resolving power of significant terms: 
ability of terms to discriminate document content 
peak at rank order position half way between the two cut-offs 
124
Luhnโ€™s analysis - Implications 
โ€ข Common terms are not good at representing document 
content 
โ€“ partly implemented through the removal of stop words 
โ€ข Rare words are also not good at representing document 
content 
โ€“ usually nothing is done 
โ€“ Not true for every โ€œdocumentโ€ 
โ€ข Need a means to quantify the resolving power of a term: 
โ€“ associate weights to index terms 
โ€“ tfร—idf approach 
125
Ranked retrieval 
โ€ข Boolean queries are good for expert users with precise 
understanding of their needs and the collection. 
โ€“ Also good for applications: Applications can easily consume 
1000s of results. 
โ€ข Not good for the majority of users. 
โ€“ Most users incapable of writing Boolean queries (or they are, 
but they think itโ€™s too much work). 
โ€“ Most users donโ€™t want to wade through 1000s of results. 
โ€ข This is particularly true of web search.
Feast or Famine 
โ€ข Boolean queries often result in either too few (=0) or too 
many (1000s) results. 
โ€ข Query 1: โ€œstandard user dlink 650โ€ โ†’ 200,000 hits 
โ€ข Query 2: โ€œstandard user dlink 650 no card foundโ€: 0 hits 
โ€ข It takes a lot of skill to come up with a query that produces 
a manageable number of hits. 
โ€“ AND gives too few; OR gives too many
Ranked retrieval models 
โ€ข Rather than a set of documents satisfying a query expression, 
in ranked retrieval, the system returns an ordering over the 
(top) documents in the collection for a query 
โ€ข Free text queries: Rather than a query language of operators 
and expressions, the userโ€™s query is just one or more words in 
a human language 
โ€ข In principle, there are two separate choices here, but in 
practice, ranked retrieval has normally been associated with 
free text queries and vice versa 
128
Feast or famine: not a problem in ranked retrieval 
โ€ข When a system produces a ranked result set, large result sets 
are not an issue 
โ€“ Indeed, the size of the result set is not an issue 
โ€“ We just show the top k ( โ‰ˆ 10) results 
โ€“ We do not overwhelm the user 
โ€“ Premise: the ranking algorithm works
Scoring as the basis of ranked retrieval 
โ€ข We wish to return in order the documents most likely to 
be useful to the searcher 
โ€ข How can we rank-order the documents in the collection 
with respect to a query? 
โ€ข Assign a score โ€“ say in [0, 1] โ€“ to each document 
โ€ข This score measures how well document and query 
โ€œmatchโ€.
Query-document matching scores 
โ€ข We need a way of assigning a score to a query/document 
pair 
โ€ข Letโ€™s start with a one-term query 
โ€ข If the query term does not occur in the document: score 
should be 0 
โ€ข The more frequent the query term in the document, the 
higher the score (should be) 
โ€ข We will look at a number of alternatives for this.
Bag of words model 
โ€ข Vector representation does not consider the ordering of 
words in a document 
โ€ข John is quicker than Mary and Mary is quicker than John 
have the same vectors 
โ€ข This is called the bag of words model.
Term frequency tf 
โ€ข The term frequency tf(t,d) of term t in document d is defined 
as the number of times that t occurs in d. 
โ€ข We want to use tf when computing query-document match 
scores. But how? 
โ€ข Raw term frequency is not what we want: 
โ€“ A document with 10 occurrences of the term is more 
relevant than a document with 1 occurrence of the term. 
โ€“ But not 10 times more relevant. 
โ€ข Relevance does not increase proportionally with term 
frequency.
Log-frequency weighting 
โ€ข The log frequency weight of term t in d is 
๏ƒฌ ๏€ซ ๏€พ 
1 log tf , if tf 0 
๏ƒฎ 
๏ƒญ 
๏€ฝ 
10 t,d t,d 
0, otherwise 
t,d w 
โ€ข 0 โ†’ 0, 1 โ†’ 1, 2 โ†’ 1.3, 10 โ†’ 2, 1000 โ†’ 4, etc. 
โ€ข Score for a document-query pair: sum over terms t in both q and d: 
โ€ข score 
โ€ข The score is 0 if none of the query terms is present in the document. 
๏ƒฅ ๏ƒŽ ๏ƒ‡ 
๏€ฝ ๏€ซ 
t q d t d (1 log tf ) ,
Document frequency 
โ€ข Rare terms are more informative than frequent terms 
โ€“ Recall stop words 
โ€ข Consider a term in the query that is rare in the collection (e.g., 
arachnocentric) 
โ€ข A document containing this term is very likely to be relevant to 
the query arachnocentric 
โ€ข โ†’ We want a high weight for rare terms like arachnocentric.
Document frequency, continued 
โ€ข Frequent terms are less informative than rare terms 
โ€ข Consider a query term that is frequent in the collection (e.g., high, 
increase, line) 
โ€ข A document containing such a term is more likely to be relevant than a 
document that does not 
โ€ข But itโ€™s not a sure indicator of relevance. 
โ€ข โ†’ For frequent terms, we want high positive weights for words like high, 
increase, and line 
โ€ข But lower weights than for rare terms. 
โ€ข We will use document frequency (df) to capture this.
idf weight 
โ€ข dft is the document frequency of t: the number of documents that contain 
t 
โ€“ dft is an inverse measure of the informativeness of t 
โ€“ dft ๏‚ฃ N 
โ€ข We define the idf (inverse document frequency) of t by 
โ€“ We use log (N/dft) instead of N/dft to โ€œdampenโ€ the effect of idf. 
idf log ( /df ) t 10 t ๏€ฝ N
Effect of idf on ranking 
โ€ข Does idf have an effect on ranking for one-term queries, like 
โ€“ iPhone 
โ€ข idf has no effect on ranking one term queries 
โ€“ idf affects the ranking of documents for queries with at least 
two terms 
โ€“ For the query capricious person, idf weighting makes 
occurrences of capricious count for much more in the final 
document ranking than occurrences of person. 
138
tf-idf weighting 
โ€ข The tf-idf weight of a term is the product of its tf weight and its 
idf weight. 
w ๏€ฝ log(1 ๏€ซ tf ) ๏‚ด 
log ( N 
/ df ) t , d 
t ,d 10 t โ€ข Best known weighting scheme in information retrieval 
โ€“ Note: the โ€œ-โ€ in tf-idf is a hyphen, not a minus sign! 
โ€“ Alternative names: tf.idf, tf x idf 
โ€ข Increases with the number of occurrences within a document 
โ€ข Increases with the rarity of the term in the collection
Score for a document given a query 
tรŽqร‡d รฅ 
โ€ข There are many variants 
โ€“ How โ€œtfโ€ is computed (with/without logs) 
โ€“ Whether the terms in the query are also weighted 
โ€“ โ€ฆ 
140 
Score(q,d) = tf.idft,d
Documents as vectors 
โ€ข So we have a |V|-dimensional vector space 
โ€ข Terms are axes of the space 
โ€ข Documents are points or vectors in this space 
โ€ข Very high-dimensional: tens of millions of dimensions when 
you apply this to a web search engine 
โ€ข These are very sparse vectors - most entries are zero.
Statistical similarity between documents (ยฑ1957) 
โ€ข Vector product 
โ€“ If the vector has binary components, then the product 
measures the number of shared terms 
โ€“ Vector components might be "weights" 
๏ƒฅ 
score q d ๏€ฝ q ๏ƒ— 
d 
k k ๏ƒŽ 
matching terms 
( , ) 
k 
๏ฒ ๏ฒ
Why distance is a bad idea 
The Euclidean 
distance between q 
and d2 is large even 
though the 
distribution of terms 
in the query q and the 
distribution of 
terms in the 
document d2 are 
very similar.
Vector space model (ยฑ1970) 
โ€ข Documents and 
queries are vectors in 
a high-dimensional 
space 
โ€ข Geometric measures 
(distances, angles)
Vector space model (ยฑ1970) 
โ€ข Cosine of an angle: 
โ€“ close to 1 if angle is small 
โ€“ 0 if vectors are orthogonal 
2 
m 
d q 
k k k 
d q 
m 
k 1 
k 
๏ƒฅ ๏ƒ— 
2 
m 
k 1 
k 
๏€ฝ 
1 
( ) ( ) 
๏ฒ ๏ฒ 
cos( , ) 
๏ƒฅ ๏ƒ—๏ƒฅ 
๏€ฝ 
๏€ฝ ๏€ฝ 
d q 
1 ( )2 
๏€ฝ m 
๏ƒฅ 
๏€ฝ ๏ƒฅ ๏ƒ— ๏€ฝ 
k ๏€ฝ 
k 
i 
i 
m 
k 
k k 
v 
v 
๏ฒ ๏ฒ 
d q n d n q n v 
1 
cos( , ) ( ) ( ), ( )
Vector space model (ยฑ1970) 
โ€ข PRO: Nice metaphor, easily explained; 
Mathematically sound: geometry; 
Great for relevance feedback 
โ€ข CON: Need term weighting (tf-idf); 
Hard to model structured queries
Probabilistic IR 
โ€ข An IR system has an uncertain understanding of userโ€™s queries and 
makes uncertain guesses on whether a document satisfies a query 
or not. 
โ€ข Probability theory provides a principled foundation for reasoning 
under uncertainty. 
โ€ข Probabilistic models build upon this foundation to estimate how 
likely it is that a document is relevant for a query. 
147
Event Space 
โ€ข Query representation 
โ€ข Document representation 
โ€ข Relevance 
โ€ข Event space 
โ€ข Conceptually there might be pairs with same q and d, 
but different r 
โ€ข Some times include include user u, context c, etc. 
148
Probability Ranking Principle 
โ€ข Robertson (1977) 
โ€“ โ€œIf a reference retrieval systemโ€™s response to each 
request is a ranking of the documents in the collection 
in order of decreasing probability of relevance to the 
user who submitted the request, where the 
probabilities are estimated as accurately as possible 
on the basis of whatever data have been made 
available to the system for this purpose, the overall 
effectiveness of the system to its user will be the best 
that is obtainable on the basis of those data.โ€ 
โ€ข Basis for probabilistic approaches for IR 
149
Dissecting PRP 
โ€ข Probability of relevance 
โ€ข Estimated accurately 
โ€ข Based on whatever data available 
โ€ข Best possible accuracy 
โ€“ The perfect IR system! 
โ€“ Assumes relevance is independent on other 
documents in the collection 
150
Relevance? 
โ€ข What is ? 
โ€“ Isnโ€™t it decided by the user? her opinion? 
โ€ข User doesnโ€™t mean a human being! 
โ€“ We are working with representations 
โ€“ ... or parts of the reality available to us 
โ€ข 2/3 keywords, no profile, no context ... 
โ€“ relevance is uncertain 
โ€ข depends on what the system sees 
โ€ข may be marginalized over all the 
unseen context/profiles 
151
Retrieval as binary classification 
โ€ข For every (q,d), r takes two values 
โ€“ Relevant and non-relevant documents 
โ€“ can be extended to multiple values 
โ€ข Retrieve using Bayesโ€™ decision 
โ€“ PRP is related to the Bayes error rate (lowest 
possible error rate for a class) 
โ€“ How do we estimate this probability? 
152
PRP ranking 
โ€ข How to represent the random variables? 
โ€ข How to estimate the modelโ€™s parameters? 
153
โ€ข d is a binary vector 
โ€ข Multiple Bernoulli variables 
โ€ข Under MB, we can decompose into a 
product of probabilities, with likelihoods: 
154
If the terms are not in the query: 
Otherwise we need estimates for them! 
155
Estimates 
โ€ข Assign new weights for query terms based on relevant/non-relevant 
documents 
โ€ข Give higher weights to important terms: 
Relevant Non-relevant 
156 
Document with 
t 
r n-r n 
Document 
without t 
R-r N-r-R+r N-n 
R N-R
Robertson-Spark Jones weight 
157 
Relevant docs with t 
Relevant docs without t 
Non-relevant docs with t 
Non-relevant docs without t
Estimates without relevance info 
โ€ข If we pick a relevant document, words are equally like to be 
present or absent 
โ€ข Non-relevant can be approximated with the collection as a 
whole 
158
Modeling term frequencies 
159
Modeling TF 
โ€ข Naรฏve estimation: separate probability for every 
outcome 
โ€ข BIR had only two parameters, now we have plenty 
(~many outcomes) 
โ€ข We can plug in a parametric estimate for the term 
frequencies 
โ€ข For instance, a Poisson mixture 
160
Okapi BM25 
โ€ข Same ranking function as before but with new 
estimates. Models term frequencies and 
document length. 
โ€ข Words are generated by a mixture of two 
Poissons 
โ€ข Assumes an eliteness variable (elite ~ word 
occurs unusually frequently, non-elite ~ word 
occurs as expected by chance). 
161
BM25 
โ€ข As a graphical model 
162
BM25 
โ€ข In order to approximate the formula, Robertson and Walker came up 
with: 
โ€ข Two model parameters 
โ€ข Very effective 
โ€ข The more words in common with the query the better 
โ€ข Repetitions less important than different query words 
โ€“ But more important if the document is relatively long 
163
Generative Probabilistic Language Models 
โ€ข The generative approach โ€“ A generator which produces 
events/tokens with some probability 
โ€“ Probability distribution over strings of text 
โ€“ URN Metaphor โ€“ a bucket of different colour balls (10 red, 5 
blue, 3 yellow, 2 white) 
โ€ข What is the probability of drawing a yellow ball? 3/20 
โ€ข what is the probability of drawing (with replacement) a red ball and a 
white ball? ยฝ*1/10 
โ€“ IR Metaphor: Documents are urns, full of tokens (balls) of (in) 
different terms (colors)
What is a language model? 
โ€ข How likely is a string of words in a โ€œlanguageโ€? 
โ€“ P1(โ€œthe cat sat on the matโ€) 
โ€“ P2(โ€œthe mat sat on the catโ€) 
โ€“ P3(โ€œthe cat sat en la alfombraโ€) 
โ€“ P4(โ€œel gato se sentรณ en la alfombraโ€) 
โ€ข Given a model M and a observation s we want 
โ€“ Probability of getting s through random sampling from M 
โ€“ A mechanism to produce observations (strings) legal in M 
โ€ข User thinks of a relevant document and then picks some keywords 
to use as a query 
165
Generative Probabilistic Models 
โ€ข What is the probability of producing the query from a document? p(q|d) 
โ€ข Referred to as query-likelihood 
โ€ข Assumptions: 
โ€ข The probability of a document being relevant is strongly correlated with 
the probability of a query given a document, i.e. p(d|r) is correlated 
with p(q|d) 
โ€ข User has a reasonable idea of the terms that are like to appear in the 
โ€œidealโ€ document 
โ€ข Userโ€™s query terms can distinguish the โ€œidealโ€ document from the rest 
of the corpus 
โ€ข The query is generated as a representative of the โ€œidealโ€ document 
โ€ข Systemโ€™s task is to estimate for each of the documents in the collection, 
which is most likely to be the โ€œidealโ€ document
Language Models (1998/2001) 
โ€ข Letโ€™s assume we point blindly, one at a time, at 3 words 
in a document 
โ€“ What is the probability that I, by accident, pointed at the words 
โ€œMasterโ€, โ€œcomputerโ€ and โ€œScienceโ€? 
โ€“ Compute the probability, and use it to rank the documents. 
โ€ข Words are โ€œsampledโ€ independently of each other 
โ€“ Joint probability decomposed into a product of marginals 
โ€“ Estimation of probabilities just by counting 
โ€ข Higher models or unigrams? 
โ€“ Parameter estimation can be very expensive
Standard LM Approach 
โ€ข Assume that query terms are drawn identically and 
independently from a document
Estimating language models 
โ€ข Usually we donโ€™t know M 
โ€ข Maximum Likelihood Estimate of 
โ€“ Simply use the number of times the query term occurs in 
the document divided by the total number of term 
occurrences. 
โ€ข Zero Probability (frequency) problem 
169
Document Models 
โ€ข Solution: Infer a language model for each document, 
where 
โ€ข Then we can estimate 
โ€ข Standard approach is to use the probability of a term to 
smooth the document model. 
โ€ข Interpolate the ML estimator with general language 
expectations
Estimating Document Models 
โ€ข Basic Components 
โ€“ Probability of a term given a document (maximum likelihood estimate) 
โ€“ Probability of a term given the collection 
โ€“ tf(t,d) is the number of times term t occurs in document d (term frequency)
Language Models 
โ€ข Implementation
Implementation as vector product 
df t 
tf t D 
p t ๏ƒฅ 
๏ƒฅ 
๏€ฝ 
' 
( ) 
( ' ) 
( ) 
t 
df t 
๏€ฝ 
' 
( , ) 
( ' , ) 
( | ) 
t 
tf t D 
p t D 
Recall: 
score q d q dk 
q ๏€ฝ 
tf k q 
( , ) . 
( , ) 
tf k d df t 
( , ) ( ) 
k 
tf.idf of term k in document d 
๏ฌ 
๏€ญ 
๏ฌ 
Odds of the probability of 
๏ƒฅ ๏ƒŽ 
๏ƒฅ 
Inverse length of d Term importance 
๏€ฝ 
๏€ฝ 
๏ƒฅ 
1 
. 
( ) ( , ) 
log 
Matching Text 
t 
t 
k 
k 
k 
df k tf t d 
d
Document length normalization 
โ€ข Probabilistic models assume causes for documents differing in 
length 
โ€“ Scope 
โ€“ Verbosity 
โ€ข In practice, document length softens the term frequency 
contribution to the final score 
โ€“ Weโ€™ve seen it in BM25 and LMs 
โ€“ Usually with a tunable parameter that regulates the 
amount of softening 
โ€“ Can be a function of the deviation of the average 
document length 
โ€“ Can be incorporated into vanilla tf-idf 
174
Other models 
โ€ข Modeling term dependencies (positions) in the language 
modeling framework 
โ€“ Markov Random Fields 
โ€ข Modeling matches (occurrences of words) in different 
parts of a document -> fielded models 
โ€“ BM25F 
โ€“ Markov Random Fields can account for this as well 
175
More involved signals for ranking 
โ€ข From document understanding to query 
understanding 
โ€ข Query rewrites (gazetteers, spell correction), 
named entity recognition, query suggestions, 
query categories, query segmentation ... 
โ€ข Detecting query intent, triggering verticals 
โ€“ direct target towards answers 
โ€“ richer interfaces 
176
Signals for Ranking 
โ€ข Signals for ranking: matches of query terms in 
documents, query-independent quality measures, 
CTR, among others 
โ€ข Probabilistic IR models are all about counting 
โ€“ occurrences of terms in documents, in sets of 
documents, etc. 
โ€ข How to aggregate efficiently a large number of 
โ€œdifferentโ€ counts 
โ€“ coming from the same terms 
โ€“ no double counts! 
177
Searching for food 
โ€ข New Yorkโ€™s greatest pizza 
โ€ฃ New OR Yorkโ€™s OR greatest OR pizza 
โ€ฃ New AND Yorkโ€™s AND greatest AND pizza 
โ€ฃ New OR York OR great OR pizza 
โ€ฃ โ€œNew Yorkโ€ OR โ€œgreat pizzaโ€ 
โ€ฃ โ€œNew Yorkโ€ AND โ€œgreat pizzaโ€ 
โ€ฃ York < New AND great OR pizza 
โ€ข among many more. 
178
โ€œRefinedโ€matching 
โ€ข Extract a number of virtual regions in the document 
that match some version of the query (operators) 
โ€“ Each region provides a different evidence of 
relevance (i.e. signal) 
โ€ข Aggregate the scores over the different regions 
โ€ข Ex. :โ€œat least any two words in the query appear 
either consecutively or with an extra word between 
themโ€ 
179
Probability of Relevance 
180
Remember BM25 
โ€ข Term (tf) independence 
โ€ข Vague Prior over terms not 
appearing in the query 
โ€ข Eliteness - topical model that 
perturbs the word distribution 
โ€ข 2-poisson distribution of term 
frequencies over relevant and non-relevant 
documents 
181
Feature dependencies 
โ€ข Class-linearly dependent (or affine) features 
โ€“ add no extra evidence/signal 
โ€“ model overfitting (vs capacity) 
โ€ข Still, it is desirable to enrich the model with more 
involved features 
โ€ข Some features are surprisingly correlated 
โ€ข Positional information requires a large number of 
parameters to estimate 
โ€ข Potentially up to 
182
Query concept segmentation 
โ€ข Queries are made up of basic conceptual units, 
comprising many words 
โ€“ โ€œIndian summer victor herbertโ€ 
โ€ข Spurious matches: โ€œsan jose airportโ€ -> โ€œsan jose 
city airportโ€ 
โ€ข Model to detect segments based on generative 
language models and Wikipedia 
โ€ข Relax matches using factors of the max ratio 
between span length and segment length 
183
Virtual regions 
โ€ข Different parts of the document 
provide different evidence of 
relevance 
โ€ข Create a (finite) set of (latent) 
artificial regions and re-weight 
184
Implementation 
โ€ข An operator maps a query to a set of queries, 
which could match a document 
โ€ข Each operator has a weight 
โ€ข The average term frequency in a document is 
185
Remarks 
โ€ข Different saturation (eliteness) function? 
โ€“ learn the real functional shape! 
โ€“ log-logistic is good if the class-conditional 
distributions are drawn from an exp. family 
โ€ข Positions as variables? 
โ€“ kernel-like method or exp. #parameters 
โ€ข Apply operators on a per query or per query class 
basis? 
186
Operator examples 
โ€ข BOW: maps a raw query to the set of queries 
whose elements are the single terms 
โ€ข p-grams: set of all p-gram of consecutive terms 
โ€ข p-and: all conjunctions of p arbitrary terms 
โ€ข segments: match only the โ€œconceptsโ€ 
โ€ข Enlargement: some words might sneak in 
between the phrases/segments 
187
How does it work in practice? 
188
... not that far away 
term frequency 
link information 
query intent information 
editorial information 
click-through information 
geographical information 
language information 
user preferences 
document length 
document fields 
other gazillion sources of information 
189
Dictionaries 
โ€ข Fast look-up 
โ€“ Might need specific structures to scale up 
โ€ข Hash tables 
โ€ข Trees 
โ€“ Tolerant retrieval (prefixes) 
โ€“ Spell checking 
โ€ข Document correction (OCR) 
โ€ข Query misspellings (did you mean โ€ฆ ?) 
โ€ข (Weighted) edit distance โ€“ dynamic programming 
โ€ข Jaccard overlap (index character k-grams) 
โ€ข Context sensitive 
โ€ข http://norvig.com/spell-correct.html 
โ€“ Wild-card queries 
โ€ข Permuterm index 
โ€ข K-gram indexes 
190
Hardware basics 
โ€ข Access to data in memory is much faster than access to data on disk. 
โ€ข Disk seeks: No data is transferred from disk while the disk head is being 
positioned. 
โ€ข Therefore: Transferring one large chunk of data from disk to memory is 
faster than transferring many small chunks. 
โ€ข Disk I/O is block-based: Reading and writing of entire blocks (as opposed 
to smaller chunks). 
โ€ข Block sizes: 8KB to 256 KB. 
191
Hardware basics 
โ€ข Many design decisions in information retrieval are based on the 
characteristics of hardware 
โ€ข Servers used in IR systems now typically have several GB of main memory, 
sometimes tens of GB. 
โ€ข Available disk space is several (2-3) orders of magnitude larger. 
โ€ข Fault tolerance is very expensive: It is much cheaper to use many regular 
machines rather than one fault tolerant machine. 
192
Data flow 
splits 
Parser 
Parser 
Parser 
Master 
a-f g-p q-z 
a-f g-p q-z 
a-f g-p q-z 
Inverter 
Inverter 
Inverter 
Postings 
a-f 
g-p 
q-z 
assign assign 
Map 
phase Segment files 
Reduce 
phase 
193
MapReduce 
โ€ข The index construction algorithm we just described is an instance of 
MapReduce. 
โ€ข MapReduce (Dean and Ghemawat 2004) is a robust and conceptually 
simple framework for distributed computing โ€ฆ 
โ€ข โ€ฆ without having to write code for the distribution part. 
โ€ข They describe the Google indexing system (ca. 2002) as consisting of a 
number of phases, each implemented in MapReduce. 
โ€ข Open source implementation Hadoop 
โ€“ Widely used throughout industry 
194
MapReduce 
โ€ข Index construction was just one phase. 
โ€ข Another phase: transforming a term-partitioned index 
into a document-partitioned index. 
โ€“ Term-partitioned: one machine handles a subrange of 
terms 
โ€“ Document-partitioned: one machine handles a 
subrange of documents 
โ€ข Msearch engines use a document-partitioned index for 
better load balancing, etc. 
195
Distributed IR 
โ€ข Basic process 
โ€“ All queries sent to a director machine 
โ€“ Director then sends messages to many index servers 
โ€ข Each index server does some portion of the query processing 
โ€“ Director organizes the results and returns them to the user 
โ€ข Two main approaches 
โ€“ Document distribution 
โ€ข by far the most popular 
โ€“ Term distribution 
196
Distributed IR (II) 
โ€ข Document distribution 
โ€“ each index server acts as a search engine for a small fraction of 
the total collection 
โ€“ director sends a copy of the query to each of the index servers, 
each of which returns the top k results 
โ€“ results are merged into a single ranked list by the director 
โ€ข Collection statistics should be shared for effective ranking 
197
Caching 
โ€ข Query distributions similar to Zipf 
โ€ข About ยฝ each day are unique, but some are very popular 
โ€“ Caching can significantly improve effectiveness 
โ€ข Cache popular query results 
โ€ข Cache common inverted lists 
โ€“ Inverted list caching can help with unique queries 
โ€“ Cache must be refreshed to prevent stale data 
198
Others 
โ€ข Efficiency (compression, storage, caching, 
distribution) 
โ€ข Novelty and diversity 
โ€ข Evaluation 
โ€ข Relevance feedback 
โ€ข Learning to rank 
โ€ข User models 
โ€“ Context, personalization 
โ€ข Sponsored Search 
โ€ข Temporal aspects 
โ€ข Social aspects 
199
200
Ad

More Related Content

What's hot (20)

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
Selman Bozkฤฑr
ย 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
ย 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
ย 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
ย 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
ย 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
Nisha Arankandath
ย 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
Primya Tamil
ย 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
dhatchayaninandu
ย 
Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
Stephane Marchand-Maillet
ย 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
ย 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
Tharuka Vishwajith Sarathchandra
ย 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
ย 
Lec1,2
Lec1,2Lec1,2
Lec1,2
alaa223
ย 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Carsten Eickhoff
ย 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
ย 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
Vaibhav Khanna
ย 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
Stanley Wang
ย 
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
ย 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
ย 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
ย 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
Selman Bozkฤฑr
ย 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
ย 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
ย 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
ย 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
ย 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
Nisha Arankandath
ย 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
Primya Tamil
ย 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
dhatchayaninandu
ย 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
ย 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
ย 
Lec1,2
Lec1,2Lec1,2
Lec1,2
alaa223
ย 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Carsten Eickhoff
ย 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
ย 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
Vaibhav Khanna
ย 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
Stanley Wang
ย 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
ย 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
ย 

Similar to Introduction to Information Retrieval (20)

Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
Roi Blanco
ย 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
sssw2012
ย 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
Roi Blanco
ย 
Informationa Retrieval Techniques .pptx
Informationa Retrieval Techniques  .pptxInformationa Retrieval Techniques  .pptx
Informationa Retrieval Techniques .pptx
lekhacce
ย 
Unit 1
Unit 1Unit 1
Unit 1
karthiksmart21
ย 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction
Grace Hui Yang
ย 
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
shumawakjira26
ย 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
Findwise
ย 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
Barbara Starr
ย 
Ojala "The Sophisticated User"
Ojala "The Sophisticated User"Ojala "The Sophisticated User"
Ojala "The Sophisticated User"
National Information Standards Organization (NISO)
ย 
Designing Structure Part II: Information Archtecture
Designing Structure Part II: Information ArchtectureDesigning Structure Part II: Information Archtecture
Designing Structure Part II: Information Archtecture
Christina Wodtke
ย 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?
Toine Bogers
ย 
Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010
vman916
ย 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
SEAD
ย 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based Research
David Nzoputa Ofili
ย 
Searching Online
Searching OnlineSearching Online
Searching Online
Colquitt County High School
ย 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
ย 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
thenmozhip8
ย 
Information retrieval 1 introduction to ir
Information retrieval 1 introduction to irInformation retrieval 1 introduction to ir
Information retrieval 1 introduction to ir
Vaibhav Khanna
ย 
Information RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptxInformation RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptx
lekhacce
ย 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
Roi Blanco
ย 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
sssw2012
ย 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
Roi Blanco
ย 
Informationa Retrieval Techniques .pptx
Informationa Retrieval Techniques  .pptxInformationa Retrieval Techniques  .pptx
Informationa Retrieval Techniques .pptx
lekhacce
ย 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction
Grace Hui Yang
ย 
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
shumawakjira26
ย 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
Findwise
ย 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
Barbara Starr
ย 
Designing Structure Part II: Information Archtecture
Designing Structure Part II: Information ArchtectureDesigning Structure Part II: Information Archtecture
Designing Structure Part II: Information Archtecture
Christina Wodtke
ย 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?
Toine Bogers
ย 
Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010
vman916
ย 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
SEAD
ย 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based Research
David Nzoputa Ofili
ย 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
ย 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
thenmozhip8
ย 
Information retrieval 1 introduction to ir
Information retrieval 1 introduction to irInformation retrieval 1 introduction to ir
Information retrieval 1 introduction to ir
Vaibhav Khanna
ย 
Information RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptxInformation RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptx
lekhacce
ย 
Ad

More from Roi Blanco (12)

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
Roi Blanco
ย 
๏ฟผEntity Linking via Graph-Distance Minimization
๏ฟผEntity Linking via Graph-Distance Minimization๏ฟผEntity Linking via Graph-Distance Minimization
๏ฟผEntity Linking via Graph-Distance Minimization
Roi Blanco
ย 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Roi Blanco
ย 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
Roi Blanco
ย 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
Roi Blanco
ย 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
Roi Blanco
ย 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
Roi Blanco
ย 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
Roi Blanco
ย 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Roi Blanco
ย 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
Roi Blanco
ย 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
Roi Blanco
ย 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
Roi Blanco
ย 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
Roi Blanco
ย 
๏ฟผEntity Linking via Graph-Distance Minimization
๏ฟผEntity Linking via Graph-Distance Minimization๏ฟผEntity Linking via Graph-Distance Minimization
๏ฟผEntity Linking via Graph-Distance Minimization
Roi Blanco
ย 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Roi Blanco
ย 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
Roi Blanco
ย 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
Roi Blanco
ย 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
Roi Blanco
ย 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
Roi Blanco
ย 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
Roi Blanco
ย 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Roi Blanco
ย 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
Roi Blanco
ย 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
Roi Blanco
ย 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
Roi Blanco
ย 
Ad

Recently uploaded (20)

Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
ย 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
"Client Partnership โ€” the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership โ€” the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership โ€” the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership โ€” the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
ย 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
ย 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
ย 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
ย 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
ย 
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtBuckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Lynda Kane
ย 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
ย 
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
Lynda Kane
ย 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
ย 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
ย 
Image processinglab image processing image processing
Image processinglab image processing  image processingImage processinglab image processing  image processing
Image processinglab image processing image processing
RaghadHany
ย 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
ย 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
ย 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
ย 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
ย 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
ย 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
ย 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
ย 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
"Client Partnership โ€” the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership โ€” the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership โ€” the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership โ€” the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
ย 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
ย 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
ย 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
ย 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
ย 
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtBuckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Lynda Kane
ย 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
ย 
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
Lynda Kane
ย 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
ย 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
ย 
Image processinglab image processing image processing
Image processinglab image processing  image processingImage processinglab image processing  image processing
Image processinglab image processing image processing
RaghadHany
ย 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
ย 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
ย 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
ย 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
ย 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
ย 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
ย 

Introduction to Information Retrieval

  • 1. Introduction to Information Retrieval June, 2013 Roi Blanco
  • 2. Acknowledgements โ€ข Many of these slides were taken from other presentations โ€“ P. Raghavan, C. Manning, H. Schutze IR lectures โ€“ Mounia Lalmasโ€™s personal stash โ€“ Other random slide decks โ€ข Textbooks โ€“ Ricardo Baeza-Yates, Berthier Ribeiro Neto โ€“ Raghavan, Manning, Schutze โ€“ โ€ฆ among other good books โ€ข Many online tutorials, many online tools available (full toolkits) 2
  • 3. Big Plan โ€ข What is Information Retrieval? โ€“ Search engine history โ€“ Examples of IR systems (you might now have known!) โ€ข Is IR hard? โ€“ Users and human cognition โ€“ What is it like to be a search engine? โ€ข Web Search โ€“ Architecture โ€“ Differences between Web search and IR โ€“ Crawling 3
  • 4. โ€ข Representation โ€“ Document view โ€“ Document processing โ€“ Indexing โ€ข Modeling โ€“ Vector space โ€“ Probabilistic โ€“ Language Models โ€“ Extensions โ€ข Others โ€“ Distributed โ€“ Efficiency โ€“ Caching โ€“ Temporal issues โ€“ Relevance feedback โ€“ โ€ฆ 4
  • 5. 5
  • 6. Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Christopher D. Manning, Prabhakar Raghavan and Hinrich Schรผtze Introduction to Information Retrieval 6 6
  • 7. Information Retrieval (II) โ€ข What do we understand by documents? How do we decide what is a document and whatnot? โ€ข What is an information need? What types of information needs can we satisfy automatically? โ€ข What is a large collection? Which environments are suitable for IR 7 7
  • 8. Basic assumptions of Information Retrieval โ€ข Collection: A set of documents โ€“ Assume it is a static collection โ€ข Goal: Retrieve documents with information that is relevant to the userโ€™s information need and helps the user complete a task 8
  • 9. Key issues โ€ข How to describe information resources or information-bearing objects in ways that they can be effectively used by those who need to use them ? โ€“ Organizing/Indexing/Storing โ€ข How to find the appropriate information resources or information-bearing objects for someoneโ€™s (or your own) needs โ€“ Retrieving / Accessing / Filtering 9
  • 10. Unstructured data Unstructured data? SELECT * from HOTELS where city = Bangalore and $$$ < 2 10 Cheap hotels in Bangalore CITY $$$ name Bangalore 1.5 Cheapo one Barcelona 1 EvenCheapoer 10
  • 11. Unstructured (text) vs. structured (database) data in the mid-nineties 11
  • 12. Unstructured (text) vs. structured (database) data today
  • 13. 13
  • 14. Search Engine Index Square Pants! 14
  • 15. 15
  • 16. Timeline 1990 1991 1993 1994 1998 ... 16
  • 17. ... 1995 1996 1997 1998 1999 2000 17
  • 18. 2009 2005 ... 2008 18
  • 19. 2001 2003 2002 2003 2003 2003 2003 2010 2010 2003 19
  • 20. 20
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32
  • 33. 33
  • 34. 34
  • 35. Usability We also fail at using the technology Sometimes
  • 36. 36
  • 37. Applications โ€ข Text Search โ€ข Ad search โ€ข Image/Video search โ€ข Email Search โ€ข Question Answering systems โ€ข Recommender systems โ€ข Desktop Search โ€ข Expert Finding โ€ข .... Jobs Prizes Products News Source code Videogames Maps Partners Mashups ... 37
  • 38. Types of search engines โ€ข Q&A engines โ€ข Collaborative โ€ข Enterprise โ€ข Web โ€ข Metasearch โ€ข Semantic โ€ข NLP โ€ข ... 38
  • 40. 40
  • 41. IR issues โ€ข Find out what the user needs โ€ฆ and do it quickly โ€ข Challenges: user intention, accessibility, volatility, redundancy, lack of structure, low quality, different data sources, volume, scale โ€ข The main bottleneck is human cognition and not computational 41
  • 42. IR is mostly about relevance โ€ข Relevance is the core concept in IR, but nobody has a good definition โ€ข Relevance = useful โ€ข Relevance = topically related โ€ข Relevance = new โ€ข Relevance = interesting โ€ข Relevance = ??? โ€ข However we still want relevant information 42
  • 43. โ€ข Information needs must be expressed as a query โ€“ But users donโ€™t often know what they want โ€ข Problems โ€“ Verbalizing information needs โ€“ Understanding query syntax โ€“ Understanding search engines 43
  • 44. Understanding(?) the user I am a hungry tourist in Barcelona, and I want to find a place to eat; however I donโ€™t want to spend a lot of money I want information on places with cheap food in Barcelona Info about bars in Barcelona Bar celona Misconception Mistranslation Misformulation 44
  • 45. Why this is hard? โ€ข Documents/images/ video/speech/etc are complex. We need some representation โ€ข Semantics โ€“ What do words mean? โ€ข Natural language โ€“ How do we say things? โ€ข L Computers cannot deal with these easily 45
  • 46. โ€ฆ and even harder โ€ข Context โ€ข Opinion Funny? Talented? Honest? 46
  • 47. Semantics Bank Note River Bank Bank 47 Blood bank
  • 48. What is it like to be a search engine? โ€ข How can we figure out what youโ€™re trying to do? โ€ข Signal can be somehow weak, sometimes! [ jaguar ] [ iraq ] [ latest release Thinkpad drivers touchpad ] [ ebay ] [ first ] [ google ] [ brittttteny spirs ] 48
  • 49. Search is a multi-step process โ€ข Session search โ€“ Verbalize your query โ€“ Look for a document โ€“ Find your information there โ€“ Refine โ€ข Teleporting โ€“ Go directly to the site you like โ€“ Formulating the query is too hard, you trust more the final site, etc. 49
  • 50. โ€ข Someone told me that in the mid-1800โ€™s, people often would carry around a special kind of notebook. They would use the notebook to write down quotations that they heard, or copy passages from books theyโ€™d read. The notebook was an important part of their education, and it had a particular name. โ€“ What was the name of the notebook? 50 Examples from Dan Russel
  • 51. Naming the un-nameable โ€ข Whatโ€™s this thing called? 51
  • 52. More tasks โ€ฆ โ€ข Going beyond a search engine โ€“ Using images / multimedia content โ€“ Using maps โ€“ Using other sources โ€ข Think of how to express things differently (synonyms) โ€“ A friend told me that there is an abandoned city in the waters of San Francisco Bay. Is that true? If it IS true, what was the name of the supposed city? โ€ข Exploring a topic further in depth โ€ข Refining a question โ€“ Suppose you want to buy a unicycle for your Mom or Dad. How would you find it? โ€ข Looking for lists of information โ€“ Can you find a list of all the groups that inhabited California at the time of the missions? 52
  • 53. IR tasks โ€ข Known-item finding โ€“ You want to retrieve some data that you know they exist โ€“ What year was Peter Mika born? โ€ข Exploratory seeking โ€“ You want to find some information through an iterative process โ€“ Not a single answer to your query โ€ข Exhaustive search โ€“ You want to find all the information possible about a particular issue โ€“ Issuing several queries to cover the user information need โ€ข Re-finding โ€“ You want to find an item you have found already 53
  • 54. Scale โ€ข >300TB of print data produced per year โ€“ +Video, speech, domain-specific information (>600PB per year) โ€ข IR has to be fast + scalable โ€ข Information is dynamic โ€“ News, web pages, maps, โ€ฆ โ€“ Queries are dynamic (you might even change your information needs while searching) โ€ข Cope with data and searcher change โ€“ This introduces tensions in every component of a search engine 54
  • 55. Methodology โ€ข Experimentation in IR โ€ข Three fundamental types of IR research: โ€“ Systems (efficiency) โ€“ Methods (effectiveness) โ€“ Applications (user utility) โ€ข Empirical evaluation plays a critical role across all three types of research 55
  • 56. Methodology (II) โ€ข Information retrieval (IR) is a highly applied scientific discipline โ€ข Experimentation is a critical component of the scientific method โ€ข Poor experimental methodologies are not scientifically sound and should be avoided 56
  • 57. 57
  • 58. 58 Task Info need Verbal form query Search engine Corpus results Query refinement
  • 59. User Interface Query interpretation Document Collection Crawling Text Processing Indexing General Voodoo Matching Ranking Metadata Index Document Interpretation 59
  • 60. Crawler NLP pipeline Indexer Documents Tokens Index Query System 60
  • 61. Broker DNS Cluster Cluster cache server partition replication 61
  • 62. <a href= โ€ข Web pages are linked โ€“ AKA Web Graph โ€ข We can walk trough the graph to crawl โ€ข We can rank using the graph 62
  • 63. Web pages are connected 63
  • 64. Web Search โ€ข Basic search technology shared with IR systems โ€“ Representation โ€“ Indexing โ€“ Ranking โ€ข Scale (in terms of data and users) changes the game โ€“ Efficiency/architectural design decisions โ€ข Link structure โ€“ For data acquisition (crawling) โ€“ For ranking (PageRank, HITS) โ€“ For spam detection โ€“ For extending document representations (anchor text) โ€ข Adversarial IR โ€ข Monetization 64
  • 65. User Needs โ€ข Need โ€“ Informational โ€“ want to learn about something (~40% / 65%) โ€“ Navigational โ€“ want to go to that page (~25% / 15%) โ€“ Transactional โ€“ want to do something (web-mediated) (~35% / 20%) โ€ข Access a service โ€ข Downloads โ€ข Shop โ€“ Gray areas โ€ข Find a good hub โ€ข Exploratory search โ€œsee whatโ€™s thereโ€ Low hemoglobin United Airlines Seattle weather Mars surface images Canon S410 Car rental Brasil 65
  • 66. How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 66
  • 67. Usersโ€™ empirical evaluation of results โ€ข Quality of pages varies widely โ€“ Relevance is not enough โ€“ Other desirable qualities (non IR!!) โ€ข Content: Trustworthy, diverse, non-duplicated, well maintained โ€ข Web readability: display correctly & fast โ€ข No annoyances: pop-ups, etc. โ€ข Precision vs. recall โ€“ On the web, recall seldom matters โ€ข What matters โ€“ Precision at 1? Precision above the fold? โ€“ Comprehensiveness โ€“ must be able to deal with obscure queries โ€ข Recall matters when the number of matches is very small โ€ข User perceptions may be unscientific, but are significant over a large aggregate 67
  • 68. Usersโ€™ empirical evaluation of engines โ€ข Relevance and validity of results โ€ข UI โ€“ Simple, no clutter, error tolerant โ€ข Trust โ€“ Results are objective โ€ข Coverage of topics for ambiguous queries โ€ข Pre/Post process tools provided โ€“ Mitigate user errors (auto spell check, search assist,โ€ฆ) โ€“ Explicit: Search within results, more like this, refine ... โ€“ Anticipative: related searches โ€ข Deal with idiosyncrasies โ€“ Web specific vocabulary โ€ข Impact on stemming, spell-check, etc. โ€“ Web addresses typed in the search box โ€ข โ€œThe first, the last, the best and the worst โ€ฆโ€ 68
  • 69. The Web document collection โ€ข No design/co-ordination โ€ข Distributed content creation, linking, democratization of publishing โ€ข Content includes truth, lies, obsolete information, contradictions โ€ฆ โ€ข Unstructured (text, html, โ€ฆ), semi-structured (XML, annotated photos), structured (Databases)โ€ฆ โ€ข Scale much larger than previous text collections โ€ฆ but corporate records are catching up โ€ข Growth โ€“ slowed down from initial โ€œvolume doubling every few monthsโ€ but still expanding โ€ข Content can be dynamically generated The Web 69
  • 70. Basic crawler operation โ€ข Begin with known โ€œseedโ€ URLs โ€ข Fetch and parse them โ€“Extract URLs they point to โ€“Place the extracted URLs on a queue โ€ข Fetch each URL on the queue and repeat 70
  • 71. Crawling picture Web URLs frontier Unseen Web URLs crawled and parsed Seed pages 71
  • 72. Simple picture โ€“ complications โ€ข Web crawling isnโ€™t feasible with one machine โ€“ All of the above steps distributed โ€ข Malicious pages โ€“ Spam pages โ€“ Spider traps โ€“ including dynamically generated โ€ข Even non-malicious pages pose challenges โ€“ Latency/bandwidth to remote servers vary โ€“ Webmastersโ€™ stipulations โ€ข How โ€œdeepโ€ should you crawl a siteโ€™s URL hierarchy? โ€“ Site mirrors and duplicate pages โ€ข Politeness โ€“ donโ€™t hit a server too often 72
  • 73. What any crawler must do โ€ข Be Polite: Respect implicit and explicit politeness considerations โ€“ Only crawl allowed pages โ€“ Respect robots.txt โ€ข Be Robust: Be immune to spider traps and other malicious behavior from web servers โ€“Be efficient 73
  • 74. What any crawler should do โ€ข Be capable of distributed operation: designed to run on multiple distributed machines โ€ข Be scalable: designed to increase the crawl rate by adding more machines โ€ข Performance/efficiency: permit full use of available processing and network resources 74
  • 75. What any crawler should do โ€ข Fetch pages of โ€œhigher qualityโ€ first โ€ข Continuous operation: Continue fetching fresh copies of a previously fetched page โ€ข Extensible: Adapt to new data formats, protocols 75
  • 76. Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread 76
  • 77. 77
  • 78. Document views sailing greece mediterranean fish sunset Author = โ€œB. Smithโ€ Crdate = โ€œ14.12.96โ€ Ladate = โ€œ11.07.02โ€ Sailing in Greece B. Smith content view head title author chapter section section structure view data view layout view 78
  • 79. What is a document: document views โ€ข Content view is concerned with representing the content of the document; that is, what is the document about. โ€ข Data view is concerned with factual data associated with the document (e.g. author names, publishing date) โ€ข Layout view is concerned with how documents are displayed to the users; this view is related to user interface and visualization issues. โ€ข Structure view is concerned with the logical structure of the document, (e.g. a book being composed of chapters, themselves composed of sections, etc.) 79
  • 80. Indexing language โ€ข An indexing language: โ€“ Is the language used to describe the content of documents (and queries) โ€“ And it usually consists of index terms that are derived from the text (automatic indexing), or arrived at independently (manual indexing), using a controlled or uncontrolled vocabulary โ€“ Basic operation: is this query term present in this document? 80
  • 81. Generating document representations โ€ข The building of the indexing language, that is generating the document representation, is done in several steps: โ€“ Character encoding โ€“ Language recognition โ€“ Page segmentation (boilerplate detection) โ€“ Tokenization (identification of words) โ€“ Term normalization โ€“ Stopword removal โ€“ Stemming โ€“ Others (doc. Expansion, etc.) 81
  • 82. Generating document representations: overview documents tokens stop-words stems terms (index terms) tokenization remove noisy words reduce to stems + others: e.g. - thesaurus - more complex processing 82
  • 83. Parsing a document โ€ข What format is it in? โ€“ pdf/word/excel/html? โ€ข What language is it in? โ€ข What character set is in use? โ€“ (ISO-8818, UTF-8, โ€ฆ) But these tasks are often done heuristically โ€ฆ 83
  • 84. Complications: Format/language โ€ข Documents being indexed can include docs from many different languages โ€“ A single index may contain terms from many languages. โ€ข Sometimes a document or its components can contain multiple languages/formats โ€“ French email with a German pdf attachment. โ€“ French email quote clauses from an English-language contract โ€ข There are commercial and open source libraries that can handle a lot of this stuff 84
  • 85. Complications: What is a document? We return from our query โ€œdocumentsโ€ but there are often interesting questions of grain size: What is a unit document? โ€“ A file? โ€“ An email? (Perhaps one of many in a single mbox file) โ€ข What about an email with 5 attachments? โ€“ A group of files (e.g., PPT or LaTeX split over HTML pages) 85
  • 86. Tokenization โ€ข Input: โ€œFriends, Romans and Countrymenโ€ โ€ข Output: Tokens โ€“ Friends โ€“ Romans โ€“ Countrymen โ€ข A token is an instance of a sequence of characters โ€ข Each such token is now a candidate for an index entry, after further processing โ€ข But what are valid tokens to emit? 86
  • 87. Tokenization โ€ข Issues in tokenization: โ€“ Finlandโ€™s capital ๏‚ฎ Finland AND s? Finlands? Finlandโ€™s? โ€“ Hewlett-Packard ๏‚ฎ Hewlett and Packard as two tokens? โ€ข state-of-the-art: break up hyphenated sequence. โ€ข co-education โ€ข lowercase, lower-case, lower case ? โ€ข It can be effective to get the user to put in possible hyphens โ€“ San Francisco: one token or two? โ€ข How do you decide it is one token? 87
  • 88. Numbers โ€ข 3/20/91 Mar. 12, 1991 20/3/91 โ€ข 55 B.C. โ€ข B-52 โ€ข My PGP key is 324a3df234cb23e โ€ข (800) 234-2333 โ€ข Often have embedded spaces โ€ข Older IR systems may not index numbers But often very useful: think about things like looking up error codes/stacktraces on the web โ€ข Will often index โ€œmeta-dataโ€ separately Creation date, format, etc. 88
  • 89. Tokenization: language issues โ€ข French โ€“ L'ensemble ๏‚ฎ one token or two? โ€ข L ? Lโ€™ ? Le ? โ€ข Want lโ€™ensemble to match with un ensemble โ€“ Until at least 2003, it didnโ€™t on Google ยป Internationalization! โ€ข German noun compounds are not segmented โ€“ Lebensversicherungsgesellschaftsangestellter โ€“ โ€˜life insurance company employeeโ€™ โ€“ German retrieval systems benefit greatly from a compound splitter module โ€“ Can give a 15% performance boost for German 89
  • 90. Tokenization: language issues โ€ข Chinese and Japanese have no spaces between words: โ€“ ่ŽŽๆ‹‰ๆณขๅจƒ็Žฐๅœจๅฑ…ไฝๅœจ็พŽๅ›ฝไธœๅ—้ƒจ็š„ไฝ›็ฝ—้‡Œ่พพใ€‚ โ€“ Not always guaranteed a unique tokenization โ€ข Further complicated in Japanese, with multiple alphabets intermingled โ€“ Dates/amounts in multiple formats ใƒ•ใ‚ฉใƒผใƒใƒฅใƒณ500็คพใฏๆƒ…ๅ ฑไธ่ถณใฎใŸใ‚ๆ™‚้–“ใ‚ใŸ$500K(็ด„6,000ไธ‡ๅ††) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana! 90
  • 91. Tokenization: language issues โ€ข Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right โ€ข Words are separated, but letter forms within a word form complex ligatures โ† โ†’ โ† โ†’ โ† start โ€˜Algeria achieved its independence in 1962 after 132 years of French occupation.โ€™ โ€ข With Unicode, the surface presentation is complex, but the stored form is straightforward 91
  • 92. Stop words โ€ข With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: โ€“ They have little semantic content: the, a, and, to, be โ€“ There are a lot of them: ~30% of postings for top 30 words โ€ข But the trend is away from doing this: โ€“ Good compression techniques means the space for including stop words in a system can be small โ€“ Good query optimization techniques mean you pay little at query time for including stop words. โ€“ You need them for: โ€ข Phrase queries: โ€œKing of Denmarkโ€ โ€ข Various song titles, etc.: โ€œLet it beโ€, โ€œTo be or not to beโ€ โ€ข โ€œRelationalโ€ queries: โ€œflights to Londonโ€ 92
  • 93. Normalization to terms โ€ข Want: matches to occur despite superficial differences in the character sequences of the tokens โ€ข We may need to โ€œnormalizeโ€ words in indexed text as well as query words into the same form โ€“ We want to match U.S.A. and USA โ€ข Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary โ€ข We most commonly implicitly define equivalence classes of terms by, e.g., โ€“ deleting periods to form a term โ€ข U.S.A., USA ๏ƒจ USA โ€“ deleting hyphens to form a term โ€ข anti-discriminatory, antidiscriminatory ๏ƒจ antidiscriminatory 93
  • 94. Normalization: other languages โ€ข Accents: e.g., French rรฉsumรฉ vs. resume. โ€ข Umlauts: e.g., German: Tuebingen vs. Tรผbingen โ€“ Should be equivalent โ€ข Most important criterion: โ€“ How are your users like to write their queries for these words? โ€ข Even in languages that standardly have accents, users often may not type them โ€“ Often best to normalize to a de-accented term โ€ข Tuebingen, Tรผbingen, Tubingen ๏ƒจ Tubingen 94
  • 95. Case folding โ€ข Reduce all letters to lower case โ€“ exception: upper case in mid-sentence? โ€ข e.g., General Motors โ€ข Fed vs. fed โ€ข SAIL vs. sail โ€“ Often best to lower case everything, since users will use lowercase regardless of โ€˜correctโ€™ capitalizationโ€ฆ โ€ข Longstanding Google example: [fixed in 2011โ€ฆ] โ€“ Query C.A.T. โ€“ #1 result is for โ€œcatsโ€ (well, Lolcats) not Caterpillar Inc. 95
  • 96. Normalization to terms โ€ข An alternative to equivalence classing is to do asymmetric expansion โ€ข An example of where this may be useful โ€“ Enter: window Search: window, windows โ€“ Enter: windows Search: Windows, windows, window โ€“ Enter: Windows Search: Windows โ€ข Potentially more powerful, but less efficient 96
  • 97. Thesauri and soundex โ€ข Do we handle synonyms and homonyms? โ€“ E.g., by hand-constructed equivalence classes โ€ข car = automobile color = colour โ€“ We can rewrite to form equivalence-class terms โ€ข When the document contains automobile, index it under car-automobile (and vice-versa) โ€“ Or we can expand a query โ€ข When the query contains automobile, look under car as well โ€ข What about spelling mistakes? โ€“ One approach is Soundex, which forms equivalence classes of words based on phonetic heuristics 97
  • 98. Lemmatization โ€ข Reduce inflectional/variant forms to base form โ€ข E.g., โ€“ am, are, is ๏‚ฎ be โ€“ car, cars, car's, cars' ๏‚ฎ car โ€ข the boy's cars are different colors ๏‚ฎ the boy car be different color โ€ข Lemmatization implies doing โ€œproperโ€ reduction to dictionary headword form 98
  • 99. Stemming โ€ข Reduce terms to their โ€œrootsโ€ before indexing โ€ข โ€œStemmingโ€ suggests crude affix chopping โ€“ language dependent โ€“ e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress 99
  • 100. โ€“ Affix removal โ€ข remove the longest affix: {sailing, sailor} => sail โ€ข simple and effective stemming โ€ข a widely used such stemmer is Porterโ€™s algorithm โ€“ Dictionary-based using a look-up table โ€ข look for stem of a word in table: play + ing => play โ€ข space is required to store the (large) table, so often not practical 100
  • 101. Stemming: some issues โ€ข Detect equivalent stems: โ€“ {organize, organise}: e as the longest affix leads to {organiz, organis}, which should lead to one stem: organis โ€“ Heuristics are therefore used to deal with such cases. โ€ข Over-stemming: โ€“ {organisation, organ} reduced into org, which is incorrect โ€“ Again heuristics are used to deal with such cases. 101
  • 102. Porterโ€™s algorithm โ€ข Commonest algorithm for stemming English โ€“ Results suggest itโ€™s at least as good as other stemming options โ€ข Conventions + 5 phases of reductions โ€“ phases applied sequentially โ€“ each phase consists of a set of commands โ€“ sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. 102
  • 103. Typical rules in Porter โ€ข sses ๏‚ฎ ss โ€ข ies ๏‚ฎ i โ€ข ational ๏‚ฎ ate โ€ข tional ๏‚ฎ tion 103
  • 104. Language-specificity โ€ข The above methods embody transformations that are โ€“ Language-specific, and often โ€“ Application-specific โ€ข These are โ€œplug-inโ€ addenda to the indexing process โ€ข Both open source and commercial plug-ins are available for handling these 104
  • 105. Does stemming help? โ€ข English: very mixed results. Helps recall for some queries but harms precision on others โ€“ E.g., operative (dentistry) โ‡’ oper โ€ข Definitely useful for Spanish, German, Finnish, โ€ฆ โ€“ 30% performance gains for Finnish! 105
  • 106. Others: Using a thesaurus โ€ข A thesaurus provides a standard vocabulary for indexing (and searching) โ€ข More precisely, a thesaurus provides a classified hierarchy for broadening and narrowing terms bank: 1. Finance institute 2. River edge โ€“ if a document is indexed with bank, then index it with โ€œfinance instituteโ€ or โ€œriver edgeโ€ โ€“ need to disambiguate the sense of bank in the text: e.g. if money appears in the document, then chose โ€œfinance instituteโ€ โ€ข A widely used online thesaurus: WordNet 106
  • 107. Information storage โ€ข Whole topic on its own โ€ข How do we keep fresh copies of the web manageable by a cluster of computers and are able to answer millions of queries in milliseconds โ€“ Inverted indexes โ€“ Compression โ€“ Caching โ€“ Distributed architectures โ€“ โ€ฆ and a lot of tricks โ€ข Inverted indexes: cornerstone data structure of IR systems โ€“ For each term t, we must store a list of all documents that contain t. โ€“ Identify each doc by a docID, a document serial number โ€“ Index construction is tricky (canโ€™t hold all the information needed in memory) 107
  • 108. 108 docs t1 t2 t3 D1 1 0 1 D2 1 0 0 D3 0 1 1 D4 1 0 0 D5 1 1 1 D6 1 1 0 D7 0 1 0 D8 0 1 0 D9 0 1 1 D10 0 1 1 Terms D1 D2 D3 D4 t1 1 1 0 1 t2 0 0 1 0 t3 1 0 1 0
  • 109. โ€ข Most basic form: โ€“ Document frequency โ€“ Term frequency โ€“ Document identifiers 109 term Term id df a 1 4 as 2 3 (1,2), (2,5), (10,1), (11,1) (1,3), (3,4), (20,1)
  • 110. โ€ข Indexes contain more information โ€“ Position in the document โ€ข Useful for โ€œphrase queriesโ€ or โ€œproximity queriesโ€ โ€“ Fields in which the term appears in the document โ€“ Metadata โ€ฆ โ€“ All that can be used for ranking 110 (1,2, [1, 1], [2,10]), โ€ฆ Field 1 (title), position 1
  • 111. Queries โ€ข How do we process a query? โ€ข Several kinds of queries โ€“ Boolean โ€ขChicken AND salt โ€ข Gnome OR KDE โ€ข Salt AND NOT pepper โ€“ Phrase queries โ€“ Ranked 111
  • 112. List Merging โ€ขโ€œExact matchโ€ queries โ€“ Chicken AND curry โ€“ Locate Chicken in the dictionary โ€“ Fetch its postings โ€“ Locate curry in the dictionary โ€“Fetch its postings โ€“Merge both postings 112
  • 114. List Merging Walk through the postings in O(x+y) time salt pepper 3 22 23 25 3 5 22 25 36 3 22 25 114
  • 115. 115
  • 116. Models of information retrieval โ€ข A model: โ€“ abstracts away from the real world โ€“ uses a branch of mathematics โ€“ possibly: uses a metaphor for searching 116
  • 117. Short history of IR modelling โ€ข Boolean model (ยฑ1950) โ€ข Document similarity (ยฑ1957) โ€ข Vector space model (ยฑ1970) โ€ข Probabilistic retrieval (ยฑ1976) โ€ข Language models (ยฑ1998) โ€ข Linkage-based models (ยฑ1998) โ€ข Positional models (ยฑ2004) โ€ข Fielded models (ยฑ2005) 117
  • 118. The Boolean model (ยฑ1950) โ€ข Exact matching: data retrieval (instead of information retrieval) โ€“ A term specifies a set of documents โ€“ Boolean logic to combine terms / document sets โ€“ AND, OR and NOT: intersection, union, and difference 118
  • 119. Statistical similarity between documents (ยฑ1957) โ€ข The principle of similarity "The more two representations agree in given elements and their distribution, the higher would be the probability of their representing similar informationโ€ (Luhn 1957) It is here proposed that the frequency of word [term] occurrence in an article [document ] furnishes a useful measurement of word [term] significanceโ€ 119
  • 120. Zipfโ€™s law terms by rank order frequency of terms f r 120
  • 121. Zipfโ€™s law โ€ข Relative frequencies of terms. โ€ข In natural language, there are a few very frequent terms and very many very rare terms. โ€ข Zipfโ€™s law: The ith most frequent term has frequency proportional to 1/i . โ€ข cfi โˆ 1/i = K/i where K is a normalizing constant โ€ข cfi is collection frequency: the number of occurrences of the term ti in the collection. โ€ข Zipfโ€™s law holds for different languages 121
  • 122. Zipf consequences โ€ข If the most frequent term (the) occurs cf1 times โ€“ then the second most frequent term (of) occurs cf1/2 times โ€“ the third most frequent term (and) occurs cf1/3 times โ€ฆ โ€ข Equivalent: cfi = K/i where K is a normalizing factor, so โ€“ log cfi = log K - log i โ€“ Linear relationship between log cfi and log i โ€ข Another power law relationship 122
  • 123. Zipfโ€™s law in action 123
  • 124. Luhnโ€™s analysis -Observation terms by rank order frequency of terms f resolving power r upper cut-off lower cut-off common terms rare terms significant terms Resolving power of significant terms: ability of terms to discriminate document content peak at rank order position half way between the two cut-offs 124
  • 125. Luhnโ€™s analysis - Implications โ€ข Common terms are not good at representing document content โ€“ partly implemented through the removal of stop words โ€ข Rare words are also not good at representing document content โ€“ usually nothing is done โ€“ Not true for every โ€œdocumentโ€ โ€ข Need a means to quantify the resolving power of a term: โ€“ associate weights to index terms โ€“ tfร—idf approach 125
  • 126. Ranked retrieval โ€ข Boolean queries are good for expert users with precise understanding of their needs and the collection. โ€“ Also good for applications: Applications can easily consume 1000s of results. โ€ข Not good for the majority of users. โ€“ Most users incapable of writing Boolean queries (or they are, but they think itโ€™s too much work). โ€“ Most users donโ€™t want to wade through 1000s of results. โ€ข This is particularly true of web search.
  • 127. Feast or Famine โ€ข Boolean queries often result in either too few (=0) or too many (1000s) results. โ€ข Query 1: โ€œstandard user dlink 650โ€ โ†’ 200,000 hits โ€ข Query 2: โ€œstandard user dlink 650 no card foundโ€: 0 hits โ€ข It takes a lot of skill to come up with a query that produces a manageable number of hits. โ€“ AND gives too few; OR gives too many
  • 128. Ranked retrieval models โ€ข Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query โ€ข Free text queries: Rather than a query language of operators and expressions, the userโ€™s query is just one or more words in a human language โ€ข In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa 128
  • 129. Feast or famine: not a problem in ranked retrieval โ€ข When a system produces a ranked result set, large result sets are not an issue โ€“ Indeed, the size of the result set is not an issue โ€“ We just show the top k ( โ‰ˆ 10) results โ€“ We do not overwhelm the user โ€“ Premise: the ranking algorithm works
  • 130. Scoring as the basis of ranked retrieval โ€ข We wish to return in order the documents most likely to be useful to the searcher โ€ข How can we rank-order the documents in the collection with respect to a query? โ€ข Assign a score โ€“ say in [0, 1] โ€“ to each document โ€ข This score measures how well document and query โ€œmatchโ€.
  • 131. Query-document matching scores โ€ข We need a way of assigning a score to a query/document pair โ€ข Letโ€™s start with a one-term query โ€ข If the query term does not occur in the document: score should be 0 โ€ข The more frequent the query term in the document, the higher the score (should be) โ€ข We will look at a number of alternatives for this.
  • 132. Bag of words model โ€ข Vector representation does not consider the ordering of words in a document โ€ข John is quicker than Mary and Mary is quicker than John have the same vectors โ€ข This is called the bag of words model.
  • 133. Term frequency tf โ€ข The term frequency tf(t,d) of term t in document d is defined as the number of times that t occurs in d. โ€ข We want to use tf when computing query-document match scores. But how? โ€ข Raw term frequency is not what we want: โ€“ A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. โ€“ But not 10 times more relevant. โ€ข Relevance does not increase proportionally with term frequency.
  • 134. Log-frequency weighting โ€ข The log frequency weight of term t in d is ๏ƒฌ ๏€ซ ๏€พ 1 log tf , if tf 0 ๏ƒฎ ๏ƒญ ๏€ฝ 10 t,d t,d 0, otherwise t,d w โ€ข 0 โ†’ 0, 1 โ†’ 1, 2 โ†’ 1.3, 10 โ†’ 2, 1000 โ†’ 4, etc. โ€ข Score for a document-query pair: sum over terms t in both q and d: โ€ข score โ€ข The score is 0 if none of the query terms is present in the document. ๏ƒฅ ๏ƒŽ ๏ƒ‡ ๏€ฝ ๏€ซ t q d t d (1 log tf ) ,
  • 135. Document frequency โ€ข Rare terms are more informative than frequent terms โ€“ Recall stop words โ€ข Consider a term in the query that is rare in the collection (e.g., arachnocentric) โ€ข A document containing this term is very likely to be relevant to the query arachnocentric โ€ข โ†’ We want a high weight for rare terms like arachnocentric.
  • 136. Document frequency, continued โ€ข Frequent terms are less informative than rare terms โ€ข Consider a query term that is frequent in the collection (e.g., high, increase, line) โ€ข A document containing such a term is more likely to be relevant than a document that does not โ€ข But itโ€™s not a sure indicator of relevance. โ€ข โ†’ For frequent terms, we want high positive weights for words like high, increase, and line โ€ข But lower weights than for rare terms. โ€ข We will use document frequency (df) to capture this.
  • 137. idf weight โ€ข dft is the document frequency of t: the number of documents that contain t โ€“ dft is an inverse measure of the informativeness of t โ€“ dft ๏‚ฃ N โ€ข We define the idf (inverse document frequency) of t by โ€“ We use log (N/dft) instead of N/dft to โ€œdampenโ€ the effect of idf. idf log ( /df ) t 10 t ๏€ฝ N
  • 138. Effect of idf on ranking โ€ข Does idf have an effect on ranking for one-term queries, like โ€“ iPhone โ€ข idf has no effect on ranking one term queries โ€“ idf affects the ranking of documents for queries with at least two terms โ€“ For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person. 138
  • 139. tf-idf weighting โ€ข The tf-idf weight of a term is the product of its tf weight and its idf weight. w ๏€ฝ log(1 ๏€ซ tf ) ๏‚ด log ( N / df ) t , d t ,d 10 t โ€ข Best known weighting scheme in information retrieval โ€“ Note: the โ€œ-โ€ in tf-idf is a hyphen, not a minus sign! โ€“ Alternative names: tf.idf, tf x idf โ€ข Increases with the number of occurrences within a document โ€ข Increases with the rarity of the term in the collection
  • 140. Score for a document given a query tรŽqร‡d รฅ โ€ข There are many variants โ€“ How โ€œtfโ€ is computed (with/without logs) โ€“ Whether the terms in the query are also weighted โ€“ โ€ฆ 140 Score(q,d) = tf.idft,d
  • 141. Documents as vectors โ€ข So we have a |V|-dimensional vector space โ€ข Terms are axes of the space โ€ข Documents are points or vectors in this space โ€ข Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine โ€ข These are very sparse vectors - most entries are zero.
  • 142. Statistical similarity between documents (ยฑ1957) โ€ข Vector product โ€“ If the vector has binary components, then the product measures the number of shared terms โ€“ Vector components might be "weights" ๏ƒฅ score q d ๏€ฝ q ๏ƒ— d k k ๏ƒŽ matching terms ( , ) k ๏ฒ ๏ฒ
  • 143. Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.
  • 144. Vector space model (ยฑ1970) โ€ข Documents and queries are vectors in a high-dimensional space โ€ข Geometric measures (distances, angles)
  • 145. Vector space model (ยฑ1970) โ€ข Cosine of an angle: โ€“ close to 1 if angle is small โ€“ 0 if vectors are orthogonal 2 m d q k k k d q m k 1 k ๏ƒฅ ๏ƒ— 2 m k 1 k ๏€ฝ 1 ( ) ( ) ๏ฒ ๏ฒ cos( , ) ๏ƒฅ ๏ƒ—๏ƒฅ ๏€ฝ ๏€ฝ ๏€ฝ d q 1 ( )2 ๏€ฝ m ๏ƒฅ ๏€ฝ ๏ƒฅ ๏ƒ— ๏€ฝ k ๏€ฝ k i i m k k k v v ๏ฒ ๏ฒ d q n d n q n v 1 cos( , ) ( ) ( ), ( )
  • 146. Vector space model (ยฑ1970) โ€ข PRO: Nice metaphor, easily explained; Mathematically sound: geometry; Great for relevance feedback โ€ข CON: Need term weighting (tf-idf); Hard to model structured queries
  • 147. Probabilistic IR โ€ข An IR system has an uncertain understanding of userโ€™s queries and makes uncertain guesses on whether a document satisfies a query or not. โ€ข Probability theory provides a principled foundation for reasoning under uncertainty. โ€ข Probabilistic models build upon this foundation to estimate how likely it is that a document is relevant for a query. 147
  • 148. Event Space โ€ข Query representation โ€ข Document representation โ€ข Relevance โ€ข Event space โ€ข Conceptually there might be pairs with same q and d, but different r โ€ข Some times include include user u, context c, etc. 148
  • 149. Probability Ranking Principle โ€ข Robertson (1977) โ€“ โ€œIf a reference retrieval systemโ€™s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.โ€ โ€ข Basis for probabilistic approaches for IR 149
  • 150. Dissecting PRP โ€ข Probability of relevance โ€ข Estimated accurately โ€ข Based on whatever data available โ€ข Best possible accuracy โ€“ The perfect IR system! โ€“ Assumes relevance is independent on other documents in the collection 150
  • 151. Relevance? โ€ข What is ? โ€“ Isnโ€™t it decided by the user? her opinion? โ€ข User doesnโ€™t mean a human being! โ€“ We are working with representations โ€“ ... or parts of the reality available to us โ€ข 2/3 keywords, no profile, no context ... โ€“ relevance is uncertain โ€ข depends on what the system sees โ€ข may be marginalized over all the unseen context/profiles 151
  • 152. Retrieval as binary classification โ€ข For every (q,d), r takes two values โ€“ Relevant and non-relevant documents โ€“ can be extended to multiple values โ€ข Retrieve using Bayesโ€™ decision โ€“ PRP is related to the Bayes error rate (lowest possible error rate for a class) โ€“ How do we estimate this probability? 152
  • 153. PRP ranking โ€ข How to represent the random variables? โ€ข How to estimate the modelโ€™s parameters? 153
  • 154. โ€ข d is a binary vector โ€ข Multiple Bernoulli variables โ€ข Under MB, we can decompose into a product of probabilities, with likelihoods: 154
  • 155. If the terms are not in the query: Otherwise we need estimates for them! 155
  • 156. Estimates โ€ข Assign new weights for query terms based on relevant/non-relevant documents โ€ข Give higher weights to important terms: Relevant Non-relevant 156 Document with t r n-r n Document without t R-r N-r-R+r N-n R N-R
  • 157. Robertson-Spark Jones weight 157 Relevant docs with t Relevant docs without t Non-relevant docs with t Non-relevant docs without t
  • 158. Estimates without relevance info โ€ข If we pick a relevant document, words are equally like to be present or absent โ€ข Non-relevant can be approximated with the collection as a whole 158
  • 160. Modeling TF โ€ข Naรฏve estimation: separate probability for every outcome โ€ข BIR had only two parameters, now we have plenty (~many outcomes) โ€ข We can plug in a parametric estimate for the term frequencies โ€ข For instance, a Poisson mixture 160
  • 161. Okapi BM25 โ€ข Same ranking function as before but with new estimates. Models term frequencies and document length. โ€ข Words are generated by a mixture of two Poissons โ€ข Assumes an eliteness variable (elite ~ word occurs unusually frequently, non-elite ~ word occurs as expected by chance). 161
  • 162. BM25 โ€ข As a graphical model 162
  • 163. BM25 โ€ข In order to approximate the formula, Robertson and Walker came up with: โ€ข Two model parameters โ€ข Very effective โ€ข The more words in common with the query the better โ€ข Repetitions less important than different query words โ€“ But more important if the document is relatively long 163
  • 164. Generative Probabilistic Language Models โ€ข The generative approach โ€“ A generator which produces events/tokens with some probability โ€“ Probability distribution over strings of text โ€“ URN Metaphor โ€“ a bucket of different colour balls (10 red, 5 blue, 3 yellow, 2 white) โ€ข What is the probability of drawing a yellow ball? 3/20 โ€ข what is the probability of drawing (with replacement) a red ball and a white ball? ยฝ*1/10 โ€“ IR Metaphor: Documents are urns, full of tokens (balls) of (in) different terms (colors)
  • 165. What is a language model? โ€ข How likely is a string of words in a โ€œlanguageโ€? โ€“ P1(โ€œthe cat sat on the matโ€) โ€“ P2(โ€œthe mat sat on the catโ€) โ€“ P3(โ€œthe cat sat en la alfombraโ€) โ€“ P4(โ€œel gato se sentรณ en la alfombraโ€) โ€ข Given a model M and a observation s we want โ€“ Probability of getting s through random sampling from M โ€“ A mechanism to produce observations (strings) legal in M โ€ข User thinks of a relevant document and then picks some keywords to use as a query 165
  • 166. Generative Probabilistic Models โ€ข What is the probability of producing the query from a document? p(q|d) โ€ข Referred to as query-likelihood โ€ข Assumptions: โ€ข The probability of a document being relevant is strongly correlated with the probability of a query given a document, i.e. p(d|r) is correlated with p(q|d) โ€ข User has a reasonable idea of the terms that are like to appear in the โ€œidealโ€ document โ€ข Userโ€™s query terms can distinguish the โ€œidealโ€ document from the rest of the corpus โ€ข The query is generated as a representative of the โ€œidealโ€ document โ€ข Systemโ€™s task is to estimate for each of the documents in the collection, which is most likely to be the โ€œidealโ€ document
  • 167. Language Models (1998/2001) โ€ข Letโ€™s assume we point blindly, one at a time, at 3 words in a document โ€“ What is the probability that I, by accident, pointed at the words โ€œMasterโ€, โ€œcomputerโ€ and โ€œScienceโ€? โ€“ Compute the probability, and use it to rank the documents. โ€ข Words are โ€œsampledโ€ independently of each other โ€“ Joint probability decomposed into a product of marginals โ€“ Estimation of probabilities just by counting โ€ข Higher models or unigrams? โ€“ Parameter estimation can be very expensive
  • 168. Standard LM Approach โ€ข Assume that query terms are drawn identically and independently from a document
  • 169. Estimating language models โ€ข Usually we donโ€™t know M โ€ข Maximum Likelihood Estimate of โ€“ Simply use the number of times the query term occurs in the document divided by the total number of term occurrences. โ€ข Zero Probability (frequency) problem 169
  • 170. Document Models โ€ข Solution: Infer a language model for each document, where โ€ข Then we can estimate โ€ข Standard approach is to use the probability of a term to smooth the document model. โ€ข Interpolate the ML estimator with general language expectations
  • 171. Estimating Document Models โ€ข Basic Components โ€“ Probability of a term given a document (maximum likelihood estimate) โ€“ Probability of a term given the collection โ€“ tf(t,d) is the number of times term t occurs in document d (term frequency)
  • 172. Language Models โ€ข Implementation
  • 173. Implementation as vector product df t tf t D p t ๏ƒฅ ๏ƒฅ ๏€ฝ ' ( ) ( ' ) ( ) t df t ๏€ฝ ' ( , ) ( ' , ) ( | ) t tf t D p t D Recall: score q d q dk q ๏€ฝ tf k q ( , ) . ( , ) tf k d df t ( , ) ( ) k tf.idf of term k in document d ๏ฌ ๏€ญ ๏ฌ Odds of the probability of ๏ƒฅ ๏ƒŽ ๏ƒฅ Inverse length of d Term importance ๏€ฝ ๏€ฝ ๏ƒฅ 1 . ( ) ( , ) log Matching Text t t k k k df k tf t d d
  • 174. Document length normalization โ€ข Probabilistic models assume causes for documents differing in length โ€“ Scope โ€“ Verbosity โ€ข In practice, document length softens the term frequency contribution to the final score โ€“ Weโ€™ve seen it in BM25 and LMs โ€“ Usually with a tunable parameter that regulates the amount of softening โ€“ Can be a function of the deviation of the average document length โ€“ Can be incorporated into vanilla tf-idf 174
  • 175. Other models โ€ข Modeling term dependencies (positions) in the language modeling framework โ€“ Markov Random Fields โ€ข Modeling matches (occurrences of words) in different parts of a document -> fielded models โ€“ BM25F โ€“ Markov Random Fields can account for this as well 175
  • 176. More involved signals for ranking โ€ข From document understanding to query understanding โ€ข Query rewrites (gazetteers, spell correction), named entity recognition, query suggestions, query categories, query segmentation ... โ€ข Detecting query intent, triggering verticals โ€“ direct target towards answers โ€“ richer interfaces 176
  • 177. Signals for Ranking โ€ข Signals for ranking: matches of query terms in documents, query-independent quality measures, CTR, among others โ€ข Probabilistic IR models are all about counting โ€“ occurrences of terms in documents, in sets of documents, etc. โ€ข How to aggregate efficiently a large number of โ€œdifferentโ€ counts โ€“ coming from the same terms โ€“ no double counts! 177
  • 178. Searching for food โ€ข New Yorkโ€™s greatest pizza โ€ฃ New OR Yorkโ€™s OR greatest OR pizza โ€ฃ New AND Yorkโ€™s AND greatest AND pizza โ€ฃ New OR York OR great OR pizza โ€ฃ โ€œNew Yorkโ€ OR โ€œgreat pizzaโ€ โ€ฃ โ€œNew Yorkโ€ AND โ€œgreat pizzaโ€ โ€ฃ York < New AND great OR pizza โ€ข among many more. 178
  • 179. โ€œRefinedโ€matching โ€ข Extract a number of virtual regions in the document that match some version of the query (operators) โ€“ Each region provides a different evidence of relevance (i.e. signal) โ€ข Aggregate the scores over the different regions โ€ข Ex. :โ€œat least any two words in the query appear either consecutively or with an extra word between themโ€ 179
  • 181. Remember BM25 โ€ข Term (tf) independence โ€ข Vague Prior over terms not appearing in the query โ€ข Eliteness - topical model that perturbs the word distribution โ€ข 2-poisson distribution of term frequencies over relevant and non-relevant documents 181
  • 182. Feature dependencies โ€ข Class-linearly dependent (or affine) features โ€“ add no extra evidence/signal โ€“ model overfitting (vs capacity) โ€ข Still, it is desirable to enrich the model with more involved features โ€ข Some features are surprisingly correlated โ€ข Positional information requires a large number of parameters to estimate โ€ข Potentially up to 182
  • 183. Query concept segmentation โ€ข Queries are made up of basic conceptual units, comprising many words โ€“ โ€œIndian summer victor herbertโ€ โ€ข Spurious matches: โ€œsan jose airportโ€ -> โ€œsan jose city airportโ€ โ€ข Model to detect segments based on generative language models and Wikipedia โ€ข Relax matches using factors of the max ratio between span length and segment length 183
  • 184. Virtual regions โ€ข Different parts of the document provide different evidence of relevance โ€ข Create a (finite) set of (latent) artificial regions and re-weight 184
  • 185. Implementation โ€ข An operator maps a query to a set of queries, which could match a document โ€ข Each operator has a weight โ€ข The average term frequency in a document is 185
  • 186. Remarks โ€ข Different saturation (eliteness) function? โ€“ learn the real functional shape! โ€“ log-logistic is good if the class-conditional distributions are drawn from an exp. family โ€ข Positions as variables? โ€“ kernel-like method or exp. #parameters โ€ข Apply operators on a per query or per query class basis? 186
  • 187. Operator examples โ€ข BOW: maps a raw query to the set of queries whose elements are the single terms โ€ข p-grams: set of all p-gram of consecutive terms โ€ข p-and: all conjunctions of p arbitrary terms โ€ข segments: match only the โ€œconceptsโ€ โ€ข Enlargement: some words might sneak in between the phrases/segments 187
  • 188. How does it work in practice? 188
  • 189. ... not that far away term frequency link information query intent information editorial information click-through information geographical information language information user preferences document length document fields other gazillion sources of information 189
  • 190. Dictionaries โ€ข Fast look-up โ€“ Might need specific structures to scale up โ€ข Hash tables โ€ข Trees โ€“ Tolerant retrieval (prefixes) โ€“ Spell checking โ€ข Document correction (OCR) โ€ข Query misspellings (did you mean โ€ฆ ?) โ€ข (Weighted) edit distance โ€“ dynamic programming โ€ข Jaccard overlap (index character k-grams) โ€ข Context sensitive โ€ข http://norvig.com/spell-correct.html โ€“ Wild-card queries โ€ข Permuterm index โ€ข K-gram indexes 190
  • 191. Hardware basics โ€ข Access to data in memory is much faster than access to data on disk. โ€ข Disk seeks: No data is transferred from disk while the disk head is being positioned. โ€ข Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. โ€ข Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). โ€ข Block sizes: 8KB to 256 KB. 191
  • 192. Hardware basics โ€ข Many design decisions in information retrieval are based on the characteristics of hardware โ€ข Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB. โ€ข Available disk space is several (2-3) orders of magnitude larger. โ€ข Fault tolerance is very expensive: It is much cheaper to use many regular machines rather than one fault tolerant machine. 192
  • 193. Data flow splits Parser Parser Parser Master a-f g-p q-z a-f g-p q-z a-f g-p q-z Inverter Inverter Inverter Postings a-f g-p q-z assign assign Map phase Segment files Reduce phase 193
  • 194. MapReduce โ€ข The index construction algorithm we just described is an instance of MapReduce. โ€ข MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing โ€ฆ โ€ข โ€ฆ without having to write code for the distribution part. โ€ข They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce. โ€ข Open source implementation Hadoop โ€“ Widely used throughout industry 194
  • 195. MapReduce โ€ข Index construction was just one phase. โ€ข Another phase: transforming a term-partitioned index into a document-partitioned index. โ€“ Term-partitioned: one machine handles a subrange of terms โ€“ Document-partitioned: one machine handles a subrange of documents โ€ข Msearch engines use a document-partitioned index for better load balancing, etc. 195
  • 196. Distributed IR โ€ข Basic process โ€“ All queries sent to a director machine โ€“ Director then sends messages to many index servers โ€ข Each index server does some portion of the query processing โ€“ Director organizes the results and returns them to the user โ€ข Two main approaches โ€“ Document distribution โ€ข by far the most popular โ€“ Term distribution 196
  • 197. Distributed IR (II) โ€ข Document distribution โ€“ each index server acts as a search engine for a small fraction of the total collection โ€“ director sends a copy of the query to each of the index servers, each of which returns the top k results โ€“ results are merged into a single ranked list by the director โ€ข Collection statistics should be shared for effective ranking 197
  • 198. Caching โ€ข Query distributions similar to Zipf โ€ข About ยฝ each day are unique, but some are very popular โ€“ Caching can significantly improve effectiveness โ€ข Cache popular query results โ€ข Cache common inverted lists โ€“ Inverted list caching can help with unique queries โ€“ Cache must be refreshed to prevent stale data 198
  • 199. Others โ€ข Efficiency (compression, storage, caching, distribution) โ€ข Novelty and diversity โ€ข Evaluation โ€ข Relevance feedback โ€ข Learning to rank โ€ข User models โ€“ Context, personalization โ€ข Sponsored Search โ€ข Temporal aspects โ€ข Social aspects 199
  • 200. 200

Editor's Notes

  • #11: Not only the data is different, also the queries, and the results we get from it!
  • #13: To the surprise of many, the search box has become the preferred method of information access. Customers ask: Why canโ€™t I search my database in the same way?
  • #17: Archie is a tool for indexing FTP archives, allowing people to find specific files. It is considered to be the first Internet search engine. In the summer of 1993, no search engine existed for the web, just catalog One of the first "all text" crawler-based search engines was WebCrawler, which came out in 1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which has become the standard for all major search engines since. It was also the first one widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was launched and became a major commercial endeavor.
  • #18: In 1996, Netscape was looking to give a single search engine an exclusive deal as the featured search engine on Netscape's web browser. There was so much interest that instead Netscape struck deals with five of the major search engines: for $5 million a year, each search engine would be in rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite.[7][8] Google adopted the idea of selling search terms in 1998, from a small search engine company named goto.com. This move had a significant effect on the SE business, which went from struggling to one of the most profitable businesses in the internet.[6]
  • #20: Aardvark was a social search service that connected users live with friends or friends-of-friends who were able to answer their questions, also known as a knowledge market. Bought by google 2010 Kaltix Corp., commonly known as Kaltix is a personalized search engine company founded at Stanford University in June 2003 by Sep Kamvar, Taher Haveliwala and Glen Jeh.[1][2] It was acquired by Google in September 2003.
  • #44: How do we communicate with search engines
  • #45: Information needs must be expressed as a query โ€“ But users donโ€™t often know what they want ASK ย  Hypothesis Belkin et al (1982) Proposed a model called Anomalous State of Knowledge ASKย  hypothesis: โ€“ difficult for people to define exactly what their information need is, because that information is a gap in their knowledge - Search Engines should look for information that fills those gaps Interesting ideas, little practical impact (yet)
  • #49: Under specified Ambiguous Context sensitive ย represent different types of search โ€“ ย E.g. decision making โ€“ ย background search โ€“ ย fact search
  • #50: Need to have fairly deep knowledge... โ€“ ย What sites are possible โ€“ ย Whatโ€™s in a given site (whatโ€™s likely to be there) โ€“ ย Authority of source / site โ€“ ย Index structure (time, place, person, ...) what kinds of searches? โ€“ ย How to read a SERP critically
  • #51: Commonplace book
  • #52: Start with the simplest search you can think of: [ upper lip indentation ] If itโ€™s not right, you can always modify it. โ€ข When I did this, I clicked on the first result, which took me to Yahoo Answers. Thereโ€™s a nice article there about something called the philtrum.
  • #53: Ghost town vs abandoned 1750 Search for images with creative commons attributions
  • #59: The need is verbalized mentally
  • #60: Queries and documents must share a (at least comparable if not the same) representation
  • #64: SCC โ€“ single connected component IN โ€“ pages not discovered yet OUT โ€“ sites that contain only in-host link Tendrils โ€“ canโ€™t reach or be reached from the SCC
  • #74: creation of indefinitely deep directory structures like http://foo.com/bar/foo/bar/foo/bar/foo/bar/..... dynamic pages like calendars that produce an infinite number of pages for a web crawler to follow. pages filled with a large number of characters, crashing the lexical analyzer parsing the page. pages with session-id's based on required cookies.
  • #80: Data: ; this type of data is conventionally dealt with a database management system. Structure: With this view, documents are not treated as flat entities, so a document and its components (e.g. sections) can be retrieved
  • #83: How do we arrive to the content representation of a document?
  • #85: Nontrivial issues. Requires some design decisions.
  • #86: Nontrivial issues. Requires some design decisions. Matches are then more likely to be relevant, and since the documents are smaller it will be much easier for the user to find the relevant passages in the document. But why stop there? We could treat individual sentences as mini-documents. It becomes clear that there is a precision/recall tradeoff here. If the units get too small, we are likely to miss important passages because terms were distributed over several mini-documents, while if units are too large we tend to get spurious matches and the relevant information is hard for the user to find. The problems with large document units can be alleviated by use of explicit or implicit proximity search
  • #88: A simple strategy is to just split on all non-alphanumeric characters โ€“ bad you always want to do the exact same tokenization of document and query words, generally by processing queries with the same tokenize Conceptually, splitting on white space can also split what should be re- garded as a single token. This occurs most commonly with names (San Fran- cisco, Los Angeles) but also with borrowed foreign phrases (au fait)
  • #89: Index numbers -> (One answer is using n-grams: IIR ch. 3)
  • #90: Methods of word segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or condi- tional random fields, trained over hand-segmented words
  • #91: No unique tokenization + completely different interpretation of a sequence depending on where you split
  • #93: Nevertheless: โ€œGoogle ignores common words and characters such as where, the, how, and other digits and letters which slow down your search without improving the results.โ€ (Though you can explicitly ask for them to remain.)
  • #94: Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the to- kens.4 The most standard way to normalize is to implicitly create equivalence classes, which are normally named after one member of the set. For instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory, in both the document text and queries, then searches for one term will retrieve documents that contain either. The advantage of just using mapping rules that remove characters like hyphens is that the equivalence classing to be done is implicit, rather than being fully calculated in advance: the terms that happen to become identical as the result of these rules are the equivalence classes. It is only easy to write rules of this sort that remove characters. Since the equivalence classes are implicit, it is not obvious when you might want to add characters. For instance, it would be hard to know to turn antidiscriminatory into anti-discriminatory.
  • #95: An alternative to creating equivalence classes is to maintain relations between not normalized tokens. This method can be extended to hand-constructed lists of synonyms such as car and automobile, a topic we discuss further in
  • #96: Too much equivalence class
  • #97: Why not the reverse?
  • #101: Also stemmers based on N-grams-based For example trigrams: information => {inf, nfo, for, etc}
  • #104: careses parties separational -> separate factional -> faction
  • #111: Compression Cache pressure
  • #121: The distribution of term frequencies is similar for different texts of significant large size.
  • #122: Heapsโ€™ law gives the vocabulary size in collections.
  • #133: Positional indexes are helpful, but weโ€™ll ignore them for now
  • #147: (Salton & McGill 1983)
  • #153: The classifier that assigns a vector x to the class with the highest posterior is called the Bayes classifier. The error associated with this classifier is called the Bayes error. This is the lowest possible error rate for any classifier over the distribution of all examples and for a chosen hypothesis space
  • #154: A complete probability distribution over documents โˆ’ ย defines likelihood for any possible document d (observation) โˆ’ ย P(relevant) via P(document): P๎‚žRโˆฃd๎‚ŸโˆP๎‚ždโˆฃR๎‚ŸP๎‚žR๎‚Ÿ โˆ’ ย can โ€œgenerateโ€ synthetic documents ๏ฌ willsharesomepropertiesoftheoriginalcollection Not all IR Models do this โ€“ possible to estimate p(R|d) directly โ€“ log regression Assumptions: one relevance value for every word w Words are conditionally independent given R โ€“ false, but allows to lower down the number of parameters All words absent are equally likely to be observed in relevant and not relevant classes
  • #156: One relevance status value per word empty document (all words absent) is equally likely to be observed in relevant and non-relevant classes (provides a natural zero) - practical reason, only score terms that appear in the query (TAT)
  • #159: Doesnโ€™t model word dependence. Doesnโ€™t account for document length. Doesnโ€™t model word frequencies
  • #160: Now D_t = d_t account for the number of times we observe the term in the document (we have a vector of frequencies)
  • #165: Can we seen as a probabilisitic automata They originate from probabilistic models of language gen- eration developed for automatic speech recognition systems in the early 1980's (see e.g. Rabiner 1990). Automatic speech recognition systems combine prob- abilities of two distinct models: the acoustic model, and the language model. The acoustic model might for instance produce the following candidate texts in decreasing order of probability: \food born thing", \good corn sing", \mood morning", and \good morning". Now, the language model would determine that the phrase \good morning" is much more probable, i.e., it occurs more frequently in English than the other phrases. When combined with the acoustic model, the system is able to decide that \good morning" was the most likely utterance, thereby increasing the system's performance. For information retrieval, language models are built for each document. By following this approach, the language model of the book you are reading now would assign an exceptionally high probability to the word \retrieval" indicating that this book would be a good candidate for retrieval if the query contains this word.
  • #166: For some applications we want all this highly probable P3 In IR P1=P2
  • #169: Veto terms Original multiple bernoulli, multinomial widely used now accountsformultiplewordoccurrencesinthequery(primitive) โ€“ wellunderstood:lotsofresearchinrelatedfields(andnowinIR) โ€“ possibilityforintegrationwithASR/MT/NLP(sameeventspace)
  • #171: Discounting methods Problem with all discounting methods: โ€“ discounting treats unseen words equally (add or subtract ฮต) โ€“ somewordsaremorefrequentthanothers Essentially, the data model and retrieval function are one and the same
  • #173: Different ways of smoothing, dirichlet priors smoothing particularly popular