SlideShare a Scribd company logo
Information Retrieval Basics
Kira Radinsky
Many of the following slides are courtesy of Ronny Lempel (Yahoo!),
Aya Soffer and David Carmel (IBM)
19 October 2010 CS236620 Search Engine Technology 2
Indexing/Retrieval – Basics
• 2 Main Stages
– Indexing process - involves pre-processing and storing of
information into a repository – an index
– Retrieval/runtime process - involves issuing a query, accessing the
index to find documents relevant to the query
• Basic Concepts:
– Document – any piece of information (book, article, database
record, Web page, image, video, song)
• usually textual data
– Query – some text representing the user’s information need
– Relevance – a binary relation (a predicate) between documents and
queries R(d,q)
• Obviously a simplification of subjective quality with many shades of gray
19 October 2010 CS236620 Search Engine Technology 3
Defining Document Relevance to a Query
• Although Relevance is the basic concept of IR, it lacks a precise
definition
• Difficulties :
– User/intent dependent
– Time/place dependent
– In practice, depends on what other documents are deemed
relevant
• And even on what other documents were retrieved for the
same query
• Simplified assumption:
– D - the set of all documents in the collection (corpus),
– Q - the set of all possible queries,
– R: D X Q  {0,1} is well defined
– d is “relevant” to q iff R(d,q) = 1
• Some models try to measure the probability of relevance: Pr(R|d,q)
19 October 2010 CS236620 Search Engine Technology 4
Index building: Text Profiling
• Documents/Queries are profiled to generate a canonical representation
• The profile is usually based on the set of indexing units (terms) in the text
• Indexing units (terms) are generally representative words in the text
– How to select representative units?
– For the moment, let’s take all the words in the given document/query
In the beginning
God created the
heaven and the
earth . And the
earth was without
form and void.
..,and,,beginning,,created,,earth,,form,,god,,heaven,,in,,the,,void,,was,,without
.., 3 ,, 1 ,,1 ,, 2 ,, 1 ,, 1 ,, 1 ,, 1,, 4 ,, 1 ,, 1 ,, 1
19 October 2010 CS236620 Search Engine Technology 5
More Formally
Given a collection of documents (a corpus)
• All the terms in the collection can be labeled t1, t2, …,tN
• The profile of document dj is an N-dimensional vector,
dj  (w1j, w2j, …,wNj)
– where
• wij = 0 if ti does not appear in dj
• wij > 0 otherwise
• The N-dimensional vector space is conceptual – implementations will
not actually manipulate such large vectors
19 October 2010 CS236620 Search Engine Technology 6
Index Representation as a (Sparse) Matrix
dM
…d2d1A(t,d)
w1Mw12w11t1
t2
wNMwN1tN
Most entries are zero – certainly for large corpora
19 October 2010 CS236620 Search Engine Technology 7
Using the Index for Relevance Retrieval
• Assumption: a document not containing any query term is
not relevant
• Given a simple query of one term q={ti}
• Use the index for retrieval:
– 1. Retrieve all documents dj with wij > 0
– 2. Sort them in decreasing order (in some models)
– 3. Return the (ranked) list of “relevant” documents to the user
• In general: given a user’s query q={t1…tk}:
– Disjunctive queries: return a (ranked) list of documents containing
at least one of the query terms
– Conjunctive queries: return a (ranked) list of documents containing
all of the query terms
19 October 2010 CS236620 Search Engine Technology 8
The Boolean Model
• Simple model based on Set Theory
• Queries are specified as Boolean expressions
– indexing units are words
– Boolean Operator: OR, AND, NOT
– Example:
q =“java” AND “compilers” AND (“unix” OR “linux”)
• Relevance: A document is relevant to the query if it
satisfies the query Boolean expression
19 October 2010 CS236620 Search Engine Technology 9
Boolean Model- Example
d5d4d3d2d1A(t,d)
10111a
11010b
10100c
q = a AND (b OR (NOT c))
19 October 2010 CS236620 Search Engine Technology 10
Search Using an Inverted Index
• a  d1,d2,d3,d5
• b  d2,d4,d5
• c  d3,d5
• a  1,1,1,0,1
• b  0,1,0,1,1
• NOT c  1,1,0,1,0 1,1,0,1,1
OR 1,1,0,0,1
AND
q = a AND (b OR (NOT c))
Results: d1, d2, d5
19 October 2010 CS236620 Search Engine Technology 11
Boolean Model – Pros & Cons
• Pros:
– Fast (bitmap vector operations)
– Binary decision (doc is “relevant” or not)
– Some extensions are easy (e.g. synonym support)
• Cons:
– Binary decision - what about ranking?
– Who speaks Boolean?
19 October 2010 CS236620 Search Engine Technology 12
Vector Space Model
• Documents are represented as vectors in a (huge) N-
dimensional space
– N is the number of terms in the corpus, i.e. size of the
lexicon/dictionary
• Query is a document like any other document
• Relevance – measured by similarity:
– A document is relevant to the query if its representative vector
is similar to the query’s representative vector
19 October 2010 CS236620 Search Engine Technology 13
Documents as Vectors
Star
Diet
Doc about astronomy
Doc about movie stars
Doc about mammal behavior
19 October 2010 CS236620 Search Engine Technology 14
Vector-space Model
• “Relevance” is measured by similarity - the cosine of the
angle between doc-vectors and the query vector
• Need to represent the query as a vector in the same vector-
space as the documents






i
id
i
iq
i
idiq
ww
ww
dq
dq
dqSim
22||||
),(
i
j
a1
d1
q
d2
19 October 2010 CS236620 Search Engine Technology 15
Example
D2
D1
Q
1a
2a
Term B
Term A
Q = (0.4,0.8)
D1=(0.8,0.3)
D2=(0.2,0.7)
98.0
42.0
64.0
])7.0()2.0[(])8.0()4.0[(
)7.08.0()2.04.0(
),(
22222



dqsim
74.0
58.0
56.
),( 1 dqsim






i
id
i
iq
i
idiq
ww
ww
dq
dq
dqSim
22||||
),(
19 October 2010 CS236620 Search Engine Technology 16
How to Determine the w(t,d) Weights?
• Binary weights:
– wi,j= 1 iff document dj contains term ti, otherwise 0.
– (e.g. the Boolean model)
• Term frequency (tf):
– wi,j= (number of occurrences of ti in dj)
• What about term importance?
– E.g. q=“galaxy in space”.
– Should an occurrence of the query term “in” in a document
contribute the same as an occurrence of the query term “galaxy”?
19 October 2010 CS236620 Search Engine Technology 17
Determining the w(t,d) Weights (cont)
tf x idf weighting scheme (Salton 73)
• tf – a monotonic function of the term frequency in the
document,
– e.g. tf(t,d)= log(freq(t,d) + 1)
• idf – the inverse document frequency of a term – a
decreasing function of the term total freq Nt
– e.g: idf(t) = log(N / Nt) (for terms appearing at least once,
N- #documents, Nt -#documents with t)
– Intuition: query terms that are rare in the corpus better
distinguish the relevant documents from the irrelevant ones
– Wi,j = tf(ti,dj) * idf(ti)
19 October 2010 CS236620 Search Engine Technology 18
Vector Space Pros & Cons
• Pros
– Terms weighting scheme improves retrieval effectiveness
– Allows for approximate query matching
– Cosine similarity is a good ranking measure
– Simple and elegant, with a solid mathematical foundation
• Cons
– Terms are not really orthogonal dimensions due to strong term
relationships and dependencies
– Ranking does not guarantee multiple term containment
• Default semantics of search engines is “AND” for multi-term queries
(conjunction queries)
– Term weighting schemes sometimes difficult to maintain in
incremental settings, e.g. idf values and document norms
frequently change
19 October 2010 CS236620 Search Engine Technology 19
Practical Considerations
• Document length approximations
• Incorporating proximity considerations of query terms
occurrences in result documents into the formulae
• Stop-word elimination
– Stop-word examples: and, the, or, of, in, a, an, to, …
• Linguistic processing of terms (stemming, lemmatization,
synonym expansion, compounds) and their effects on
recall/precision
Ad

More Related Content

What's hot (20)

Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
Gangadhar S
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
Primya Tamil
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
Stanley Wang
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptx
someyamohsen2
 
Term weighting
Term weightingTerm weighting
Term weighting
Primya Tamil
 
information retrieval
information retrievalinformation retrieval
information retrieval
ssbd6985
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Carsten Eickhoff
 
Information retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic modelsInformation retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic models
Vaibhav Khanna
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
Tharuka Vishwajith Sarathchandra
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
alaa223
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
Mainul Hassan
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
dhatchayaninandu
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
Narni Rajesh
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
Gangadhar S
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
Primya Tamil
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
Stanley Wang
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptx
someyamohsen2
 
information retrieval
information retrievalinformation retrieval
information retrieval
ssbd6985
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Carsten Eickhoff
 
Information retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic modelsInformation retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic models
Vaibhav Khanna
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
Narni Rajesh
 

Similar to Tutorial 1 (information retrieval basics) (20)

Web search engines
Web search enginesWeb search engines
Web search engines
AbdusamadAbdukarimov2
 
Multilingual document analysis
Multilingual document analysisMultilingual document analysis
Multilingual document analysis
Carlos Badenes-Olmedo
 
chapter 5 Information Retrieval Models.ppt
chapter 5 Information Retrieval Models.pptchapter 5 Information Retrieval Models.ppt
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
Kai Chan
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
Habtamu100
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
Kai Chan
 
Information Retrieval and Storage Systems
Information Retrieval and Storage SystemsInformation Retrieval and Storage Systems
Information Retrieval and Storage Systems
abduwasiahmed
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
rchbeir
 
Sigir 2011 proceedings
Sigir 2011 proceedingsSigir 2011 proceedings
Sigir 2011 proceedings
chetanagavankar
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
台灣資料科學年會
 
unit -4MODELING AND RETRIEVAL EVALUATION
unit -4MODELING AND RETRIEVAL EVALUATIONunit -4MODELING AND RETRIEVAL EVALUATION
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
Ir models
Ir modelsIr models
Ir models
Ambreen Angel
 
A hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing andA hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing and
Ibrahim Bounhas
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
thenmozhip8
 
Summary of SIGIR 2011 Papers
Summary of SIGIR 2011 PapersSummary of SIGIR 2011 Papers
Summary of SIGIR 2011 Papers
chetanagavankar
 
Slides
SlidesSlides
Slides
butest
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
Vladimir Alexiev, PhD, PMP
 
search engine
search enginesearch engine
search engine
Musaib Khan
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
KishorMahale5
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
alaa223
 
chapter 5 Information Retrieval Models.ppt
chapter 5 Information Retrieval Models.pptchapter 5 Information Retrieval Models.ppt
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
Kai Chan
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
Habtamu100
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
Kai Chan
 
Information Retrieval and Storage Systems
Information Retrieval and Storage SystemsInformation Retrieval and Storage Systems
Information Retrieval and Storage Systems
abduwasiahmed
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
rchbeir
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
台灣資料科學年會
 
unit -4MODELING AND RETRIEVAL EVALUATION
unit -4MODELING AND RETRIEVAL EVALUATIONunit -4MODELING AND RETRIEVAL EVALUATION
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
A hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing andA hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing and
Ibrahim Bounhas
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
thenmozhip8
 
Summary of SIGIR 2011 Papers
Summary of SIGIR 2011 PapersSummary of SIGIR 2011 Papers
Summary of SIGIR 2011 Papers
chetanagavankar
 
Slides
SlidesSlides
Slides
butest
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
Vladimir Alexiev, PhD, PMP
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
KishorMahale5
 
Ad

More from Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 
Ad

Recently uploaded (20)

Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Rock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning JourneyRock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning Journey
Lynda Kane
 
Asthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdfAsthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdf
VanessaRaudez
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Rock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning JourneyRock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning Journey
Lynda Kane
 
Asthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdfAsthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdf
VanessaRaudez
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 

Tutorial 1 (information retrieval basics)

  • 1. Information Retrieval Basics Kira Radinsky Many of the following slides are courtesy of Ronny Lempel (Yahoo!), Aya Soffer and David Carmel (IBM)
  • 2. 19 October 2010 CS236620 Search Engine Technology 2 Indexing/Retrieval – Basics • 2 Main Stages – Indexing process - involves pre-processing and storing of information into a repository – an index – Retrieval/runtime process - involves issuing a query, accessing the index to find documents relevant to the query • Basic Concepts: – Document – any piece of information (book, article, database record, Web page, image, video, song) • usually textual data – Query – some text representing the user’s information need – Relevance – a binary relation (a predicate) between documents and queries R(d,q) • Obviously a simplification of subjective quality with many shades of gray
  • 3. 19 October 2010 CS236620 Search Engine Technology 3 Defining Document Relevance to a Query • Although Relevance is the basic concept of IR, it lacks a precise definition • Difficulties : – User/intent dependent – Time/place dependent – In practice, depends on what other documents are deemed relevant • And even on what other documents were retrieved for the same query • Simplified assumption: – D - the set of all documents in the collection (corpus), – Q - the set of all possible queries, – R: D X Q  {0,1} is well defined – d is “relevant” to q iff R(d,q) = 1 • Some models try to measure the probability of relevance: Pr(R|d,q)
  • 4. 19 October 2010 CS236620 Search Engine Technology 4 Index building: Text Profiling • Documents/Queries are profiled to generate a canonical representation • The profile is usually based on the set of indexing units (terms) in the text • Indexing units (terms) are generally representative words in the text – How to select representative units? – For the moment, let’s take all the words in the given document/query In the beginning God created the heaven and the earth . And the earth was without form and void. ..,and,,beginning,,created,,earth,,form,,god,,heaven,,in,,the,,void,,was,,without .., 3 ,, 1 ,,1 ,, 2 ,, 1 ,, 1 ,, 1 ,, 1,, 4 ,, 1 ,, 1 ,, 1
  • 5. 19 October 2010 CS236620 Search Engine Technology 5 More Formally Given a collection of documents (a corpus) • All the terms in the collection can be labeled t1, t2, …,tN • The profile of document dj is an N-dimensional vector, dj  (w1j, w2j, …,wNj) – where • wij = 0 if ti does not appear in dj • wij > 0 otherwise • The N-dimensional vector space is conceptual – implementations will not actually manipulate such large vectors
  • 6. 19 October 2010 CS236620 Search Engine Technology 6 Index Representation as a (Sparse) Matrix dM …d2d1A(t,d) w1Mw12w11t1 t2 wNMwN1tN Most entries are zero – certainly for large corpora
  • 7. 19 October 2010 CS236620 Search Engine Technology 7 Using the Index for Relevance Retrieval • Assumption: a document not containing any query term is not relevant • Given a simple query of one term q={ti} • Use the index for retrieval: – 1. Retrieve all documents dj with wij > 0 – 2. Sort them in decreasing order (in some models) – 3. Return the (ranked) list of “relevant” documents to the user • In general: given a user’s query q={t1…tk}: – Disjunctive queries: return a (ranked) list of documents containing at least one of the query terms – Conjunctive queries: return a (ranked) list of documents containing all of the query terms
  • 8. 19 October 2010 CS236620 Search Engine Technology 8 The Boolean Model • Simple model based on Set Theory • Queries are specified as Boolean expressions – indexing units are words – Boolean Operator: OR, AND, NOT – Example: q =“java” AND “compilers” AND (“unix” OR “linux”) • Relevance: A document is relevant to the query if it satisfies the query Boolean expression
  • 9. 19 October 2010 CS236620 Search Engine Technology 9 Boolean Model- Example d5d4d3d2d1A(t,d) 10111a 11010b 10100c q = a AND (b OR (NOT c))
  • 10. 19 October 2010 CS236620 Search Engine Technology 10 Search Using an Inverted Index • a  d1,d2,d3,d5 • b  d2,d4,d5 • c  d3,d5 • a  1,1,1,0,1 • b  0,1,0,1,1 • NOT c  1,1,0,1,0 1,1,0,1,1 OR 1,1,0,0,1 AND q = a AND (b OR (NOT c)) Results: d1, d2, d5
  • 11. 19 October 2010 CS236620 Search Engine Technology 11 Boolean Model – Pros & Cons • Pros: – Fast (bitmap vector operations) – Binary decision (doc is “relevant” or not) – Some extensions are easy (e.g. synonym support) • Cons: – Binary decision - what about ranking? – Who speaks Boolean?
  • 12. 19 October 2010 CS236620 Search Engine Technology 12 Vector Space Model • Documents are represented as vectors in a (huge) N- dimensional space – N is the number of terms in the corpus, i.e. size of the lexicon/dictionary • Query is a document like any other document • Relevance – measured by similarity: – A document is relevant to the query if its representative vector is similar to the query’s representative vector
  • 13. 19 October 2010 CS236620 Search Engine Technology 13 Documents as Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior
  • 14. 19 October 2010 CS236620 Search Engine Technology 14 Vector-space Model • “Relevance” is measured by similarity - the cosine of the angle between doc-vectors and the query vector • Need to represent the query as a vector in the same vector- space as the documents       i id i iq i idiq ww ww dq dq dqSim 22|||| ),( i j a1 d1 q d2
  • 15. 19 October 2010 CS236620 Search Engine Technology 15 Example D2 D1 Q 1a 2a Term B Term A Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) 98.0 42.0 64.0 ])7.0()2.0[(])8.0()4.0[( )7.08.0()2.04.0( ),( 22222    dqsim 74.0 58.0 56. ),( 1 dqsim       i id i iq i idiq ww ww dq dq dqSim 22|||| ),(
  • 16. 19 October 2010 CS236620 Search Engine Technology 16 How to Determine the w(t,d) Weights? • Binary weights: – wi,j= 1 iff document dj contains term ti, otherwise 0. – (e.g. the Boolean model) • Term frequency (tf): – wi,j= (number of occurrences of ti in dj) • What about term importance? – E.g. q=“galaxy in space”. – Should an occurrence of the query term “in” in a document contribute the same as an occurrence of the query term “galaxy”?
  • 17. 19 October 2010 CS236620 Search Engine Technology 17 Determining the w(t,d) Weights (cont) tf x idf weighting scheme (Salton 73) • tf – a monotonic function of the term frequency in the document, – e.g. tf(t,d)= log(freq(t,d) + 1) • idf – the inverse document frequency of a term – a decreasing function of the term total freq Nt – e.g: idf(t) = log(N / Nt) (for terms appearing at least once, N- #documents, Nt -#documents with t) – Intuition: query terms that are rare in the corpus better distinguish the relevant documents from the irrelevant ones – Wi,j = tf(ti,dj) * idf(ti)
  • 18. 19 October 2010 CS236620 Search Engine Technology 18 Vector Space Pros & Cons • Pros – Terms weighting scheme improves retrieval effectiveness – Allows for approximate query matching – Cosine similarity is a good ranking measure – Simple and elegant, with a solid mathematical foundation • Cons – Terms are not really orthogonal dimensions due to strong term relationships and dependencies – Ranking does not guarantee multiple term containment • Default semantics of search engines is “AND” for multi-term queries (conjunction queries) – Term weighting schemes sometimes difficult to maintain in incremental settings, e.g. idf values and document norms frequently change
  • 19. 19 October 2010 CS236620 Search Engine Technology 19 Practical Considerations • Document length approximations • Incorporating proximity considerations of query terms occurrences in result documents into the formulae • Stop-word elimination – Stop-word examples: and, the, or, of, in, a, an, to, … • Linguistic processing of terms (stemming, lemmatization, synonym expansion, compounds) and their effects on recall/precision