0% found this document useful (0 votes)

1 views

Chapter 4

Uploaded by

belam8456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Chapter 4

Uploaded by

belam8456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

IR Models - Basic Concepts

Chapter 4
 Word evidence: Bag of words
IR Models  IR systems usually adopt index terms to index and
retrieve documents
Outline  Each document is represented by a set of
representative keywords or index terms (called Bag
 Introduction of IR Models of Words)

 Boolean model  An index term is a word from a document useful for

remembering the document main themes
 Vector space model  Not all terms are equally useful for representing the document
contents:
 Probabilistic model  less frequent terms allow identifying a narrower set
of documents
 But No ordering information is attached to the Bag of Words
identified from the document collection.

IR Models - Basic Concepts IR Models - Basic Concepts

• After preprocessing, N distinct terms remain which
• One central problem regarding IR systems is the
are Unique terms that form the VOCABULARY
issue of predicting which documents are relevant
and which are not
• Let
– ki be an index term i & dj be a document j
• Such a decision is usually dependent on a ranking
– K = (k1, k2, …, kN) is the set of all index terms
algorithm which attempts to establish a simple
• Each term, i, in a document or query j, is given a
ordering of the documents retrieved
real-valued weight, wij.
• Documents appearning at the top of this
– wij is a weight associated with (ki,dj). If wij = 0 ,
ordering are considered to be more likely to be
it indicates that term does not belong to
relevant
document dj
• Thus ranking algorithms are at the core of IR
systems
• The weight wij quantifies the importance of the
index term for describing the document contents
• The IR models determine the predictions of
• vec(dj) = (w1j, w2j, …, wtj) is a weighted vector
what is relevant and what is not, based on the
notion of relevance implemented by the system associated with the document dj

1
Mapping Documents & Queries Weighting Terms in Vector Sapce
 Represent both documents and queries as N-
dimensional vectors in a term-document matrix, which • The importance of the index terms is represented by
shows occurrence of terms in the document collection weights associated to them
or query • Problem: to show the importance of the index term for
 An entry in the matrix corresponds to the “weight” of a describing the document/query contents, what weight can
term in the document; d  (t , t ,...,t ); q  (t , t ,...,t ) we assign?
j 1, j 2, j N, j k 1,k 2,k N ,k
• Solution 1: Binary weights: t=1 if presence, 0 otherwise
– Document collection is mapped to
T1 T2 …. TN term-by-document matrix – Similarity: number of terms in common
D1 w11 w12 … w1N – The documents are viewed as • Problem: Not all terms equally interesting
D2 w21 w22 … w2N vectors in multidimensional space – E.g. the vs. dog vs. cat
: : : : • “Nearby” vectors are related • Solution: Replace binary weights with non-binary weights
: : : : – Normalize the weight as usual for  
DM wM1 wM2 … wMN vector length to avoid the effect of d j  ( w1, j , w2, j ,..., wN , j ); qk  ( w1,k , w2,k ,..., wN , k )
document length

The Boolean Model The Boolean Model: Example

• Boolean model is a simple model based on set theory • Generate the relevant documents retrieved by
• The Boolean model imposes a binary criterion for the Boolean model for the query :
deciding relevance
q = k1  (k2  k3)
• Terms are either present or absent. Thus,
wij  {0,1} k2
k1
• sim(q,dj) = 1, if document satisfies the boolean query d7
0 otherwise d2 d6
T1 T2 …. TN
D1 w11 w12 … w1N d4 d5
d3
- Note that, no weights D2 w21 w22 … w2N d1
assigned in-between 0 and 1, : : : :
only values 0 or 1 can be : : : :
assigned DM wM1 wM2 … wMN k3

2
The Boolean Model: Example The Boolean Model: Further Example
• Given the following determine documents retrieved by the
Given the following three documents, Construct Term – document
Boolean model based IR system
matrix and find the relevant documents retrieved by the
• Index Terms: K1, …,K8. Boolean model for given query
• Documents: • D1: “Shipment of gold damaged in a fire”
1. D1 = {K1, K2, K3, K4, K5} • D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
2. D2 = {K1, K2, K3, K4}
• Query: “gold  (silver   truck)”
3. D3 = {K2, K4, K6, K8}
Table below shows document –term (ti) matrix
4. D4 = {K1, K3, K5, K7}
arrive damage deliver fire gold silver ship truck Also find the relevant
5. D5 = {K4, K5, K6, K7, K8} documents for the queries:
D1 (a) “golddelivery”;
6. D6 = {K1, K2, K3, K4} (b) ship  ¬ gold;
D2
• Query: K1 (K2  K3) (c) “silver  truck”
D3
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
query
= {D1, D2, D6}

Exercise Exercise: What are the relevant documents retrieved for the query:
((chaucer OR milton) AND (swift OR shakespeare))

Given the following three documents with Doc No Term 1 Term 2 Term 3 Term 4

the following contents: 1

2 Shakespeare
Swift

– D1 = “computer information retrieval” 3 Shakespeare Swift

– D2 = “computer retrieval” 4 Milton

5 Milton Swift
– D3 = “information” 6 Milton Shakespeare
– D4 = “computer information” 7 Milton Shakespeare Swift
8 Chaucer
• What are the relevant documents 9 Chaucer Swift
10 Chaucer Shakespeare
retrieved for the queries: 11 Chaucer Shakespeare Swift
– Q1 = “information  retrieval” 12 Chaucer Milton
13 Chaucer Milton Swift
– Q2 = “information  ¬computer”
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift

3
Drawbacks of the Boolean Model Vector-Space Model (VSM)
• Retrieval based on binary decision criteria with no
notion of partial matching
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
• No ranking of the documents is provided (absence of because,
a grading scale) • Use of binary weights is too limiting
• Information need has to be translated into a Boolean • Non-binary weights provide consideration for partial
expression which most users find awkward matches
• The Boolean queries formulated by the users are • These term weights are used to compute a degree of
most often too simplistic similarity between a query and each document
• As a consequence, the Boolean model frequently • Ranked set of documents provides for better matching
returns either too few or too many documents in • The idea behind VSM is that
response to a user query • the meaning of a document is conveyed by the words
• Just changing a boolean operator from “AND” to “OR” used in that document
changes the result from intersection to union

Vector-Space Model Term-document matrix.

To find relevant documens for a given query, • A collection of n documents and query can be represented
• First, Documents and queries are mapped into term vector in the vector space model by a term-document matrix.
space. –An entry in the matrix corresponds to the “weight” of a
term in the document;
• Note that queries are considered as short documents –zero means the term has no significance in the document or
• Second, in the vector space, queries and documents are it simply doesn’t exist in the document. Otherwise, wij >
represented as weighted vectors 0 whenever ki  dj
• There are different weighting technique; the most widely T1 T2 …. TN
used one is computing tf*idf for each term D1 w11 w21 … w1N
• Third, similarity measurement is used to rank documents by D2 w21 w22 … w2N
the closeness of their vectors to the query. : : : :
: : : :
• Documents are ranked by closeness to the query.
DM wM1 wM2 … wMN
Closeness is determined by a similarity score calculation

4
Computing weights Computing Weights
• How do we compute weights for term i in document j and  Let,
query q; wij and wiq ?  N be the total number of documents in the collection
 ni be the number of documents which contain ki
• A good weight must take into account two effects:  freq(i,j) raw frequency of ki within dj
– Quantification of intra-document contents (similarity)
 A normalized tf factor is given by
• The tf factor, the term frequency within a document
 f(i,j) = freq(i,j) / max(freq(l,j))
– Quantification of inter-documents separation  where the maximum is computed over all terms which
(dissimilarity) occur within the document dj
• The idf factor, the inverse document frequency
 The idf factor is computed as
– As a result of which most IR systems are using tf*idf  idf(i) = log (N/ni)
weighting technique:  the log is used to make the values of tf and idf
wij = tf(i,j) * idf(i) comparable. It can also be interpreted as the amount of
information associated with the term ki.

Computing weights Example: Computing weights

 The best term-weighting schemes use tf*idf  A collection includes 10,000 documents
weights which are given by  The term A appears 20 times in a particular
document
 wij = f(i,j) * log(N/ni)
 The maximum appearance of any term in this
 For the query term weights, a suggestion is document is 50
 wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *  The term A appears in 2,000 of the collection
log(N/ni) documents.
 The vector space model with tf*idf weights is
 Compute TF*IDF weight?
a good ranking strategy with general
 f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4
collections
 idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
 The vector space model is usually as good as the  wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
known ranking alternatives.
 It is also simple and fast to compute.

5
Vector-Space Model: Example
Similarity Measure
j
 Suppose we query for the query: Q: “gold silver
dj truck”. The database collection consists of
three documents with the following documents.
 • D1: “Shipment of gold damaged in a fire”
q
• D2: “Delivery of silver arrived in a silver truck”
• Sim(q,dj) = cos() i
• D3: “Shipment of gold arrived in a truck”
 
  Assume that all terms are used, including
n
dj q wi , j qi , k
sim(d j , q )     i 1
common terms, stop words, and also no terms
 
n n
dj q wi2, j qi2,k are reduced to root terms.
i 1 i 1

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1  Show retrieval results in ranked order?
• A document is retrieved even if it matches the query
terms only partially

Vector-Space Model: Example Vector-Space Model

Terms Q Counts TF DF IDF Wi = TF*IDF Terms Q D1 D2 D3
D1 D2 D3 Q D1 D2 D3 a 0 0 0 0
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 0.176 0.176
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 0.477 0 0
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0 delivery 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0 fire 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176 gold 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0 in 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0 of 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
silver 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176

6
Vector-Space Model: Example Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1) Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
• First, for each document and query, compute all vector
lengths (zero terms ignored) Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
|d1|= 0.477 2  0 .477 2  0 .176 2  0.176 2 = 0.517 = 0.719
Finally, we sort and rank documents in descending
|d2|= 0 .176 2  0 .477 2  0 .176 2  0 .176 2 = 1 .2001 = 1.095
order according to the similarity scores
|d3|= 0 .176 2  0 .176 2  0 .176 2  0 .176 2 = 0.124 = 0.352 Rank 1: Doc 2 = 0.8246
|q|= = 0.2896 = 0.538 Rank 2: Doc 3 = 0.3271
0.176 2  0 .471 2  0.176 2
Rank 3: Doc 1 = 0.0801
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310 • Exercise: using normalized TF, rank documents using
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862 cosine similarity measure? Hint: Normalize TF of term i
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620 in doc j using max frequency of a term k in document j.

Vector-Space Model Technical Memo Example

Suppose the database collection consists of the following
documents.
• Advantages: c1: Human machine interface for Lab ABC computer
• term-weighting improves quality of the answer applications
set since it displays in ranked order c2: A survey of user opinion of computer system response
• partial matching allows retrieval of documents time
that approximate the query conditions c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
• cosine ranking formula sorts documents c5: Relation of user-perceived response time to error
according to degree of similarity to the query measure
M1: The generation of random, binary, unordered trees
• Disadvantages: M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
• assumes independence of index terms (??) m4: Graph minors: A survey
Query:
Find documents relevant to "human computer interaction"

7
Exercises
• Given the following documents, rank documents
according to their relevance to the query using cosine
similarity, Euclidean distance and inner product
measures?
Thank you
docID words in document
1 Taipei Taiwan
2 Macao Taiwan Shanghai
3 Japan Sapporo
4 Sapporo Osaka Taiwan
Query: Taiwan Sapporo ?

29 30

Lufthansa Celonis WS2020 Final Report
No ratings yet
Lufthansa Celonis WS2020 Final Report
26 pages
Odoo Performance
100% (4)
Odoo Performance
44 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
F-IR
No ratings yet
F-IR
30 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Unit 2
No ratings yet
Unit 2
58 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
NLP 4
No ratings yet
NLP 4
33 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Models
No ratings yet
IR Models
65 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
IR - Models
100% (3)
IR - Models
58 pages
Chapter 4
No ratings yet
Chapter 4
48 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
UNIT 2
No ratings yet
UNIT 2
13 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
L03
No ratings yet
L03
16 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
15 pages
IR Models
No ratings yet
IR Models
261 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
Lecture 5
No ratings yet
Lecture 5
75 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
IR_MOD2_NOTES
No ratings yet
IR_MOD2_NOTES
26 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
T1 PDF
No ratings yet
T1 PDF
2 pages
Comments on Paul Cobley's Essay (2018) "Human Understanding: A Key Triad"
From Everand
Comments on Paul Cobley's Essay (2018) "Human Understanding: A Key Triad"
Razie Mah
No ratings yet
PDF Microsoft SQL Server 2008 Bible 1st ed Edition Paul Nielsen download
No ratings yet
PDF Microsoft SQL Server 2008 Bible 1st ed Edition Paul Nielsen download
55 pages
Basis SAP: GRMG Configuration For TREX System
No ratings yet
Basis SAP: GRMG Configuration For TREX System
44 pages
NUEVAVIZCAYA
No ratings yet
NUEVAVIZCAYA
23 pages
Toy Problem List To Do in Data Science Domain
No ratings yet
Toy Problem List To Do in Data Science Domain
5 pages
Column-vs-Row databases
No ratings yet
Column-vs-Row databases
12 pages
SQL and SQL Plus Basics
100% (2)
SQL and SQL Plus Basics
69 pages
ASCP EAM Integration
No ratings yet
ASCP EAM Integration
9 pages
Introduction To Data Archiving
No ratings yet
Introduction To Data Archiving
12 pages
Configuration of Data Guard Broker For Switchover
No ratings yet
Configuration of Data Guard Broker For Switchover
10 pages
Introduction To Real-Time Databases: (Embedded Systems and Wireless Networking Laboratory)
No ratings yet
Introduction To Real-Time Databases: (Embedded Systems and Wireless Networking Laboratory)
31 pages
BI Questions 2 PDF
No ratings yet
BI Questions 2 PDF
19 pages
Oracle DBA Pocket Guide 8i
No ratings yet
Oracle DBA Pocket Guide 8i
2 pages
01-Introduction To Data Analytics
No ratings yet
01-Introduction To Data Analytics
7 pages
CC Sec-2 Wk4-5 VIP questions
No ratings yet
CC Sec-2 Wk4-5 VIP questions
2 pages
Database Management
No ratings yet
Database Management
14 pages
Practical 1: Describe Deposit, Branch. Query: Describe Deposit Output
No ratings yet
Practical 1: Describe Deposit, Branch. Query: Describe Deposit Output
29 pages
Test Data Management
No ratings yet
Test Data Management
4 pages
SQL Injection
100% (1)
SQL Injection
28 pages
Ex 9 DWM Aryant
No ratings yet
Ex 9 DWM Aryant
9 pages
Dbms Short Notes
No ratings yet
Dbms Short Notes
15 pages
A. BETWEEN Operator
No ratings yet
A. BETWEEN Operator
11 pages
1 Appdynamics Internship Interview Experience 2
No ratings yet
1 Appdynamics Internship Interview Experience 2
5 pages
Iare DBMS Lecture Notes PDF
No ratings yet
Iare DBMS Lecture Notes PDF
161 pages
EE436L: Database Engineering: Lab Manual
No ratings yet
EE436L: Database Engineering: Lab Manual
2 pages
Staging BAPIs For The SAP Business Information Warehouse
No ratings yet
Staging BAPIs For The SAP Business Information Warehouse
79 pages
1423120250428131843-ITC05 FINAL PROJECT GUIDELINES
No ratings yet
1423120250428131843-ITC05 FINAL PROJECT GUIDELINES
2 pages
The Ultimate Guide To Data Integration
No ratings yet
The Ultimate Guide To Data Integration
48 pages
SQL SERVER - Import CSV File Into SQL Server Using Bulk Insert
No ratings yet
SQL SERVER - Import CSV File Into SQL Server Using Bulk Insert
94 pages

Chapter 4

Uploaded by

Chapter 4

Uploaded by

IR Models - Basic Concepts

 Boolean model  An index term is a word from a document useful for

IR Models - Basic Concepts IR Models - Basic Concepts

The Boolean Model The Boolean Model: Example

the following contents: 1

– D1 = “computer information retrieval” 3 Shakespeare Swift

– D2 = “computer retrieval” 4 Milton

Vector-Space Model Term-document matrix.

Computing weights Example: Computing weights

Vector-Space Model: Example Vector-Space Model

Vector-Space Model Technical Memo Example

You might also like