Tdm information retrieval

Text Mining Applications
Information Retrieval
 Query-based search of large text archives, e.g., the Web
Text Classification
 Automated assignment of topics to Web pages, e.g.,
Yahoo, Google
 Automated classification of email into spam and non-
spam
Text Clustering
 Automated organization of search results in real-time into
categories
 Discovery clusters and trends in technical literature (e.g.
CiteSeer)
Information Extraction
 Extracting standard fields from free-text
 Extracting names and places from reports, newspapers

Information Retrieval - Definition
 Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).
 Information Retrieval
 Deals with the representation, storage, organization of, and
access to information items
– Modern Information Retrieval
 General Objective: Minimize the overhead of a user
locating needed information

Information Retrieval Is Not
Database
 Process stored
documents
 Search documents
relevant to user queries
 No standard of how
queries should be
 Query results are
permissive to errors or
inaccurate items
Database
 Normally no processing
of data
 Search records
matching queries
 Standard: SQL
language
 Query results should
have 100% accuracy.
Zero tolerant to errors

Information Retrieval Is Not
Data Mining
 User target: Existing
relevant data entries
Data Mining
 User target: Knowledge
(rules, etc.) implied by
data (not the individual
data entries themselves)
• Many techniques and models
are shared and related
• E.g. classification of
documents

Is Information Retrieval a Form of
Text Mining?
What is the principal computer specialty for processing
documents and text??
 Information Retrieval (IR)
 The task of IR is to retrieve relevant documents in
response to a query.
 The fundamental technique of IR is measuring
similarity
 A query is examined and transformed into a vector of
values to be compared with stored documents

Is Information Retrieval a Form of
Text Mining?
 In the predication problem similar documents are
retrieved, then measure their properties, i.e. count the # of
class labels to see which label should be assigned to a
new document
 The objectives of the prediction can be posed in the form
of an IR model where documents are retrieved that are
relevant to a query, the query will be a new document

Specify Query
Search
Document
Collection
Return Subset
of Relevant
Documents
Key Steps in Information Retrieval
Examine
Document
Collection
Learn
Classification
Criteria
Apply Criteria
to New
Documents
Key Steps in Predictive Text Mining

Specify
Query Vector
Match
Document
Collection
Get Subset
of Relevant
Documents
Examine
Document
Properties
Predicting from Retrieved Documents
Key steps in IR
simple criteria such as
document’s labels

Information Retrieval (IR)
 Conceptually, IR is the study of finding needed
information. I.e., IR helps users find information that
matches their information needs.
 Expressed as queries
 Historically, IR is about document retrieval, emphasizing
document as the basic unit.
 Finding documents relevant to user queries
 Technically, IR studies the acquisition, organization,
storage, retrieval, and distribution of information.

Information Retrieval Cycle
Source
Selection
Search
Query
Selection
Results
Examination
Documents
Delivery
Information
Query
Formulation
Resource
source reselection
System
discovery
Vocabulary
discovery
Concept
discovery
Document
discovery

Abstract IR Architecture
DocumentsQuery
Hits
Representation
Function
Representation
Function
Query Representation Document Representation
Comparison
Function Index
offlineonline

IR Queries
 Keyword queries
 Boolean queries (using AND, OR, NOT)
 Phrase queries
 Proximity queries
 Full document queries
 Natural language questions

Information retrieval models
 An IR model governs how a document and a query are
represented and how the relevance of a document to a
user query is defined.
 Main models:
 Boolean model
 Vector space model
 Statistical language model
 etc

Elements in Information Retrieval
 Processing of documents
 Acceptance and processing of queries from
users
 Modelling, searching and ranking of
documents
 Presenting the search result

Process of Retrieving Information

Document Processing
 Removing stopwords (appear frequently but no
much meaning, e.g. “the”, “of”)
 Stemming: recognize different words with the
same grammar root
 Noun groups: common combination of words
 Indexing: for fast locating documents

Processing Queries
 Define a “language” for queries
 Syntax, operators, etc.
 Modify the queries for better search
 Ignore meaningless parts: punctuations,
conjunctives, etc.
 Append synonyms
e.g. e-business e-commerce
 Emerging technology
 Natural language queries

Modelling/Ranking of Documents
 Model the relevance (usefulness) of documents
against the user query Q
 The model represents a function Rel(Q,D)
 D is a document, Q is a user query
 Rel(Q,D) is the relevance of document D to query
Q
 There are many models available
 Algebraic models
 Probabilistic models
 Set-theoretic models

Basic Vector Space Model
 Define a set of words
and phases as terms
 Text is represented by
a vector of terms
 User query is
converted to a vector,
too
 Measure the vector
“distance” between a
document vector and
the query vector
business
computer
PowerPoint
presentation
user
web
Term Set
We are doing an e-business
presentation in PowerPoint.
Document
(1,0,1,1,0,0)
computer presentation
Query
(0,1,0,1,0,0)
222222
)00()00()11()01()10()01( 
Distance

Probabilistic Models Overview
Probabilistic Models
 Ranking: the probability that a document is
relevant to a query
 Often denoted as Pr(R|D,Q)
 In actual measure, log-odds transformation is
used:
 Probability values are estimated in applications
),|Pr(
),|Pr(
log
QDR
QDR

Given
 A source of textual documents
 A well defined limited query (text based)
Find
 Sentences with relevant information
 Extract the relevant information and ignore non-
relevant information (important!)
 Link related information and output in a predetermined
format
 Example: news stories, e-mails, web pages, photograph,
music, statistical data, biomedical data, etc.
 Information items can be in the form of text, image, video,
audio, numbers, etc.

2 basic information retrieval (IR) process:
Browsing or navigation system
 User skims document collection by jumping from one
document to the other via hypertext or hypermedia
links until relevant document found
Classical IR system: question answering system
 Query: question in natural language
 Answer: directly extracted from text of document
collection
Text Based Information Retrieval:
Information item (document)
 Text format (written/spoken) or has textual
description
Information need (query)

General concepts in IR
Representation language
 Typically a vector of d attribute values, e.g.,
 set of color, intensity, texture, features characterizing
images
 word counts for text documents
Data set D of N objects
 Typically represented as an N x d matrix
Query Q
 User poses a query to search D
 Query is typically expressed in the same representation
language as the data, e.g.,
 each text document is a set of words that occur in the
document

Query by Content
Traditional DB query: exact matches
 E.g. query Q = [level = MANAGER] AND [age < 30] or,
 Boolean match on text
 Query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun”
 Not useful when there are many matches
 E.g., “data mining” in Google returns 60 million documents
Query-by-content query: more general / less precise
 E.g. what record is most similar to a query Q?
 For text data, often called “information retrieval (IR)”
 Can also be used for images, sequences, video, etc
 Q can itself be an object (e.g., a document) or a shorter version
(e.g., 1 word)
Goal
 Match query Q to the N objects in the database
 Return a ranked list (typically) of the most similar/relevant objects
in the data set D given Q

Issues in Query by Content
 What representation language to use
 How to measure similarity between Q and each object
in D
 How to compute the results in real-time (for interactive
querying)
 How to rank the results for the user
 Allowing user feedback (query modification)
 How to evaluate and compare different IR
algorithms/systems

The Standard Approach
 Fixed-length (d dimensional) vector representation
 For query (1-by-d Q) and database (n-by-d X) objects
 Use domain-specific higher-level features (vs raw)
 Image
“bag of features”: color (e.g. RGB), texture (e.g.
Gabor, Fourier coeffs), …
 Text
“bag of words”: freq count for each word in each
document, …
Also known as the “vector-space” model
 Compute distances between vectorized representation
 Use k-NN to find k vectors in X closest to Q

Text Retrieval
 Document: book, paper, WWW page, ...
 Term: word, word-pair, phrase, … (often: 50,000+)
 query Q = set of terms, e.g., “data” + “mining”
 NLP (natural language processing) too hard, so …
 Want (vector) representation for text which
 Retains maximum useful semantics
 Supports efficient distance computes between docs
and Q
 Term weights
 Boolean (e.g. term in document or not); “bag of
words”
 Real-valued (e.g. freq term in doc; relative to all
docs) ...

Practical Issues
Tokenization
 Convert document to word counts
 word token = “any nonempty sequence of characters”
 for HTML (etc) need to remove formatting
Canonical forms, Stopwords, Stemming
 Remove capitalization
 Stopwords
 Remove very frequent words (a, the, and…) – can use standard
list
 Can also remove very rare words
 Stemming (next slide)
Data representation
 E.g., 3 column: <docid termid position>
 Inverted index (faster)
 List of sorted <termid docid> pairs: useful for finding docs
containing certain terms
 Equivalent to a sparse representation of term x doc matrix

Intelligent Information Retrieval
 Meaning of words
 Synonyms “buy” / “purchase”
 Ambiguity “bat” (baseball vs. mammal)
 Order of words in the query
 Hot dog stand in the amusement park
 Hot amusement stand in the dog park

Key Word Search
 The technical goal for prediction is to classify new,
unseen documents
 The Prediction and IR are unified by the computation of
similarity of documents
 IR based on traditional keyword search through a
search engine
 So we should recognize that using a search engine is a
special instance of prediction concept

Key Word Search
 We enter a key words to a search engine and expect
relevant documents to be returned
 These key words are words in a dictionary created
from the document collection and can be viewed as a
small document
 So, we want to measuring how similar the new
document (query) is to the documents in the collection

Key Word Search
 So, the notion of similarity is reduced to finding
documents with the same keywords as posed to the
search engine
 But, the objective of the search engine is to rank the
documents, not to assign a label
 So we need additional techniques to break the expected
ties (all retrieved documents match the search criteria)

Key Word Search
 In full text retrieval, all the words in each document are
considered to be keywords.
 We use the word term to refer to the words in a document
 Information-retrieval systems typically allow query expressions
formed using keywords and the logical connectives and, or,
and not
 Ands are implicit, even if not explicitly specified
 Ranking of documents on the basis of estimated relevance to a
query is critical
 Relevance ranking is based on factors such as
o Term frequency
Frequency of occurrence of query keyword in
document
o Inverse document frequency
How many documents the query keyword occurs in
Fewer  give more importance to keyword
o Hyperlinks to documents

Relevance Ranking Using Terms
TF-IDF (Term frequency/Inverse Document frequency)
ranking:
 Let n(d) = number of terms in the document d
 n(d, t) = number of occurrences of term t in the
document d.
 Relevance of a document d to a term t
 The log factor is to avoid excessive weight to
frequent terms
 Relevance of document to query Q
n(d)
n(d, t)
1 +TF (d, t) = log
r (d, Q) =  TF (d, t)
n(t)tQ

Relevance Ranking Using Terms
 Most systems add to the above model
 Words that occur in title, author list, section headings,
etc. are given greater importance
 Words whose first occurrence is late in the document
are given lower importance
 Very common words such as “a”, “an”, “the”, “it” etc
are eliminated
Called stop words
 Proximity: if keywords in query occur close together
in the document, the document has higher importance
than if they occur far apart
 Documents are returned in decreasing order of relevance
score
 Usually only top few documents are returned, not all

Similarity Based Retrieval
 Similarity based retrieval - retrieve documents similar to a
given document
 Similarity may be defined on the basis of common words
E.g. find k terms in A with highest TF (d, t ) / n (t )
and use these terms to find relevance of other
documents.
 Relevance feedback: Similarity can be used to refine answer
set to keyword query
 User selects a few relevant documents from those
retrieved by keyword query, and system finds other
documents similar to these
 Vector space model: define an n-dimensional space, where n
is the number of words in the document set.
 Vector for document d goes from origin to a point whose i
th coordinate is TF (d,t ) / n (t )
 The cosine of the angle between the vectors of two

Relevance Using Hyperlinks
 Number of documents relevant to a query can be enormous
if only term frequencies are taken into account
 Using term frequencies makes “spamming” easy
 E.g. a travel agency can add many occurrences of the
words “travel” to its page to make its rank very high
 Most of the time people are looking for pages from popular
sites
 Idea: use popularity of Web site (e.g. how many people visit
it) to rank site pages that match given keywords

 Solution: use number of hyperlinks to a site as a measure of the
popularity or prestige of the site
 Count only one hyperlink from each site
 Popularity measure is for site, not for individual page
But, most hyperlinks are to root of site
Also, concept of “site” difficult to define since a URL
prefix like cs.yale.edu contains many unrelated pages
of varying popularity
 Refinements
 When computing prestige based on links to a site, give more
weight to links from sites that themselves have higher
prestige
Definition is circular
Set up and solve system of simultaneous linear
equations

 Connections to social networking theories that ranked prestige
of people
 E.g. the president of the U.S.A has a high prestige since
many people know him
 Someone known by multiple prestigious people has high
prestige
 Hub and authority based ranking
 A hub is a page that stores links to many pages (on a topic)
 An authority is a page that contains actual information on a
topic
 Each page gets a hub prestige based on prestige of
authorities that it points to
 Each page gets an authority prestige based on prestige of
hubs that point to it
 Again, prestige definitions are cyclic, and can be got by

Nearest-Neighbor Methods
 A method that compares vectors and measures
similarity
 In Prediction: the NNMs will collect the K most similar
documents and then look at their labels
 In IR: the NNMs will determine whether a satisfactory
response to the search query has been found

Measuring Similarity
 These measures used to examine how documents are
similar and the output is a numerical measure of
similarity
 Three increasingly complex measures:
 Shared Word Count
 Word Count and Bonus
 Cosine Similarity

Shared Word Count
 Counts the shared words between documents
 The words:
 In IR we have a global dictionary where all
potential words will be included, with the
exception of stopwords.
 In Prediction its better to preselect the dictionary
relative to the label

Computing similarity by Shared
words
 Look at all words in the new document
 For each document in the collection count how many
of these words appear
 No weighting are used, just a simple count
 The dictionary has true key words (weakly words
removed)
 The results of this measure are clearly intuitive
 No one will question why a document was
retrieved

words
 Each document represented as a vector of key words
(zeros and ones)
 The similarity of 2 documents is the product of the 2
vectors
 If 2 documents have the same key word then this word
is counted (1*1)
 The performance of this measure depends mainly on
the dictionary used

words
 Shared words is an exact search
 Either retrieving or not retrieving a document.
 No weighting can be done on terms
 In query, A and B, you can’t specify A is more
important than B
 Every retrieved document are treated equally

Word Count and Bonus
 TF – term frequency
 Number of times a term occurs in a document
 DF –Document frequency
 Number of documents that contain the term.
 IDF – inversed document frequency
 =log (N/df)
 N: the total number of documents
 Vector is a numerical representation for a point in a multi-
dimensional space.
 (x1, x2, … … xn)
Dimensions of the space need to be defined
A measure of the space needs to be defined.

 Each indexing term is a dimension
 Each document is a vector
Di = (ti1, ti2, ti3, ti4, ... tik)
 Document similarity is defined as
    
   







 
0
1
1
,
1
jdfjw
jwiD
K
j
If word (j) occurs in both
documents
otherwise
K = number of words

 The bonus 1/df(j) is a variant of idf. Thus, if the word
occurs in many documents, the bonus is small.
 This measure better than the Shared Word count,
because its discriminate among the weak and strong
predictive words.

2.83
1.33
0
1.33
1.5
1.33
2.67
Measure
Similarity With
Bonus
10101
11000
00010
10001
00100
01010
11001
Similarity
Scores
1101
Labeled
Spreadsheet
Vector
New
Document
Computing Similarity Scores with Bonus
• A document Space is
defined by five terms:
hardware, software, user,
information, index.
•The query is “ hardware,
user, information.

Cosine Similarity
The Vector Space
A document is represented as a vector:
(W1, W2, … … , Wn)
Binary:
 Wi= 1 if the corresponding term is in the
document
 Wi= 0 if the term is not in the document
TF: (Term Frequency)
 Wi= tfi where tfi is the number of times the term
occurred in the document
TF*IDF: (Inverse Document Frequency)
 Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the
number of documents contains the term i, and N
the total number of documents in the collection.

Cosine Similarity
The Vector Space
vec(D) = (w1, w2, ..., wt)
Sim(d1,d2) = cos()
= [vec(d1)  vec(d2)] / |d1| *
|d2| = [ wd1(j) * wd2(j)] / |d1| * |d2|
W(j) > 0 whenever j di
So, 0 <= sim(d1,d2) <=1
A document is retrieved
even if it matches the
query terms only partially

Cosine Similarity
 How to compute the weight wj?
 A good weight must take into account two effects:
 quantification of intra-document contents (similarity)
tf factor, the term frequency within a
document
 quantification of inter-documents separation (dissi-
milarity)
idf factor, the inverse document frequency
 wj = tf(j) * idf(j)

Cosine Similarity
 TF in the given document shows how important the term is
in this document (makes the frequent words for the
document more important)
 IDF makes rare words across all documents more
important.
 A high weight in a tf-idf ranking scheme is therefore
reached by a high term frequency in the given document
and a low term frequency in all other documents.
 Term weights in a document affects the position of the
document vectors

Cosine Similarity
TF-IDF definitions:
fik: number occurrences of term ti in document Dk
tfik: fik / max(fik) normalized term frequency
dfk: number of documents which contain tk
idfk: log(N / dfk) where N is the total number of documents
wik: tfik idfk term weight
Intuition: rare words get more weight, common words less
weight

Example TF-IDF
 Given a document containing terms with given frequencies:
Kent = 3; Ohio = 2; University = 1
and assume a collection of 10,000 documents and document
frequencies of these terms are:
Kent = 50; Ohio = 1300; University = 250.
THEN
Kent: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3
Ohio: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3
University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

Cosine Similarity
 Cosine
 W(j) = tf(j) * idf(j)
 Idf(j) = log(N / df(j))
 
    
   

22
2*1
2*1
2,1
jwdjwd
jwdjwd
dd

Why Mine the Web?
Enormous wealth of textual information on the Web.
 Book/CD/Video stores (e.g., Amazon)
 Restaurant information (e.g., Zagats)
 Car prices (e.g., Carpoint)
Lots of data on user access patterns
 Web logs contain sequence of URLs accessed by users
Possible to retrieve “previously unknown” information
 People who ski also frequently break their leg.
 Restaurants that serve sea food in California are likely
to be outside San-Francisco

Mining the Web
IR / IE
System
Query
Documents
source
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Web Spider

Web-based Retrieval
Additional information in Web documents
 Link structure (e.g., Page Rank)
 HTML structure
 Link/anchor text
 Title text, Etc.,
 Can be leveraged for better retrieval
Additional issues in Web retrieval
 Scalability: size of “corpus” is huge (10 to 100 billion docs)
 Constantly changing:
 Crawlers to update document-term information
 Need schemes for efficient updating indices
 Evaluation is more difficult:
 How is relevance measured?
 How many documents in total are relevant?

Probabilistic Approaches to
Retrieval
Compute P(q | d) for each document d
 Intuition: relevance of d to q is related to how likely it is that q
was generated by d, or “how likely is q under a model for d?”
Simple model for P(q|d)
 Pe(q|d) = empirical frequency of words in document d
 “tuned” to d, but likely to be sparse (will contain many zeros)
2-stage probabilistic model (or linear interpolation model)
 P(q|d) = l Pe (q | d) + (1- l ) Pe (q | corpus)
 l can be fixed, e.g., tuned to a particular data set
 Or it can depend on d, e.g.,
where nd = number of words in doc d , and m = a constant (e.g.,
1000)
Can also use more sophisticated models for P(q|d) e.g., topic-
based models
)/(1 mnn dd 

 Web-Based Document Search
 Page Rank
 Anchor Text
 Document Matching
 Inverted Lists

Page Rank
 PR(A): the page rank of page A.
 C(T) : the number of outgoing links from
page T.
 d : minimum value assigned to any
page.
 : a page pointing to A.

j
jj TCTPRddAPR ))(/)((*)1()(
jT

Algorithm of Page Rank
1. Use the PageRank Equation to compute
PageRank for each page in the collection
using latest PageRanks of pages.
2. Repeat step 1 until no significant change to
any PageRank.

Example
In The First Iteration:
 PR(A)=0.1+0.9*(PR(B)+PR(C))
=0.1+0.9*(1+1)
=1.9
 PR(B)=0.1+0.9*(PR(A)/2)
=0.1+0.9*(1.9/2)
=0.95
 PR(C)=0.1+0.9*(PR(A)/2)
=0.1+0.9*(1.9/2)
=0.95
PR(A)=1.48, PR(B)=0.76, PR(C)=0.76
Initial Value:
PR(A)=PR(B)=PR(C)=1
d=0.1

Anchor Text
 The anchor text is the visible, clickable text in a
hyperlink.
 For example:
<ahref=“http://www.wikipedia.org”>Wikipedia</a>
 The anchor text is Wikipedia; the complex URL
http://www.wikipedia.org/ displays on the web
page as Wikipedia, contributing to a clean, easy
to read text or document.

Anchor Text
 Anchor text usually gives the user relevant descriptive
or contextual information about the content of the link’s
destination.
 The anchor text may or may not be related to the actual
text of the URL of the link.
 The words contained in the Anchor Text can determine
the ranking that the page will receive by search engines.

Common Misunderstanding
 Webmasters sometimes tend to misunderstand anchor
text.
 Instead of turning appropriate words inside of a
sentence into a clickable link, webmasters frequently
insert extra text.

Anchor Text
 This proper method of linking is beneficial not only to
users, but also to the webmasters as anchor text holds
significant weight in search engine ranking.
 Most search engine optimization experts recommend
against using “click here” to designate a link.

Document Matching
 An arbitrarily long document is the query, not just a few
key words.
 But the goal is still to rank and output an ordered list of
relevant documents.
 The most similar documents are found using the
measures described earlier.
 Search engines and document matchers are not
focused on classification of new documents.
 Their primary goal is to retrieve the most relevant

Generalization of searching
• Matching a document to a collection of documents
looks like a tedious and expensive operation.
• Even for a short query, comparison to all large
documents in the collection implies a relatively intensive
computation task.

Example of document matching
 Consider an online help desk, where a complete
description of a problem is submitted.
 That document could be matched to stored documents,
hopefully finding descriptions of similar problems and
solutions without having the user experiment with
numerous key word searches.

Inverted Lists
 Instead of documents pointing to words, a list
of words pointing to documents is the primary
internal representation for processing queries
and matching documents.

Inverted Lists
 The inverted list is the key to the efficiency of
information retrieval systems.
 The inverted list has contributed to make nearest-
neighbor methods a pragmatic possibility for prediction.

Example
If the query contained words
100 and 200
1) First processing W(100)
to compute the
similarity S(i) of each
document i：
S(1)=0+1
S(2)=0+1
…
2) Then process W(200) in
the same way:
S(2)=1+1
…

Evaluating IE Accuracy
 Always evaluate performance on independent, manually-
annotated test data not used during system development.
 Measure for each test document:
 Total number of correct extractions in the solution
template: N
 Total number of slot/value pairs extracted by the
system: E
 Number of extracted slot/value pairs that are correct
(i.e. in the solution template): C
 Compute average value of metrics adapted from IR:
 Recall = C/N
 Precision = C/E
 F-Measure = Harmonic mean of recall and precision

Related Types of Data
Sparse high-dimensional data sets with counts, like
document-term matrices, are common in data mining, e.g.,
 “transaction data”
 Rows = customers; Columns = products
 Web log data (ignoring sequence)
 Rows = Web surfers; Columns = Web pages
Recommender systems
 Given some products from user i, suggest other
products to the user
 e.g., Amazon.com’s book recommender
 Collaborative filtering:
 use k-nearest-individuals as the basis for
predictions
 Many similarities with querying and information retrieval

What is a Good IR System?
 Minimize the overhead of a user locating needed
information
 Fast, accurate, comprehensive, easy to use, …
 Objective measures
 Precision
 Recall
retrieveddocumentsallofNo.
retrieveddocumentsrelevantofNo.
P
dataindocumentsrelevantallofNo.
retrieveddocumentsrelevantofNo.
R

Measuring Retrieval Effectiveness
 Information-retrieval systems save space by using
index structures that support only approximate retrieval.
May result in:
 false negative (false drop) - some relevant
documents may not be retrieved.
 false positive - some irrelevant documents may be
retrieved.
For many applications a good index should not permit
any false drops, but may permit a few false positives.
 Relevant performance metrics:
 precision - what percentage of the retrieved
documents are relevant to the query.
 recall - what percentage of the documents

Measuring Retrieval
Effectiveness
Recall vs. precision tradeoff:
 Can increase recall by retrieving many documents
(down to a low level of relevance ranking), but many
irrelevant documents would be fetched, reducing
precision
Measures of retrieval effectiveness:
 Recall as a function of number of documents fetched,
or
 Precision as a function of recall
Equivalently, as a function of number of
documents fetched
 E.g. “precision of 75% at recall of 50%, and 60% at a
recall of 75%”

Applications of Information
Retrieval
 Classic application
 Library catalogue
e.g. The UofC library catalogue
 Current applications
 Digital library
e.g. http://www.acm.org/dl
 WWW search engines
e.g. http://www.google.com

Other applications of IE Systems
 Job resumes
 Seminar announcements
 Molecular biology information from MEDLINE, e.g, Extracting
gene drug interactions from biomed texts
 Summarizing medical patient records by extracting
diagnoses, symptoms, physical findings, test results.
 Gathering earnings, profits, board members, etc. [corporate
information] from web, company reports
 Verification of construction industry specifications documents
(are the quantities correct/reasonable?)
 Extraction of political/economic/business changes from
newspaper articles

Conclusion
1. Information retrieval methods are specialized
nearest-neighbor methods, which are well-known
prediction methods.
2. IR methods typically process unlabeled data and
order and display the retrieved documents.
3. The IR methods have no training and induce no new
rules for classification.

Tdm information retrieval

Recommended

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Tdm information retrieval (20)

Tdm information retrieval