SlideShare a Scribd company logo
Text Data Mining
PART I - IR
Text Mining Applications
Information Retrieval
 Query-based search of large text archives, e.g., the Web
Text Classification
 Automated assignment of topics to Web pages, e.g.,
Yahoo, Google
 Automated classification of email into spam and non-
spam
Text Clustering
 Automated organization of search results in real-time into
categories
 Discovery clusters and trends in technical literature (e.g.
CiteSeer)
Information Extraction
 Extracting standard fields from free-text
 Extracting names and places from reports, newspapers
Information Retrieval - Definition
 Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).
 Information Retrieval
 Deals with the representation, storage, organization of, and
access to information items
– Modern Information Retrieval
 General Objective: Minimize the overhead of a user
locating needed information
Information Retrieval Is Not
Database
Information Retrieval
 Process stored
documents
 Search documents
relevant to user queries
 No standard of how
queries should be
 Query results are
permissive to errors or
inaccurate items
Database
 Normally no processing
of data
 Search records
matching queries
 Standard: SQL
language
 Query results should
have 100% accuracy.
Zero tolerant to errors
Information Retrieval Is Not
Data Mining
Information Retrieval
 User target: Existing
relevant data entries
Data Mining
 User target: Knowledge
(rules, etc.) implied by
data (not the individual
data entries themselves)
• Many techniques and models
are shared and related
• E.g. classification of
documents
Is Information Retrieval a Form of
Text Mining?
What is the principal computer specialty for processing
documents and text??
 Information Retrieval (IR)
 The task of IR is to retrieve relevant documents in
response to a query.
 The fundamental technique of IR is measuring
similarity
 A query is examined and transformed into a vector of
values to be compared with stored documents
Is Information Retrieval a Form of
Text Mining?
 In the predication problem similar documents are
retrieved, then measure their properties, i.e. count the # of
class labels to see which label should be assigned to a
new document
 The objectives of the prediction can be posed in the form
of an IR model where documents are retrieved that are
relevant to a query, the query will be a new document
Specify Query
Search
Document
Collection
Return Subset
of Relevant
Documents
Key Steps in Information Retrieval
Examine
Document
Collection
Learn
Classification
Criteria
Apply Criteria
to New
Documents
Key Steps in Predictive Text Mining
Specify
Query Vector
Match
Document
Collection
Get Subset
of Relevant
Documents
Examine
Document
Properties
Predicting from Retrieved Documents
Key steps in IR
simple criteria such as
document’s labels
Information Retrieval (IR)
 Conceptually, IR is the study of finding needed
information. I.e., IR helps users find information that
matches their information needs.
 Expressed as queries
 Historically, IR is about document retrieval, emphasizing
document as the basic unit.
 Finding documents relevant to user queries
 Technically, IR studies the acquisition, organization,
storage, retrieval, and distribution of information.
Information Retrieval Cycle
Source
Selection
Search
Query
Selection
Results
Examination
Documents
Delivery
Information
Query
Formulation
Resource
source reselection
System
discovery
Vocabulary
discovery
Concept
discovery
Document
discovery
Abstract IR Architecture
DocumentsQuery
Hits
Representation
Function
Representation
Function
Query Representation Document Representation
Comparison
Function Index
offlineonline
IR Architecture
IR Queries
 Keyword queries
 Boolean queries (using AND, OR, NOT)
 Phrase queries
 Proximity queries
 Full document queries
 Natural language questions
Information retrieval models
 An IR model governs how a document and a query are
represented and how the relevance of a document to a
user query is defined.
 Main models:
 Boolean model
 Vector space model
 Statistical language model
 etc
Elements in Information Retrieval
 Processing of documents
 Acceptance and processing of queries from
users
 Modelling, searching and ranking of
documents
 Presenting the search result
Process of Retrieving Information
Document Processing
 Removing stopwords (appear frequently but no
much meaning, e.g. “the”, “of”)
 Stemming: recognize different words with the
same grammar root
 Noun groups: common combination of words
 Indexing: for fast locating documents
Processing Queries
 Define a “language” for queries
 Syntax, operators, etc.
 Modify the queries for better search
 Ignore meaningless parts: punctuations,
conjunctives, etc.
 Append synonyms
e.g. e-business e-commerce
 Emerging technology
 Natural language queries
Modelling/Ranking of Documents
 Model the relevance (usefulness) of documents
against the user query Q
 The model represents a function Rel(Q,D)
 D is a document, Q is a user query
 Rel(Q,D) is the relevance of document D to query
Q
 There are many models available
 Algebraic models
 Probabilistic models
 Set-theoretic models
Basic Vector Space Model
 Define a set of words
and phases as terms
 Text is represented by
a vector of terms
 User query is
converted to a vector,
too
 Measure the vector
“distance” between a
document vector and
the query vector
business
computer
PowerPoint
presentation
user
web
Term Set
We are doing an e-business
presentation in PowerPoint.
Document
(1,0,1,1,0,0)
computer presentation
Query
(0,1,0,1,0,0)
222222
)00()00()11()01()10()01( 
Distance
Probabilistic Models Overview
Probabilistic Models
 Ranking: the probability that a document is
relevant to a query
 Often denoted as Pr(R|D,Q)
 In actual measure, log-odds transformation is
used:
 Probability values are estimated in applications
),|Pr(
),|Pr(
log
QDR
QDR
Information Retrieval
Given
 A source of textual documents
 A well defined limited query (text based)
Find
 Sentences with relevant information
 Extract the relevant information and ignore non-
relevant information (important!)
 Link related information and output in a predetermined
format
 Example: news stories, e-mails, web pages, photograph,
music, statistical data, biomedical data, etc.
 Information items can be in the form of text, image, video,
audio, numbers, etc.
Information Retrieval
2 basic information retrieval (IR) process:
Browsing or navigation system
 User skims document collection by jumping from one
document to the other via hypertext or hypermedia
links until relevant document found
Classical IR system: question answering system
 Query: question in natural language
 Answer: directly extracted from text of document
collection
Text Based Information Retrieval:
Information item (document)
 Text format (written/spoken) or has textual
description
Information need (query)
Tdm information retrieval
Classical IR System Process
General concepts in IR
Representation language
 Typically a vector of d attribute values, e.g.,
 set of color, intensity, texture, features characterizing
images
 word counts for text documents
Data set D of N objects
 Typically represented as an N x d matrix
Query Q
 User poses a query to search D
 Query is typically expressed in the same representation
language as the data, e.g.,
 each text document is a set of words that occur in the
document
Query by Content
Traditional DB query: exact matches
 E.g. query Q = [level = MANAGER] AND [age < 30] or,
 Boolean match on text
 Query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun”
 Not useful when there are many matches
 E.g., “data mining” in Google returns 60 million documents
Query-by-content query: more general / less precise
 E.g. what record is most similar to a query Q?
 For text data, often called “information retrieval (IR)”
 Can also be used for images, sequences, video, etc
 Q can itself be an object (e.g., a document) or a shorter version
(e.g., 1 word)
Goal
 Match query Q to the N objects in the database
 Return a ranked list (typically) of the most similar/relevant objects
in the data set D given Q
Issues in Query by Content
 What representation language to use
 How to measure similarity between Q and each object
in D
 How to compute the results in real-time (for interactive
querying)
 How to rank the results for the user
 Allowing user feedback (query modification)
 How to evaluate and compare different IR
algorithms/systems
The Standard Approach
 Fixed-length (d dimensional) vector representation
 For query (1-by-d Q) and database (n-by-d X) objects
 Use domain-specific higher-level features (vs raw)
 Image
“bag of features”: color (e.g. RGB), texture (e.g.
Gabor, Fourier coeffs), …
 Text
“bag of words”: freq count for each word in each
document, …
Also known as the “vector-space” model
 Compute distances between vectorized representation
 Use k-NN to find k vectors in X closest to Q
Text Retrieval
 Document: book, paper, WWW page, ...
 Term: word, word-pair, phrase, … (often: 50,000+)
 query Q = set of terms, e.g., “data” + “mining”
 NLP (natural language processing) too hard, so …
 Want (vector) representation for text which
 Retains maximum useful semantics
 Supports efficient distance computes between docs
and Q
 Term weights
 Boolean (e.g. term in document or not); “bag of
words”
 Real-valued (e.g. freq term in doc; relative to all
docs) ...
Practical Issues
Tokenization
 Convert document to word counts
 word token = “any nonempty sequence of characters”
 for HTML (etc) need to remove formatting
Canonical forms, Stopwords, Stemming
 Remove capitalization
 Stopwords
 Remove very frequent words (a, the, and…) – can use standard
list
 Can also remove very rare words
 Stemming (next slide)
Data representation
 E.g., 3 column: <docid termid position>
 Inverted index (faster)
 List of sorted <termid docid> pairs: useful for finding docs
containing certain terms
 Equivalent to a sparse representation of term x doc matrix
Intelligent Information Retrieval
 Meaning of words
 Synonyms “buy” / “purchase”
 Ambiguity “bat” (baseball vs. mammal)
 Order of words in the query
 Hot dog stand in the amusement park
 Hot amusement stand in the dog park
Key Word Search
 The technical goal for prediction is to classify new,
unseen documents
 The Prediction and IR are unified by the computation of
similarity of documents
 IR based on traditional keyword search through a
search engine
 So we should recognize that using a search engine is a
special instance of prediction concept
Key Word Search
 We enter a key words to a search engine and expect
relevant documents to be returned
 These key words are words in a dictionary created
from the document collection and can be viewed as a
small document
 So, we want to measuring how similar the new
document (query) is to the documents in the collection
Key Word Search
 So, the notion of similarity is reduced to finding
documents with the same keywords as posed to the
search engine
 But, the objective of the search engine is to rank the
documents, not to assign a label
 So we need additional techniques to break the expected
ties (all retrieved documents match the search criteria)
Key Word Search
 In full text retrieval, all the words in each document are
considered to be keywords.
 We use the word term to refer to the words in a document
 Information-retrieval systems typically allow query expressions
formed using keywords and the logical connectives and, or,
and not
 Ands are implicit, even if not explicitly specified
 Ranking of documents on the basis of estimated relevance to a
query is critical
 Relevance ranking is based on factors such as
o Term frequency
Frequency of occurrence of query keyword in
document
o Inverse document frequency
How many documents the query keyword occurs in
Fewer  give more importance to keyword
o Hyperlinks to documents
Relevance Ranking Using Terms
TF-IDF (Term frequency/Inverse Document frequency)
ranking:
 Let n(d) = number of terms in the document d
 n(d, t) = number of occurrences of term t in the
document d.
 Relevance of a document d to a term t
 The log factor is to avoid excessive weight to
frequent terms
 Relevance of document to query Q
n(d)
n(d, t)
1 +TF (d, t) = log
r (d, Q) =  TF (d, t)
n(t)tQ
Relevance Ranking Using Terms
 Most systems add to the above model
 Words that occur in title, author list, section headings,
etc. are given greater importance
 Words whose first occurrence is late in the document
are given lower importance
 Very common words such as “a”, “an”, “the”, “it” etc
are eliminated
Called stop words
 Proximity: if keywords in query occur close together
in the document, the document has higher importance
than if they occur far apart
 Documents are returned in decreasing order of relevance
score
 Usually only top few documents are returned, not all
Similarity Based Retrieval
 Similarity based retrieval - retrieve documents similar to a
given document
 Similarity may be defined on the basis of common words
E.g. find k terms in A with highest TF (d, t ) / n (t )
and use these terms to find relevance of other
documents.
 Relevance feedback: Similarity can be used to refine answer
set to keyword query
 User selects a few relevant documents from those
retrieved by keyword query, and system finds other
documents similar to these
 Vector space model: define an n-dimensional space, where n
is the number of words in the document set.
 Vector for document d goes from origin to a point whose i
th coordinate is TF (d,t ) / n (t )
 The cosine of the angle between the vectors of two
Relevance Using Hyperlinks
 Number of documents relevant to a query can be enormous
if only term frequencies are taken into account
 Using term frequencies makes “spamming” easy
 E.g. a travel agency can add many occurrences of the
words “travel” to its page to make its rank very high
 Most of the time people are looking for pages from popular
sites
 Idea: use popularity of Web site (e.g. how many people visit
it) to rank site pages that match given keywords
Relevance Using Hyperlinks
 Solution: use number of hyperlinks to a site as a measure of the
popularity or prestige of the site
 Count only one hyperlink from each site
 Popularity measure is for site, not for individual page
But, most hyperlinks are to root of site
Also, concept of “site” difficult to define since a URL
prefix like cs.yale.edu contains many unrelated pages
of varying popularity
 Refinements
 When computing prestige based on links to a site, give more
weight to links from sites that themselves have higher
prestige
Definition is circular
Set up and solve system of simultaneous linear
equations
Relevance Using Hyperlinks
 Connections to social networking theories that ranked prestige
of people
 E.g. the president of the U.S.A has a high prestige since
many people know him
 Someone known by multiple prestigious people has high
prestige
 Hub and authority based ranking
 A hub is a page that stores links to many pages (on a topic)
 An authority is a page that contains actual information on a
topic
 Each page gets a hub prestige based on prestige of
authorities that it points to
 Each page gets an authority prestige based on prestige of
hubs that point to it
 Again, prestige definitions are cyclic, and can be got by
Nearest-Neighbor Methods
 A method that compares vectors and measures
similarity
 In Prediction: the NNMs will collect the K most similar
documents and then look at their labels
 In IR: the NNMs will determine whether a satisfactory
response to the search query has been found
Measuring Similarity
 These measures used to examine how documents are
similar and the output is a numerical measure of
similarity
 Three increasingly complex measures:
 Shared Word Count
 Word Count and Bonus
 Cosine Similarity
Shared Word Count
 Counts the shared words between documents
 The words:
 In IR we have a global dictionary where all
potential words will be included, with the
exception of stopwords.
 In Prediction its better to preselect the dictionary
relative to the label
Computing similarity by Shared
words
 Look at all words in the new document
 For each document in the collection count how many
of these words appear
 No weighting are used, just a simple count
 The dictionary has true key words (weakly words
removed)
 The results of this measure are clearly intuitive
 No one will question why a document was
retrieved
Computing similarity by Shared
words
 Each document represented as a vector of key words
(zeros and ones)
 The similarity of 2 documents is the product of the 2
vectors
 If 2 documents have the same key word then this word
is counted (1*1)
 The performance of this measure depends mainly on
the dictionary used
Computing similarity by Shared
words
 Shared words is an exact search
 Either retrieving or not retrieving a document.
 No weighting can be done on terms
 In query, A and B, you can’t specify A is more
important than B
 Every retrieved document are treated equally
Word Count and Bonus
 TF – term frequency
 Number of times a term occurs in a document
 DF –Document frequency
 Number of documents that contain the term.
 IDF – inversed document frequency
 =log (N/df)
 N: the total number of documents
 Vector is a numerical representation for a point in a multi-
dimensional space.
 (x1, x2, … … xn)
Dimensions of the space need to be defined
A measure of the space needs to be defined.
Word Count and Bonus
 Each indexing term is a dimension
 Each document is a vector
Di = (ti1, ti2, ti3, ti4, ... tik)
 Document similarity is defined as
    
   







 
0
1
1
,
1
jdfjw
jwiD
K
j
If word (j) occurs in both
documents
otherwise
K = number of words
Word Count and Bonus
 The bonus 1/df(j) is a variant of idf. Thus, if the word
occurs in many documents, the bonus is small.
 This measure better than the Shared Word count,
because its discriminate among the weak and strong
predictive words.
Word Count and Bonus
2.83
1.33
0
1.33
1.5
1.33
2.67
Measure
Similarity With
Bonus
10101
11000
00010
10001
00100
01010
11001
Similarity
Scores
1101
Labeled
Spreadsheet
Vector
New
Document
Computing Similarity Scores with Bonus
• A document Space is
defined by five terms:
hardware, software, user,
information, index.
•The query is “ hardware,
user, information.
Cosine Similarity
The Vector Space
A document is represented as a vector:
(W1, W2, … … , Wn)
Binary:
 Wi= 1 if the corresponding term is in the
document
 Wi= 0 if the term is not in the document
TF: (Term Frequency)
 Wi= tfi where tfi is the number of times the term
occurred in the document
TF*IDF: (Inverse Document Frequency)
 Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the
number of documents contains the term i, and N
the total number of documents in the collection.
Cosine Similarity
The Vector Space
vec(D) = (w1, w2, ..., wt)
Sim(d1,d2) = cos()
= [vec(d1)  vec(d2)] / |d1| *
|d2| = [ wd1(j) * wd2(j)] / |d1| * |d2|
W(j) > 0 whenever j di
So, 0 <= sim(d1,d2) <=1
A document is retrieved
even if it matches the
query terms only partially
Cosine Similarity
 How to compute the weight wj?
 A good weight must take into account two effects:
 quantification of intra-document contents (similarity)
tf factor, the term frequency within a
document
 quantification of inter-documents separation (dissi-
milarity)
idf factor, the inverse document frequency
 wj = tf(j) * idf(j)
Cosine Similarity
 TF in the given document shows how important the term is
in this document (makes the frequent words for the
document more important)
 IDF makes rare words across all documents more
important.
 A high weight in a tf-idf ranking scheme is therefore
reached by a high term frequency in the given document
and a low term frequency in all other documents.
 Term weights in a document affects the position of the
document vectors
Cosine Similarity
TF-IDF definitions:
fik: number occurrences of term ti in document Dk
tfik: fik / max(fik) normalized term frequency
dfk: number of documents which contain tk
idfk: log(N / dfk) where N is the total number of documents
wik: tfik idfk term weight
Intuition: rare words get more weight, common words less
weight
Example TF-IDF
 Given a document containing terms with given frequencies:
Kent = 3; Ohio = 2; University = 1
and assume a collection of 10,000 documents and document
frequencies of these terms are:
Kent = 50; Ohio = 1300; University = 250.
THEN
Kent: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3
Ohio: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3
University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2
Cosine Similarity
 Cosine
 W(j) = tf(j) * idf(j)
 Idf(j) = log(N / df(j))
 
    
   

22
2*1
2*1
2,1
jwdjwd
jwdjwd
dd
Why Mine the Web?
Enormous wealth of textual information on the Web.
 Book/CD/Video stores (e.g., Amazon)
 Restaurant information (e.g., Zagats)
 Car prices (e.g., Carpoint)
Lots of data on user access patterns
 Web logs contain sequence of URLs accessed by users
Possible to retrieve “previously unknown” information
 People who ski also frequently break their leg.
 Restaurants that serve sea food in California are likely
to be outside San-Francisco
Mining the Web
IR / IE
System
Query
Documents
source
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Web Spider
Web-based Retrieval
Additional information in Web documents
 Link structure (e.g., Page Rank)
 HTML structure
 Link/anchor text
 Title text, Etc.,
 Can be leveraged for better retrieval
Additional issues in Web retrieval
 Scalability: size of “corpus” is huge (10 to 100 billion docs)
 Constantly changing:
 Crawlers to update document-term information
 Need schemes for efficient updating indices
 Evaluation is more difficult:
 How is relevance measured?
 How many documents in total are relevant?
Probabilistic Approaches to
Retrieval
Compute P(q | d) for each document d
 Intuition: relevance of d to q is related to how likely it is that q
was generated by d, or “how likely is q under a model for d?”
Simple model for P(q|d)
 Pe(q|d) = empirical frequency of words in document d
 “tuned” to d, but likely to be sparse (will contain many zeros)
2-stage probabilistic model (or linear interpolation model)
 P(q|d) = l Pe (q | d) + (1- l ) Pe (q | corpus)
 l can be fixed, e.g., tuned to a particular data set
 Or it can depend on d, e.g.,
where nd = number of words in doc d , and m = a constant (e.g.,
1000)
Can also use more sophisticated models for P(q|d) e.g., topic-
based models
)/(1 mnn dd 
Information Retrieval
 Web-Based Document Search
 Page Rank
 Anchor Text
 Document Matching
 Inverted Lists
Page Rank
 PR(A): the page rank of page A.
 C(T) : the number of outgoing links from
page T.
 d : minimum value assigned to any
page.
 : a page pointing to A.

j
jj TCTPRddAPR ))(/)((*)1()(
jT
Algorithm of Page Rank
1. Use the PageRank Equation to compute
PageRank for each page in the collection
using latest PageRanks of pages.
2. Repeat step 1 until no significant change to
any PageRank.
Example
In The First Iteration:
 PR(A)=0.1+0.9*(PR(B)+PR(C))
=0.1+0.9*(1+1)
=1.9
 PR(B)=0.1+0.9*(PR(A)/2)
=0.1+0.9*(1.9/2)
=0.95
 PR(C)=0.1+0.9*(PR(A)/2)
=0.1+0.9*(1.9/2)
=0.95
PR(A)=1.48, PR(B)=0.76, PR(C)=0.76
Initial Value:
PR(A)=PR(B)=PR(C)=1
d=0.1
Anchor Text
 The anchor text is the visible, clickable text in a
hyperlink.
 For example:
<ahref=“http://www.wikipedia.org”>Wikipedia</a>
 The anchor text is Wikipedia; the complex URL
http://www.wikipedia.org/ displays on the web
page as Wikipedia, contributing to a clean, easy
to read text or document.
Anchor Text
 Anchor text usually gives the user relevant descriptive
or contextual information about the content of the link’s
destination.
 The anchor text may or may not be related to the actual
text of the URL of the link.
 The words contained in the Anchor Text can determine
the ranking that the page will receive by search engines.
Common Misunderstanding
 Webmasters sometimes tend to misunderstand anchor
text.
 Instead of turning appropriate words inside of a
sentence into a clickable link, webmasters frequently
insert extra text.
Anchor Text
 This proper method of linking is beneficial not only to
users, but also to the webmasters as anchor text holds
significant weight in search engine ranking.
 Most search engine optimization experts recommend
against using “click here” to designate a link.
Document Matching
 An arbitrarily long document is the query, not just a few
key words.
 But the goal is still to rank and output an ordered list of
relevant documents.
 The most similar documents are found using the
measures described earlier.
 Search engines and document matchers are not
focused on classification of new documents.
 Their primary goal is to retrieve the most relevant
Generalization of searching
• Matching a document to a collection of documents
looks like a tedious and expensive operation.
• Even for a short query, comparison to all large
documents in the collection implies a relatively intensive
computation task.
Example of document matching
 Consider an online help desk, where a complete
description of a problem is submitted.
 That document could be matched to stored documents,
hopefully finding descriptions of similar problems and
solutions without having the user experiment with
numerous key word searches.
Inverted Lists
 Instead of documents pointing to words, a list
of words pointing to documents is the primary
internal representation for processing queries
and matching documents.
Inverted Lists
 The inverted list is the key to the efficiency of
information retrieval systems.
 The inverted list has contributed to make nearest-
neighbor methods a pragmatic possibility for prediction.
Example
If the query contained words
100 and 200
1) First processing W(100)
to compute the
similarity S(i) of each
document i:
S(1)=0+1
S(2)=0+1
…
2) Then process W(200) in
the same way:
S(2)=1+1
…
Evaluating IE Accuracy
 Always evaluate performance on independent, manually-
annotated test data not used during system development.
 Measure for each test document:
 Total number of correct extractions in the solution
template: N
 Total number of slot/value pairs extracted by the
system: E
 Number of extracted slot/value pairs that are correct
(i.e. in the solution template): C
 Compute average value of metrics adapted from IR:
 Recall = C/N
 Precision = C/E
 F-Measure = Harmonic mean of recall and precision
Related Types of Data
Sparse high-dimensional data sets with counts, like
document-term matrices, are common in data mining, e.g.,
 “transaction data”
 Rows = customers; Columns = products
 Web log data (ignoring sequence)
 Rows = Web surfers; Columns = Web pages
Recommender systems
 Given some products from user i, suggest other
products to the user
 e.g., Amazon.com’s book recommender
 Collaborative filtering:
 use k-nearest-individuals as the basis for
predictions
 Many similarities with querying and information retrieval
What is a Good IR System?
 Minimize the overhead of a user locating needed
information
 Fast, accurate, comprehensive, easy to use, …
 Objective measures
 Precision
 Recall
retrieveddocumentsallofNo.
retrieveddocumentsrelevantofNo.
P
dataindocumentsrelevantallofNo.
retrieveddocumentsrelevantofNo.
R
Measuring Retrieval Effectiveness
 Information-retrieval systems save space by using
index structures that support only approximate retrieval.
May result in:
 false negative (false drop) - some relevant
documents may not be retrieved.
 false positive - some irrelevant documents may be
retrieved.
For many applications a good index should not permit
any false drops, but may permit a few false positives.
 Relevant performance metrics:
 precision - what percentage of the retrieved
documents are relevant to the query.
 recall - what percentage of the documents
Measuring Retrieval
Effectiveness
Recall vs. precision tradeoff:
 Can increase recall by retrieving many documents
(down to a low level of relevance ranking), but many
irrelevant documents would be fetched, reducing
precision
Measures of retrieval effectiveness:
 Recall as a function of number of documents fetched,
or
 Precision as a function of recall
Equivalently, as a function of number of
documents fetched
 E.g. “precision of 75% at recall of 50%, and 60% at a
recall of 75%”
Applications of Information
Retrieval
 Classic application
 Library catalogue
e.g. The UofC library catalogue
 Current applications
 Digital library
e.g. http://www.acm.org/dl
 WWW search engines
e.g. http://www.google.com
Other applications of IE Systems
 Job resumes
 Seminar announcements
 Molecular biology information from MEDLINE, e.g, Extracting
gene drug interactions from biomed texts
 Summarizing medical patient records by extracting
diagnoses, symptoms, physical findings, test results.
 Gathering earnings, profits, board members, etc. [corporate
information] from web, company reports
 Verification of construction industry specifications documents
(are the quantities correct/reasonable?)
 Extraction of political/economic/business changes from
newspaper articles
Conclusion
1. Information retrieval methods are specialized
nearest-neighbor methods, which are well-known
prediction methods.
2. IR methods typically process unlabeled data and
order and display the retrieved documents.
3. The IR methods have no training and induce no new
rules for classification.
Ad

More Related Content

What's hot (20)

WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
Stephane Marchand-Maillet
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
alaa223
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
Primya Tamil
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
NLP
NLPNLP
NLP
guestff64339
 
Information retrieval 3 query search interfaces
Information retrieval 3 query search interfacesInformation retrieval 3 query search interfaces
Information retrieval 3 query search interfaces
Vaibhav Khanna
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
Vaibhav Khanna
 
Inverted index
Inverted indexInverted index
Inverted index
Krishna Gehlot
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
ShivaVemula2
 
Chapter 1 semantic web
Chapter 1 semantic webChapter 1 semantic web
Chapter 1 semantic web
R A Akerkar
 
Metadata ppt
Metadata pptMetadata ppt
Metadata ppt
Shashikant Kumar
 
The semantic web
The semantic web The semantic web
The semantic web
ap
 
Text mining
Text miningText mining
Text mining
Koshy Geoji
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
Stanley Wang
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
Primya Tamil
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
Information retrieval 3 query search interfaces
Information retrieval 3 query search interfacesInformation retrieval 3 query search interfaces
Information retrieval 3 query search interfaces
Vaibhav Khanna
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
Vaibhav Khanna
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
ShivaVemula2
 
Chapter 1 semantic web
Chapter 1 semantic webChapter 1 semantic web
Chapter 1 semantic web
R A Akerkar
 
The semantic web
The semantic web The semantic web
The semantic web
ap
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
Stanley Wang
 

Viewers also liked (6)

Text categorization
Text categorizationText categorization
Text categorization
KU Leuven
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
KU Leuven
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
KU Leuven
 
Text categorization
Text categorizationText categorization
Text categorization
KU Leuven
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
KU Leuven
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
KU Leuven
 
Ad

Similar to Tdm information retrieval (20)

Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
Dave King
 
Text Mining.pptx
Text Mining.pptxText Mining.pptx
Text Mining.pptx
vrundadevani
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
Thanh Tran
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
intro.ppt
intro.pptintro.ppt
intro.ppt
UbaidURRahman78
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptx
HasanulFahmi2
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
maxfalc
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
Peter Mika
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resources
Hammad Afzal
 
Information Retrieval and Map-Reduce Implementations
Information Retrieval and Map-Reduce ImplementationsInformation Retrieval and Map-Reduce Implementations
Information Retrieval and Map-Reduce Implementations
Jason J Pulikkottil
 
Fundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptxFundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptx
aini658222
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Search
freewi11
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
Kapil Kumar
 
Technical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search EngineTechnical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search Engine
s0P5a41b
 
Chapter 1 Intro Information Rerieval.pptx
Chapter 1 Intro Information Rerieval.pptxChapter 1 Intro Information Rerieval.pptx
Chapter 1 Intro Information Rerieval.pptx
bekidea
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles Duncan
JISC CETIS
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
Sumit Sony
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
Dave King
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
Thanh Tran
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptx
HasanulFahmi2
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
maxfalc
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
Peter Mika
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resources
Hammad Afzal
 
Information Retrieval and Map-Reduce Implementations
Information Retrieval and Map-Reduce ImplementationsInformation Retrieval and Map-Reduce Implementations
Information Retrieval and Map-Reduce Implementations
Jason J Pulikkottil
 
Fundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptxFundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptx
aini658222
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Search
freewi11
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
Kapil Kumar
 
Technical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search EngineTechnical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search Engine
s0P5a41b
 
Chapter 1 Intro Information Rerieval.pptx
Chapter 1 Intro Information Rerieval.pptxChapter 1 Intro Information Rerieval.pptx
Chapter 1 Intro Information Rerieval.pptx
bekidea
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles Duncan
JISC CETIS
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
Sumit Sony
 
Ad

Tdm information retrieval

  • 2. Text Mining Applications Information Retrieval  Query-based search of large text archives, e.g., the Web Text Classification  Automated assignment of topics to Web pages, e.g., Yahoo, Google  Automated classification of email into spam and non- spam Text Clustering  Automated organization of search results in real-time into categories  Discovery clusters and trends in technical literature (e.g. CiteSeer) Information Extraction  Extracting standard fields from free-text  Extracting names and places from reports, newspapers
  • 3. Information Retrieval - Definition  Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).  Information Retrieval  Deals with the representation, storage, organization of, and access to information items – Modern Information Retrieval  General Objective: Minimize the overhead of a user locating needed information
  • 4. Information Retrieval Is Not Database Information Retrieval  Process stored documents  Search documents relevant to user queries  No standard of how queries should be  Query results are permissive to errors or inaccurate items Database  Normally no processing of data  Search records matching queries  Standard: SQL language  Query results should have 100% accuracy. Zero tolerant to errors
  • 5. Information Retrieval Is Not Data Mining Information Retrieval  User target: Existing relevant data entries Data Mining  User target: Knowledge (rules, etc.) implied by data (not the individual data entries themselves) • Many techniques and models are shared and related • E.g. classification of documents
  • 6. Is Information Retrieval a Form of Text Mining? What is the principal computer specialty for processing documents and text??  Information Retrieval (IR)  The task of IR is to retrieve relevant documents in response to a query.  The fundamental technique of IR is measuring similarity  A query is examined and transformed into a vector of values to be compared with stored documents
  • 7. Is Information Retrieval a Form of Text Mining?  In the predication problem similar documents are retrieved, then measure their properties, i.e. count the # of class labels to see which label should be assigned to a new document  The objectives of the prediction can be posed in the form of an IR model where documents are retrieved that are relevant to a query, the query will be a new document
  • 8. Specify Query Search Document Collection Return Subset of Relevant Documents Key Steps in Information Retrieval Examine Document Collection Learn Classification Criteria Apply Criteria to New Documents Key Steps in Predictive Text Mining
  • 9. Specify Query Vector Match Document Collection Get Subset of Relevant Documents Examine Document Properties Predicting from Retrieved Documents Key steps in IR simple criteria such as document’s labels
  • 10. Information Retrieval (IR)  Conceptually, IR is the study of finding needed information. I.e., IR helps users find information that matches their information needs.  Expressed as queries  Historically, IR is about document retrieval, emphasizing document as the basic unit.  Finding documents relevant to user queries  Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.
  • 12. Abstract IR Architecture DocumentsQuery Hits Representation Function Representation Function Query Representation Document Representation Comparison Function Index offlineonline
  • 14. IR Queries  Keyword queries  Boolean queries (using AND, OR, NOT)  Phrase queries  Proximity queries  Full document queries  Natural language questions
  • 15. Information retrieval models  An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined.  Main models:  Boolean model  Vector space model  Statistical language model  etc
  • 16. Elements in Information Retrieval  Processing of documents  Acceptance and processing of queries from users  Modelling, searching and ranking of documents  Presenting the search result
  • 17. Process of Retrieving Information
  • 18. Document Processing  Removing stopwords (appear frequently but no much meaning, e.g. “the”, “of”)  Stemming: recognize different words with the same grammar root  Noun groups: common combination of words  Indexing: for fast locating documents
  • 19. Processing Queries  Define a “language” for queries  Syntax, operators, etc.  Modify the queries for better search  Ignore meaningless parts: punctuations, conjunctives, etc.  Append synonyms e.g. e-business e-commerce  Emerging technology  Natural language queries
  • 20. Modelling/Ranking of Documents  Model the relevance (usefulness) of documents against the user query Q  The model represents a function Rel(Q,D)  D is a document, Q is a user query  Rel(Q,D) is the relevance of document D to query Q  There are many models available  Algebraic models  Probabilistic models  Set-theoretic models
  • 21. Basic Vector Space Model  Define a set of words and phases as terms  Text is represented by a vector of terms  User query is converted to a vector, too  Measure the vector “distance” between a document vector and the query vector business computer PowerPoint presentation user web Term Set We are doing an e-business presentation in PowerPoint. Document (1,0,1,1,0,0) computer presentation Query (0,1,0,1,0,0) 222222 )00()00()11()01()10()01(  Distance
  • 22. Probabilistic Models Overview Probabilistic Models  Ranking: the probability that a document is relevant to a query  Often denoted as Pr(R|D,Q)  In actual measure, log-odds transformation is used:  Probability values are estimated in applications ),|Pr( ),|Pr( log QDR QDR
  • 23. Information Retrieval Given  A source of textual documents  A well defined limited query (text based) Find  Sentences with relevant information  Extract the relevant information and ignore non- relevant information (important!)  Link related information and output in a predetermined format  Example: news stories, e-mails, web pages, photograph, music, statistical data, biomedical data, etc.  Information items can be in the form of text, image, video, audio, numbers, etc.
  • 24. Information Retrieval 2 basic information retrieval (IR) process: Browsing or navigation system  User skims document collection by jumping from one document to the other via hypertext or hypermedia links until relevant document found Classical IR system: question answering system  Query: question in natural language  Answer: directly extracted from text of document collection Text Based Information Retrieval: Information item (document)  Text format (written/spoken) or has textual description Information need (query)
  • 27. General concepts in IR Representation language  Typically a vector of d attribute values, e.g.,  set of color, intensity, texture, features characterizing images  word counts for text documents Data set D of N objects  Typically represented as an N x d matrix Query Q  User poses a query to search D  Query is typically expressed in the same representation language as the data, e.g.,  each text document is a set of words that occur in the document
  • 28. Query by Content Traditional DB query: exact matches  E.g. query Q = [level = MANAGER] AND [age < 30] or,  Boolean match on text  Query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun”  Not useful when there are many matches  E.g., “data mining” in Google returns 60 million documents Query-by-content query: more general / less precise  E.g. what record is most similar to a query Q?  For text data, often called “information retrieval (IR)”  Can also be used for images, sequences, video, etc  Q can itself be an object (e.g., a document) or a shorter version (e.g., 1 word) Goal  Match query Q to the N objects in the database  Return a ranked list (typically) of the most similar/relevant objects in the data set D given Q
  • 29. Issues in Query by Content  What representation language to use  How to measure similarity between Q and each object in D  How to compute the results in real-time (for interactive querying)  How to rank the results for the user  Allowing user feedback (query modification)  How to evaluate and compare different IR algorithms/systems
  • 30. The Standard Approach  Fixed-length (d dimensional) vector representation  For query (1-by-d Q) and database (n-by-d X) objects  Use domain-specific higher-level features (vs raw)  Image “bag of features”: color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), …  Text “bag of words”: freq count for each word in each document, … Also known as the “vector-space” model  Compute distances between vectorized representation  Use k-NN to find k vectors in X closest to Q
  • 31. Text Retrieval  Document: book, paper, WWW page, ...  Term: word, word-pair, phrase, … (often: 50,000+)  query Q = set of terms, e.g., “data” + “mining”  NLP (natural language processing) too hard, so …  Want (vector) representation for text which  Retains maximum useful semantics  Supports efficient distance computes between docs and Q  Term weights  Boolean (e.g. term in document or not); “bag of words”  Real-valued (e.g. freq term in doc; relative to all docs) ...
  • 32. Practical Issues Tokenization  Convert document to word counts  word token = “any nonempty sequence of characters”  for HTML (etc) need to remove formatting Canonical forms, Stopwords, Stemming  Remove capitalization  Stopwords  Remove very frequent words (a, the, and…) – can use standard list  Can also remove very rare words  Stemming (next slide) Data representation  E.g., 3 column: <docid termid position>  Inverted index (faster)  List of sorted <termid docid> pairs: useful for finding docs containing certain terms  Equivalent to a sparse representation of term x doc matrix
  • 33. Intelligent Information Retrieval  Meaning of words  Synonyms “buy” / “purchase”  Ambiguity “bat” (baseball vs. mammal)  Order of words in the query  Hot dog stand in the amusement park  Hot amusement stand in the dog park
  • 34. Key Word Search  The technical goal for prediction is to classify new, unseen documents  The Prediction and IR are unified by the computation of similarity of documents  IR based on traditional keyword search through a search engine  So we should recognize that using a search engine is a special instance of prediction concept
  • 35. Key Word Search  We enter a key words to a search engine and expect relevant documents to be returned  These key words are words in a dictionary created from the document collection and can be viewed as a small document  So, we want to measuring how similar the new document (query) is to the documents in the collection
  • 36. Key Word Search  So, the notion of similarity is reduced to finding documents with the same keywords as posed to the search engine  But, the objective of the search engine is to rank the documents, not to assign a label  So we need additional techniques to break the expected ties (all retrieved documents match the search criteria)
  • 37. Key Word Search  In full text retrieval, all the words in each document are considered to be keywords.  We use the word term to refer to the words in a document  Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not  Ands are implicit, even if not explicitly specified  Ranking of documents on the basis of estimated relevance to a query is critical  Relevance ranking is based on factors such as o Term frequency Frequency of occurrence of query keyword in document o Inverse document frequency How many documents the query keyword occurs in Fewer  give more importance to keyword o Hyperlinks to documents
  • 38. Relevance Ranking Using Terms TF-IDF (Term frequency/Inverse Document frequency) ranking:  Let n(d) = number of terms in the document d  n(d, t) = number of occurrences of term t in the document d.  Relevance of a document d to a term t  The log factor is to avoid excessive weight to frequent terms  Relevance of document to query Q n(d) n(d, t) 1 +TF (d, t) = log r (d, Q) =  TF (d, t) n(t)tQ
  • 39. Relevance Ranking Using Terms  Most systems add to the above model  Words that occur in title, author list, section headings, etc. are given greater importance  Words whose first occurrence is late in the document are given lower importance  Very common words such as “a”, “an”, “the”, “it” etc are eliminated Called stop words  Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart  Documents are returned in decreasing order of relevance score  Usually only top few documents are returned, not all
  • 40. Similarity Based Retrieval  Similarity based retrieval - retrieve documents similar to a given document  Similarity may be defined on the basis of common words E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents.  Relevance feedback: Similarity can be used to refine answer set to keyword query  User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these  Vector space model: define an n-dimensional space, where n is the number of words in the document set.  Vector for document d goes from origin to a point whose i th coordinate is TF (d,t ) / n (t )  The cosine of the angle between the vectors of two
  • 41. Relevance Using Hyperlinks  Number of documents relevant to a query can be enormous if only term frequencies are taken into account  Using term frequencies makes “spamming” easy  E.g. a travel agency can add many occurrences of the words “travel” to its page to make its rank very high  Most of the time people are looking for pages from popular sites  Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords
  • 42. Relevance Using Hyperlinks  Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site  Count only one hyperlink from each site  Popularity measure is for site, not for individual page But, most hyperlinks are to root of site Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity  Refinements  When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige Definition is circular Set up and solve system of simultaneous linear equations
  • 43. Relevance Using Hyperlinks  Connections to social networking theories that ranked prestige of people  E.g. the president of the U.S.A has a high prestige since many people know him  Someone known by multiple prestigious people has high prestige  Hub and authority based ranking  A hub is a page that stores links to many pages (on a topic)  An authority is a page that contains actual information on a topic  Each page gets a hub prestige based on prestige of authorities that it points to  Each page gets an authority prestige based on prestige of hubs that point to it  Again, prestige definitions are cyclic, and can be got by
  • 44. Nearest-Neighbor Methods  A method that compares vectors and measures similarity  In Prediction: the NNMs will collect the K most similar documents and then look at their labels  In IR: the NNMs will determine whether a satisfactory response to the search query has been found
  • 45. Measuring Similarity  These measures used to examine how documents are similar and the output is a numerical measure of similarity  Three increasingly complex measures:  Shared Word Count  Word Count and Bonus  Cosine Similarity
  • 46. Shared Word Count  Counts the shared words between documents  The words:  In IR we have a global dictionary where all potential words will be included, with the exception of stopwords.  In Prediction its better to preselect the dictionary relative to the label
  • 47. Computing similarity by Shared words  Look at all words in the new document  For each document in the collection count how many of these words appear  No weighting are used, just a simple count  The dictionary has true key words (weakly words removed)  The results of this measure are clearly intuitive  No one will question why a document was retrieved
  • 48. Computing similarity by Shared words  Each document represented as a vector of key words (zeros and ones)  The similarity of 2 documents is the product of the 2 vectors  If 2 documents have the same key word then this word is counted (1*1)  The performance of this measure depends mainly on the dictionary used
  • 49. Computing similarity by Shared words  Shared words is an exact search  Either retrieving or not retrieving a document.  No weighting can be done on terms  In query, A and B, you can’t specify A is more important than B  Every retrieved document are treated equally
  • 50. Word Count and Bonus  TF – term frequency  Number of times a term occurs in a document  DF –Document frequency  Number of documents that contain the term.  IDF – inversed document frequency  =log (N/df)  N: the total number of documents  Vector is a numerical representation for a point in a multi- dimensional space.  (x1, x2, … … xn) Dimensions of the space need to be defined A measure of the space needs to be defined.
  • 51. Word Count and Bonus  Each indexing term is a dimension  Each document is a vector Di = (ti1, ti2, ti3, ti4, ... tik)  Document similarity is defined as                   0 1 1 , 1 jdfjw jwiD K j If word (j) occurs in both documents otherwise K = number of words
  • 52. Word Count and Bonus  The bonus 1/df(j) is a variant of idf. Thus, if the word occurs in many documents, the bonus is small.  This measure better than the Shared Word count, because its discriminate among the weak and strong predictive words.
  • 53. Word Count and Bonus 2.83 1.33 0 1.33 1.5 1.33 2.67 Measure Similarity With Bonus 10101 11000 00010 10001 00100 01010 11001 Similarity Scores 1101 Labeled Spreadsheet Vector New Document Computing Similarity Scores with Bonus • A document Space is defined by five terms: hardware, software, user, information, index. •The query is “ hardware, user, information.
  • 54. Cosine Similarity The Vector Space A document is represented as a vector: (W1, W2, … … , Wn) Binary:  Wi= 1 if the corresponding term is in the document  Wi= 0 if the term is not in the document TF: (Term Frequency)  Wi= tfi where tfi is the number of times the term occurred in the document TF*IDF: (Inverse Document Frequency)  Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the number of documents contains the term i, and N the total number of documents in the collection.
  • 55. Cosine Similarity The Vector Space vec(D) = (w1, w2, ..., wt) Sim(d1,d2) = cos() = [vec(d1)  vec(d2)] / |d1| * |d2| = [ wd1(j) * wd2(j)] / |d1| * |d2| W(j) > 0 whenever j di So, 0 <= sim(d1,d2) <=1 A document is retrieved even if it matches the query terms only partially
  • 56. Cosine Similarity  How to compute the weight wj?  A good weight must take into account two effects:  quantification of intra-document contents (similarity) tf factor, the term frequency within a document  quantification of inter-documents separation (dissi- milarity) idf factor, the inverse document frequency  wj = tf(j) * idf(j)
  • 57. Cosine Similarity  TF in the given document shows how important the term is in this document (makes the frequent words for the document more important)  IDF makes rare words across all documents more important.  A high weight in a tf-idf ranking scheme is therefore reached by a high term frequency in the given document and a low term frequency in all other documents.  Term weights in a document affects the position of the document vectors
  • 58. Cosine Similarity TF-IDF definitions: fik: number occurrences of term ti in document Dk tfik: fik / max(fik) normalized term frequency dfk: number of documents which contain tk idfk: log(N / dfk) where N is the total number of documents wik: tfik idfk term weight Intuition: rare words get more weight, common words less weight
  • 59. Example TF-IDF  Given a document containing terms with given frequencies: Kent = 3; Ohio = 2; University = 1 and assume a collection of 10,000 documents and document frequencies of these terms are: Kent = 50; Ohio = 1300; University = 250. THEN Kent: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 Ohio: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2
  • 60. Cosine Similarity  Cosine  W(j) = tf(j) * idf(j)  Idf(j) = log(N / df(j))             22 2*1 2*1 2,1 jwdjwd jwdjwd dd
  • 61. Why Mine the Web? Enormous wealth of textual information on the Web.  Book/CD/Video stores (e.g., Amazon)  Restaurant information (e.g., Zagats)  Car prices (e.g., Carpoint) Lots of data on user access patterns  Web logs contain sequence of URLs accessed by users Possible to retrieve “previously unknown” information  People who ski also frequently break their leg.  Restaurants that serve sea food in California are likely to be outside San-Francisco
  • 62. Mining the Web IR / IE System Query Documents source Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . Web Spider
  • 63. Web-based Retrieval Additional information in Web documents  Link structure (e.g., Page Rank)  HTML structure  Link/anchor text  Title text, Etc.,  Can be leveraged for better retrieval Additional issues in Web retrieval  Scalability: size of “corpus” is huge (10 to 100 billion docs)  Constantly changing:  Crawlers to update document-term information  Need schemes for efficient updating indices  Evaluation is more difficult:  How is relevance measured?  How many documents in total are relevant?
  • 64. Probabilistic Approaches to Retrieval Compute P(q | d) for each document d  Intuition: relevance of d to q is related to how likely it is that q was generated by d, or “how likely is q under a model for d?” Simple model for P(q|d)  Pe(q|d) = empirical frequency of words in document d  “tuned” to d, but likely to be sparse (will contain many zeros) 2-stage probabilistic model (or linear interpolation model)  P(q|d) = l Pe (q | d) + (1- l ) Pe (q | corpus)  l can be fixed, e.g., tuned to a particular data set  Or it can depend on d, e.g., where nd = number of words in doc d , and m = a constant (e.g., 1000) Can also use more sophisticated models for P(q|d) e.g., topic- based models )/(1 mnn dd 
  • 65. Information Retrieval  Web-Based Document Search  Page Rank  Anchor Text  Document Matching  Inverted Lists
  • 66. Page Rank  PR(A): the page rank of page A.  C(T) : the number of outgoing links from page T.  d : minimum value assigned to any page.  : a page pointing to A.  j jj TCTPRddAPR ))(/)((*)1()( jT
  • 67. Algorithm of Page Rank 1. Use the PageRank Equation to compute PageRank for each page in the collection using latest PageRanks of pages. 2. Repeat step 1 until no significant change to any PageRank.
  • 68. Example In The First Iteration:  PR(A)=0.1+0.9*(PR(B)+PR(C)) =0.1+0.9*(1+1) =1.9  PR(B)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95  PR(C)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95 PR(A)=1.48, PR(B)=0.76, PR(C)=0.76 Initial Value: PR(A)=PR(B)=PR(C)=1 d=0.1
  • 69. Anchor Text  The anchor text is the visible, clickable text in a hyperlink.  For example: <ahref=“http://www.wikipedia.org”>Wikipedia</a>  The anchor text is Wikipedia; the complex URL http://www.wikipedia.org/ displays on the web page as Wikipedia, contributing to a clean, easy to read text or document.
  • 70. Anchor Text  Anchor text usually gives the user relevant descriptive or contextual information about the content of the link’s destination.  The anchor text may or may not be related to the actual text of the URL of the link.  The words contained in the Anchor Text can determine the ranking that the page will receive by search engines.
  • 71. Common Misunderstanding  Webmasters sometimes tend to misunderstand anchor text.  Instead of turning appropriate words inside of a sentence into a clickable link, webmasters frequently insert extra text.
  • 72. Anchor Text  This proper method of linking is beneficial not only to users, but also to the webmasters as anchor text holds significant weight in search engine ranking.  Most search engine optimization experts recommend against using “click here” to designate a link.
  • 73. Document Matching  An arbitrarily long document is the query, not just a few key words.  But the goal is still to rank and output an ordered list of relevant documents.  The most similar documents are found using the measures described earlier.  Search engines and document matchers are not focused on classification of new documents.  Their primary goal is to retrieve the most relevant
  • 74. Generalization of searching • Matching a document to a collection of documents looks like a tedious and expensive operation. • Even for a short query, comparison to all large documents in the collection implies a relatively intensive computation task.
  • 75. Example of document matching  Consider an online help desk, where a complete description of a problem is submitted.  That document could be matched to stored documents, hopefully finding descriptions of similar problems and solutions without having the user experiment with numerous key word searches.
  • 76. Inverted Lists  Instead of documents pointing to words, a list of words pointing to documents is the primary internal representation for processing queries and matching documents.
  • 77. Inverted Lists  The inverted list is the key to the efficiency of information retrieval systems.  The inverted list has contributed to make nearest- neighbor methods a pragmatic possibility for prediction.
  • 78. Example If the query contained words 100 and 200 1) First processing W(100) to compute the similarity S(i) of each document i: S(1)=0+1 S(2)=0+1 … 2) Then process W(200) in the same way: S(2)=1+1 …
  • 79. Evaluating IE Accuracy  Always evaluate performance on independent, manually- annotated test data not used during system development.  Measure for each test document:  Total number of correct extractions in the solution template: N  Total number of slot/value pairs extracted by the system: E  Number of extracted slot/value pairs that are correct (i.e. in the solution template): C  Compute average value of metrics adapted from IR:  Recall = C/N  Precision = C/E  F-Measure = Harmonic mean of recall and precision
  • 80. Related Types of Data Sparse high-dimensional data sets with counts, like document-term matrices, are common in data mining, e.g.,  “transaction data”  Rows = customers; Columns = products  Web log data (ignoring sequence)  Rows = Web surfers; Columns = Web pages Recommender systems  Given some products from user i, suggest other products to the user  e.g., Amazon.com’s book recommender  Collaborative filtering:  use k-nearest-individuals as the basis for predictions  Many similarities with querying and information retrieval
  • 81. What is a Good IR System?  Minimize the overhead of a user locating needed information  Fast, accurate, comprehensive, easy to use, …  Objective measures  Precision  Recall retrieveddocumentsallofNo. retrieveddocumentsrelevantofNo. P dataindocumentsrelevantallofNo. retrieveddocumentsrelevantofNo. R
  • 82. Measuring Retrieval Effectiveness  Information-retrieval systems save space by using index structures that support only approximate retrieval. May result in:  false negative (false drop) - some relevant documents may not be retrieved.  false positive - some irrelevant documents may be retrieved. For many applications a good index should not permit any false drops, but may permit a few false positives.  Relevant performance metrics:  precision - what percentage of the retrieved documents are relevant to the query.  recall - what percentage of the documents
  • 83. Measuring Retrieval Effectiveness Recall vs. precision tradeoff:  Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision Measures of retrieval effectiveness:  Recall as a function of number of documents fetched, or  Precision as a function of recall Equivalently, as a function of number of documents fetched  E.g. “precision of 75% at recall of 50%, and 60% at a recall of 75%”
  • 84. Applications of Information Retrieval  Classic application  Library catalogue e.g. The UofC library catalogue  Current applications  Digital library e.g. http://www.acm.org/dl  WWW search engines e.g. http://www.google.com
  • 85. Other applications of IE Systems  Job resumes  Seminar announcements  Molecular biology information from MEDLINE, e.g, Extracting gene drug interactions from biomed texts  Summarizing medical patient records by extracting diagnoses, symptoms, physical findings, test results.  Gathering earnings, profits, board members, etc. [corporate information] from web, company reports  Verification of construction industry specifications documents (are the quantities correct/reasonable?)  Extraction of political/economic/business changes from newspaper articles
  • 86. Conclusion 1. Information retrieval methods are specialized nearest-neighbor methods, which are well-known prediction methods. 2. IR methods typically process unlabeled data and order and display the retrieved documents. 3. The IR methods have no training and induce no new rules for classification.