Info Retrieval
Info Retrieval
Aarti Dharmani
What do these images tell?
Information Retrieval
• It is the process of obtaining information system resources that are
relevant to an information need from a collection of those resources.
• It is the science of searching for information in a document,
searching for documents themselves, and also searching for the
metadata that describes data, and for databases of texts, images or
sounds.
• In simple words, it works to sort and rank documents based on the
queries of a user.
• An information retrieval process begins when a user or searcher
enters a query into the system.
• Queries are formal statements of information needs, for example
search strings in web search engines.
• In information retrieval a query does not uniquely identify a single
object in the collection.
• Instead, several objects may match the query, perhaps with different
degrees of relevance.
Model Types
• For effectively retrieving relevant documents by IR strategies, the
documents are typically transformed into a suitable representation.
• Each retrieval strategy incorporates a specific model for its document
representation purposes.
• The models are categorized according to two dimensions: the
mathematical basis and the properties of the model.
First dimension: mathematical basis
• Set-theoretic models represent documents as sets of words or phrases.
Similarities are usually derived from set-theoretic operations on those sets.
Common models are:
Egs: Standard Boolean model, Extended Boolean model, Fuzzy retrieval
• Algebraic models represent documents and queries usually as vectors,
matrices, or tuples. The similarity of the query vector and document vector is
represented as a scalar value.
Egs: Vector space model, Generalized vector space model, Topic-based
Vector Space Model, Extended Boolean model, Latent semantic indexing a.k.
a. latent semantic analysis
• Probabilistic models treat the process of document retrieval as a probabilistic
inference. Similarities are computed as probabilities that a document is
relevant for a given query. Probabilistic theorems like the Bayes' theorem are
often used in these models.
Egs: Binary Independence Model, Probabilistic relevance model on which is
based the okapi (BM25) relevance function, Uncertain inference, Language
model, Divergence-from-randomness model, Latent Dirichlet allocation
• Feature-based retrieval models view documents as vectors of values of
feature functions (or just features) and seek the best way to combine these
features into a single relevance score, typically by learning to rank methods.
Feature functions are arbitrary functions of document and query, and as such
can easily incorporate almost any other retrieval model as just another
feature.
Is there a difference between Information
Extraction and Information Retrieval?
Issues with IR systems
• Query evaluation
- Uncertainity
- Vagueness
• Ambiguition