The vector space model

Aug 30, 2015Download as PPTX, PDF6 likes6,957 views

The document discusses the vector space model for representing text documents and queries in information retrieval systems. It describes how documents and queries are represented as vectors of term weights, with each term being assigned a weight based on its frequency in the document or query. The vector space model allows documents and queries to be compared by calculating the similarity between their vector representations. Terms that are more frequent in a document and less frequent overall are given higher weights through techniques like TF-IDF weighting. This vector representation enables efficient retrieval of documents ranked by similarity to the query.

The Vector space model
Submitted By –
Deeksha Agarwal
Semester 5th
University of Allahabad

Boolean Model Disadvantages
• Similarity function is boolean
⁻ Exact-match only, no partial matches
⁻ Retrieved documents not ranked
• All terms are equally important
– Boolean operator usage has much more
influence than a critical word
• Query language is expressive but complicated

Statistical Models
• A document is typically represented by a bag
of words (unordered words with frequencies).
• Bag = set that allows multiple occurrences of
the same element.

4
Statistical Retrieval
• Retrieval based on similarity between query and
documents.
• Output documents are ranked according to
similarity to query.
• Similarity based on occurrence frequencies of
keywords in query and document.
• Automatic relevance feedback can be supported:
– Relevant documents “added” to query.
– Irrelevant documents “subtracted” from query.

5
The Vector-Space Model
• Documents and queries are both vectors
• Each term, i, in a document or query, j, is given a
real-valued weight, wij.
• Both documents and queries are expressed as t-
dimensional vectors:
dj = (w1j, w2j, …, wtj)

6
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5

7
Document Collection
• A collection of n documents can be represented in the vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term in
the document; zero means the term has no significance in the
document or it simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

8
Term Weights: Term Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by
dividing by the frequency of the most
common term in the document:
tfij = fij / maxi{fij}

9
Term Weights: Inverse Document Frequency
• Terms that appear in many different
documents are less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)

10
TF-IDF Weighting
• A typical combined term importance indicator
is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document
but rarely in the rest of the collection is given
high weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work
well.

11
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and document
frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

Neural networks are inspired by biological neural networks and are composed of interconnected processing elements called neurons. Neural networks can learn complex patterns and relationships through a learning process without being explicitly programmed. They are widely used for applications like pattern recognition, classification, forecasting and more. The document discusses neural network concepts like architecture, learning methods, activation functions and applications. It provides examples of biological and artificial neurons and compares their characteristics.

Propositional logicMamta Pandey

This document provides an overview of propositional logic, including: - Propositions are statements that can be true or false. Compound propositions combine simpler statements with logical connectives like "and" and "or". - Truth tables show the truth values of compound propositions based on the truth values of their variables. - Common logical connectives include conjunction, disjunction, negation, implication, and equivalence. - Tautologies and contradictions are types of statements that are always true or false regardless of variable values. - Quantifiers like "for all" and "there exists" can be used to define propositional functions on a domain. - Valid arguments are those where the conclusion is necessarily true

Vector space model of information retrievalNanthini Dominique

The document discusses different theories used in information retrieval systems. It describes cognitive or user-centered theories that model human information behavior and structural or system-centered theories like the vector space model. The vector space model represents documents and queries as vectors of term weights and compares similarities between queries and documents. It was first used in the SMART information retrieval system and involves assigning term vectors and weights to documents based on relevance.

Mining Frequent Patterns And Association RulesRashmi Bhat

The document discusses frequent pattern mining and association rule mining. It defines key concepts like frequent itemsets, association rules, support and confidence. It explains the Apriori algorithm for mining frequent itemsets in multiple steps. The algorithm uses a level-wise search approach and the Apriori property to reduce the search space. It generates candidate itemsets in the join step and determines frequent itemsets by pruning infrequent candidates in the prune step. An example applying the Apriori algorithm to a retail transaction database is also provided to illustrate the working of the algorithm.

(Ch#1) artificial intelligenceNoor Ul Hudda Memon

The document provides an overview of artificial intelligence, including definitions, key concepts, and applications. It defines AI as the simulation of human intelligence in machines, and notes the differences between weak/narrow AI which focuses on specific problems, versus strong/general AI which aims to achieve human-level intelligence. The document also discusses how AI works by trying to think and act well, and by attempting to think and act like humans. It provides examples of AI application areas and practical tools used today.

Graph in data structureAbrish06

This document defines and provides examples of graphs and their representations. It discusses: - Graphs are data structures consisting of nodes and edges connecting nodes. - Examples of directed and undirected graphs are given. - Graphs can be represented using adjacency matrices or adjacency lists. Adjacency matrices store connections in a grid and adjacency lists store connections as linked lists. - Key graph terms are defined such as vertices, edges, paths, and degrees. Properties like connectivity and completeness are also discussed.

Machine LearningRabab Munawar

Probabilistic information retrieval models & systemsSelman Bozkır

The document discusses probabilistic information retrieval and Bayesian approaches. It introduces concepts like conditional probability, Bayes' theorem, and the probability ranking principle. It explains how probabilistic models estimate the probability of relevance between a document and query by representing them as term sets and making probabilistic assumptions. The goal is to rank documents by the probability of relevance to present the most likely relevant documents first.

CS6007 information retrieval - 5 units notesAnandh Arumugakan

This document provides a full syllabus with questions and answers related to the course "Information Retrieval" including definitions of key concepts, the historical development of the field, comparisons between information retrieval and web search, applications of IR, components of an IR system, and issues in IR systems. It also lists examples of open source search frameworks and performance measures for search engines.

Vector space model in information retrievalTharuka Vishwajith Sarathchandra

Boolean,vector space retrieval Models Primya Tamil

The document discusses various information retrieval models including Boolean, vector space, and probabilistic models. It provides details on how documents and queries are represented and compared in the vector space model. Specifically, it explains that in this model, documents and queries are represented as vectors of term weights in a multi-dimensional space. The similarity between a document and query vector is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.

Information Retrieval ModelsNisha Arankandath

The document discusses various information retrieval models, including: 1) Classic models like Boolean and vector space models that use index terms to represent documents and queries. 2) Probabilistic models that view IR as estimating the probability of relevance between documents and queries. 3) Structured models that incorporate document structure, including models based on non-overlapping text regions and hierarchical document structure. 4) Browsing models like flat, structure-guided, and hypertext models for navigating document collections.

Term weightingPrimya Tamil

The document discusses various aspects of term weighting which is important for text retrieval systems, including term frequency, document frequency, inverse document frequency, and how they are used to calculate TF-IDF weights for terms. It also covers stoplists, stemming, and the bag-of-words model which represents text as vectors of word occurrences without considering word order. Term weighting schemes play a major role in the similarity measures used by information retrieval systems to determine document relevance.

Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke

INTRODUCTION TO INFORMATION RETRIEVAL This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments. INFORMATION RETRIEVAL MODELS This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.

Inverted indexKrishna Gehlot

An inverted file indexes a text collection to speed up searching. It contains a vocabulary of distinct words and occurrences lists with information on where each word appears. For each term in the vocabulary, it stores a list of pointers to occurrences called an inverted list. Coarser granularity indexes use less storage but require more processing, while word-level indexes enable proximity searches but use more space. The document describes how inverted files are structured and constructed from text and discusses techniques like block addressing that reduce their space requirements.

Information retrieval ssilambu111

The document discusses information retrieval, which involves obtaining information resources relevant to an information need from a collection. The information retrieval process begins when a user submits a query. The system matches queries to database information, ranks objects based on relevance, and returns top results to the user. The process involves document acquisition and representation, user problem representation as queries, and searching/retrieval through matching and result retrieval.

Information Retrievalssbd6985

Information retrieval systems aim to find documents relevant to a user's information need. Search engines are a common example, allowing users to enter queries and receiving a list of relevant web pages. Effective systems represent documents and queries statistically based on word frequencies and use scoring functions to rank documents by estimated relevance to the query. Evaluation involves measuring a system's precision, the proportion of returned documents that are relevant, and recall, the proportion of all relevant documents that are returned.

Information retrieval introductionnimmyjans4

This document provides an overview of information retrieval models. It begins with definitions of information retrieval and how it differs from data retrieval. It then discusses the retrieval process and logical representations of documents. A taxonomy of IR models is presented including classic, structured, and browsing models. Boolean, vector, and probabilistic models are explained as examples of classic models. The document concludes with descriptions of ad-hoc retrieval and filtering tasks and formal characteristics of IR models.

Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan

Information retrieval 14 fuzzy set models of irVaibhav Khanna

Information retrieval 9 tf idf weightsVaibhav Khanna

Text miningThejeswiniChivukula

This document provides an introduction to text mining, including definitions of text mining and how it differs from data mining. It describes common areas and applications of text mining such as information retrieval, natural language processing, and information extraction. The document outlines the typical process of text mining including preprocessing, feature generation and selection, and different mining techniques. It also discusses common approaches to text mining such as keyword-based analysis and document classification/clustering. Finally, it notes some challenges of text mining related to unstructured text data.

Natural Language ProcessingIla Group

The document discusses different approaches to generating biographies through natural language processing, including information extraction and language modeling. It describes using information extraction patterns learned from Wikipedia to extract fields like date of birth and place of birth, and bouncing between Wikipedia and Google search results to learn patterns for other fields with less structured data. It also proposes selecting and ranking sentences from search results to improve recall when information extraction may miss relevant sentences. The goal is to build biographies by combining these techniques for high precision on structured fields and better recall on more complex fields.

Deductive databasesDabbal Singh Mahara

The document discusses deductive databases and how they differ from conventional databases. Deductive databases contain facts and rules that allow implicit facts to be deduced from the stored information. This reduces the amount of storage needed compared to explicitly storing all facts. Deductive databases use logic programming through languages like Datalog to specify rules that define virtual relations. The rules allow new facts to be inferred through an inference engine even if they are not explicitly represented.

Web search vs irPrimya Tamil

This document compares web search and information retrieval (IR) across 10 differentiators: 1. Languages - Web search indexes documents in many languages using full text, while IR databases usually cover one language. 2. File types - Web search indexes several file types including some without text, while IR indexes consistent formats like PDF. 3. Document length - Web documents vary widely in length from short to long, while IR documents vary less. 4. Document structure - Web documents are semi-structured HTML, while IR allows searching structured document fields.

similarity measure ZHAO Sam

This document discusses various similarity measures that can be used to quantify the similarity between documents, queries, or a document and query in an information retrieval system. It describes classic measures like Dice coefficient, overlap coefficient, Jaccard coefficient, and cosine coefficient. It provides examples of calculating these measures and compares the relations between different measures. The document also discusses using term-document matrices and shows an example matrix.

Cross-lingual Information RetrievalShadi Saleh

This document discusses cross-lingual information retrieval. It presents approaches for translating queries from other languages to the document language, including using online machine translation systems and developing a statistical machine translation system. It describes experiments on reranking translations to select the one most effective for retrieval and on adapting the reranking model to new languages. Results show the reranking approach improves over baselines and online translation systems. The document also explores document translation and query expansion techniques.

Signature filesDeepali Raikar

The document discusses signature files, which are used for document retrieval. A signature file creates a compressed representation or "signature" for each document in a database. These signatures are stored in hash tables to allow easy retrieval of matching documents for user queries. Signatures can represent words using triplets of characters and a hash function, or entire documents through concatenation of word signatures or superimposed coding. Signature files provide a quick link between queries and documents but have lower accuracy than inverted files, which are generally better for information retrieval applications.

Document similarity with vector space modeldalal404

Vector space model represents documents and queries as vectors in a common vector space. Each dimension corresponds to a unique term, and the value in each dimension represents how important that term is to the document or query. Document similarity is calculated by taking the cosine of the angle between the document and query vectors, with a value closer to 1 indicating greater similarity. An example calculates tf-idf weights for terms in documents and a query, derives the document and query vectors, and determines that the second document has the highest similarity to the query based on a cosine similarity value of 0.8246.

Ir 08Mohammed Romi

The document summarizes the vector space model for scoring and ranking documents in response to a query in an information retrieval system. It explains that in this model, documents and queries are represented as vectors in a common vector space. The similarity between a document and query vector is measured by calculating the cosine similarity of the two vectors, which scores and ranks documents based on the terms they share with the query. It also describes how the vector space model allows retrieving the top K documents by relevance rather than using a Boolean retrieval model.

More Related Content

What's hot (20)

CS6007 information retrieval - 5 units notesAnandh Arumugakan

Vector space model in information retrievalTharuka Vishwajith Sarathchandra

Boolean,vector space retrieval Models Primya Tamil

Information Retrieval ModelsNisha Arankandath

Term weightingPrimya Tamil

Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke

Inverted indexKrishna Gehlot

Information retrieval ssilambu111

Information Retrievalssbd6985

Information retrieval introductionnimmyjans4

Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan

Information retrieval 14 fuzzy set models of irVaibhav Khanna

Information retrieval 9 tf idf weightsVaibhav Khanna

Text miningThejeswiniChivukula

Natural Language ProcessingIla Group

Deductive databasesDabbal Singh Mahara

Web search vs irPrimya Tamil

similarity measure ZHAO Sam

Cross-lingual Information RetrievalShadi Saleh

Signature filesDeepali Raikar

CS6007 information retrieval - 5 units notesAnandh Arumugakan

Vector space model in information retrievalTharuka Vishwajith Sarathchandra

Boolean,vector space retrieval Models Primya Tamil

Information Retrieval ModelsNisha Arankandath

Term weightingPrimya Tamil

Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke

Inverted indexKrishna Gehlot

Information retrieval ssilambu111

Information Retrievalssbd6985

Information retrieval introductionnimmyjans4

Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan

Information retrieval 14 fuzzy set models of irVaibhav Khanna

Information retrieval 9 tf idf weightsVaibhav Khanna

Text miningThejeswiniChivukula

Natural Language ProcessingIla Group

Deductive databasesDabbal Singh Mahara

Web search vs irPrimya Tamil

similarity measure ZHAO Sam

Cross-lingual Information RetrievalShadi Saleh

Signature filesDeepali Raikar

Viewers also liked (20)

Document similarity with vector space modeldalal404

Ir 08Mohammed Romi

Indexing, vector spaces, search enginesXYLAB

1) The document discusses several models for representing text documents and queries in information retrieval systems, including the bag-of-words, inverted index, boolean, vector space, and PageRank models. 2) The vector space model represents documents and queries as vectors in a multidimensional space to calculate similarity between them. Term frequency-inverse document frequency (tf-idf) is used to weight the vector values. 3) PageRank and other algorithms exploit the link structure between documents on the web to determine importance, simulating a random walk across links. This importance value contributes to overall document ranking.

Probabilistic Retrieval TFIDFDKALab

The document discusses incorporating probabilistic retrieval knowledge into TFIDF-based search engines. It covers Boolean retrieval, vector space models, and probabilistic retrieval models. The probabilistic model uses Bayes' rule to estimate the probability of a document being relevant or non-relevant given its terms. This can be combined with the BM25 ranking algorithm. The document proposes applying probabilistic knowledge by learning weights for document fields to estimate the probability of relevance based on field matches. This allows incorporating importance of different fields like title vs body text. Overall, the approach aims to improve document ranking by integrating probabilistic relevance estimates into existing TFIDF and BM25 algorithms.

Ch7Mohammed Romi

The document discusses the design and implementation process in software engineering. It covers topics like using the Unified Modeling Language (UML) for object-oriented design, design patterns, and implementation issues. It then discusses the design process, including identifying system contexts and interactions, architectural design, identifying object classes, and creating design models like subsystem, sequence, and state diagrams. The example of designing a weather station system is used to illustrate these design concepts and activities.

Vector SpacesFranklin College Mathematics and Computing Department

Text SimilarityAbdul Baquee Muhammad Sharaf

Beyond tf idf why, what & howlucenerevolution

Presented by Stephen Murtagh, Etsy.com, Inc. TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term. In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.

Information retreival, By Hadi MohammadzadehHadi Mohammadzadeh

The document summarizes a seminar presentation on information retrieval (IR) given by Hadi Mohammadzadeh. It defines IR and discusses basic assumptions of IR systems. It also describes common search methods for finding documents, including the grep method, term-document incidence matrices, inverted indexes with and without skip pointers, and positional indexes. The construction of inverted indexes is also outlined.

Search: Probabilistic Information RetrievalVipul Munot

Probabilistic Information Retrieval uses probability rankings to effectively retrieve documents. It assumes binary relevance and independence between documents. The Binary Independence Model represents documents and queries as term vectors and estimates probabilities of relevance using term frequencies. Documents are ranked by their odds of relevance based on query term matches. In practice, probability estimates use collection frequencies. Extensions allow dependencies between terms and non-binary representations.

Lec 4,5alaa223

This document discusses vector space retrieval models. It describes how documents and queries are represented as vectors in a common vector space based on terms. Terms are weighted using metrics like term frequency (TF) and inverse document frequency (IDF) to determine importance. The cosine similarity measure is used to calculate similarity between document and query vectors and rank results by relevance. While simple and effective in practice, vector space models have limitations like missing semantic and syntactic information.

Probabilistic Retrievalotisg

This document discusses incorporating probabilistic retrieval knowledge into TFIDF-based search engines. It provides an overview of different retrieval models such as Boolean, vector space, probabilistic, and language models. It then describes using a probabilistic model that estimates the probability of a document being relevant or non-relevant given its terms. This model can be combined with the BM25 ranking algorithm. The document proposes applying probabilistic knowledge to different document fields during ranking to improve relevance.

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher

The document discusses probabilistic retrieval models in information retrieval. It provides an overview of older models like Boolean retrieval and vector space models. The main focus is on probabilistic models like BM25 and language models. It explains key concepts in probabilistic IR like the probability ranking principle, using Bayes' rule to estimate the probability that a document is relevant given features of the document, and estimating probabilities based on the frequencies of terms in relevant documents. The goal is to rank documents based on the probability of relevance to the query.

Natural Language Processing: L02 wordsananth

Probabilistic Information RetrievalHarsh Thakkar

The document presents an overview of probabilistic models for information retrieval. It discusses how probability theory can be applied to model the uncertain nature of retrieval, where queries only vaguely represent user needs and relevance is uncertain. The document outlines different probabilistic IR models including the classical probabilistic retrieval model, probability ranking principle, binary independence model, Bayesian networks, and language modeling approaches. It also describes datasets used to evaluate these models, including collections from TREC, Cranfield, and others. Basic probability theory concepts are reviewed, including joint probability, conditional probability, and rules relating probabilities.

Vector Spaces,subspaces,Span,BasisRavi Gelani

The document defines key concepts in vector spaces including vector space, subspace, span of a set of vectors, and basis. It provides examples to illustrate these concepts. Specifically: - A vector space is a set of objects called vectors that can be added together and multiplied by scalars, satisfying certain properties. - A subspace is a subset of a vector space that is itself a vector space under the operations of the original space. - The span of a set of vectors S is the set of all possible linear combinations of the vectors in S. - A basis is a set of vectors that spans a vector space and is linearly independent. It provides a standard representation for vectors in the space.

OUTDATED Text Mining 4/5: Text ClassificationFlorian Leitner

Data Mining: an IntroductionAli Abbasi

This document provides an introduction to data mining and machine learning. It discusses how data mining can extract hidden patterns from large datasets. The document covers common data mining tasks like classification, regression, and clustering. It also describes different algorithms for classification including decision trees, naive Bayes classifiers, and k-nearest neighbors. Regression is also introduced as predicting real-valued outputs. The document uses examples to illustrate key concepts in data mining.

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris

This document proposes CSMR, a scalable algorithm for text clustering that uses cosine similarity and MapReduce. CSMR performs pairwise text similarity by representing text documents as vectors in a vector space model and measuring similarity in parallel using MapReduce. It is a 4-phase algorithm that includes word counting, text vectorization using term frequencies, applying TF-IDF to document vectors, and measuring cosine similarity. The algorithm is designed to cluster large text corpora in a scalable manner on distributed systems like Hadoop. Future work includes implementing and testing CSMR on real data and publishing results.

IRGirish Khanzode

This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.

Document similarity with vector space modeldalal404

Ir 08Mohammed Romi

Indexing, vector spaces, search enginesXYLAB

Probabilistic Retrieval TFIDFDKALab

Ch7Mohammed Romi

Vector SpacesFranklin College Mathematics and Computing Department

Text SimilarityAbdul Baquee Muhammad Sharaf

Beyond tf idf why, what & howlucenerevolution

Information retreival, By Hadi MohammadzadehHadi Mohammadzadeh

Search: Probabilistic Information RetrievalVipul Munot

Lec 4,5alaa223

Probabilistic Retrievalotisg

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher

Natural Language Processing: L02 wordsananth

Probabilistic Information RetrievalHarsh Thakkar

Vector Spaces,subspaces,Span,BasisRavi Gelani

OUTDATED Text Mining 4/5: Text ClassificationFlorian Leitner

Data Mining: an IntroductionAli Abbasi

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris

IRGirish Khanzode

Similar to The vector space model (20)

information retrieval term Weighting.pptKelemAlebachew

Ir modelsAmbreen Angel

The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.

Information retrieval 8 term weightingVaibhav Khanna

Term weighting assigns a weight to terms in documents to quantify their importance in describing the document's contents. Weights are higher for terms that occur frequently in a document but rarely in other documents. Term frequency in a document and inverse document frequency are used to calculate TF-IDF weights. Term occurrences may be correlated, so term weights should reflect their correlation. For example, terms like "computer" and "network" often appear together in documents about computer networks.

unit -4MODELING AND RETRIEVAL EVALUATIONkarthiksmart21

IRT Unit_ 2.pptxthenmozhip8

This document discusses different information retrieval models including the Boolean model, vector space model, and probabilistic model. It focuses on describing the Boolean model and its drawbacks. Term frequency-inverse document frequency (TF-IDF) weighting is explained as a way to assign weights to terms based on frequency and document distribution. Cosine similarity is presented as a common way to measure similarity between a document vector and query vector in the vector space model.

Chapter 4 IR Models.pdfHabtamu100

The document discusses information retrieval (IR) models, including the Boolean, vector space, and probabilistic models. The Boolean model represents documents and queries as sets of index terms and determines relevance through binary term presence, while the vector space model represents documents and queries as weighted vectors in a multidimensional space and ranks documents by calculating similarity between document and query vectors. The probabilistic model determines relevance probabilities based on the likelihood of terms appearing in relevant vs. non-relevant documents.

UNIT 3 IRT.docxthenmozhip8

The document discusses several information retrieval models including the Boolean, vector space, and probabilistic models. It provides details on how each model represents documents and queries, defines relevance, and ranks documents in response to queries. Specifically, it describes: 1) The Boolean model uses exact matching to retrieve only documents that satisfy a Boolean query, but does not rank results. 2) The vector space model represents documents and queries as vectors of term weights and ranks documents based on their similarity to the query vector using measures like cosine similarity. 3) Term frequency-inverse document frequency (TF-IDF) is discussed as a method to weight terms based on their importance.

Chapter 6 Query Language .pdfHabtamu100

This document discusses different types of query languages used for information retrieval systems. It describes keyword queries where documents are retrieved based on the presence of query words. Phrase queries search for an exact sequence of words. Boolean queries use logical operators like AND, OR and NOT to combine search terms. Natural language queries allow users to enter searches in a free-form manner but require translation to a formal query language. The document provides examples and explanations of each query language type over its 12 sections.

NLP Lecture on the preprocessing approachesdheeraj306480

Information Retrievalrchbeir

The document is a presentation on information retrieval by Richard Chbeir. It discusses key concepts in information retrieval including definitions of information retrieval, the information retrieval process, query and document processing techniques like stop word removal and stemming, representation models like the Boolean and vector space models, and inverted indexes. Specific topics covered include query representation, document indexing and processing, weighting schemes for terms, and measuring similarity between queries and documents.

Document similarityHemant Hatankar

Some Information Retrieval Models and Our Experiments for TREC KBAPatrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)

Text Representation methods in Natural language processingNarendraChindanur

191CSEH IR UNIT - II for an engineering subjectphilipsmohan

chapter 5 Information Retrieval Models.pptKelemAlebachew

Information Retrieval QueryLanguageOperation.pptKelemAlebachew

4-IR Models_new.pptBereketAraya

The document discusses the vector space model used in information retrieval. It explains that documents and queries are represented as weighted vectors in a high dimensional vector space. Similarities between queries and documents are calculated to rank documents by relevance. Weights are often calculated using TF-IDF, which considers the frequency of terms within documents and across collections. Documents with vector representations closer to the query vector are considered more relevant.

4-IR Models_new.pptBereketAraya

The document discusses the vector space model used in information retrieval. It explains that documents and queries are represented as weighted vectors in a multidimensional space. Similar vectors are close to each other. The weights used are usually tf-idf, which considers both the frequency of a term within a document and its rarity across documents. Documents are ranked based on the similarity between their vector representation and the query vector.

Vector space model12345678910111213.pptxsomeyamohsen2

The vector space model is an algebraic model for representing text documents and search queries as vectors. It represents documents and queries as vectors in a multidimensional space, where each unique term is a dimension. It allows documents and queries to be compared by determining the similarity between their vector representations. The vector space model involves representing documents as vectors of the words they contain and transforming these into numerical term-document matrices. This allows techniques like information retrieval and extraction to be applied.

Information retrieval 10 tf idf and bag of wordsVaibhav Khanna

information retrieval term Weighting.pptKelemAlebachew

Ir modelsAmbreen Angel

Information retrieval 8 term weightingVaibhav Khanna

unit -4MODELING AND RETRIEVAL EVALUATIONkarthiksmart21

IRT Unit_ 2.pptxthenmozhip8

Chapter 4 IR Models.pdfHabtamu100

UNIT 3 IRT.docxthenmozhip8

Chapter 6 Query Language .pdfHabtamu100

NLP Lecture on the preprocessing approachesdheeraj306480

Information Retrievalrchbeir

Document similarityHemant Hatankar

Some Information Retrieval Models and Our Experiments for TREC KBAPatrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)

Text Representation methods in Natural language processingNarendraChindanur

191CSEH IR UNIT - II for an engineering subjectphilipsmohan

chapter 5 Information Retrieval Models.pptKelemAlebachew

Information Retrieval QueryLanguageOperation.pptKelemAlebachew

4-IR Models_new.pptBereketAraya

Vector space model12345678910111213.pptxsomeyamohsen2

Information retrieval 10 tf idf and bag of wordsVaibhav Khanna

Recently uploaded (20)

GDGLSPGCOER - Git and GitHub Workshop.pptxazeenhodekar

Quality Contril Analysis of Containers.pdfDr. Bindiya Chauhan

How to Subscribe Newsletter From Odoo 18 WebsiteCeline George

Social Problem-Unemployment .pptx notes for Physiotherapy StudentsDrNidhiAgarwal

Unemployment is a major social problem, by which not only rural population have suffered but also urban population are suffered while they are literate having good qualification.The evil consequences like poverty, frustration, revolution result in crimes and social disorganization. Therefore, it is necessary that all efforts be made to have maximum. employment facilities. The Government of India has already announced that the question of payment of unemployment allowance cannot be considered in India

To study Digestive system of insect.pptxArshad Shaikh

CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetSritoma Majumder

Introduction All the materials around us are made up of elements. These elements can be broadly divided into two major groups: Metals Non-Metals Each group has its own unique physical and chemical properties. Let's understand them one by one. Physical Properties 1. Appearance Metals: Shiny (lustrous). Example: gold, silver, copper. Non-metals: Dull appearance (except iodine, which is shiny). 2. Hardness Metals: Generally hard. Example: iron. Non-metals: Usually soft (except diamond, a form of carbon, which is very hard). 3. State Metals: Mostly solids at room temperature (except mercury, which is a liquid). Non-metals: Can be solids, liquids, or gases. Example: oxygen (gas), bromine (liquid), sulphur (solid). 4. Malleability Metals: Can be hammered into thin sheets (malleable). Non-metals: Not malleable. They break when hammered (brittle). 5. Ductility Metals: Can be drawn into wires (ductile). Non-metals: Not ductile. 6. Conductivity Metals: Good conductors of heat and electricity. Non-metals: Poor conductors (except graphite, which is a good conductor). 7. Sonorous Nature Metals: Produce a ringing sound when struck. Non-metals: Do not produce sound. Chemical Properties 1. Reaction with Oxygen Metals react with oxygen to form metal oxides. These metal oxides are usually basic. Non-metals react with oxygen to form non-metallic oxides. These oxides are usually acidic. 2. Reaction with Water Metals: Some react vigorously (e.g., sodium). Some react slowly (e.g., iron). Some do not react at all (e.g., gold, silver). Non-metals: Generally do not react with water. 3. Reaction with Acids Metals react with acids to produce salt and hydrogen gas. Non-metals: Do not react with acids. 4. Reaction with Bases Some non-metals react with bases to form salts, but this is rare. Metals generally do not react with bases directly (except amphoteric metals like aluminum and zinc). Displacement Reaction More reactive metals can displace less reactive metals from their salt solutions. Uses of Metals Iron: Making machines, tools, and buildings. Aluminum: Used in aircraft, utensils. Copper: Electrical wires. Gold and Silver: Jewelry. Zinc: Coating iron to prevent rusting (galvanization). Uses of Non-Metals Oxygen: Breathing. Nitrogen: Fertilizers. Chlorine: Water purification. Carbon: Fuel (coal), steel-making (coke). Iodine: Medicines. Alloys An alloy is a mixture of metals or a metal with a non-metal. Alloys have improved properties like strength, resistance to rusting.

How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...Celine George

Analytic accounts are used to track and manage financial transactions related to specific projects, departments, or business units. They provide detailed insights into costs and revenues at a granular level, independent of the main accounting system. This helps to better understand profitability, performance, and resource allocation, making it easier to make informed financial decisions and strategic planning.

Unit 6_Introduction_Phishing_Password Cracking.pdfKanchanPatil34

LDMMIA Reiki Master Spring 2025 Mini UpdatesLDM Mia eStudios

K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schoolsdogden2

Algebra 1 is often described as a “gateway” class, a pivotal moment that can shape the rest of a student’s K–12 education. Early access is key: successfully completing Algebra 1 in middle school allows students to complete advanced math and science coursework in high school, which research shows lead to higher wages and lower rates of unemployment in adulthood. Learn how The Atlanta Public Schools is using their data to create a more equitable enrollment in middle school Algebra classes.

P-glycoprotein pamphlet: iteration 4 of 4 finalbs22n2s

World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...larencebapu132

To study the nervous system of insect.pptxArshad Shaikh

The *nervous system of insects* is a complex network of nerve cells (neurons) and supporting cells that process and transmit information. Here's an overview: Structure 1. *Brain*: The insect brain is a complex structure that processes sensory information, controls behavior, and integrates information. 2. *Ventral nerve cord*: A chain of ganglia (nerve clusters) that runs along the insect's body, controlling movement and sensory processing. 3. *Peripheral nervous system*: Nerves that connect the central nervous system to sensory organs and muscles. Functions 1. *Sensory processing*: Insects can detect and respond to various stimuli, such as light, sound, touch, taste, and smell. 2. *Motor control*: The nervous system controls movement, including walking, flying, and feeding. 3. *Behavioral responThe *nervous system of insects* is a complex network of nerve cells (neurons) and supporting cells that process and transmit information. Here's an overview: Structure 1. *Brain*: The insect brain is a complex structure that processes sensory information, controls behavior, and integrates information. 2. *Ventral nerve cord*: A chain of ganglia (nerve clusters) that runs along the insect's body, controlling movement and sensory processing. 3. *Peripheral nervous system*: Nerves that connect the central nervous system to sensory organs and muscles. Functions 1. *Sensory processing*: Insects can detect and respond to various stimuli, such as light, sound, touch, taste, and smell. 2. *Motor control*: The nervous system controls movement, including walking, flying, and feeding. 3. *Behavioral responses*: Insects can exhibit complex behaviors, such as mating, foraging, and social interactions. Characteristics 1. *Decentralized*: Insect nervous systems have some autonomy in different body parts. 2. *Specialized*: Different parts of the nervous system are specialized for specific functions. 3. *Efficient*: Insect nervous systems are highly efficient, allowing for rapid processing and response to stimuli. The insect nervous system is a remarkable example of evolutionary adaptation, enabling insects to thrive in diverse environments. The insect nervous system is a remarkable example of evolutionary adaptation, enabling insects to thrive

Understanding P–N Junction Semiconductors: A Beginner’s GuideGS Virdi

Dive into the fundamentals of P–N junctions, the heart of every diode and semiconductor device. In this concise presentation, Dr. G.S. Virdi (Former Chief Scientist, CSIR-CEERI Pilani) covers: What Is a P–N Junction? Learn how P-type and N-type materials join to create a diode. Depletion Region & Biasing: See how forward and reverse bias shape the voltage–current behavior. V–I Characteristics: Understand the curve that defines diode operation. Real-World Uses: Discover common applications in rectifiers, signal clipping, and more. Ideal for electronics students, hobbyists, and engineers seeking a clear, practical introduction to P–N junction semiconductors.

Geography Sem II Unit 1C Correlation of Geography with other school subjectsProfDrShaikhImran

YSPH VMOC Special Report - Measles Outbreak Southwest US 5-3-2025.pptxYale School of Public Health - The Virtual Medical Operations Center (VMOC)

A measles outbreak originating in West Texas has been linked to confirmed cases in New Mexico, with additional cases reported in Oklahoma and Kansas. The current case count is 817 from Texas, New Mexico, Oklahoma, and Kansas. 97 individuals have required hospitalization, and 3 deaths, 2 children in Texas and one adult in New Mexico. These fatalities mark the first measles-related deaths in the United States since 2015 and the first pediatric measles death since 2003. The YSPH Virtual Medical Operations Center Briefs (VMOC) were created as a service-learning project by faculty and graduate students at the Yale School of Public Health in response to the 2010 Haiti Earthquake. Each year, the VMOC Briefs are produced by students enrolled in Environmental Health Science Course 581 - Public Health Emergencies: Disaster Planning and Response. These briefs compile diverse information sources – including status reports, maps, news articles, and web content– into a single, easily digestible document that can be widely shared and used interactively. Key features of this report include: - Comprehensive Overview: Provides situation updates, maps, relevant news, and web resources. - Accessibility: Designed for easy reading, wide distribution, and interactive use. - Collaboration: The “unlocked" format enables other responders to share, copy, and adapt seamlessly. The students learn by doing, quickly discovering how and where to find critical information and presenting it in an easily understood manner. CURRENT CASE COUNT: 817 (As of 05/3/2025) • Texas: 688 (+20)(62% of these cases are in Gaines County). • New Mexico: 67 (+1 )(92.4% of the cases are from Eddy County) • Oklahoma: 16 (+1) • Kansas: 46 (32% of the cases are from Gray County) HOSPITALIZATIONS: 97 (+2) • Texas: 89 (+2) - This is 13.02% of all TX cases. • New Mexico: 7 - This is 10.6% of all NM cases. • Kansas: 1 - This is 2.7% of all KS cases. DEATHS: 3 • Texas: 2 – This is 0.31% of all cases • New Mexico: 1 – This is 1.54% of all cases US NATIONAL CASE COUNT: 967 (Confirmed and suspected): INTERNATIONAL SPREAD (As of 4/2/2025) • Mexico – 865 (+58) ‒Chihuahua, Mexico: 844 (+58) cases, 3 hospitalizations, 1 fatality • Canada: 1531 (+270) (This reflects Ontario's Outbreak, which began 11/24) ‒Ontario, Canada – 1243 (+223) cases, 84 hospitalizations. • Europe: 6,814

Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Library Association of Ireland

YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptxYale School of Public Health - The Virtual Medical Operations Center (VMOC)

A measles outbreak originating in West Texas has been linked to confirmed cases in New Mexico, with additional cases reported in Oklahoma and Kansas. The current case count is 795 from Texas, New Mexico, Oklahoma, and Kansas. 95 individuals have required hospitalization, and 3 deaths, 2 children in Texas and one adult in New Mexico. These fatalities mark the first measles-related deaths in the United States since 2015 and the first pediatric measles death since 2003. The YSPH Virtual Medical Operations Center Briefs (VMOC) were created as a service-learning project by faculty and graduate students at the Yale School of Public Health in response to the 2010 Haiti Earthquake. Each year, the VMOC Briefs are produced by students enrolled in Environmental Health Science Course 581 - Public Health Emergencies: Disaster Planning and Response. These briefs compile diverse information sources – including status reports, maps, news articles, and web content– into a single, easily digestible document that can be widely shared and used interactively. Key features of this report include: - Comprehensive Overview: Provides situation updates, maps, relevant news, and web resources. - Accessibility: Designed for easy reading, wide distribution, and interactive use. - Collaboration: The “unlocked" format enables other responders to share, copy, and adapt seamlessly. The students learn by doing, quickly discovering how and where to find critical information and presenting it in an easily understood manner.

Presentation of the MIPLM subject matter expert Erdem KayaMIPLM

The ever evoilving world of science /7th class science curiosity /samyans aca...Sandeep Swamy

The Ever-Evolving World of Science Welcome to Grade 7 Science4not just a textbook with facts, but an invitation to question, experiment, and explore the beautiful world we live in. From tiny cells inside a leaf to the movement of celestial bodies, from household materials to underground water flows, this journey will challenge your thinking and expand your knowledge. Notice something special about this book? The page numbers follow the playful flight of a butterfly and a soaring paper plane! Just as these objects take flight, learning soars when curiosity leads the way. Simple observations, like paper planes, have inspired scientific explorations throughout history.

GDGLSPGCOER - Git and GitHub Workshop.pptxazeenhodekar

Quality Contril Analysis of Containers.pdfDr. Bindiya Chauhan

How to Subscribe Newsletter From Odoo 18 WebsiteCeline George

Social Problem-Unemployment .pptx notes for Physiotherapy StudentsDrNidhiAgarwal

To study Digestive system of insect.pptxArshad Shaikh

CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetSritoma Majumder

How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...Celine George

Unit 6_Introduction_Phishing_Password Cracking.pdfKanchanPatil34

LDMMIA Reiki Master Spring 2025 Mini UpdatesLDM Mia eStudios

K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schoolsdogden2

P-glycoprotein pamphlet: iteration 4 of 4 finalbs22n2s

World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...larencebapu132

To study the nervous system of insect.pptxArshad Shaikh

Understanding P–N Junction Semiconductors: A Beginner’s GuideGS Virdi

Geography Sem II Unit 1C Correlation of Geography with other school subjectsProfDrShaikhImran

YSPH VMOC Special Report - Measles Outbreak Southwest US 5-3-2025.pptxYale School of Public Health - The Virtual Medical Operations Center (VMOC)

Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Library Association of Ireland

YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptxYale School of Public Health - The Virtual Medical Operations Center (VMOC)

Presentation of the MIPLM subject matter expert Erdem KayaMIPLM

The ever evoilving world of science /7th class science curiosity /samyans aca...Sandeep Swamy

The vector space model

1. The Vector space model Submitted By – Deeksha Agarwal Semester 5th University of Allahabad

2. Boolean Model Disadvantages • Similarity function is boolean ⁻ Exact-match only, no partial matches ⁻ Retrieved documents not ranked • All terms are equally important – Boolean operator usage has much more influence than a critical word • Query language is expressive but complicated

3. Statistical Models • A document is typically represented by a bag of words (unordered words with frequencies). • Bag = set that allows multiple occurrences of the same element.

4. 4 Statistical Retrieval • Retrieval based on similarity between query and documents. • Output documents are ranked according to similarity to query. • Similarity based on occurrence frequencies of keywords in query and document. • Automatic relevance feedback can be supported: – Relevant documents “added” to query. – Irrelevant documents “subtracted” from query.

5. 5 The Vector-Space Model • Documents and queries are both vectors • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t- dimensional vectors: dj = (w1j, w2j, …, wtj)

6. 6 Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 32 5

7. 7 Document Collection • A collection of n documents can be represented in the vector space model by a term-document matrix. • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn

8. 8 Term Weights: Term Frequency • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij}

9. 9 Term Weights: Inverse Document Frequency • Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents)

10. 10 TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight. • Many other ways of determining term weights have been proposed. • Experimentally, tf-idf has been found to work well.

11. 11 Computing TF-IDF -- An Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6 B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0 C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

12. THANKYOU

Editor's Notes

#3: 1.Very rigid: AND means all; OR means any. 2.Difficult to express complex user requests. 3.Difficult to control the number of documents retrieved-All matched documents will be returned.5.Difficult to rank output-All matched documents logically satisfy the query. 7.Difficult to perform relevance feedback-a document is identified by the user as relevant or irrelevant, how should the query how should the query be modified?
#9: if a term t appears often in a document, then a query containing t should retrieve that document. Zipf’s law: term frequency » 1/rank importance is inversely proportional to frequency of occurrence.
#12: tfij = fij / maxi{fij}

The vector space model

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to The vector space model (20)

Recently uploaded (20)

The vector space model

Editor's Notes