Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and todayโs search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
ย
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. In this lecture, we will cover some of the fundamentals of neural representation learning for text retrieval. We will also discuss some of the recent advances in the applications of deep neural architectures to retrieval tasks.
(These slides were presented at a lecture as part of the Information Retrieval and Data Mining course taught at UCL.)
This document provides an overview of an information retrieval system. It defines an information retrieval system as a system capable of storing, retrieving, and maintaining information such as text, images, audio, and video. The objectives of an information retrieval system are to minimize the overhead for a user to locate needed information. The document discusses functions like search, browse, indexing, cataloging, and various capabilities to facilitate querying and retrieving relevant information from the system.
This document provides an overview of machine learning concepts including supervised learning, unsupervised learning, and reinforcement learning. It explains that supervised learning involves learning from labeled examples, unsupervised learning involves categorizing without labels, and reinforcement learning involves learning behaviors to achieve goals through interaction. The document also discusses regression vs classification problems, the learning and testing process, and examples of machine learning applications like customer profiling, face recognition, and handwritten character recognition.
This document discusses information storage and retrieval. It covers basic concepts of information storage including common storage media like hard drives, floppy disks, CDs, DVDs, and USB flash drives. It also discusses basic concepts of information retrieval and the major components of IR systems including databases, search mechanisms, languages, and interfaces. Finally, it discusses retrieval techniques, IR systems, evaluating IR systems, and future trends in IR.
Trees. Defining, Creating and Traversing Trees. Traversing the File System
Binary Search Trees. Balanced Trees
Graphs and Graphs Traversal Algorithms
Exercises: Working with Trees and Graphs
The document discusses key concepts related to information retrieval including data, information, knowledge, and wisdom. It defines information retrieval as the tracing and recovery of specific information from stored data through searching. The main aspects of the information retrieval process are described as querying a collection to retrieve relevant objects that may partially match the query. Precision and recall are discussed as important measures for information retrieval systems.
Digital forensics is a scientific field that involves the identification, collection, examination, and analysis of digital data for use as evidence in court. It has several sub-disciplines including computer forensics, network forensics, mobile device forensics, digital image/video/audio forensics, memory forensics, and cloud forensics. The goal of digital forensics is to recover electronic evidence from computers, networks, mobile devices, and digital media in a forensically sound manner.
Probabilistic information retrieval models & systemsSelman Bozkฤฑr
ย
The document discusses probabilistic information retrieval and Bayesian approaches. It introduces concepts like conditional probability, Bayes' theorem, and the probability ranking principle. It explains how probabilistic models estimate the probability of relevance between a document and query by representing them as term sets and making probabilistic assumptions. The goal is to rank documents by the probability of relevance to present the most likely relevant documents first.
The document discusses information retrieval, which involves obtaining information resources relevant to an information need from a collection. The information retrieval process begins when a user submits a query. The system matches queries to database information, ranks objects based on relevance, and returns top results to the user. The process involves document acquisition and representation, user problem representation as queries, and searching/retrieval through matching and result retrieval.
This document provides a full syllabus with questions and answers related to the course "Information Retrieval" including definitions of key concepts, the historical development of the field, comparisons between information retrieval and web search, applications of IR, components of an IR system, and issues in IR systems. It also lists examples of open source search frameworks and performance measures for search engines.
This document provides an overview of information retrieval models. It begins with definitions of information retrieval and how it differs from data retrieval. It then discusses the retrieval process and logical representations of documents. A taxonomy of IR models is presented including classic, structured, and browsing models. Boolean, vector, and probabilistic models are explained as examples of classic models. The document concludes with descriptions of ad-hoc retrieval and filtering tasks and formal characteristics of IR models.
The document summarizes a technical seminar on web-based information retrieval systems. It discusses information retrieval architecture and approaches, including syntactical, statistical, and semantic methods. It also covers web search analysis techniques like web structure analysis, content analysis, and usage analysis. The document outlines the process of web crawling and types of crawlers. It discusses challenges of web structure, crawling and indexing, and searching. Finally, it concludes that as unstructured online information grows, information retrieval techniques must continue to improve to leverage this data.
The document discusses various information retrieval models, including:
1) Classic models like Boolean and vector space models that use index terms to represent documents and queries.
2) Probabilistic models that view IR as estimating the probability of relevance between documents and queries.
3) Structured models that incorporate document structure, including models based on non-overlapping text regions and hierarchical document structure.
4) Browsing models like flat, structure-guided, and hypertext models for navigating document collections.
The document discusses the World Wide Web and information retrieval on the web. It provides background on how the web was developed by Tim Berners-Lee in 1990 using HTML, HTTP, and URLs. It then discusses some key differences in information retrieval on the web compared to traditional library systems, including the presence of hyperlinks, heterogeneous content, duplication of content, exponential growth in the number of documents, and lack of stability. It also summarizes some challenges in web search including the expanding nature of the web, dynamically generated content, influence of monetary contributions on search results, and search engine spamming.
Automatic indexing is the process of analyzing documents to extract information to be included in an index. This can be done through statistical, natural language, concept-based, or hypertext linkage techniques. Statistical techniques are the most common, identifying words and phrases to index documents. Natural language techniques perform additional parsing of text. Concept indexing correlates words to concepts, while hypertext linkages create connections between documents. The goal of automatic indexing is to preprocess documents to allow for relevant search results by representing concepts in the index.
Information retrieval (IR) is the process of searching for and retrieving relevant documents from a large collection based on a user's query. Key aspects of IR include:
- Representing documents and queries in a way that allows measuring their similarity, such as the vector space model.
- Ranking retrieved documents by relevance to the query using factors like term frequency and inverse document frequency.
- Allowing for similarity-based retrieval where documents similar to a given document are retrieved.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
The document discusses different theories used in information retrieval systems. It describes cognitive or user-centered theories that model human information behavior and structural or system-centered theories like the vector space model. The vector space model represents documents and queries as vectors of term weights and compares similarities between queries and documents. It was first used in the SMART information retrieval system and involves assigning term vectors and weights to documents based on relevance.
This document provides an overview of information retrieval systems, including their definition, objectives, and key functional processes. An information retrieval system aims to minimize the time and effort users spend locating needed information by supporting search generation, presenting relevant results, and allowing iterative refinement of searches. The major functional processes involve normalizing input items, selectively disseminating new items to users, searching archived documents and user-created indexes. Information retrieval systems differ from database management systems in their handling of unstructured text-based information rather than strictly structured data.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
This document discusses evaluation in information retrieval. It describes standard test collections which consist of a document collection, queries on the collection, and relevance judgments. It also discusses various evaluation measures used in information retrieval like precision, recall, F-measure, mean average precision, and kappa statistic which measure reliability of relevance judgments. R-precision and normalized discounted cumulative gain are also summarized as important single number evaluation measures.
The Boolean model is a classical information retrieval model based on set theory and Boolean logic. Queries are specified as Boolean expressions to retrieve documents that either contain or do not contain the query terms. All term frequencies are binary and documents are retrieved based on an exact match to the query terms. However, this model has limitations as it does not rank documents and queries are difficult for users to translate into Boolean expressions, often returning too few or too many results.
Ontologies provide a shared understanding of a domain by formally defining concepts, properties, and relationships. An ontology introduces vocabulary relevant to a domain and specifies the meaning of terms. Ontologies are machine-readable and enable overcoming differences in terminology across complex, distributed applications. Examples include gene ontologies, pharmaceutical drug ontologies, and customer profile ontologies. Semantic technologies use ontologies to provide semantic search, integration, reasoning, and analysis capabilities.
The document provides an overview of the key components and objectives of an information retrieval system. It discusses how an IR system aims to minimize the time a user spends locating needed information by facilitating search generation, presenting search results in a relevant order, and processing incoming documents through normalization, indexing, and selective dissemination to users. The major measures of an IR system's effectiveness are precision and recall.
The document discusses the vector space model for representing text documents and queries in information retrieval systems. It describes how documents and queries are represented as vectors of term weights, with each term being assigned a weight based on its frequency in the document or query. The vector space model allows documents and queries to be compared by calculating the similarity between their vector representations. Terms that are more frequent in a document and less frequent overall are given higher weights through techniques like TF-IDF weighting. This vector representation enables efficient retrieval of documents ranked by similarity to the query.
Slides for the iDB summer school (Sapporo, Japan) http://db-event.jpn.org/idb2013/
Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.
This document provides an overview of semantic search. It defines semantic search as involving user intent and resources represented using semantic models that are then exploited in matching and ranking resources. Various semantic models are discussed including linguistic models, conceptual models, and ontologies. Semantic search is described as an emerging field at the intersection of several areas including information retrieval, natural language processing, databases, and the semantic web. The document compares semantic search to traditional information retrieval and databases, and outlines some common tasks in semantic search like entity search and related entity finding.
Probabilistic information retrieval models & systemsSelman Bozkฤฑr
ย
The document discusses probabilistic information retrieval and Bayesian approaches. It introduces concepts like conditional probability, Bayes' theorem, and the probability ranking principle. It explains how probabilistic models estimate the probability of relevance between a document and query by representing them as term sets and making probabilistic assumptions. The goal is to rank documents by the probability of relevance to present the most likely relevant documents first.
The document discusses information retrieval, which involves obtaining information resources relevant to an information need from a collection. The information retrieval process begins when a user submits a query. The system matches queries to database information, ranks objects based on relevance, and returns top results to the user. The process involves document acquisition and representation, user problem representation as queries, and searching/retrieval through matching and result retrieval.
This document provides a full syllabus with questions and answers related to the course "Information Retrieval" including definitions of key concepts, the historical development of the field, comparisons between information retrieval and web search, applications of IR, components of an IR system, and issues in IR systems. It also lists examples of open source search frameworks and performance measures for search engines.
This document provides an overview of information retrieval models. It begins with definitions of information retrieval and how it differs from data retrieval. It then discusses the retrieval process and logical representations of documents. A taxonomy of IR models is presented including classic, structured, and browsing models. Boolean, vector, and probabilistic models are explained as examples of classic models. The document concludes with descriptions of ad-hoc retrieval and filtering tasks and formal characteristics of IR models.
The document summarizes a technical seminar on web-based information retrieval systems. It discusses information retrieval architecture and approaches, including syntactical, statistical, and semantic methods. It also covers web search analysis techniques like web structure analysis, content analysis, and usage analysis. The document outlines the process of web crawling and types of crawlers. It discusses challenges of web structure, crawling and indexing, and searching. Finally, it concludes that as unstructured online information grows, information retrieval techniques must continue to improve to leverage this data.
The document discusses various information retrieval models, including:
1) Classic models like Boolean and vector space models that use index terms to represent documents and queries.
2) Probabilistic models that view IR as estimating the probability of relevance between documents and queries.
3) Structured models that incorporate document structure, including models based on non-overlapping text regions and hierarchical document structure.
4) Browsing models like flat, structure-guided, and hypertext models for navigating document collections.
The document discusses the World Wide Web and information retrieval on the web. It provides background on how the web was developed by Tim Berners-Lee in 1990 using HTML, HTTP, and URLs. It then discusses some key differences in information retrieval on the web compared to traditional library systems, including the presence of hyperlinks, heterogeneous content, duplication of content, exponential growth in the number of documents, and lack of stability. It also summarizes some challenges in web search including the expanding nature of the web, dynamically generated content, influence of monetary contributions on search results, and search engine spamming.
Automatic indexing is the process of analyzing documents to extract information to be included in an index. This can be done through statistical, natural language, concept-based, or hypertext linkage techniques. Statistical techniques are the most common, identifying words and phrases to index documents. Natural language techniques perform additional parsing of text. Concept indexing correlates words to concepts, while hypertext linkages create connections between documents. The goal of automatic indexing is to preprocess documents to allow for relevant search results by representing concepts in the index.
Information retrieval (IR) is the process of searching for and retrieving relevant documents from a large collection based on a user's query. Key aspects of IR include:
- Representing documents and queries in a way that allows measuring their similarity, such as the vector space model.
- Ranking retrieved documents by relevance to the query using factors like term frequency and inverse document frequency.
- Allowing for similarity-based retrieval where documents similar to a given document are retrieved.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
The document discusses different theories used in information retrieval systems. It describes cognitive or user-centered theories that model human information behavior and structural or system-centered theories like the vector space model. The vector space model represents documents and queries as vectors of term weights and compares similarities between queries and documents. It was first used in the SMART information retrieval system and involves assigning term vectors and weights to documents based on relevance.
This document provides an overview of information retrieval systems, including their definition, objectives, and key functional processes. An information retrieval system aims to minimize the time and effort users spend locating needed information by supporting search generation, presenting relevant results, and allowing iterative refinement of searches. The major functional processes involve normalizing input items, selectively disseminating new items to users, searching archived documents and user-created indexes. Information retrieval systems differ from database management systems in their handling of unstructured text-based information rather than strictly structured data.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
This document discusses evaluation in information retrieval. It describes standard test collections which consist of a document collection, queries on the collection, and relevance judgments. It also discusses various evaluation measures used in information retrieval like precision, recall, F-measure, mean average precision, and kappa statistic which measure reliability of relevance judgments. R-precision and normalized discounted cumulative gain are also summarized as important single number evaluation measures.
The Boolean model is a classical information retrieval model based on set theory and Boolean logic. Queries are specified as Boolean expressions to retrieve documents that either contain or do not contain the query terms. All term frequencies are binary and documents are retrieved based on an exact match to the query terms. However, this model has limitations as it does not rank documents and queries are difficult for users to translate into Boolean expressions, often returning too few or too many results.
Ontologies provide a shared understanding of a domain by formally defining concepts, properties, and relationships. An ontology introduces vocabulary relevant to a domain and specifies the meaning of terms. Ontologies are machine-readable and enable overcoming differences in terminology across complex, distributed applications. Examples include gene ontologies, pharmaceutical drug ontologies, and customer profile ontologies. Semantic technologies use ontologies to provide semantic search, integration, reasoning, and analysis capabilities.
The document provides an overview of the key components and objectives of an information retrieval system. It discusses how an IR system aims to minimize the time a user spends locating needed information by facilitating search generation, presenting search results in a relevant order, and processing incoming documents through normalization, indexing, and selective dissemination to users. The major measures of an IR system's effectiveness are precision and recall.
The document discusses the vector space model for representing text documents and queries in information retrieval systems. It describes how documents and queries are represented as vectors of term weights, with each term being assigned a weight based on its frequency in the document or query. The vector space model allows documents and queries to be compared by calculating the similarity between their vector representations. Terms that are more frequent in a document and less frequent overall are given higher weights through techniques like TF-IDF weighting. This vector representation enables efficient retrieval of documents ranked by similarity to the query.
Slides for the iDB summer school (Sapporo, Japan) http://db-event.jpn.org/idb2013/
Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.
This document provides an overview of semantic search. It defines semantic search as involving user intent and resources represented using semantic models that are then exploited in matching and ranking resources. Various semantic models are discussed including linguistic models, conceptual models, and ontologies. Semantic search is described as an emerging field at the intersection of several areas including information retrieval, natural language processing, databases, and the semantic web. The document compares semantic search to traditional information retrieval and databases, and outlines some common tasks in semantic search like entity search and related entity finding.
Introduction to Enterprise Search. A two hour class to introduce Enterprise Search. It covers:
The problems enterprise search can solve
History of (web) search
How we search and find?
Current state of Enterprise Search + stats
Technical concept
Information quality
Feedback cycle
Five dimensions of Findability
The document discusses the evolution of search engines from basic keyword search to semantic search using knowledge graphs and structured data. It provides examples of how search engines like Google are now able to provide direct answers to queries by searching structured data rather than just documents. It emphasizes the importance of representing web content as structured data using schemas like schema.org to be discoverable in semantic search and knowledge graphs.
This presentation was provided by Marydee Ojala of Information Today during the NISO event "The Impact of the Interface: Traditional and Non Traditional Content," held on November 20, 2019.
Designing Structure Part II: Information ArchtectureChristina Wodtke
ย
Part two on Designing Structure for my General Assembly class on User Experience is about Information Architecture. We cover why classification is important, types of classification and trends in IA.
Search & Recommendation: Birds of a Feather?Toine Bogers
ย
In just a little over half a century, the field of information retrieval has experienced spectacular growth and success, with IR applications such as search engines becoming a billion-dollar industry in the past decades. Recommender systems have seen an even more meteoric rise to success with wide-scale application by companies like Amazon, Facebook, and Netflix. But are search and recommendation really two different fields of research that address different problems with different sets of algorithms in papers published at distinct conferences?
In my talk, I want to argue that search and recommendation are more similar than they have been treated in the past decade. By looking more closely at the tasks and problems that search and recommendation try to solve, at the algorithms used to solve these problems and at the way their performance is evaluated, I want to show that there is no clear black and white division between the two. Instead, search and recommendation are part of a much more fluid continuum of methods and techniques for information access.
(Keynote at "Mind The Gap '14" workshop at the iConference 2014 in Berlin, Germany)
1. SharePoint 2010 introduces a new Managed Metadata Service that allows for centralized storage and management of terms across sites and site collections. This provides a consistent way to organize content.
2. The Managed Metadata Service supports both taxonomies for structured terms as well as folksonomies for user-generated keywords and tags. It integrates with other features like Business Connectivity Services.
3. While powerful, the Managed Metadata Service requires planning to set up terms and administer the term store. Considerations include importing structures metadata, separating terms with commas, and preventing misspellings.
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...SEAD
ย
This document discusses research data management and the role of university libraries. It describes the SEAD (Sustainable Environment Actionable Data) project, which provides data services like curation, preservation, and a social community network to support research data across its lifecycle. SEAD aims to support interdisciplinary research by allowing researchers to define and manage related collections of data and metadata called Research Objects in a scalable way. The document argues that research organizations are best positioned to provide comprehensive long-term data services that integrate across the entire research process.
Information Discovery and Search Strategies for Evidence-Based ResearchDavid Nzoputa Ofili
ย
This event was on May 2, 2017 at Wesley University, Ondo State, Nigeria. I trained the university's staff (academic and non-academic) on "Information Discovery and Search Strategies for Evidence-Based Research" in an information/digital literacy session.
This document provides information about searching online, including:
- The size of the internet has grown tremendously, making proper searching skills more important.
- IPV6 was launched in 2012 to accommodate more internet addresses as devices increase.
- Search engines, directories, and databases are described as important tools for online research. Keywords, boolean searchers, and other search techniques are also outlined.
- Criteria like authority, purpose, currency and bias are important to evaluate sources found in online searches.
Slides from Enterprise Search & Analytics Meetup @ Cisco Systems - http://www.meetup.com/Enterprise-Search-and-Analytics-Meetup/events/220742081/
Relevancy and Search Quality Analysis - By Mark David and Avi Rappoport
The Manifold Path to Search Quality
To achieve accurate search results, we must come to an understanding of the three pillars involved.
1. Understand your data
2. Understand your customersโ intent
3. Understand your search engine
The first path passes through Data Analysis and Text Processing.
The second passes through Query Processing, Log Analysis, and Result Presentation.
Everything learned from those explorations feeds into the final path of Relevancy Ranking.
Search quality is focused on end users finding what they want -- technical relevance is sometimes irrelevant! Working with the short head (very frequent queries) has the most return on investment for improving the search experience, tuning the results, for example, to emphasize recent documents or de-emphasize archive documents, near-duplicate detection, exposing diverse results in ambiguous situations, using synonyms, and guiding search via best bets and auto-suggest. Long-tail analysis can reveal user intent by detecting patterns, discovering related terms, and identifying the most fruitful results by aggregated behavior. all this feeds back into the regression testing, which provides reliable metrics to evaluate the changes.
By merging these insights, you can improve the quality of the search overall, in a scalable and maintainable fashion.
This document provides an overview of an Information Retrieval Techniques course. It discusses the objectives of understanding IR basics, text classification, search engines, and recommender systems. The syllabus covers what information is, types of information, retrieval, how IR differs from data retrieval, components of an IR system including document, user and search subsystems, and early developments in the field of IR. It also discusses the software architecture of a traditional IR system including processes like document gathering, indexing, searching, and document management.
Information retrieval 1 introduction to irVaibhav Khanna
ย
Information retrieval (IR) is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing
This document provides an introduction to information retrieval. It discusses what information retrieval is, including that it is the process of finding unstructured information from large collections to satisfy an information need. It also discusses the different components of an information retrieval system, including documents, queries, retrieval and ranking processes. Finally, it discusses some of the main issues in information retrieval like document and query indexing, query evaluation, and system evaluation.
From queries to answers in the Web document discusses:
- How web search has evolved from primarily returning links to now attempting to directly answer queries.
- Future trends in search include more personalized, social, contextual and anticipatory search capabilities.
- Semantic search aims to understand user intent and resources using semantic models to improve matching and ranking.
๏ฟผEntity Linking via Graph-Distance MinimizationRoi Blanco
ย
Entity-linking is a natural-language--processing task that consists in identifying strings of text that refer to a particular
item in some reference knowledge base.
One instance of entity-linking can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between chosen items.
Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; nonetheless, it turns out to be solvable in linear time under some more restrictive assumptions. For the general case, we propose several heuristics: one of these tries to enforce the above assumptions while the others try to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset.
Slides used for the keynote at the even Big Data & Data Science http://eventos.citius.usc.es/bigdata/
Some slides are borrowed from random hadoop/big data presentations
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
ย
Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.
Beyond document retrieval using semantic annotations Roi Blanco
ย
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Large knowledge bases consisting of entities and relationships between them have become vital sources of information for many applications. Most of these knowledge bases adopt the Semantic-Web data model RDF as a representation model. Querying these knowledge bases is typically done using structured queries utilizing graph-pattern languages such as SPARQL. However, such structured queries require some expertise from users which limits the accessibility to such data sources. To overcome this, keyword search must be supported. In this paper, we propose a retrieval model for keyword queries over RDF graphs. Our model retrieves a set of subgraphs that match the query keywords, and ranks them based on statistical language models. We show that our retrieval model outperforms the-state-of-the-art IR and DB models for keyword search over structured data using experiments over two real-world datasets.
Extending BM25 with multiple query operatorsRoi Blanco
ย
Traditional probabilistic relevance frameworks for informational retrieval refrain from taking positional information into account, due to the hurdles of developing a sound model while avoiding an explosion in the number of parameters. Nonetheless, the well-known BM25F extension of the successful Okapi ranking function can be seen as an embryonic attempt in that direction. In this paper, we proceed along the same line, defining the notion of virtual region: a virtual region is a part of the document that, like a BM25F-field, can provide a (larger or smaller, depending on a tunable weighting parameter) evidence of relevance of the document; differently from BM25F fields, though, virtual regions are generated implicitly by applying suitable (usually, but not necessarily, positional-aware) operators to the query. This technique fits nicely in the eliteness model behind BM25 and provides a principled explanation to BM25F; it specializes to BM25(F) for some trivial operators, but has a much more general appeal. Our experiments (both on standard collections, such as TREC, and on Web-like repertoires) show that the use of virtual regions is beneficial for retrieval effectiveness.
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesRoi Blanco
ย
Concurrently processing thousands of web queries, each with a response time under a fraction of a second, necessitates maintaining and operating massive data centers. For large-scale web search engines, this translates into high energy consumption and a huge electric bill. This work takes the challenge to reduce the electric bill of commercial web search engines operating on data centers that are geographically far apart. Based on the observation that energy prices and query workloads show high spatio-temporal variation, we propose a technique that dynamically shifts the query workload of a search engine between its data centers to reduce the electric bill. Experiments on real-life query workloads obtained from a commercial search engine show that significant financial savings can be achieved by this technique.
Effective and Efficient Entity Search in RDF dataRoi Blanco
ย
Triple stores have long provided RDF storage as well as data access using expressive, formal query languages such as SPARQL. The new end users of the Semantic Web, however, are mostly unaware of SPARQL and overwhelmingly prefer imprecise, informal keyword queries for searching over data. At the same time, the amount of data on the Semantic Web is approaching the limits of the architectures that provide support for the full expressivity of SPARQL. These factors combined have led to an increased interest in semantic search, i.e. access to RDF data using Information Retrieval methods. In this work, we propose a method for effective and efficient entity search over RDF data. We describe an adaptation of the BM25F ranking function for RDF data, and demonstrate that it outperforms other state-of-the-art methods in ranking RDF resources. We also propose a set of new index structures for efficient retrieval and ranking of results. We implement these results using the open-source MG4J framework.
Caching Search Engine Results over Incremental IndicesRoi Blanco
ย
A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naive approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index.
To obtain this property, we propose a framework for developing invalidation predictors and define metrics to evaluate invalidation schemes. We describe concrete predictors using this framework and compare them against a baseline that uses a cache invalidation scheme based on time-to-live (TTL). Evaluation over Wikipedia documents using a query log from the Yahoo! search engine shows that selective invalidation of cached search results can lower the number of unnecessary query evaluations by as much as 30% compared to a baseline scheme, while returning results of similar freshness. In general, our predictors enable fewer unnecessary invalidations and fewer stale results compared to a TTL-only scheme for similar freshness of results.
We study the problem of finding sentences that explain the relationship between a named entity and an ad-hoc query, which we refer to as entity support sentences. This is an important sub-problem of entity ranking which, to the best of our knowledge, has not been addressed before. In this paper we give the first formalization of the problem, how it can be evaluated, and present a full evaluation dataset. We propose several methods to rank these sentences, namely retrieval-based, entity-ranking based and position-based. We found that traditional bag-of-words models perform relatively well when there is a match between an entity and a query in a given sentence, but they fail to find a support sentence for a substantial portion of entities. This can be improved by incorporating small windows of context sentences and ranking them appropriately.
Mobile App Development Company in Saudi ArabiaSteve Jonas
ย
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
How Can I use the AI Hype in my Business Context?Daniel Lehner
ย
๐๐จ ๐ผ๐ ๐๐ช๐จ๐ฉ ๐๐ฎ๐ฅ๐? ๐๐ง ๐๐จ ๐๐ฉ ๐ฉ๐๐ ๐๐๐ข๐ ๐๐๐๐ฃ๐๐๐ง ๐ฎ๐ค๐ช๐ง ๐๐ช๐จ๐๐ฃ๐๐จ๐จ ๐ฃ๐๐๐๐จ?
Everyoneโs talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know ๐ต๐ผ๐.
โ What exactly should you ask to find real AI opportunities?
โ Which AI techniques actually fit your business?
โ Is your data even ready for AI?
If youโre not sure, youโre not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
"Client Partnership โ the Path to Exponential Growth for Companies Sized 50-5...Fwdays
ย
Why the "more leads, more sales" approach is not a silver bullet for a company.
Common symptoms of an ineffective Client Partnership (CP).
Key reasons why CP fails.
Step-by-step roadmap for building this function (processes, roles, metrics).
Business outcomes of CP implementation based on examples of companies sized 50-500.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
ย
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, โThe Coding War Games.โ
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we donโt find ourselves having the same discussion again in a decade?
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
ย
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtLynda Kane
ย
Slide Deck from Buckeye Dreamin' 2024 presentation Assessing and Resolving Technical Debt. Focused on identifying technical debt in Salesforce and working towards resolving it.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. ๐
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! ๐
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
ย
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
ย
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
2. Acknowledgements
โข Many of these slides were taken from other presentations
โ P. Raghavan, C. Manning, H. Schutze IR lectures
โ Mounia Lalmasโs personal stash
โ Other random slide decks
โข Textbooks
โ Ricardo Baeza-Yates, Berthier Ribeiro Neto
โ Raghavan, Manning, Schutze
โ โฆ among other good books
โข Many online tutorials, many online tools available (full toolkits)
2
3. Big Plan
โข What is Information Retrieval?
โ Search engine history
โ Examples of IR systems (you might now have known!)
โข Is IR hard?
โ Users and human cognition
โ What is it like to be a search engine?
โข Web Search
โ Architecture
โ Differences between Web search and IR
โ Crawling
3
6. Information Retrieval
Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schรผtze
Introduction to Information Retrieval
6
6
7. Information Retrieval (II)
โข What do we understand by documents? How do
we decide what is a document and whatnot?
โข What is an information need? What types of
information needs can we satisfy automatically?
โข What is a large collection? Which environments
are suitable for IR
7
7
8. Basic assumptions of Information Retrieval
โข Collection: A set of documents
โ Assume it is a static collection
โข Goal: Retrieve documents with information that is
relevant to the userโs information need and helps
the user complete a task
8
9. Key issues
โข How to describe information resources or information-bearing
objects in ways that they can be effectively used
by those who need to use them ?
โ Organizing/Indexing/Storing
โข How to find the appropriate information resources or
information-bearing objects for someoneโs (or your own)
needs
โ Retrieving / Accessing / Filtering
9
10. Unstructured data
Unstructured data?
SELECT * from HOTELS
where city = Bangalore and
$$$ < 2
10
Cheap hotels in
Bangalore
CITY $$$ name
Bangalore 1.5 Cheapo one
Barcelona 1 EvenCheapoer
10
41. IR issues
โข Find out what the user needs
โฆ and do it quickly
โข Challenges: user intention, accessibility, volatility,
redundancy, lack of structure, low quality, different data
sources, volume, scale
โข The main bottleneck is human cognition and not
computational
41
42. IR is mostly about relevance
โข Relevance is the core concept in IR, but nobody has a good
definition
โข Relevance = useful
โข Relevance = topically related
โข Relevance = new
โข Relevance = interesting
โข Relevance = ???
โข However we still want relevant information
42
43. โข Information needs must be expressed as a query
โ But users donโt often know what they want
โข Problems
โ Verbalizing information needs
โ Understanding query syntax
โ Understanding search engines
43
44. Understanding(?) the user
I am a hungry tourist in
Barcelona, and I want to
find a place to eat;
however I donโt want to
spend a lot of money
I want information
on places with
cheap food in
Barcelona
Info about bars in
Barcelona
Bar celona
Misconception
Mistranslation
Misformulation
44
45. Why this is hard?
โข Documents/images/ video/speech/etc are complex. We
need some representation
โข Semantics
โ What do words mean?
โข Natural language
โ How do we say things?
โข L Computers cannot deal with these easily
45
46. โฆ and even harder
โข Context
โข Opinion
Funny? Talented? Honest?
46
48. What is it like to be a search engine?
โข How can we figure out what youโre trying to do?
โข Signal can be somehow weak, sometimes!
[ jaguar ]
[ iraq ]
[ latest release Thinkpad drivers touchpad ] [
ebay ]
[ first ]
[ google ]
[ brittttteny spirs ]
48
49. Search is a multi-step process
โข Session search
โ Verbalize your query
โ Look for a document
โ Find your information there
โ Refine
โข Teleporting
โ Go directly to the site you like
โ Formulating the query is too hard, you trust more
the final site, etc.
49
50. โข Someone told me that in the mid-1800โs, people often would carry
around a special kind of notebook. They would use the notebook to
write down quotations that they heard, or copy passages from books
theyโd read. The notebook was an important part of their education,
and it had a particular name.
โ What was the name of the notebook?
50
Examples from Dan Russel
52. More tasks โฆ
โข Going beyond a search engine
โ Using images / multimedia content
โ Using maps
โ Using other sources
โข Think of how to express things differently (synonyms)
โ A friend told me that there is an abandoned city in the waters of San Francisco
Bay. Is that true? If it IS true, what was the name of the supposed city?
โข Exploring a topic further in depth
โข Refining a question
โ Suppose you want to buy a unicycle for your Mom or Dad. How would you find
it?
โข Looking for lists of information
โ Can you find a list of all the groups that inhabited California at the time of the
missions?
52
53. IR tasks
โข Known-item finding
โ You want to retrieve some data that you know they exist
โ What year was Peter Mika born?
โข Exploratory seeking
โ You want to find some information through an iterative process
โ Not a single answer to your query
โข Exhaustive search
โ You want to find all the information possible about a particular issue
โ Issuing several queries to cover the user information need
โข Re-finding
โ You want to find an item you have found already
53
54. Scale
โข >300TB of print data produced per year
โ +Video, speech, domain-specific information (>600PB per year)
โข IR has to be fast + scalable
โข Information is dynamic
โ News, web pages, maps, โฆ
โ Queries are dynamic (you might even change your information needs while
searching)
โข Cope with data and searcher change
โ This introduces tensions in every component of a search engine
54
55. Methodology
โข Experimentation in IR
โข Three fundamental types of IR research:
โ Systems (efficiency)
โ Methods (effectiveness)
โ Applications (user utility)
โข Empirical evaluation plays a critical role across all three types
of research
55
56. Methodology (II)
โข Information retrieval (IR) is a highly applied scientific
discipline
โข Experimentation is a critical component of the scientific
method
โข Poor experimental methodologies are not scientifically
sound and should be avoided
56
58. 58
Task
Info
need
Verbal
form
query
Search
engine
Corpus
results
Query
refinement
59. User
Interface
Query
interpretation
Document
Collection
Crawling
Text Processing
Indexing
General Voodoo
Matching
Ranking
Metadata
Index
Document
Interpretation
59
64. Web Search
โข Basic search technology shared with IR systems
โ Representation
โ Indexing
โ Ranking
โข Scale (in terms of data and users) changes the game
โ Efficiency/architectural design decisions
โข Link structure
โ For data acquisition (crawling)
โ For ranking (PageRank, HITS)
โ For spam detection
โ For extending document representations (anchor text)
โข Adversarial IR
โข Monetization
64
65. User Needs
โข Need
โ Informational โ want to learn about something (~40% / 65%)
โ Navigational โ want to go to that page (~25% / 15%)
โ Transactional โ want to do something (web-mediated) (~35% / 20%)
โข Access a service
โข Downloads
โข Shop
โ Gray areas
โข Find a good hub
โข Exploratory search โsee whatโs thereโ
Low hemoglobin
United Airlines
Seattle weather
Mars surface images
Canon S410
Car rental Brasil
65
66. How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
66
67. Usersโ empirical evaluation of results
โข Quality of pages varies widely
โ Relevance is not enough
โ Other desirable qualities (non IR!!)
โข Content: Trustworthy, diverse, non-duplicated, well maintained
โข Web readability: display correctly & fast
โข No annoyances: pop-ups, etc.
โข Precision vs. recall
โ On the web, recall seldom matters
โข What matters
โ Precision at 1? Precision above the fold?
โ Comprehensiveness โ must be able to deal with obscure queries
โข Recall matters when the number of matches is very small
โข User perceptions may be unscientific, but are significant
over a large aggregate
67
68. Usersโ empirical evaluation of engines
โข Relevance and validity of results
โข UI โ Simple, no clutter, error tolerant
โข Trust โ Results are objective
โข Coverage of topics for ambiguous queries
โข Pre/Post process tools provided
โ Mitigate user errors (auto spell check, search assist,โฆ)
โ Explicit: Search within results, more like this, refine ...
โ Anticipative: related searches
โข Deal with idiosyncrasies
โ Web specific vocabulary
โข Impact on stemming, spell-check, etc.
โ Web addresses typed in the search box
โข โThe first, the last, the best and the worst โฆโ
68
69. The Web document collection
โข No design/co-ordination
โข Distributed content creation, linking,
democratization of publishing
โข Content includes truth, lies, obsolete
information, contradictions โฆ
โข Unstructured (text, html, โฆ), semi-structured
(XML, annotated photos), structured
(Databases)โฆ
โข Scale much larger than previous text collections
โฆ but corporate records are catching up
โข Growth โ slowed down from initial โvolume
doubling every few monthsโ but still expanding
โข Content can be dynamically generated The Web
69
70. Basic crawler operation
โข Begin with known โseedโ URLs
โข Fetch and parse them
โExtract URLs they point to
โPlace the extracted URLs on a queue
โข Fetch each URL on the queue and
repeat
70
71. Crawling picture
Web
URLs frontier
Unseen Web
URLs crawled
and parsed
Seed
pages
71
72. Simple picture โ complications
โข Web crawling isnโt feasible with one machine
โ All of the above steps distributed
โข Malicious pages
โ Spam pages
โ Spider traps โ including dynamically generated
โข Even non-malicious pages pose challenges
โ Latency/bandwidth to remote servers vary
โ Webmastersโ stipulations
โข How โdeepโ should you crawl a siteโs URL hierarchy?
โ Site mirrors and duplicate pages
โข Politeness โ donโt hit a server too often
72
73. What any crawler must do
โข Be Polite: Respect implicit and explicit
politeness considerations
โ Only crawl allowed pages
โ Respect robots.txt
โข Be Robust: Be immune to spider traps
and other malicious behavior from
web servers
โBe efficient
73
74. What any crawler should do
โข Be capable of distributed operation: designed to
run on multiple distributed machines
โข Be scalable: designed to increase the crawl rate
by adding more machines
โข Performance/efficiency: permit full use of
available processing and network resources
74
75. What any crawler should do
โข Fetch pages of โhigher qualityโ first
โข Continuous operation: Continue fetching
fresh copies of a previously fetched page
โข Extensible: Adapt to new data formats,
protocols
75
76. Updated crawling picture
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
76
78. Document views
sailing
greece
mediterranean
fish
sunset
Author = โB. Smithโ
Crdate = โ14.12.96โ
Ladate = โ11.07.02โ
Sailing in
Greece
B. Smith
content
view
head
title
author
chapter
section
section
structure
view
data
view
layout
view
78
79. What is a document: document views
โข Content view is concerned with representing the content
of the document; that is, what is the document about.
โข Data view is concerned with factual data associated with
the document (e.g. author names, publishing date)
โข Layout view is concerned with how documents are
displayed to the users; this view is related to user interface
and visualization issues.
โข Structure view is concerned with the logical structure of
the document, (e.g. a book being composed of chapters,
themselves composed of sections, etc.)
79
80. Indexing language
โข An indexing language:
โ Is the language used to describe the content of
documents (and queries)
โ And it usually consists of index terms that are derived
from the text (automatic indexing), or arrived at
independently (manual indexing), using a controlled
or uncontrolled vocabulary
โ Basic operation: is this query term present in this
document?
80
81. Generating document representations
โข The building of the indexing language, that is generating
the document representation, is done in several steps:
โ Character encoding
โ Language recognition
โ Page segmentation (boilerplate detection)
โ Tokenization (identification of words)
โ Term normalization
โ Stopword removal
โ Stemming
โ Others (doc. Expansion, etc.)
81
82. Generating document representations: overview
documents
tokens
stop-words
stems
terms (index terms)
tokenization
remove noisy words
reduce to stems
+ others: e.g.
- thesaurus
- more complex
processing
82
83. Parsing a document
โข What format is it in?
โ pdf/word/excel/html?
โข What language is it in?
โข What character set is in use?
โ (ISO-8818, UTF-8, โฆ)
But these tasks are often done heuristically โฆ
83
84. Complications: Format/language
โข Documents being indexed can include docs from many
different languages
โ A single index may contain terms from many languages.
โข Sometimes a document or its components can contain
multiple languages/formats
โ French email with a German pdf attachment.
โ French email quote clauses from an English-language
contract
โข There are commercial and open source libraries that can
handle a lot of this stuff
84
85. Complications: What is a document?
We return from our query โdocumentsโ but there are often
interesting questions of grain size:
What is a unit document?
โ A file?
โ An email? (Perhaps one of many in a single mbox file)
โข What about an email with 5 attachments?
โ A group of files (e.g., PPT or LaTeX split over HTML pages)
85
86. Tokenization
โข Input: โFriends, Romans and Countrymenโ
โข Output: Tokens
โ Friends
โ Romans
โ Countrymen
โข A token is an instance of a sequence of characters
โข Each such token is now a candidate for an index entry, after
further processing
โข But what are valid tokens to emit?
86
87. Tokenization
โข Issues in tokenization:
โ Finlandโs capital ๏ฎ
Finland AND s? Finlands? Finlandโs?
โ Hewlett-Packard ๏ฎ Hewlett and Packard as two
tokens?
โข state-of-the-art: break up hyphenated sequence.
โข co-education
โข lowercase, lower-case, lower case ?
โข It can be effective to get the user to put in possible hyphens
โ San Francisco: one token or two?
โข How do you decide it is one token?
87
88. Numbers
โข 3/20/91 Mar. 12, 1991 20/3/91
โข 55 B.C.
โข B-52
โข My PGP key is 324a3df234cb23e
โข (800) 234-2333
โข Often have embedded spaces
โข Older IR systems may not index numbers
But often very useful: think about things like looking up error
codes/stacktraces on the web
โข Will often index โmeta-dataโ separately
Creation date, format, etc.
88
89. Tokenization: language issues
โข French
โ L'ensemble ๏ฎ one token or two?
โข L ? Lโ ? Le ?
โข Want lโensemble to match with un ensemble
โ Until at least 2003, it didnโt on Google
ยป Internationalization!
โข German noun compounds are not segmented
โ Lebensversicherungsgesellschaftsangestellter
โ โlife insurance company employeeโ
โ German retrieval systems benefit greatly from a compound splitter
module
โ Can give a 15% performance boost for German
89
90. Tokenization: language issues
โข Chinese and Japanese have no spaces between words:
โ ่ๆๆณขๅจ็ฐๅจๅฑ ไฝๅจ็พๅฝไธๅ้จ็ไฝ็ฝ้่พพใ
โ Not always guaranteed a unique tokenization
โข Further complicated in Japanese, with multiple alphabets
intermingled
โ Dates/amounts in multiple formats
ใใฉใผใใฅใณ500็คพใฏๆ ๅ ฑไธ่ถณใฎใใๆ้ใใ$500K(็ด6,000ไธๅ)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
90
91. Tokenization: language issues
โข Arabic (or Hebrew) is basically written right to left, but with certain items
like numbers written left to right
โข Words are separated, but letter forms within a word form complex
ligatures
โ โ โ โ โ start
โAlgeria achieved its independence in 1962 after 132 years of French
occupation.โ
โข With Unicode, the surface presentation is complex, but the stored
form is straightforward
91
92. Stop words
โข With a stop list, you exclude from the dictionary entirely the commonest
words. Intuition:
โ They have little semantic content: the, a, and, to, be
โ There are a lot of them: ~30% of postings for top 30 words
โข But the trend is away from doing this:
โ Good compression techniques means the space for including stop words in a system
can be small
โ Good query optimization techniques mean you pay little at query time for including
stop words.
โ You need them for:
โข Phrase queries: โKing of Denmarkโ
โข Various song titles, etc.: โLet it beโ, โTo be or not to beโ
โข โRelationalโ queries: โflights to Londonโ
92
93. Normalization to terms
โข Want: matches to occur despite superficial differences in the
character sequences of the tokens
โข We may need to โnormalizeโ words in indexed text as well as query words
into the same form
โ We want to match U.S.A. and USA
โข Result is terms: a term is a (normalized) word type, which is an entry in
our IR system dictionary
โข We most commonly implicitly define equivalence classes of terms by, e.g.,
โ deleting periods to form a term
โข U.S.A., USA ๏จ USA
โ deleting hyphens to form a term
โข anti-discriminatory, antidiscriminatory ๏จ antidiscriminatory
93
94. Normalization: other languages
โข Accents: e.g., French rรฉsumรฉ vs. resume.
โข Umlauts: e.g., German: Tuebingen vs. Tรผbingen
โ Should be equivalent
โข Most important criterion:
โ How are your users like to write their queries for these words?
โข Even in languages that standardly have accents, users often may not type
them
โ Often best to normalize to a de-accented term
โข Tuebingen, Tรผbingen, Tubingen ๏จ Tubingen
94
95. Case folding
โข Reduce all letters to lower case
โ exception: upper case in mid-sentence?
โข e.g., General Motors
โข Fed vs. fed
โข SAIL vs. sail
โ Often best to lower case everything, since users will use lowercase
regardless of โcorrectโ capitalizationโฆ
โข Longstanding Google example: [fixed in 2011โฆ]
โ Query C.A.T.
โ #1 result is for โcatsโ (well, Lolcats) not Caterpillar Inc.
95
96. Normalization to terms
โข An alternative to equivalence classing is to do asymmetric
expansion
โข An example of where this may be useful
โ Enter: window Search: window, windows
โ Enter: windows Search: Windows, windows, window
โ Enter: Windows Search: Windows
โข Potentially more powerful, but less efficient
96
97. Thesauri and soundex
โข Do we handle synonyms and homonyms?
โ E.g., by hand-constructed equivalence classes
โข car = automobile color = colour
โ We can rewrite to form equivalence-class terms
โข When the document contains automobile, index it under
car-automobile (and vice-versa)
โ Or we can expand a query
โข When the query contains automobile, look under car as
well
โข What about spelling mistakes?
โ One approach is Soundex, which forms equivalence classes of
words based on phonetic heuristics
97
98. Lemmatization
โข Reduce inflectional/variant forms to base form
โข E.g.,
โ am, are, is ๏ฎ be
โ car, cars, car's, cars' ๏ฎ car
โข the boy's cars are different colors ๏ฎ the boy car be
different color
โข Lemmatization implies doing โproperโ reduction to
dictionary headword form
98
99. Stemming
โข Reduce terms to their โrootsโ before indexing
โข โStemmingโ suggests crude affix chopping
โ language dependent
โ e.g., automate(s), automatic, automation all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
99
100. โ Affix removal
โข remove the longest affix: {sailing, sailor} => sail
โข simple and effective stemming
โข a widely used such stemmer is Porterโs algorithm
โ Dictionary-based using a look-up table
โข look for stem of a word in table: play + ing => play
โข space is required to store the (large) table, so often not practical
100
101. Stemming: some issues
โข Detect equivalent stems:
โ {organize, organise}: e as the longest affix leads to {organiz,
organis}, which should lead to one stem: organis
โ Heuristics are therefore used to deal with such cases.
โข Over-stemming:
โ {organisation, organ} reduced into org, which is incorrect
โ Again heuristics are used to deal with such cases.
101
102. Porterโs algorithm
โข Commonest algorithm for stemming English
โ Results suggest itโs at least as good as other stemming options
โข Conventions + 5 phases of reductions
โ phases applied sequentially
โ each phase consists of a set of commands
โ sample convention: Of the rules in a compound command, select
the one that applies to the longest suffix.
102
103. Typical rules in Porter
โข sses ๏ฎ ss
โข ies ๏ฎ i
โข ational ๏ฎ ate
โข tional ๏ฎ tion
103
104. Language-specificity
โข The above methods embody transformations that are
โ Language-specific, and often
โ Application-specific
โข These are โplug-inโ addenda to the indexing process
โข Both open source and commercial plug-ins are
available for handling these
104
105. Does stemming help?
โข English: very mixed results. Helps recall for some queries but
harms precision on others
โ E.g., operative (dentistry) โ oper
โข Definitely useful for Spanish, German, Finnish, โฆ
โ 30% performance gains for Finnish!
105
106. Others: Using a thesaurus
โข A thesaurus provides a standard vocabulary for indexing
(and searching)
โข More precisely, a thesaurus provides a classified
hierarchy for broadening and narrowing terms
bank: 1. Finance institute
2. River edge
โ if a document is indexed with bank, then index it with
โfinance instituteโ or โriver edgeโ
โ need to disambiguate the sense of bank in the text: e.g. if
money appears in the document, then chose โfinance
instituteโ
โข A widely used online thesaurus: WordNet
106
107. Information storage
โข Whole topic on its own
โข How do we keep fresh copies of the web manageable by a cluster of
computers and are able to answer millions of queries in milliseconds
โ Inverted indexes
โ Compression
โ Caching
โ Distributed architectures
โ โฆ and a lot of tricks
โข Inverted indexes: cornerstone data structure of IR systems
โ For each term t, we must store a list of all documents that contain t.
โ Identify each doc by a docID, a document serial number
โ Index construction is tricky (canโt hold all the information needed in memory)
107
109. โข Most basic form:
โ Document frequency
โ Term frequency
โ Document identifiers
109
term Term id df
a 1 4
as 2 3
(1,2), (2,5), (10,1), (11,1)
(1,3), (3,4), (20,1)
110. โข Indexes contain more information
โ Position in the document
โข Useful for โphrase queriesโ or โproximity queriesโ
โ Fields in which the term appears in the document
โ Metadata โฆ
โ All that can be used for ranking
110
(1,2, [1, 1], [2,10]), โฆ
Field 1 (title), position 1
111. Queries
โข How do we process a query?
โข Several kinds of queries
โ Boolean
โขChicken AND salt
โข Gnome OR KDE
โข Salt AND NOT pepper
โ Phrase queries
โ Ranked
111
112. List Merging
โขโExact matchโ queries
โ Chicken AND curry
โ Locate Chicken in the dictionary
โ Fetch its postings
โ Locate curry in the dictionary
โFetch its postings
โMerge both postings
112
116. Models of information retrieval
โข A model:
โ abstracts away from the real world
โ uses a branch of mathematics
โ possibly: uses a metaphor for searching
116
117. Short history of IR modelling
โข Boolean model (ยฑ1950)
โข Document similarity (ยฑ1957)
โข Vector space model (ยฑ1970)
โข Probabilistic retrieval (ยฑ1976)
โข Language models (ยฑ1998)
โข Linkage-based models (ยฑ1998)
โข Positional models (ยฑ2004)
โข Fielded models (ยฑ2005)
117
118. The Boolean model (ยฑ1950)
โข Exact matching: data retrieval (instead of
information retrieval)
โ A term specifies a set of documents
โ Boolean logic to combine terms / document sets
โ AND, OR and NOT: intersection, union, and
difference
118
119. Statistical similarity between documents (ยฑ1957)
โข The principle of similarity
"The more two representations agree in given elements and their
distribution, the higher would be the probability of their representing
similar informationโ
(Luhn 1957)
It is here proposed that the frequency of word [term] occurrence in an
article [document ] furnishes a useful measurement of word [term]
significanceโ
119
121. Zipfโs law
โข Relative frequencies of terms.
โข In natural language, there are a few very frequent terms and very many
very rare terms.
โข Zipfโs law: The ith most frequent term has frequency proportional to 1/i .
โข cfi โ 1/i = K/i where K is a normalizing constant
โข cfi is collection frequency: the number of occurrences of the term ti in the
collection.
โข Zipfโs law holds for different languages
121
122. Zipf consequences
โข If the most frequent term (the) occurs cf1 times
โ then the second most frequent term (of) occurs cf1/2 times
โ the third most frequent term (and) occurs cf1/3 times โฆ
โข Equivalent: cfi = K/i where K is a normalizing factor, so
โ log cfi = log K - log i
โ Linear relationship between log cfi and log i
โข Another power law relationship
122
124. Luhnโs analysis -Observation
terms by rank order
frequency of terms
f
resolving power
r
upper cut-off lower cut-off
common terms
rare terms
significant terms
Resolving power of significant terms:
ability of terms to discriminate document content
peak at rank order position half way between the two cut-offs
124
125. Luhnโs analysis - Implications
โข Common terms are not good at representing document
content
โ partly implemented through the removal of stop words
โข Rare words are also not good at representing document
content
โ usually nothing is done
โ Not true for every โdocumentโ
โข Need a means to quantify the resolving power of a term:
โ associate weights to index terms
โ tfรidf approach
125
126. Ranked retrieval
โข Boolean queries are good for expert users with precise
understanding of their needs and the collection.
โ Also good for applications: Applications can easily consume
1000s of results.
โข Not good for the majority of users.
โ Most users incapable of writing Boolean queries (or they are,
but they think itโs too much work).
โ Most users donโt want to wade through 1000s of results.
โข This is particularly true of web search.
127. Feast or Famine
โข Boolean queries often result in either too few (=0) or too
many (1000s) results.
โข Query 1: โstandard user dlink 650โ โ 200,000 hits
โข Query 2: โstandard user dlink 650 no card foundโ: 0 hits
โข It takes a lot of skill to come up with a query that produces
a manageable number of hits.
โ AND gives too few; OR gives too many
128. Ranked retrieval models
โข Rather than a set of documents satisfying a query expression,
in ranked retrieval, the system returns an ordering over the
(top) documents in the collection for a query
โข Free text queries: Rather than a query language of operators
and expressions, the userโs query is just one or more words in
a human language
โข In principle, there are two separate choices here, but in
practice, ranked retrieval has normally been associated with
free text queries and vice versa
128
129. Feast or famine: not a problem in ranked retrieval
โข When a system produces a ranked result set, large result sets
are not an issue
โ Indeed, the size of the result set is not an issue
โ We just show the top k ( โ 10) results
โ We do not overwhelm the user
โ Premise: the ranking algorithm works
130. Scoring as the basis of ranked retrieval
โข We wish to return in order the documents most likely to
be useful to the searcher
โข How can we rank-order the documents in the collection
with respect to a query?
โข Assign a score โ say in [0, 1] โ to each document
โข This score measures how well document and query
โmatchโ.
131. Query-document matching scores
โข We need a way of assigning a score to a query/document
pair
โข Letโs start with a one-term query
โข If the query term does not occur in the document: score
should be 0
โข The more frequent the query term in the document, the
higher the score (should be)
โข We will look at a number of alternatives for this.
132. Bag of words model
โข Vector representation does not consider the ordering of
words in a document
โข John is quicker than Mary and Mary is quicker than John
have the same vectors
โข This is called the bag of words model.
133. Term frequency tf
โข The term frequency tf(t,d) of term t in document d is defined
as the number of times that t occurs in d.
โข We want to use tf when computing query-document match
scores. But how?
โข Raw term frequency is not what we want:
โ A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
โ But not 10 times more relevant.
โข Relevance does not increase proportionally with term
frequency.
134. Log-frequency weighting
โข The log frequency weight of term t in d is
๏ฌ ๏ซ ๏พ
1 log tf , if tf 0
๏ฎ
๏ญ
๏ฝ
10 t,d t,d
0, otherwise
t,d w
โข 0 โ 0, 1 โ 1, 2 โ 1.3, 10 โ 2, 1000 โ 4, etc.
โข Score for a document-query pair: sum over terms t in both q and d:
โข score
โข The score is 0 if none of the query terms is present in the document.
๏ฅ ๏ ๏
๏ฝ ๏ซ
t q d t d (1 log tf ) ,
135. Document frequency
โข Rare terms are more informative than frequent terms
โ Recall stop words
โข Consider a term in the query that is rare in the collection (e.g.,
arachnocentric)
โข A document containing this term is very likely to be relevant to
the query arachnocentric
โข โ We want a high weight for rare terms like arachnocentric.
136. Document frequency, continued
โข Frequent terms are less informative than rare terms
โข Consider a query term that is frequent in the collection (e.g., high,
increase, line)
โข A document containing such a term is more likely to be relevant than a
document that does not
โข But itโs not a sure indicator of relevance.
โข โ For frequent terms, we want high positive weights for words like high,
increase, and line
โข But lower weights than for rare terms.
โข We will use document frequency (df) to capture this.
137. idf weight
โข dft is the document frequency of t: the number of documents that contain
t
โ dft is an inverse measure of the informativeness of t
โ dft ๏ฃ N
โข We define the idf (inverse document frequency) of t by
โ We use log (N/dft) instead of N/dft to โdampenโ the effect of idf.
idf log ( /df ) t 10 t ๏ฝ N
138. Effect of idf on ranking
โข Does idf have an effect on ranking for one-term queries, like
โ iPhone
โข idf has no effect on ranking one term queries
โ idf affects the ranking of documents for queries with at least
two terms
โ For the query capricious person, idf weighting makes
occurrences of capricious count for much more in the final
document ranking than occurrences of person.
138
139. tf-idf weighting
โข The tf-idf weight of a term is the product of its tf weight and its
idf weight.
w ๏ฝ log(1 ๏ซ tf ) ๏ด
log ( N
/ df ) t , d
t ,d 10 t โข Best known weighting scheme in information retrieval
โ Note: the โ-โ in tf-idf is a hyphen, not a minus sign!
โ Alternative names: tf.idf, tf x idf
โข Increases with the number of occurrences within a document
โข Increases with the rarity of the term in the collection
140. Score for a document given a query
tรqรd รฅ
โข There are many variants
โ How โtfโ is computed (with/without logs)
โ Whether the terms in the query are also weighted
โ โฆ
140
Score(q,d) = tf.idft,d
141. Documents as vectors
โข So we have a |V|-dimensional vector space
โข Terms are axes of the space
โข Documents are points or vectors in this space
โข Very high-dimensional: tens of millions of dimensions when
you apply this to a web search engine
โข These are very sparse vectors - most entries are zero.
142. Statistical similarity between documents (ยฑ1957)
โข Vector product
โ If the vector has binary components, then the product
measures the number of shared terms
โ Vector components might be "weights"
๏ฅ
score q d ๏ฝ q ๏
d
k k ๏
matching terms
( , )
k
๏ฒ ๏ฒ
143. Why distance is a bad idea
The Euclidean
distance between q
and d2 is large even
though the
distribution of terms
in the query q and the
distribution of
terms in the
document d2 are
very similar.
144. Vector space model (ยฑ1970)
โข Documents and
queries are vectors in
a high-dimensional
space
โข Geometric measures
(distances, angles)
145. Vector space model (ยฑ1970)
โข Cosine of an angle:
โ close to 1 if angle is small
โ 0 if vectors are orthogonal
2
m
d q
k k k
d q
m
k 1
k
๏ฅ ๏
2
m
k 1
k
๏ฝ
1
( ) ( )
๏ฒ ๏ฒ
cos( , )
๏ฅ ๏๏ฅ
๏ฝ
๏ฝ ๏ฝ
d q
1 ( )2
๏ฝ m
๏ฅ
๏ฝ ๏ฅ ๏ ๏ฝ
k ๏ฝ
k
i
i
m
k
k k
v
v
๏ฒ ๏ฒ
d q n d n q n v
1
cos( , ) ( ) ( ), ( )
146. Vector space model (ยฑ1970)
โข PRO: Nice metaphor, easily explained;
Mathematically sound: geometry;
Great for relevance feedback
โข CON: Need term weighting (tf-idf);
Hard to model structured queries
147. Probabilistic IR
โข An IR system has an uncertain understanding of userโs queries and
makes uncertain guesses on whether a document satisfies a query
or not.
โข Probability theory provides a principled foundation for reasoning
under uncertainty.
โข Probabilistic models build upon this foundation to estimate how
likely it is that a document is relevant for a query.
147
148. Event Space
โข Query representation
โข Document representation
โข Relevance
โข Event space
โข Conceptually there might be pairs with same q and d,
but different r
โข Some times include include user u, context c, etc.
148
149. Probability Ranking Principle
โข Robertson (1977)
โ โIf a reference retrieval systemโs response to each
request is a ranking of the documents in the collection
in order of decreasing probability of relevance to the
user who submitted the request, where the
probabilities are estimated as accurately as possible
on the basis of whatever data have been made
available to the system for this purpose, the overall
effectiveness of the system to its user will be the best
that is obtainable on the basis of those data.โ
โข Basis for probabilistic approaches for IR
149
150. Dissecting PRP
โข Probability of relevance
โข Estimated accurately
โข Based on whatever data available
โข Best possible accuracy
โ The perfect IR system!
โ Assumes relevance is independent on other
documents in the collection
150
151. Relevance?
โข What is ?
โ Isnโt it decided by the user? her opinion?
โข User doesnโt mean a human being!
โ We are working with representations
โ ... or parts of the reality available to us
โข 2/3 keywords, no profile, no context ...
โ relevance is uncertain
โข depends on what the system sees
โข may be marginalized over all the
unseen context/profiles
151
152. Retrieval as binary classification
โข For every (q,d), r takes two values
โ Relevant and non-relevant documents
โ can be extended to multiple values
โข Retrieve using Bayesโ decision
โ PRP is related to the Bayes error rate (lowest
possible error rate for a class)
โ How do we estimate this probability?
152
153. PRP ranking
โข How to represent the random variables?
โข How to estimate the modelโs parameters?
153
154. โข d is a binary vector
โข Multiple Bernoulli variables
โข Under MB, we can decompose into a
product of probabilities, with likelihoods:
154
155. If the terms are not in the query:
Otherwise we need estimates for them!
155
156. Estimates
โข Assign new weights for query terms based on relevant/non-relevant
documents
โข Give higher weights to important terms:
Relevant Non-relevant
156
Document with
t
r n-r n
Document
without t
R-r N-r-R+r N-n
R N-R
157. Robertson-Spark Jones weight
157
Relevant docs with t
Relevant docs without t
Non-relevant docs with t
Non-relevant docs without t
158. Estimates without relevance info
โข If we pick a relevant document, words are equally like to be
present or absent
โข Non-relevant can be approximated with the collection as a
whole
158
160. Modeling TF
โข Naรฏve estimation: separate probability for every
outcome
โข BIR had only two parameters, now we have plenty
(~many outcomes)
โข We can plug in a parametric estimate for the term
frequencies
โข For instance, a Poisson mixture
160
161. Okapi BM25
โข Same ranking function as before but with new
estimates. Models term frequencies and
document length.
โข Words are generated by a mixture of two
Poissons
โข Assumes an eliteness variable (elite ~ word
occurs unusually frequently, non-elite ~ word
occurs as expected by chance).
161
163. BM25
โข In order to approximate the formula, Robertson and Walker came up
with:
โข Two model parameters
โข Very effective
โข The more words in common with the query the better
โข Repetitions less important than different query words
โ But more important if the document is relatively long
163
164. Generative Probabilistic Language Models
โข The generative approach โ A generator which produces
events/tokens with some probability
โ Probability distribution over strings of text
โ URN Metaphor โ a bucket of different colour balls (10 red, 5
blue, 3 yellow, 2 white)
โข What is the probability of drawing a yellow ball? 3/20
โข what is the probability of drawing (with replacement) a red ball and a
white ball? ยฝ*1/10
โ IR Metaphor: Documents are urns, full of tokens (balls) of (in)
different terms (colors)
165. What is a language model?
โข How likely is a string of words in a โlanguageโ?
โ P1(โthe cat sat on the matโ)
โ P2(โthe mat sat on the catโ)
โ P3(โthe cat sat en la alfombraโ)
โ P4(โel gato se sentรณ en la alfombraโ)
โข Given a model M and a observation s we want
โ Probability of getting s through random sampling from M
โ A mechanism to produce observations (strings) legal in M
โข User thinks of a relevant document and then picks some keywords
to use as a query
165
166. Generative Probabilistic Models
โข What is the probability of producing the query from a document? p(q|d)
โข Referred to as query-likelihood
โข Assumptions:
โข The probability of a document being relevant is strongly correlated with
the probability of a query given a document, i.e. p(d|r) is correlated
with p(q|d)
โข User has a reasonable idea of the terms that are like to appear in the
โidealโ document
โข Userโs query terms can distinguish the โidealโ document from the rest
of the corpus
โข The query is generated as a representative of the โidealโ document
โข Systemโs task is to estimate for each of the documents in the collection,
which is most likely to be the โidealโ document
167. Language Models (1998/2001)
โข Letโs assume we point blindly, one at a time, at 3 words
in a document
โ What is the probability that I, by accident, pointed at the words
โMasterโ, โcomputerโ and โScienceโ?
โ Compute the probability, and use it to rank the documents.
โข Words are โsampledโ independently of each other
โ Joint probability decomposed into a product of marginals
โ Estimation of probabilities just by counting
โข Higher models or unigrams?
โ Parameter estimation can be very expensive
168. Standard LM Approach
โข Assume that query terms are drawn identically and
independently from a document
169. Estimating language models
โข Usually we donโt know M
โข Maximum Likelihood Estimate of
โ Simply use the number of times the query term occurs in
the document divided by the total number of term
occurrences.
โข Zero Probability (frequency) problem
169
170. Document Models
โข Solution: Infer a language model for each document,
where
โข Then we can estimate
โข Standard approach is to use the probability of a term to
smooth the document model.
โข Interpolate the ML estimator with general language
expectations
171. Estimating Document Models
โข Basic Components
โ Probability of a term given a document (maximum likelihood estimate)
โ Probability of a term given the collection
โ tf(t,d) is the number of times term t occurs in document d (term frequency)
173. Implementation as vector product
df t
tf t D
p t ๏ฅ
๏ฅ
๏ฝ
'
( )
( ' )
( )
t
df t
๏ฝ
'
( , )
( ' , )
( | )
t
tf t D
p t D
Recall:
score q d q dk
q ๏ฝ
tf k q
( , ) .
( , )
tf k d df t
( , ) ( )
k
tf.idf of term k in document d
๏ฌ
๏ญ
๏ฌ
Odds of the probability of
๏ฅ ๏
๏ฅ
Inverse length of d Term importance
๏ฝ
๏ฝ
๏ฅ
1
.
( ) ( , )
log
Matching Text
t
t
k
k
k
df k tf t d
d
174. Document length normalization
โข Probabilistic models assume causes for documents differing in
length
โ Scope
โ Verbosity
โข In practice, document length softens the term frequency
contribution to the final score
โ Weโve seen it in BM25 and LMs
โ Usually with a tunable parameter that regulates the
amount of softening
โ Can be a function of the deviation of the average
document length
โ Can be incorporated into vanilla tf-idf
174
175. Other models
โข Modeling term dependencies (positions) in the language
modeling framework
โ Markov Random Fields
โข Modeling matches (occurrences of words) in different
parts of a document -> fielded models
โ BM25F
โ Markov Random Fields can account for this as well
175
176. More involved signals for ranking
โข From document understanding to query
understanding
โข Query rewrites (gazetteers, spell correction),
named entity recognition, query suggestions,
query categories, query segmentation ...
โข Detecting query intent, triggering verticals
โ direct target towards answers
โ richer interfaces
176
177. Signals for Ranking
โข Signals for ranking: matches of query terms in
documents, query-independent quality measures,
CTR, among others
โข Probabilistic IR models are all about counting
โ occurrences of terms in documents, in sets of
documents, etc.
โข How to aggregate efficiently a large number of
โdifferentโ counts
โ coming from the same terms
โ no double counts!
177
178. Searching for food
โข New Yorkโs greatest pizza
โฃ New OR Yorkโs OR greatest OR pizza
โฃ New AND Yorkโs AND greatest AND pizza
โฃ New OR York OR great OR pizza
โฃ โNew Yorkโ OR โgreat pizzaโ
โฃ โNew Yorkโ AND โgreat pizzaโ
โฃ York < New AND great OR pizza
โข among many more.
178
179. โRefinedโmatching
โข Extract a number of virtual regions in the document
that match some version of the query (operators)
โ Each region provides a different evidence of
relevance (i.e. signal)
โข Aggregate the scores over the different regions
โข Ex. :โat least any two words in the query appear
either consecutively or with an extra word between
themโ
179
181. Remember BM25
โข Term (tf) independence
โข Vague Prior over terms not
appearing in the query
โข Eliteness - topical model that
perturbs the word distribution
โข 2-poisson distribution of term
frequencies over relevant and non-relevant
documents
181
182. Feature dependencies
โข Class-linearly dependent (or affine) features
โ add no extra evidence/signal
โ model overfitting (vs capacity)
โข Still, it is desirable to enrich the model with more
involved features
โข Some features are surprisingly correlated
โข Positional information requires a large number of
parameters to estimate
โข Potentially up to
182
183. Query concept segmentation
โข Queries are made up of basic conceptual units,
comprising many words
โ โIndian summer victor herbertโ
โข Spurious matches: โsan jose airportโ -> โsan jose
city airportโ
โข Model to detect segments based on generative
language models and Wikipedia
โข Relax matches using factors of the max ratio
between span length and segment length
183
184. Virtual regions
โข Different parts of the document
provide different evidence of
relevance
โข Create a (finite) set of (latent)
artificial regions and re-weight
184
185. Implementation
โข An operator maps a query to a set of queries,
which could match a document
โข Each operator has a weight
โข The average term frequency in a document is
185
186. Remarks
โข Different saturation (eliteness) function?
โ learn the real functional shape!
โ log-logistic is good if the class-conditional
distributions are drawn from an exp. family
โข Positions as variables?
โ kernel-like method or exp. #parameters
โข Apply operators on a per query or per query class
basis?
186
187. Operator examples
โข BOW: maps a raw query to the set of queries
whose elements are the single terms
โข p-grams: set of all p-gram of consecutive terms
โข p-and: all conjunctions of p arbitrary terms
โข segments: match only the โconceptsโ
โข Enlargement: some words might sneak in
between the phrases/segments
187
189. ... not that far away
term frequency
link information
query intent information
editorial information
click-through information
geographical information
language information
user preferences
document length
document fields
other gazillion sources of information
189
190. Dictionaries
โข Fast look-up
โ Might need specific structures to scale up
โข Hash tables
โข Trees
โ Tolerant retrieval (prefixes)
โ Spell checking
โข Document correction (OCR)
โข Query misspellings (did you mean โฆ ?)
โข (Weighted) edit distance โ dynamic programming
โข Jaccard overlap (index character k-grams)
โข Context sensitive
โข http://norvig.com/spell-correct.html
โ Wild-card queries
โข Permuterm index
โข K-gram indexes
190
191. Hardware basics
โข Access to data in memory is much faster than access to data on disk.
โข Disk seeks: No data is transferred from disk while the disk head is being
positioned.
โข Therefore: Transferring one large chunk of data from disk to memory is
faster than transferring many small chunks.
โข Disk I/O is block-based: Reading and writing of entire blocks (as opposed
to smaller chunks).
โข Block sizes: 8KB to 256 KB.
191
192. Hardware basics
โข Many design decisions in information retrieval are based on the
characteristics of hardware
โข Servers used in IR systems now typically have several GB of main memory,
sometimes tens of GB.
โข Available disk space is several (2-3) orders of magnitude larger.
โข Fault tolerance is very expensive: It is much cheaper to use many regular
machines rather than one fault tolerant machine.
192
194. MapReduce
โข The index construction algorithm we just described is an instance of
MapReduce.
โข MapReduce (Dean and Ghemawat 2004) is a robust and conceptually
simple framework for distributed computing โฆ
โข โฆ without having to write code for the distribution part.
โข They describe the Google indexing system (ca. 2002) as consisting of a
number of phases, each implemented in MapReduce.
โข Open source implementation Hadoop
โ Widely used throughout industry
194
195. MapReduce
โข Index construction was just one phase.
โข Another phase: transforming a term-partitioned index
into a document-partitioned index.
โ Term-partitioned: one machine handles a subrange of
terms
โ Document-partitioned: one machine handles a
subrange of documents
โข Msearch engines use a document-partitioned index for
better load balancing, etc.
195
196. Distributed IR
โข Basic process
โ All queries sent to a director machine
โ Director then sends messages to many index servers
โข Each index server does some portion of the query processing
โ Director organizes the results and returns them to the user
โข Two main approaches
โ Document distribution
โข by far the most popular
โ Term distribution
196
197. Distributed IR (II)
โข Document distribution
โ each index server acts as a search engine for a small fraction of
the total collection
โ director sends a copy of the query to each of the index servers,
each of which returns the top k results
โ results are merged into a single ranked list by the director
โข Collection statistics should be shared for effective ranking
197
198. Caching
โข Query distributions similar to Zipf
โข About ยฝ each day are unique, but some are very popular
โ Caching can significantly improve effectiveness
โข Cache popular query results
โข Cache common inverted lists
โ Inverted list caching can help with unique queries
โ Cache must be refreshed to prevent stale data
198
199. Others
โข Efficiency (compression, storage, caching,
distribution)
โข Novelty and diversity
โข Evaluation
โข Relevance feedback
โข Learning to rank
โข User models
โ Context, personalization
โข Sponsored Search
โข Temporal aspects
โข Social aspects
199
#11: Not only the data is different, also the queries, and the results we get from it!
#13: To the surprise of many, the search box has become the preferred method of information access.
Customers ask: Why canโt I search my database in the same way?
#17: Archie is a tool for indexing FTP archives, allowing people to find specific files. It is considered to be the first Internet search engine.
In the summer of 1993, no search engine existed for the web, just catalog
One of the first "all text" crawler-based search engines was WebCrawler, which came out in 1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which has become the standard for all major search engines since. It was also the first one widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was launched and became a major commercial endeavor.
#18: In 1996, Netscape was looking to give a single search engine an exclusive deal as the featured search engine on Netscape's web browser. There was so much interest that instead Netscape struck deals with five of the major search engines: for $5 million a year, each search engine would be in rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite.[7][8]
Google adopted the idea of selling search terms in 1998, from a small search engine company named goto.com. This move had a significant effect on the SE business, which went from struggling to one of the most profitable businesses in the internet.[6]
#20: Aardvark was a social search service that connected users live with friends or friends-of-friends who were able to answer their questions, also known as a knowledge market. Bought by google 2010
Kaltix Corp., commonly known as Kaltix is a personalized search engine company founded at Stanford University in June 2003 by Sep Kamvar, Taher Haveliwala and Glen Jeh.[1][2] It was acquired by Google in September 2003.
#45: Information needs must be expressed as a query
โ But users donโt often know what they want
ASK
ย
Hypothesis Belkin et al (1982)
Proposed a model called Anomalous State of Knowledge
ASKย
hypothesis:
โ difficult for people to define exactly what their information need is, because that information is a gap in their knowledge
- Search Engines should look for information that fills those gaps
Interesting ideas, little practical impact (yet)
#49: Under specified
Ambiguous
Context sensitive
ย represent different types of search
โ ย E.g. decision making
โ ย background search
โ ย fact search
#50: Need to have fairly deep knowledge...
โ ย What sites are possible
โ ย Whatโs in a given site (whatโs likely to be there)
โ ย Authority of source / site
โ ย Index structure (time, place, person, ...) what kinds of searches?
โ ย How to read a SERP critically
#52: Start with the simplest search you can think of:
[ upper lip indentation ]
If itโs not right, you can always modify it.
โข When I did this, I clicked on the first result, which took me to Yahoo Answers. Thereโs a nice article there about something called the philtrum.
#53: Ghost town vs abandoned
1750
Search for images with creative commons attributions
#60: Queries and documents must share a (at least comparable if not the same) representation
#64: SCC โ single connected component
IN โ pages not discovered yet
OUT โ sites that contain only in-host link
Tendrils โ canโt reach or be reached from the SCC
#74: creation of indefinitely deep directory structures like http://foo.com/bar/foo/bar/foo/bar/foo/bar/.....
dynamic pages like calendars that produce an infinite number of pages for a web crawler to follow.
pages filled with a large number of characters, crashing the lexical analyzer parsing the page.
pages with session-id's based on required cookies.
#80:
Data: ; this type of data is conventionally dealt with a database management system.
Structure: With this view, documents are not treated as flat entities, so a document and its components (e.g. sections) can be retrieved
#83: How do we arrive to the content representation of a document?
#85: Nontrivial issues. Requires some design decisions.
#86: Nontrivial issues. Requires some design decisions.
Matches are then more likely to be relevant, and since the documents are smaller it will be much easier for the user to find the relevant passages in the document. But why stop there? We could treat individual sentences as mini-documents. It becomes clear that there is a precision/recall tradeoff here. If the units get too small, we are likely to miss important passages because terms were distributed over several mini-documents, while if units are too large we tend to get spurious matches and the relevant information is hard for the user to find.
The problems with large document units can be alleviated by use of explicit or implicit proximity search
#88: A simple strategy is to just split on all non-alphanumeric characters โ bad
you always want to do the exact same tokenization of document and query words, generally by processing queries with the same tokenize
Conceptually, splitting on white space can also split what should be re- garded as a single token. This occurs most commonly with names (San Fran- cisco, Los Angeles) but also with borrowed foreign phrases (au fait)
#89: Index numbers -> (One answer is using n-grams: IIR ch. 3)
#90: Methods of word segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or condi- tional random fields, trained over hand-segmented words
#91: No unique tokenization + completely different interpretation of a sequence depending on where you split
#93: Nevertheless: โGoogle ignores common words and characters such as where, the, how, and other digits and letters which slow down your search without improving the results.โ (Though you can explicitly ask for them to remain.)
#94: Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the to- kens.4 The most standard way to normalize is to implicitly create equivalence classes, which are normally named after one member of the set. For instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory, in both the document text and queries, then searches for one term will retrieve documents that contain either.
The advantage of just using mapping rules that remove characters like hyphens is that the equivalence classing to be done is implicit, rather than being fully calculated in advance: the terms that happen to become identical as the result of these rules are the equivalence classes. It is only easy to write rules of this sort that remove characters. Since the equivalence classes are implicit, it is not obvious when you might want to add characters. For instance, it would be hard to know to turn antidiscriminatory into anti-discriminatory.
#95: An alternative to creating equivalence classes is to maintain relations between not normalized tokens. This method can be extended to hand-constructed lists of synonyms such as car and automobile, a topic we discuss further in
#153: The classifier that assigns a vector x to the class with the highest posterior is called the Bayes classifier.
The error associated with this classifier is called the Bayes error. This is the lowest possible error rate for any classifier over the distribution of all examples and for a chosen hypothesis space
#154: A complete probability distribution over documents
โ ย defines likelihood for any possible document d (observation)
โ ย P(relevant) via P(document): P๎Rโฃd๎โP๎dโฃR๎P๎R๎
โ ย can โgenerateโ synthetic documents๏ฌ willsharesomepropertiesoftheoriginalcollection
Not all IR Models do this โ possible to estimate p(R|d) directly โ log regression
Assumptions: one relevance value for every word w
Words are conditionally independent given R โ false, but allows to lower down the number of parameters
All words absent are equally likely to be observed in relevant and not relevant classes
#156: One relevance status value per word
empty document (all words absent) is equally likely
to be observed in relevant and non-relevant classes (provides a natural zero) - practical reason, only score terms that appear in the query (TAT)
#159: Doesnโt model word dependence. Doesnโt account for document length. Doesnโt model word frequencies
#160: Now D_t = d_t account for the number of times we observe the term in the document (we have a vector of frequencies)
#165: Can we seen as a probabilisitic automata
They originate from probabilistic models of language gen-
eration developed for automatic speech recognition systems in the early 1980's
(see e.g. Rabiner 1990). Automatic speech recognition systems combine prob-
abilities of two distinct models: the acoustic model, and the language model.
The acoustic model might for instance produce the following candidate texts in
decreasing order of probability: \food born thing", \good corn sing", \mood
morning", and \good morning". Now, the language model would determine
that the phrase \good morning" is much more probable, i.e., it occurs more
frequently in English than the other phrases. When combined with the acoustic
model, the system is able to decide that \good morning" was the most likely
utterance, thereby increasing the system's performance.
For information retrieval, language models are built for each document. By
following this approach, the language model of the book you are reading now
would assign an exceptionally high probability to the word \retrieval" indicating
that this book would be a good candidate for retrieval if the query contains
this word.
#166: For some applications we want all this highly probable P3 In IR P1=P2
#169: Veto terms
Original multiple bernoulli, multinomial widely used now
accountsformultiplewordoccurrencesinthequery(primitive)โ wellunderstood:lotsofresearchinrelatedfields(andnowinIR) โ possibilityforintegrationwithASR/MT/NLP(sameeventspace)
#171: Discounting methods
Problem with all discounting methods:
โ discounting treats unseen words equally (add or subtract ฮต) โ somewordsaremorefrequentthanothers
Essentially, the data model and retrieval function are one and the same
#173: Different ways of smoothing, dirichlet priors smoothing particularly popular