The document discusses information retrieval models. It describes the Boolean retrieval model, which represents documents and queries as sets of terms combined with Boolean operators. Documents are retrieved if they satisfy the Boolean query, but there is no ranking of results. The Boolean model has limitations including difficulty expressing complex queries, controlling result size, and ranking results. It works best for simple, precise queries when users know exactly what they are searching for.
Ppt evaluation of information retrieval systemsilambu111
The document discusses the evaluation of information retrieval systems. Evaluation is defined as systematically determining a subject's merit using a set of standards. The main purposes of evaluation are to compare the performance of different systems, assess how well systems meet their goals, and identify ways to improve effectiveness. Evaluation can consider managerial or user viewpoints. Common criteria include recall, precision, fallout, generality, effectiveness, efficiency, usability, satisfaction, and cost. Recall measures the proportion of relevant documents retrieved while precision measures the proportion of retrieved documents that are relevant. Evaluation helps identify ways to improve information retrieval system performance.
The document summarizes a technical seminar on web-based information retrieval systems. It discusses information retrieval architecture and approaches, including syntactical, statistical, and semantic methods. It also covers web search analysis techniques like web structure analysis, content analysis, and usage analysis. The document outlines the process of web crawling and types of crawlers. It discusses challenges of web structure, crawling and indexing, and searching. Finally, it concludes that as unstructured online information grows, information retrieval techniques must continue to improve to leverage this data.
This document summarizes scholarly communication and e-journals. It defines scholarly communication as the process by which academic content is generated, reviewed, disseminated and built upon. E-journals are described as journals available electronically over the internet or on CD-ROM. The benefits of e-journals include speed of publication and distribution, unlimited access, portability, and ability to link to other resources. E-journals are now overtaking print journals due to factors like cost reductions and user expectations changing with technology. However, issues still include the exponential rise in prices of some journals and licensing restrictions on electronic access.
Overview of a few Content Management Systems and how they can be used in libraries.
Final Project presentation for MLIS 7505 at Valdosta State University.
The document discusses the components and design of information storage and retrieval systems (ISRS). It describes ISRS as having three main components: the user interface, knowledge base, and search agent. The user interface allows users to input queries and view results, and should be intuitive. The knowledge base stores the information to be retrieved in a database. And the search agent acts to translate user queries and match them to the knowledge base to retrieve relevant information. The document provides details on each of these components and discusses best practices for designing an effective ISRS.
Functions of information retrival system(1)silambu111
The document discusses information retrieval systems. It defines information retrieval as the process of searching collections of documents to identify those dealing with a particular subject. Information retrieval systems aim to facilitate literature searching. They involve representing, storing, organizing, and providing access to information items so that users can easily find information of interest. Information retrieval draws from multiple disciplines and involves subsystems for documents, users, and searching/matching.
Webometrics is defined as the study of quantitative aspects of web construction and use through bibliometric and informetric approaches. It considers linking relationships and volume between websites to determine significance. Main areas of webometrics research include link analysis, web citation analysis, search engine studies, and web impact analysis. Link analysis quantitatively studies hyperlinks between pages and web citation analysis looks at how often articles are cited on the web.
This document provides an introduction to digital libraries, including definitions, key components, and advantages and disadvantages. A digital library is a special library that stores digital objects like text, audio, video and images electronically rather than physically. It defines digital libraries as collections that can be accessed remotely and comprehensively collect, manage and preserve digital content. The document discusses how digital archives differ from physical libraries, strategies for searching digital libraries, common software used, and advantages like no physical boundaries but also challenges around access, organization and digital preservation.
Information Storage and Retrieval : A Case StudyBhojaraju Gunjal
Bhojaraju.G, M.S.Banerji and Muttayya Koganurmath (2004). Information Storage and Retrieval: A Case Study, In Proceedings of International Conference on Digital Libraries (ICDL 2004), New Delhi, Feb 24-27, 2004.
(Best Poster Presentation Award)
Knowledge organization (KO) refers to activities like document description, indexing and classification performed in libraries and databases to organize documents and concepts. Knowledge organization systems (KOS) include classification schemes, subject headings, thesauri and other systems used to organize information. KOS impose structures on collections and can be used in digital libraries to provide overviews and support retrieval, though different KOS may characterize entities differently. Common types of KOS include term lists, classifications and categories, and relationship lists.
Introduction into Search Engines and Information RetrievalA. LE
Gives a brief introduction into search engines and information retrieval. Covers basics about Google and Yahoo, fundamental terms in the area of information retrieval and an introduction into the famous page rank algorithm
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
Chain indexing is a method of subject indexing developed by Dr. S. R. Ranganathan. It involves classifying documents using a preferred classification scheme and representing the class number as a chain of links moving from general to specific subjects. Specific subject headings and related references are then derived from analyzing the chain of links. The headings and references are alphabetically arranged to complete the chain indexing process.
This document discusses search strategies and refinement techniques for online databases. It outlines the steps to develop an effective search strategy, including formulating a clear query, brainstorming keywords, choosing appropriate databases, and combining keywords using techniques like Boolean operators, nesting, truncation and proximity searching. The document also discusses evaluating search results and refining searches by applying limiters and conducting field-specific searches. The goal is to retrieve the most relevant and accurate information through a systematic search approach.
1. The document defines key terms related to information retrieval systems such as information, retrieval, system, and discusses the basic components and functions of IRS.
2. It explains that the role of users is to formulate queries, and the role of librarians is to assist users in meeting their information needs.
3. The document contrasts older IRS that retrieved entire documents with modern IRS that allow storage, organization, and access to text and multimedia information through techniques like keyword searching and hyperlinks.
The Boolean model is a classical information retrieval model based on set theory and Boolean logic. Queries are specified as Boolean expressions to retrieve documents that either contain or do not contain the query terms. All term frequencies are binary and documents are retrieved based on an exact match to the query terms. However, this model has limitations as it does not rank documents and queries are difficult for users to translate into Boolean expressions, often returning too few or too many results.
This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
This document presents an overview of web mining techniques. It discusses how web mining uses data mining algorithms to extract useful information from the web. The document classifies web mining into three categories: web structure mining, web content mining, and web usage mining. It provides examples and explanations of techniques for each category such as document classification, clustering, association rule mining, and sequential pattern mining. The document also discusses opportunities and challenges of web mining as well as sources of web usage data like server logs.
Presents my findings from analyzing the Library, Information Sciences & Technology Abstracts (LISTA) database. Points of analysis included keyword versus natural language queries, specificity, exhaustivity, indexes and access points, types of searches and search protocols, coverage, currency, predictability, retrievability, user-friendliness, and search help.
This document provides an overview of taxonomy, ontology, folksonomies, and SKOS (Simple Knowledge Organization Systems). It defines each concept and provides examples. Taxonomy is described as a subject-based classification system. Ontology is defined as a formal specification of concepts and relationships. Folksonomies allow user-generated tagging. SKOS provides a standard for sharing and linking knowledge organization systems on the web. Bibliographies with relevant references are also included for each topic.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
The document discusses interoperability in digital libraries. It describes how digital libraries aim to support interoperability at three levels: data gathering, harvesting, and federation. It also discusses protocols used for interoperability such as OAI-PMH, DCMES, and LDAP. OAI-PMH allows harvesting of metadata using the OAI-PMH protocol, while DCMES defines a set of 15 elements for resource description. LDAP enables locating resources on a network.
The document describes PRECIS (PREserved Context Indexing System), an indexing system developed in the 1970s. It aims to represent meaning in index entries without disturbing user understanding. PRECIS uses role operators and strings of terms to preserve context across permuted index entries. It was used for indexing the British National Bibliography but was replaced by COMPASS in 1990. PRECIS requires analyzing documents, organizing concepts, and assigning role codes to terms to generate automated two-line index entries preserving semantics and syntax.
This document summarizes Nicholas Belkin's theory of anomalous state of knowledge (ASK), which proposes that information needs arise from gaps or anomalies in a person's knowledge. It compares the traditional information retrieval model to Belkin's ASK model, which recognizes that users may not be able to precisely specify their information need when they have an incomplete understanding. The document also outlines some applications of anomaly detection and discusses implications of Belkin's theory, such as the need to represent information needs differently than the best-match approach used by most search systems.
The document provides an overview of research data management and the importance of avoiding a "DATApocalypse" or data disaster. It discusses the definition of research data, why data management is important, questions to consider, best practices for data management planning, documentation, and long-term preservation. The goal is to help researchers and institutions properly manage data to enable sharing and preservation, as required by most major funders.
February 18 2015 NISO Virtual Conference Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Learning to Curate Research Data
Jennifer Doty, Research Data Librarian, Emory Center for Digital Scholarship, Emory University, Robert W. Woodruff Library
This document provides an introduction to digital libraries, including definitions, key components, and advantages and disadvantages. A digital library is a special library that stores digital objects like text, audio, video and images electronically rather than physically. It defines digital libraries as collections that can be accessed remotely and comprehensively collect, manage and preserve digital content. The document discusses how digital archives differ from physical libraries, strategies for searching digital libraries, common software used, and advantages like no physical boundaries but also challenges around access, organization and digital preservation.
Information Storage and Retrieval : A Case StudyBhojaraju Gunjal
Bhojaraju.G, M.S.Banerji and Muttayya Koganurmath (2004). Information Storage and Retrieval: A Case Study, In Proceedings of International Conference on Digital Libraries (ICDL 2004), New Delhi, Feb 24-27, 2004.
(Best Poster Presentation Award)
Knowledge organization (KO) refers to activities like document description, indexing and classification performed in libraries and databases to organize documents and concepts. Knowledge organization systems (KOS) include classification schemes, subject headings, thesauri and other systems used to organize information. KOS impose structures on collections and can be used in digital libraries to provide overviews and support retrieval, though different KOS may characterize entities differently. Common types of KOS include term lists, classifications and categories, and relationship lists.
Introduction into Search Engines and Information RetrievalA. LE
Gives a brief introduction into search engines and information retrieval. Covers basics about Google and Yahoo, fundamental terms in the area of information retrieval and an introduction into the famous page rank algorithm
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
Chain indexing is a method of subject indexing developed by Dr. S. R. Ranganathan. It involves classifying documents using a preferred classification scheme and representing the class number as a chain of links moving from general to specific subjects. Specific subject headings and related references are then derived from analyzing the chain of links. The headings and references are alphabetically arranged to complete the chain indexing process.
This document discusses search strategies and refinement techniques for online databases. It outlines the steps to develop an effective search strategy, including formulating a clear query, brainstorming keywords, choosing appropriate databases, and combining keywords using techniques like Boolean operators, nesting, truncation and proximity searching. The document also discusses evaluating search results and refining searches by applying limiters and conducting field-specific searches. The goal is to retrieve the most relevant and accurate information through a systematic search approach.
1. The document defines key terms related to information retrieval systems such as information, retrieval, system, and discusses the basic components and functions of IRS.
2. It explains that the role of users is to formulate queries, and the role of librarians is to assist users in meeting their information needs.
3. The document contrasts older IRS that retrieved entire documents with modern IRS that allow storage, organization, and access to text and multimedia information through techniques like keyword searching and hyperlinks.
The Boolean model is a classical information retrieval model based on set theory and Boolean logic. Queries are specified as Boolean expressions to retrieve documents that either contain or do not contain the query terms. All term frequencies are binary and documents are retrieved based on an exact match to the query terms. However, this model has limitations as it does not rank documents and queries are difficult for users to translate into Boolean expressions, often returning too few or too many results.
This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
This document presents an overview of web mining techniques. It discusses how web mining uses data mining algorithms to extract useful information from the web. The document classifies web mining into three categories: web structure mining, web content mining, and web usage mining. It provides examples and explanations of techniques for each category such as document classification, clustering, association rule mining, and sequential pattern mining. The document also discusses opportunities and challenges of web mining as well as sources of web usage data like server logs.
Presents my findings from analyzing the Library, Information Sciences & Technology Abstracts (LISTA) database. Points of analysis included keyword versus natural language queries, specificity, exhaustivity, indexes and access points, types of searches and search protocols, coverage, currency, predictability, retrievability, user-friendliness, and search help.
This document provides an overview of taxonomy, ontology, folksonomies, and SKOS (Simple Knowledge Organization Systems). It defines each concept and provides examples. Taxonomy is described as a subject-based classification system. Ontology is defined as a formal specification of concepts and relationships. Folksonomies allow user-generated tagging. SKOS provides a standard for sharing and linking knowledge organization systems on the web. Bibliographies with relevant references are also included for each topic.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
The document discusses interoperability in digital libraries. It describes how digital libraries aim to support interoperability at three levels: data gathering, harvesting, and federation. It also discusses protocols used for interoperability such as OAI-PMH, DCMES, and LDAP. OAI-PMH allows harvesting of metadata using the OAI-PMH protocol, while DCMES defines a set of 15 elements for resource description. LDAP enables locating resources on a network.
The document describes PRECIS (PREserved Context Indexing System), an indexing system developed in the 1970s. It aims to represent meaning in index entries without disturbing user understanding. PRECIS uses role operators and strings of terms to preserve context across permuted index entries. It was used for indexing the British National Bibliography but was replaced by COMPASS in 1990. PRECIS requires analyzing documents, organizing concepts, and assigning role codes to terms to generate automated two-line index entries preserving semantics and syntax.
This document summarizes Nicholas Belkin's theory of anomalous state of knowledge (ASK), which proposes that information needs arise from gaps or anomalies in a person's knowledge. It compares the traditional information retrieval model to Belkin's ASK model, which recognizes that users may not be able to precisely specify their information need when they have an incomplete understanding. The document also outlines some applications of anomaly detection and discusses implications of Belkin's theory, such as the need to represent information needs differently than the best-match approach used by most search systems.
The document provides an overview of research data management and the importance of avoiding a "DATApocalypse" or data disaster. It discusses the definition of research data, why data management is important, questions to consider, best practices for data management planning, documentation, and long-term preservation. The goal is to help researchers and institutions properly manage data to enable sharing and preservation, as required by most major funders.
February 18 2015 NISO Virtual Conference Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Learning to Curate Research Data
Jennifer Doty, Research Data Librarian, Emory Center for Digital Scholarship, Emory University, Robert W. Woodruff Library
This presentation was delivered as part of a Digital Humanities workshop in Medieval Studies at the University of Toronto. Its aim was to engage with digital humanists in the area of data management and start a conversation about what good data management means (from collection to preservation). Included is a data management checklist for DH projects.
Human computation, crowdsourcing and social: An industrial perspectiveoralonso
This document summarizes a talk on human computation and crowdsourcing from an industrial perspective. It discusses how crowdsourcing can provide large amounts of cheap labeled data through platforms like Mechanical Turk but that ensuring high quality labels requires careful task design, payments, quality control methods and addressing issues like worker experience and content. Current trends include algorithms for optimizing human-machine workflows and routing tasks between crowds based on their expertise.
Librarians can provide valuable data management services to researchers on campus. An effective strategy includes surveying researchers to identify needs, communicating service offerings through workshops and consultations, and providing in-depth guidance on data management plans and long-term data preservation. Developing workshops involves setting learning objectives, evaluating content, and securing resources like space and food. Consultations allow librarians to help with specific topics like choosing file formats or finding metadata standards. Creating a data management plan requires detailing a data inventory, metadata description, long-term preservation and access methods. Trusted disciplinary repositories and use of stable identifiers help ensure long-term findability and access.
Managing Ireland's Research Data - 3 Research MethodsRebecca Grant
Slides providing an overview of the research methods used in the author's thesis, "Managing Ireland's Research Data: Recognising Roles for Recordkeepers". The methods discussed are online surveys, comparative case studies, and autoethnography.
Licensed as CC-BY.
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...SEAD
This document discusses research data management and the role of university libraries. It describes the SEAD (Sustainable Environment Actionable Data) project, which provides data services like curation, preservation, and a social community network to support research data across its lifecycle. SEAD aims to support interdisciplinary research by allowing researchers to define and manage related collections of data and metadata called Research Objects in a scalable way. The document argues that research organizations are best positioned to provide comprehensive long-term data services that integrate across the entire research process.
The document discusses the human-centered design approach to data as a service. It emphasizes engaging with communities to understand local contexts and involving stakeholders throughout the research process. The presentation outlines steps for responsible research, including obtaining ethics approval, engaging gatekeepers, sensitizing researchers to cultural practices, and documenting engagement activities. It also discusses challenges around community research fatigue and ensuring information meets recipient needs in terms of being the right information, at the right time, for the right purpose.
Slides for the iDB summer school (Sapporo, Japan) http://db-event.jpn.org/idb2013/
Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.
Research Data Management in the Humanities and Social SciencesCelia Emmelhainz
This document provides an introduction to research data management for humanities and social sciences librarians. It discusses why data management is an important part of a librarian's role in supporting faculty research, and some key concepts in data management including data formats, storage, security, preservation, and sharing. The document emphasizes that while librarians do not need to be data experts, having a basic understanding of data management concepts can help librarians better serve faculty research needs and expand their role on campus.
This document provides an overview of an Information Retrieval Techniques course. It discusses the objectives of understanding IR basics, text classification, search engines, and recommender systems. The syllabus covers what information is, types of information, retrieval, how IR differs from data retrieval, components of an IR system including document, user and search subsystems, and early developments in the field of IR. It also discusses the software architecture of a traditional IR system including processes like document gathering, indexing, searching, and document management.
Promoting Data Literacy at the Grassroots (ACRL 2015, Portland, OR)Adam Beauchamp
Presentation given at ACRL 2015, with Christine Murray, on teaching undergraduate students to discover and evaluate datasets for secondary data analysis.
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Lauri Eloranta
Third lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
The document summarizes best practices for improving enterprise intranet search. It discusses how enterprise search differs from public web search due to more complex data, users, and information needs. It provides tips for understanding users and their search behavior through analytics, designing interfaces to support users of all skill levels, and implementing an iterative process of testing, measuring, and improving search performance over time.
The document summarizes best practices for improving enterprise intranet search. It discusses how enterprise search differs from public web search due to more complex data, users, and information needs. It provides tips for understanding users and their search behavior through analytics, designing search interfaces that support users of all skill levels, and implementing an iterative process of testing, measuring, and improving search performance over time.
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schoolsdogden2
Algebra 1 is often described as a “gateway” class, a pivotal moment that can shape the rest of a student’s K–12 education. Early access is key: successfully completing Algebra 1 in middle school allows students to complete advanced math and science coursework in high school, which research shows lead to higher wages and lower rates of unemployment in adulthood.
Learn how The Atlanta Public Schools is using their data to create a more equitable enrollment in middle school Algebra classes.
This chapter provides an in-depth overview of the viscosity of macromolecules, an essential concept in biophysics and medical sciences, especially in understanding fluid behavior like blood flow in the human body.
Key concepts covered include:
✅ Definition and Types of Viscosity: Dynamic vs. Kinematic viscosity, cohesion, and adhesion.
⚙️ Methods of Measuring Viscosity:
Rotary Viscometer
Vibrational Viscometer
Falling Object Method
Capillary Viscometer
🌡️ Factors Affecting Viscosity: Temperature, composition, flow rate.
🩺 Clinical Relevance: Impact of blood viscosity in cardiovascular health.
🌊 Fluid Dynamics: Laminar vs. turbulent flow, Reynolds number.
🔬 Extension Techniques:
Chromatography (adsorption, partition, TLC, etc.)
Electrophoresis (protein/DNA separation)
Sedimentation and Centrifugation methods.
INTRO TO STATISTICS
INTRO TO SPSS INTERFACE
CLEANING MULTIPLE CHOICE RESPONSE DATA WITH EXCEL
ANALYZING MULTIPLE CHOICE RESPONSE DATA
INTERPRETATION
Q & A SESSION
PRACTICAL HANDS-ON ACTIVITY
A measles outbreak originating in West Texas has been linked to confirmed cases in New Mexico, with additional cases reported in Oklahoma and Kansas. The current case count is 795 from Texas, New Mexico, Oklahoma, and Kansas. 95 individuals have required hospitalization, and 3 deaths, 2 children in Texas and one adult in New Mexico. These fatalities mark the first measles-related deaths in the United States since 2015 and the first pediatric measles death since 2003.
The YSPH Virtual Medical Operations Center Briefs (VMOC) were created as a service-learning project by faculty and graduate students at the Yale School of Public Health in response to the 2010 Haiti Earthquake. Each year, the VMOC Briefs are produced by students enrolled in Environmental Health Science Course 581 - Public Health Emergencies: Disaster Planning and Response. These briefs compile diverse information sources – including status reports, maps, news articles, and web content– into a single, easily digestible document that can be widely shared and used interactively. Key features of this report include:
- Comprehensive Overview: Provides situation updates, maps, relevant news, and web resources.
- Accessibility: Designed for easy reading, wide distribution, and interactive use.
- Collaboration: The “unlocked" format enables other responders to share, copy, and adapt seamlessly. The students learn by doing, quickly discovering how and where to find critical information and presenting it in an easily understood manner.
Exploring Substances:
Acidic, Basic, and
Neutral
Welcome to the fascinating world of acids and bases! Join siblings Ashwin and
Keerthi as they explore the colorful world of substances at their school's
National Science Day fair. Their adventure begins with a mysterious white paper
that reveals hidden messages when sprayed with a special liquid.
In this presentation, we'll discover how different substances can be classified as
acidic, basic, or neutral. We'll explore natural indicators like litmus, red rose
extract, and turmeric that help us identify these substances through color
changes. We'll also learn about neutralization reactions and their applications in
our daily lives.
by sandeep swamy
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingCeline George
The Accounting module in Odoo 17 is a complete tool designed to manage all financial aspects of a business. Odoo offers a comprehensive set of tools for generating financial and tax reports, which are crucial for managing a company's finances and ensuring compliance with tax regulations.
The ever evoilving world of science /7th class science curiosity /samyans aca...Sandeep Swamy
The Ever-Evolving World of
Science
Welcome to Grade 7 Science4not just a textbook with facts, but an invitation to
question, experiment, and explore the beautiful world we live in. From tiny cells
inside a leaf to the movement of celestial bodies, from household materials to
underground water flows, this journey will challenge your thinking and expand
your knowledge.
Notice something special about this book? The page numbers follow the playful
flight of a butterfly and a soaring paper plane! Just as these objects take flight,
learning soars when curiosity leads the way. Simple observations, like paper
planes, have inspired scientific explorations throughout history.
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...Celine George
Analytic accounts are used to track and manage financial transactions related to specific projects, departments, or business units. They provide detailed insights into costs and revenues at a granular level, independent of the main accounting system. This helps to better understand profitability, performance, and resource allocation, making it easier to make informed financial decisions and strategic planning.
Understanding P–N Junction Semiconductors: A Beginner’s GuideGS Virdi
Dive into the fundamentals of P–N junctions, the heart of every diode and semiconductor device. In this concise presentation, Dr. G.S. Virdi (Former Chief Scientist, CSIR-CEERI Pilani) covers:
What Is a P–N Junction? Learn how P-type and N-type materials join to create a diode.
Depletion Region & Biasing: See how forward and reverse bias shape the voltage–current behavior.
V–I Characteristics: Understand the curve that defines diode operation.
Real-World Uses: Discover common applications in rectifiers, signal clipping, and more.
Ideal for electronics students, hobbyists, and engineers seeking a clear, practical introduction to P–N junction semiconductors.
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...larencebapu132
This is short and accurate description of World war-1 (1914-18)
It can give you the perfect factual conceptual clarity on the great war
Regards Simanchala Sarab
Student of BABed(ITEP, Secondary stage)in History at Guru Nanak Dev University Amritsar Punjab 🙏🙏
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetSritoma Majumder
Introduction
All the materials around us are made up of elements. These elements can be broadly divided into two major groups:
Metals
Non-Metals
Each group has its own unique physical and chemical properties. Let's understand them one by one.
Physical Properties
1. Appearance
Metals: Shiny (lustrous). Example: gold, silver, copper.
Non-metals: Dull appearance (except iodine, which is shiny).
2. Hardness
Metals: Generally hard. Example: iron.
Non-metals: Usually soft (except diamond, a form of carbon, which is very hard).
3. State
Metals: Mostly solids at room temperature (except mercury, which is a liquid).
Non-metals: Can be solids, liquids, or gases. Example: oxygen (gas), bromine (liquid), sulphur (solid).
4. Malleability
Metals: Can be hammered into thin sheets (malleable).
Non-metals: Not malleable. They break when hammered (brittle).
5. Ductility
Metals: Can be drawn into wires (ductile).
Non-metals: Not ductile.
6. Conductivity
Metals: Good conductors of heat and electricity.
Non-metals: Poor conductors (except graphite, which is a good conductor).
7. Sonorous Nature
Metals: Produce a ringing sound when struck.
Non-metals: Do not produce sound.
Chemical Properties
1. Reaction with Oxygen
Metals react with oxygen to form metal oxides.
These metal oxides are usually basic.
Non-metals react with oxygen to form non-metallic oxides.
These oxides are usually acidic.
2. Reaction with Water
Metals:
Some react vigorously (e.g., sodium).
Some react slowly (e.g., iron).
Some do not react at all (e.g., gold, silver).
Non-metals: Generally do not react with water.
3. Reaction with Acids
Metals react with acids to produce salt and hydrogen gas.
Non-metals: Do not react with acids.
4. Reaction with Bases
Some non-metals react with bases to form salts, but this is rare.
Metals generally do not react with bases directly (except amphoteric metals like aluminum and zinc).
Displacement Reaction
More reactive metals can displace less reactive metals from their salt solutions.
Uses of Metals
Iron: Making machines, tools, and buildings.
Aluminum: Used in aircraft, utensils.
Copper: Electrical wires.
Gold and Silver: Jewelry.
Zinc: Coating iron to prevent rusting (galvanization).
Uses of Non-Metals
Oxygen: Breathing.
Nitrogen: Fertilizers.
Chlorine: Water purification.
Carbon: Fuel (coal), steel-making (coke).
Iodine: Medicines.
Alloys
An alloy is a mixture of metals or a metal with a non-metal.
Alloys have improved properties like strength, resistance to rusting.
The *nervous system of insects* is a complex network of nerve cells (neurons) and supporting cells that process and transmit information. Here's an overview:
Structure
1. *Brain*: The insect brain is a complex structure that processes sensory information, controls behavior, and integrates information.
2. *Ventral nerve cord*: A chain of ganglia (nerve clusters) that runs along the insect's body, controlling movement and sensory processing.
3. *Peripheral nervous system*: Nerves that connect the central nervous system to sensory organs and muscles.
Functions
1. *Sensory processing*: Insects can detect and respond to various stimuli, such as light, sound, touch, taste, and smell.
2. *Motor control*: The nervous system controls movement, including walking, flying, and feeding.
3. *Behavioral responThe *nervous system of insects* is a complex network of nerve cells (neurons) and supporting cells that process and transmit information. Here's an overview:
Structure
1. *Brain*: The insect brain is a complex structure that processes sensory information, controls behavior, and integrates information.
2. *Ventral nerve cord*: A chain of ganglia (nerve clusters) that runs along the insect's body, controlling movement and sensory processing.
3. *Peripheral nervous system*: Nerves that connect the central nervous system to sensory organs and muscles.
Functions
1. *Sensory processing*: Insects can detect and respond to various stimuli, such as light, sound, touch, taste, and smell.
2. *Motor control*: The nervous system controls movement, including walking, flying, and feeding.
3. *Behavioral responses*: Insects can exhibit complex behaviors, such as mating, foraging, and social interactions.
Characteristics
1. *Decentralized*: Insect nervous systems have some autonomy in different body parts.
2. *Specialized*: Different parts of the nervous system are specialized for specific functions.
3. *Efficient*: Insect nervous systems are highly efficient, allowing for rapid processing and response to stimuli.
The insect nervous system is a remarkable example of evolutionary adaptation, enabling insects to thrive in diverse environments.
The insect nervous system is a remarkable example of evolutionary adaptation, enabling insects to thrive
How to Manage Opening & Closing Controls in Odoo 17 POSCeline George
In Odoo 17 Point of Sale, the opening and closing controls are key for cash management. At the start of a shift, cashiers log in and enter the starting cash amount, marking the beginning of financial tracking. Throughout the shift, every transaction is recorded, creating an audit trail.
2. A Quick Introduction
• What do we do at InfoSense
• Dynamic Search
• IR and AI
• Privacy and IR
• Today’s lecture is on IR fundamentals
• Textbooks and some of their slides are referenced and used here
• Modern Information Retrieval: The Concepts and Technology behind Search. by Ricardo Baeza-Yates,
Berthier Ribeiro-Neto. Second condition. 2011.
• Introduction to Information Retrieval. C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008.
• Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze.
• Search Engines: Information Retrieval in Practice. W. Bruce Croft, Donald Metzler, and Trevor Strohman.
2009.
• Personal views are also presented here
• Especially in the Introduction and Summary sections
2
3. Outline
• What is Information Retrieval
• Task, Scope, Relations to other disciplines
• Process
• Preprocessing, Indexing, Retrieval, Evaluation, Feedback
• Retrieval Approaches
• Boolean
• Vector Space Model
• BM25
• Language Modeling
• Summary
• What works
• State-of-the-art retrieval effectiveness
• Relation to the learning-based approaches
3
4. What is Information Retrieval (IR)?
• Task: To find a few among many
• It is probably motivated by the situation of information overload and
acts as a remedy to it
• When defining IR, we need to be aware that there is a broad sense
and a narrow sense
4
5. Broad Sense of IR
• It is a discipline that finds information that people want
• The motivation behind would include
• Humans’ desire to understand the world and to gain knowledge
• Acquire sufficient and accurate information/answer to accomplish a task
• Because finding information can be done in so many different ways, IR would involve:
• Classification (Wednesday lecture by Fraizio Sabastiani and Alejandro Mereo))
• Clustering
• Recommendation
• Social network
• Interpreting natural languages (Wednesday lecture by Fraizio Sabastiani and Alejandro Mereo))
• Question answering
• Knowledge bases
• Human-computer interaction (Friday lecture by Rishabh Mehrotra)
• Psychology, Cognitive Science, (Thursday lecture by Joshua Kroll), …
• Any topic that listed on IR conferences such as SIGIR/ICTIR/CHIIR/CIKM/WWW/WSDM…
5
6. Narrow Sense of IR
• It is ‘search’
• Mostly searching for documents
• It is a computer science discipline that designs and implements
algorithms and tools to help people find information that they want
• from one or multiple large collections of materials (text or multimedia,
structured or unstructured, with or without hyperlinks, with or without
metadata, in a foreign language or not – Monday Lecture Multilingual IR by
Doug Oard),
• where people can be a single user or a group
• who initiate the search process by an information need,
• and, the resulting information should be relevant to the information need
(based on the judgement by the person who starts the search)
6
7. Narrowest Sense of IR
• It helps people find relevant documents
• from one large collection of material (which is the Web or a TREC collection),
• where there is a single user,
• who initiates the search process by a query driven by an information need,
• and, the resulting documents should be ranked (from the most relevant to the
least) and returned in a list
7
9. A Brief Historical Line of Information Retrieval
0
1
2
3
4
5
6
7
8
1940s 1950s 1960s 1970s 1980s 1990s 2000 2005 2010 2015 2020
Memex Vector Space Model Probabilistic Theory Okapi BM25 TREC LM
Learning to Rank Deep Learning QA Filtering Query User
9
10. Relationships to Sister Disciplines
10
IR
Supervised
ML
AI
DB
NLP
QA
HCI
Recommendation
Information Seeking;
Information exploration;
sense-making
Library
Science
tabulated
data;Boolean
queries
Unstructured
data; NL queries
Humanissuedqueries;
Non-exhaustivesearch
Noquerybutuserprofile
Returns answers instead of documents
Understanding of data; Semantics
Loss of semantics; only counting terms
Intermediate step before answers extracted
Large scale; use of algorithms
Controlled vocabulary; browsing
User-centeredstudy
Data-driven; use of training data
Expert-crafted models; no training data
Interactive; complex information needs;
Exploratory; curiosity-driven
Single iteration; lookup
Solid line: transformations or special cases
Dashed line: overlap with
UI/UXforIRsystems
Big data;
Distributed
systems
Inverted index
11. Outline
• What is Information Retrieval
• Task, Scope, Relations to other disciplines
• Process
• Preprocessing, Indexing, Retrieval, Evaluation, Feedback
• Retrieval Approaches
• Boolean
• Vector Space Model
• BM25
• Language Modeling
• Summary
• What works
• State-of-the-art retrieval effectiveness
• Relations to the learning-based approaches
11
12. Process of Information Retrieval
12
Query Representation
Document
Representation
Indexing
Information
Need
Retrieval
Models Index
Retrieval Results
Corpus
Evaluation/
Feedback
13. Terminology
• Query: text to represent an information need
• Document: a returned item in the index
• Term/token: a word, a phrase, an index unit
• Vocabulary: set of the unique tokens
• Corpus/Text collection
• Index/database: index built for a corpus
• Relevance feedback: judgment from human
• Evaluation Metrics: how good is a search system?
• Precision, Recall, F1
13
15. From Information Need to Query
TASK
Info Need
Query
Verbal form
Get rid of mice in a politically
correct way
Info about removing mice
without killing them
How do I trap mice alive?
mouse trap
15
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 1
17. Tokenizer
Tokens Friends Romans Countrymen
Inverted index construction
Linguistic modules
Normalized tokens
friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Documents to
be indexed
Friends, Romans, countrymen.
Sec. 1.2
17
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Ch 1
18. An Index
• Sequence of (Normalized token, Document ID) pairs.
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Sec. 1.2
18
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 1
20. Evaluation
• Implicit (clicks, time spent) vs. Explicit (yes/no, grades)
• Done by the same user or by a third party (TREC-style)
• Judgments can be binary (Yes/No) or graded
• Assuming ranked or not
• Dimensions under consideration
• Relevance (Precision, nDCG)
• Novelty/diversity
• Usefulness
• Effort/cost
• Completeness/coverage (Recall)
• Combinations of some of the above (F1), and many more
• Relevance is the main consideration. It means
• If a document (a result) can satisfy the information need
• If a document contains the answer to my query
• The evaluation lecture (Tuesday by Nicola Ferror and Maria Maistro) will share much more
interesting details 20
22. Outline
• What is Information Retrieval
• Task, Scope, Relations to other disciplines
• Process
• Preprocessing, Indexing, Retrieval, Evaluation, Feedback
• Retrieval Approaches
• Boolean
• Vector Space Model
• BM25
• Language Modeling
• Summary
• What works
• State-of-the-art retrieval effectiveness
• Relations to the learning-based approaches
22
23. How to find relevant documents for a query?
• By keyword matching
• boolean model
• By similarity
• vector space model
• By imaging how to write out a query
• how likely a query is written with this document in mind
• generate with some randomness
• query generation language model
• By trusting how other web pages think about the web page
• pagerank, hits
• By trusting how other people find relevant documents for the same/similar query
• Learning to rank
23
24. Boolean Retrieval
• Views each document as a set of words
• Boolean Queries use AND, OR and NOT to join query terms
• Simple SQL-like queries
• Sometimes with weights attached to each component
• It is like exact match: document matches condition or not
• Perhaps the simplest model to build an IR system
• Many current search systems are still using Boolean
• Professional searchers who want to under control of the search process
• e.g. doctors and lawyers write very long and complex queries with Boolean
operators
24
Sec. 1.3
25. Summary: Boolean Retrieval
• Advantages:
• Users are under control of the search results
• The system is nearly transparent to the user
• Disadvantages:
• Only give inclusion or exclusion of docs, not rankings
• Users would need to spend more effort in manually examining the returned
sets; sometimes it is very labor intensive
• No fuzziness allowed so the user must be very precise and good at writing
their queries
• However, in many cases users start a search because they don’t know the answer
(document)
25
26. Ranked Retrieval
• Often we want to rank results
• from the most relevant to the least relevant
• Users are lazy
• maybe only look at the first 10 results
• A good ranking is important
• Given a query q, and a set of documents D, the task is to rank those
documents based on a ranking score or relevance score:
• Score (q,di) in the range of [0,1]
• from the most relevant to the least relevant
• A lot of IR research is about to determine score (q,di)
26
28. Vector Space Model
• Treat the query as a tiny document
• Represent the query and every document each as a word vector
in a word space
• Rank documents according to their proximity to the query in the
word space
Sec. 6.3
28
29. Represent Documents in a Space of Word Vectors
29
Sec. 6.3
Suppose the corpus only has two
words: ’Jealous’ and ‘Gossip’
They form a space of “Jealous” and
“Gossip”
d1: gossip gossip jealous
gossip gossip gossip gossip
gossip gossip gossip gossip
d2: gossip gossip jealous
gossip gossip gossip gossip
gossip gossip gossip jealous
jealous jealous jealous jealous
jealous jealous gossip jealous
d3: jealous gossip jealous
jealous jealous jealous jealous
jealous jealous jealous jealous
q: gossip gossip jealous
gossip gossip gossip gossip
gossip jealous jealous
jealous jealous
Adapted from textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
30. Euclidean Distance
• If if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in the
Euclidean space, their Euclidean distance is
30
31. In a space of ‘Jealous’ and ‘Gossip’
31
Sec. 6.3
d1: gossip gossip jealous
gossip gossip gossip gossip
gossip gossip gossip gossip
d2: gossip gossip jealous
gossip gossip gossip gossip
gossip gossip gossip jealous
jealous jealous jealous jealous
jealous jealous gossip jealous
d3: jealous gossip jealous
jealous jealous jealous jealous
jealous jealous jealous jealous
q: gossip gossip jealous
gossip gossip gossip gossip
gossip jealous jealous
jealous jealous
Here, if you look at the content (or we say
the word distributions) of each
document, d2 is actually the most similar
document to q
However, d2 produces a bigger Eclidean
distance score to q
Adapted from textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
32. Use angle instead of distance
• Short query and long documents will
always have big Euclidean distance
• Key idea: Rank documents according
to their angles with query
• The angle between similar vectors is
small, between dissimilar vectors is
large
• This is equivalent to perform a
document length normalization
Sec. 6.3
32
Adapted from textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
33. Cosine Similarity
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
Sec. 6.3
33
34. Exercise: Cosine Similarity
Consider two documents D1, D2 and a query Q, which
document is more similar to the query?
D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2),
Q = (1.5, 1.0, 0)
34Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7
39. • Some terms are common,
• less common than the stop words
• but still quite common
• e.g. “Information Retrieval” is uniquely important in NBA.com
• e.g. “Information Retrieval” appears at too many pages in SIGIR web site, so it is not a
very important term in those pages.
• How to discount their term weights?
39
40. Inverse Document Frequency (idf)
• dft is the document frequency of t
• the number of documents that contain t
• it inversely measures how informative a term is
• The IDF of a term t is defined as
• Log is used here to “dampen” the effect of idf.
• N is the total number of documents
• Note it is a property of the term and it is query independent
40
Sec. 6.2.1
41. tf-idf weighting
• Product of a term’s tf weight and idf weight regarding a document
• Best known term weighting scheme in IR
• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection
• Note: term frequency takes two inputs (the term and the document) while IDF
only takes one (the term)
41
Sec. 6.2.2
42. tf-idf weighting has many variants
Sec. 6.4
42
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
43. Standard tf-idf weighting scheme: Lnc.ltc
• A very standard weighting scheme is: lnc.ltc
• Document:
• L: logarithmic tf (l as first character)
• N: no idf
• C: cosine normalization
• Query:
• L: logarithmic tf (l in leftmost column)
• t: idf (t in second column)
• C: cosine normalization …
• Note: here the weightings differ in queries and in documents
Sec. 6.4
43
44. Summary: Vector Space Model
• Advantages
• Simple computational framework for ranking documents given a query
• Any similarity measure or term weighting scheme could be used
• Disadvantages
• Assumption of term independence
• Ad hoc
44
46. 46
The (Magical) Okapi BM25 Model
• BM25 is one of the most successful retrieval models
• It is a special case of the Okapi models
• Its full name is Okapi BM25
• It considers the length of documents and uses it to normalize the
term frequency
• It is virtually a probabilistic ranking algorithm though it looks very ad-
hoc
• It is intended to behave similarly to a two-Poisson model
• We will talk about Okapi in general
47. What is Behind Okapi?
• [Robertson and Walker 94 ]
• A two-Poisson document-likelihood Language model
• Models within-document term frequencies by means of a mixture of two Poisson
distributions
• Hypothesize that occurrences of a term in a document have a random or
stochastic element
• It reflects a real but hidden distinction between those documents which are “about” the concept
represented by the term and those which are not.
• Documents which are “about” this concept are described as “elite” for the term.
• Relevance to a query is related to eliteness rather than directly to term
frequency, which is assumed to depend only on eliteness.
47
48. Two-Poisson Model
• Term weight for a term t:
48
Figure adapted from “Search Engines: Information Retrieval in Practice” Chap 7
where lambda and mu are the Poisson means for tf
In the elite and non-elite sets for t
p’ = P(document elite for t| R)
q’ = P(document elite for t| NR)
49. Characteristics of Two-Poisson Model
• It is zero for tf=0;
• It increases monotonically with tf;
• but to an asymptotic maximum;
• The maximum approximates to the Robertson/Sparck-Jones weight
that would be given to a direct indicator of eliteness.
49
p = P(term present| R)
q = P(term present| NR)
50. Constructing a Function
• Constructing a function
• Such that tf/(constant + tf) increases from 0 to an asymptotic maximum
• A rough estimation of 2-poisson
50
Robertson/Sparck-Jones weight;
Becomes the idf component of Okapi
Approximated term weight
constant
tf component of Okapi
51. 51
Okapi Model
• The complete version of Okapi BMxx models
idf (Robertson-Sparck Jones weight) tf user related weight
Original Okapi: k1 = 2, b=0.75, k3 = 0
BM25: k1 = 1.2, b=0.75, k3 = a number from 0 to 1000
52. Exercise: Okapi BM25
• Query with two terms, “president lincoln”, (qtf = 1)
• No relevance information (r and R are zero)
• N = 500,000 documents
• “president” occurs in 40,000 documents (df1 = 40, 000)
• “lincoln” occurs in 300 documents (df2 = 300)
• “president” occurs 15 times in the doc (tf1 = 15)
• “lincoln” occurs 25 times in the doc (tf2 = 25)
• document length is 90% of the average length (dl/avdl = .9)
• k1 = 1.2, b = 0.75, and k3 = 100
• K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11
52Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7
57. Using language models in IR
§ Each document is treated as (the basis for) a language model
§ Given a query q, rank documents based on P(d|q)
§ P(q) is the same for all documents, so ignore
§ P(d) is the prior – often treated as the same for all d
§ But we can give a prior to high-quality documents, e.g., those with high PageRank.
§ P(q|d) is the probability of q given d
§ Ranking according to P(q|d) and P(d|q) is equivalent
57
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.
58. Query-likelihood LM
1dθ
Ndθ
• Scoring documents with query likelihood
• Known as the language modeling (LM) approach to IR
d1
d2
Document
Language Model
Query
Likelihood
dN
2dθ
q)|( 1dqp q
)|( 2dqp q
)|( Ndqp q
58
Adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
59. String = frog said that toad likes frog STOP
P(string|Md1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 0.0000000000048 = 4.8 · 10-12
P(string|Md2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = 0.0000000000120 = 12 · 10-12 P(string|Md1 ) < P(string|Md2 )
Thus, document d2 is more relevant to the string frog said that toad likes frog STOP than d1 is.
A different language model for each document
59
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.
60. Binomial Distribution
• Discrete
• Series of trials with only two outcomes, each trial being independent
from all the others
• Number r of successes out of n trials given that the probability of
success in any trial is :
60
rnr
r
n
nrb -
-÷÷
ø
ö
çç
è
æ
= )1(),;( qqq
q
61. Multinomial Distribution
• The multinomial distribution is a generalization of the binomial distribution.
• The binomial distribution counts successes of an event (for example, heads in coin
tosses).
• The parameters:
– N (number of trials)
– (the probability of success of the event)
• The multinomial counts the number of a set of events (for example, how many times
each side of a die comes up in a set of rolls).
– The parameters:
– N (number of trials)
– (the probability of success for each category)
61
q
1.. kq q
62. Multinomial Distribution
• W1,W2,..Wk are variables
62
1 2
11 1 1 1 2
1 2
!
( ,..., | , ,.., ) ..
! !.. !
knn n
k k k
k
N
P W n W n N
n n n
q q q q q= = =
1
k
i
i
n N
=
=å
1
1
k
i
i
q
=
=å
Number of possible orderings of N balls
order invariant selections
Assume events (terms being generated ) are independent
A binomial distribution is the multinomial distribution with k=2 and 1 2 2, 1q q q= -
Each is estimated by Maximum Likelihood Estimation (MLE)
63. Multi-Bernoulli vs. Multinomial
ÕÕ ÏÎ
===
qwqw
dwpdwpdqp )|0()|1()|(
text
mining
model
clustering
text
model
text
…
Doc: d text mining … model
Multi-Bernoulli:
Flip a coin for each word
Multinomial:
Roll a dice to choose a word
text
mining
model
H H T
Query q:
“text mining”
text
mining
Query q:
“text mining”
Õ=
=
||
1
),(
)|()|(
V
j
qwc
j
j
dwpdqp
63
Adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
64. § Issue: a single t with P(t|Md) = 0 will make zero
§ Smooth the estimates to avoid zeros
64
Issue
65. Dirichlet Distribution & Conjugate Prior
65
• If the prior and the posterior are the same distribution, the prior is
called a conjugate prior for the likelihood
• The Dirichlet distribution is the conjugate prior for the multinomial,
just as beta is conjugate prior for the binomial.
Gamma function
66. Dirichlet Smoothing
• Let s say the prior for is
• From observations to the data, we have the following counts
• The posterior distribution for , given the data, is
66
1( ,.., )kDir a a
1 1( ,.., )k kDir n na a+ +
1,.., kq q
1,.., kn n
1,.., kq q
• So the prior works like pseudo-counts
• it can be used for smoothing
67. 67
JM Smoothing:
§ Also known as the Mixture Model
§ Mixes the probability from the document with the general collection
frequency of the word.
§ Correctly setting λ is very important for good performance.
§ High value of λ: conjunctive-like search – tends to retrieve documents
containing all query words.
§ Low value of λ: more disjunctive, suitable for long queries
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.
68. Poisson Query-likelihood LM
text
mining
model
mining
text
clustering
text
…
Query q :
“mining text mining systems”
/
/
Rates of
arrival :
text
mining
model
clustering
…
[ ]
[ ]
[ ]
[ ]
[ ]
Duration: |q|
Poisson:
Each term is written
Receiver: Query
3/7
2/7
1/7
1/7
1
2
0
0
1
=)|( dqp !1
|)|
7
3
( 1||7/3
qe q-
!2
|)|
7
2
( 2||7/2
qe q-
!0
|)|
7
1
( 0||7/1
qe q-
!0
|)|
7
1
( 0||7/1
qe q-
!1
|)|( 1||
qe i
qi
ll-
il
68Slides adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
69. Comparison
multi-Bernoulli multinomial Poisson
Event space Appearance
/absence
Vocabulary frequency
Model frequency? No Yes Yes
Model length?
(document/query)
No Implicitly yes Yes
w/o Sum-to-one constraint? Yes No Yes
Per-Term Smoothing Easy Hard Easy
Closed form solution for
mixture of models?
No No Yes
69
Õ=
||
1
),(
)|(
V
j
qwc
j
j
dwpÕÕ ÏÎ
==
qwqw
dwpdwp )|0()|1(
Õ=
||
1
)|),((
V
j
j dqwcp
)|( dqp
Slides adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
70. Summary: Language Modeling
• LM vs. VSM:
• LM: based on probability theory
• VSM: based on similarity, a geometric/ linear algebra notion
• Modeling term frequency in LM is better than just modeling term presence/absence
• Multinomial model performs better than multi-Bernoulli
• Mixture of Multinomials for the background smoothing model has been shown to be
effective for IR
• LDA-based retrieval [Wei & Croft SIGIR 2006]
• PLSI [Hofmann SIGIR 99]
§ Probabilities are inherently length-normalized
§ When doing parameter estimation
§ Mixing document and collection frequencies has an effect similar to idf
§ Terms rare in the general collection, but common in some documents will have a
greater influence on the ranking.
70
71. Outline
• What is Information Retrieval
• Task, Scope, Relations to other disciplines
• Process
• Preprocessing, Indexing, Retrieval, Evaluation, Feedback
• Retrieval Approaches
• Boolean
• Vector Space Model
• BM25
• Language Modeling
• Summary
• What works?
• State-of-the-art retrieval effectiveness – what should you expect?
• Relations to the learning-based approaches
71
72. What works?
• Term Frequency (tf)
• Inverse Document Frequency (idf)
• Document length normalization
• Okapi BM25
• Seems ad-hoc but works so well (popularly used as a baseline)
• Created by human experts, not by data
• Other more justified methods could achieve similar effectiveness as
BM25
• They help better deep understanding of IR, related disciplines
72
73. What might not work?
• You might have heard of other topics/techniques, such as
• Pseudo-relevance feedback
• Query expansion
• N-gram instead of unit gram
• Semantically-heavy annotations
• Sophisticated understanding of documents
• Personalization (Read a lot into the user)
• .. But they usually don’t work reliably (not as much as what we expect
and sometimes worsen the performance)
• Maybe more research needs to be done
• Or, maybe they are not the right directions
73
74. At the heart is the metric
• How our users feel good about the search results
• Sometimes it could be subjective
• The approaches that we discusses today do not directly optimize the
metrics (P, R, nDCG, MAP etc)
• These approaches are considered more conventional, without making
use of large amount of data that can be learned models from
• Instead, they are created by researchers based on their own
understanding of IR and they hand-crafted or imagined most of the
models
• And these models work very well
• Salute to the brilliant minds
74
75. Learning-based Approaches
• More recently, learning-to-rank has become the dominating approach
• Due to vast amount of logged data from Web search engines
• The retrieval algorithm paradigm
• Has become data-driven
• Requires large amount of data from massive users
• IR is formulated as a supervised learning problem
• directly uses the metrics as the optimization objectives
• No longer guess what a good model should be, but leave to the data to decide
• The Deep learning lecture (Thursday by Bhaskar Mitra, Nick Craswell,
and Emine Yilmaz) will introduce them in depth
75
76. References
• IR Textbooks used for this talk:
• Introduction to Information Retrieval. C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008.
• Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze.
• Search Engines: Information Retrieval in Practice. W. Bruce Croft, Donald Metzler, and Trevor Strohman. 2009.
• Modern Information Retrieval: The Concepts and Technology behind Search. by Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Second
condition. 2011.
• Main IR research papers used for this talk:
• Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. Robertson, S. E., & Walker, S.
SIGIR 1994.
• Document Language Models, Query Models, and Risk Minimization for Information Retrieval. Lafferty, John and Zhai, Chengxiang.
SIGIR 2001.
• A study of Poisson query generation model for information retrieval. Qiaozhu Mei, Hui Fang, Chengxiang Zhai. SIGIR 2007.
• Course Materials/presentation slides used in this talk:
• Barbara Rosario’s “Mathematical Foundations” lecture notes for textbook “Statistical Natural Language Processing”
• Textbook slides for “Search Engines: Information Retrieval in Practice” by its authors
• Oznur Tastan s recitation for 10601 Machine Learning
• Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma
• CS276: Information Retrieval and Web Search by Pandu Nayak and Prabhakar Raghavan
• 11-441: Information Retrieval by Jamie Callan
• A study of Poisson query generation model for information retrieval. Qiaozhu Mei, Hui Fang, Chengxiang Zhai
76