0% found this document useful (0 votes)

5 views

Chapter 1

The document discusses the field of Information Retrieval (IR), which focuses on the structure, analysis, organization, storage, searching, and retrieval of information, particularly text. It highlights the challenges of web search, including the vast amount of unstructured data, user difficulties in formulating queries, and the importance of relevance in search results. Additionally, it outlines various IR models, user evaluation of search engines, and the distinction between data retrieval and information retrieval.

Uploaded by

gillybobfitz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Chapter 1

Uploaded by

gillybobfitz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008

Search and Information Retrieval
• Search on the Web is a daily activity for many people
throughout the world
• Search and communication are most popular uses of the
computer
• Applications involving search are everywhere
• The field of computer science that is most involved with R&D
for search is information retrieval (IR)
Information Retrieval
• “Information retrieval is a field concerned with the structure,
analysis, organization, storage, searching, and retrieval of
information.” (Salton, 1968)
• General definition that can be applied to many types of
information and search applications
• Primary focus of IR since the 50s has been on text and
documents
Information Retrieval

NLP DB

ML
Information Retrieval
Web Search
• Goal is to find information relevant to a user’s interests
• Challenge 1: A significant amount of content on the web is
not quality information
– Many pages contain nonsensical rants, etc.
– The web is full of misspellings, multiple languages, etc.
– Many pages are designed not to convey information – but to get a
high ranking (e.g., “search engine optimization”)
• Challenge 2: billions of documents
• Challenge 3: hyperlinks encode information

6
Characteristics of the Web
1. Huge (1.75 terabytes of text)
2. Allow people to share information globally and freely
3. Hides the detail of communication protocols, machine
4. locations, and operating systems
5. Data are unstructured
6. Exponential growth
7. Increasingly commercial over time (1.5 % .com in 1993 to
60% .com in 1997)
Difficulties of Building a Search Engine
• Build by Companies and hide the technical detail
• Distributed data
• High percentage of volatile data
• Large volume
• Unstructured and redundant data
• Quality of data
• Heterogeneous data
• Dynamic data
• How to specify a query from the user
• How to interpret the answer provided by the system
User Problems
• Do not exactly understand how to provide a sequence of words for
the search
• Not aware of the input requirement of the search engine.
• Problems understanding Boolean logic, so the users
• cannot use advanced search
• Novice users do not know how to start using a search engine
• Do not care about advertisements ? No funding
• Around 85% of users only look at the first page of the result, so
relevant answers might be skipped
Searching Guidelines
• Specify the words clearly (+, -)
• Use Advanced Search when necessary
• Provide as many particular terms as possible
• If looking for a company, institution, or organization, try:
www.name [.com | .edu | .org | .gov | country code]
• Some searching engine specialize in some areas
• If the user use broad queries, try to use Web directories as starting
points
• The user should notice that anyone can publish data on the Web, so
information that they get from search engines might not be accurate.
The Largest Search Engines (1998)
AltaVista Architecture
Information Retrieval
• Traditional information retrieval is basically text search
– A corpus or body of text documents, e.g., in a document collection in a library or on a CD
– Documents are generally high-quality and designed to convey information
– Documents are assumed to have no structure beyond words
• Searches are generally based on meaningful phrases, perhaps including
predicates over categories, dates, etc.
• The goal is to find the document(s) that best match the search phrase, according
to a search model
• Assumptions are typically different from Web: quality text, limited-size corpus,
no hyperlinks

13
Motivation for Information Retrieval
• Information Retrieval (IR) is about:
– Representation
– Storage
– Organization of
– And access to “information items”
• Focus is on user’s “information need” rather than a precise query:
– “March Madness” – Find information on college basketball teams which: (1) are maintained
by a US university and (2) participate in the NCAA tournament
• Emphasis is on the retrieval of information (not data)

14
Data vs. Information Retrieval
• Data retrieval, analogous to database querying: which docs contain a set of
keywords?
– Well-defined, precise logical semantics
– A single erroneous object implies failure!
• Information retrieval:
– Information about a subject or topic
– Semantics is frequently loose; we want approximate matches
– Small errors are tolerated (and in fact inevitable)
• IR system:
– Interpret contents of information items
– Generate a ranking which reflects relevance
– Notion of relevance is most important – needs a model

15
Basic Model
Docs Index Terms

doc

match
Information Need Ranking
?

query 16
The Full Info Retrieval Process
Text
Browse
r / UI
user Text
interest
Text Processing and Modeling

logical logical
Query
view view
Operatio Indexin
user ns g Crawler
feedback / Data
quer inverted
index Access
y
Searchin Inde
g x
retrieved Documen
docs ts (Web
Ranking or DB)
ranked 17

docs
Terminology
• IR systems usually adopt index terms to process queries
• Index term:
– a keyword or group of selected words
– any word (more general)
• Stemming might be used:
– connect: connecting, connection, connections
• An inverted index is built for the chosen index terms

18
What’s a Meaningful Result?
• Matching at index term level is quite imprecise
– Users are frequently dissatisfied
– One problem: users are generally poor at posing queries
– Frequent dissatisfaction of Web users (who often give single-keyword
queries)
• Issue of deciding relevance is critical for IR systems: ranking

19
Rankings
• A ranking is an ordering of the documents retrieved that (hopefully)
reflects the relevance of the documents to the user query
• A ranking is based on fundamental premises regarding the notion of
relevance, such as:
– common sets of index terms
– sharing of weighted terms
– likelihood of relevance
• Each set of premisses leads to a distinct IR model
In information retrieval (IR), premises are foundational assumptions or underlying principles that guide the design,
implementation, and evaluation of IR systems. These premises influence how data is indexed, queried, and retrieved.
20
Types of IR Models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
boolean Algebraic
vector
U probabilistic Generalized Vector
s Retrieval: Lat. Semantic Index
e Adhoc Neural Networks
r Filtering
Structured Models
Probabilistic
T Non-Overlapping Lists
a Proximal Nodes Inference Network
s Belief Network
k Browsing
Browsing
Flat
Structure Guided
Hypertext 21
Classic IR Models – Basic Concepts
• Each document represented by a set of representative
keywords or index terms
• An index term is a document word useful for remembering
the document main themes
• Traditionally, index terms were nouns because nouns have
meaning by themselves
• However, search engines assume that all words are index
terms (full text representation)

22
What is a Document?
• Documents are the basic units of retrieval in an IR system.
• In practice they might be: Web pages, email messages,
LaTeX files, news articles, phone message, etc.
• Update: add, delete, append(?), modify(?)
• Passages and XML elements are other possible units of
retrieval.
What is a Document?
• Examples:
– web pages, email, books, news stories, scholarly papers, text
messages, Word™, Powerpoint™, PDF, forum postings, patents, IM
sessions, etc.
• Common properties
– Significant text content
– Some structure (e.g., title, author, date for papers; subject, sender,
destination for email)
Documents vs. Database Records
• Database records (or tuples in relational databases) are typically
made up of well-defined fields (or attributes)
– e.g., bank records with account numbers, balances, names, addresses,
social security numbers, dates of birth, etc.
• Easy to compare fields with well-defined semantics to queries in
order to find matches
• Text is more difficult
Documents vs. Records
• Example bank database query
– Find records with balance > $50,000 in branches located in Amherst,
MA.
– Matches easily found by comparison with field values of records
• Example search engine query
– bank scandals in western mass
– This text must be compared to the text of entire news stories
Comparing Text
• Comparing the query text to the document text and determining
what is a good match is the core issue of information retrieval
• Exact matching of words is not enough
– Many different ways to write the same thing in a “natural language”
like English
– e.g., does a news story containing the text “bank director in Amherst
steals funds” match the query?
– Some stories will be better matches than others
Sec. 1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents

– Assume it is a static collection for the moment

• Goal: Retrieve documents with information that is relevant to

the user’s information need and helps the user complete a task

28
The classic search model

User task Get rid of mice in a

politically correct way
Misconception?

Info need
Info about removing mice
without killing them
Misformulation?

Query
how trap mice alive Search

Search
engine

Query Results
Collection
refinement
Sec. 1.1

How good are the retrieved docs?

 Precision : Fraction of retrieved docs that are relevant to the
user’s information need
 Recall : Fraction of relevant docs in collection that are retrieved

 More precise definitions and measurements to follow later

30
IR vs. databases: Structured vs unstructured data

• Structured data tends to refer to information in “tables”

Employee Manager Salary

Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000

Typically allows numerical range and exact match

(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
31
Unstructured data

• Typically refers to free text

• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
• Classic model for searching text documents

32
Semi-structured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified zones such as the Title and
Bullets
• … to say nothing of linguistic structure
• Facilitates “semi-structured” search such as
– Title contains data AND Bullets contain search
• Or even
– Title is about Object Oriented Programming AND Author something like
stro*rup
– where * is the wild-card operator
33
Dimensions of IR
• IR is more than just text, and more than just web search
– although these are central
• People doing IR work with different media, different types of
search applications, and different tasks
Other Media
• New applications increasingly involve new media
– e.g., video, photos, music, speech
• Like text, content is difficult to describe and compare
– text may be used to represent them (e.g. tags)
• IR approaches to search and evaluation are appropriate
Dimensions of IR
Content Applications Tasks
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned docs Desktop search Question answering
Audio Forum search
Music P2P search
Literature search

P2P search, short for peer-to-peer search, refers to a decentralized search approach where the data is distributed across multiple nodes in a
network, rather than being stored centrally. It allows users to query and retrieve data directly from peers in the network.
IR Tasks
• Ad-hoc search
– Find relevant documents for an arbitrary text query
• Filtering
– Identify relevant user profiles for a new document
• Classification
– Identify relevant labels for documents
• Question answering
– Give a specific answer to a question
How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)

38
Users’ empirical evaluation of results
• Quality of pages varies widely
– Relevance is not enough
– Other desirable qualities (non IR!!)
• Content: Trustworthy, diverse, non-duplicated, well maintained
• Web readability: display correctly & fast
• No annoyances: pop-ups, etc.
• Precision vs. recall
– On the web, recall seldom matters
• What matters
– Precision at 1? Precision above the fold?
– Comprehensiveness – must be able to deal with obscure queries
• Recall matters when the number of matches is very small
• User perceptions may be unscientific, but are significant over a large aggregate

39
Users’ empirical evaluation of engines
• Relevance and validity of results
• UI – Simple, no clutter, error tolerant
• Trust – Results are objective
• Coverage of topics for polysemic queries
• Pre/Post process tools provided
– Mitigate user errors (auto spell check, search assist,…)
– Explicit: Search within results, more like this, refine ...
– Anticipative: related searches
• Deal with idiosyncrasies
– Web specific vocabulary
• Impact on stemming, spell-check, etc.
– Web addresses typed in the search box
• “The first, the last, the best and the worst …”
40
The Web document collection
• No design/co-ordination
• Distributed content creation, linking,
democratization of publishing
• Content includes truth, lies, obsolete information,
contradictions …
• Unstructured (text, html, …), semi-structured
(XML, annotated photos), structured (Databases)
…
• Scale much larger than previous text collections …
but corporate records are catching up
• Growth – slowed down from initial “volume
doubling every few months” but still expanding
• Content can be dynamically generated

The Web
41
Big Issues in IR
• Relevance
– What is it?
– Simple (and simplistic) definition: A relevant document contains the
information that a person was looking for when they submitted a
query to the search engine
– Many factors influence a person’s decision about what is relevant:
e.g., task, context, novelty, style
– Topical relevance (same topic) vs. user relevance (everything else)
Big Issues in IR
• Relevance
– Retrieval models define a view of relevance
– Ranking algorithms used in search engines are based on retrieval models
– Most models describe statistical properties of text rather than linguistic
• i.e. counting simple text features such as words instead of parsing and analyzing
the sentences
• Statistical approach to text processing started with Luhn in the 50s
• Linguistic features can be part of a statistical model
Big Issues in IR
• Evaluation
– Experimental procedures and measures for comparing system output
with user expectations
• Originated in Cranfield experiments in the 60s
– IR evaluation methods now used in many fields
– Typically use test collection of documents, queries, and relevance
judgments
• Most commonly used are TREC collections
– Recall and precision are two examples of effectiveness measures
Big Issues in IR
• Users and Information Needs
– Search evaluation is user-centered
– Keyword queries are often poor descriptions of actual information
needs
– Interaction and context are important for understanding user intent
– Query refinement techniques such as query expansion, query
suggestion, relevance feedback improve ranking
IR and Search Engines
• A search engine is the practical application of information
retrieval techniques to large scale text collections
• Web search engines are best-known examples, but many
others
– Open source search engines are important for research and
development
• e.g., Lucene, Lemur/Indri, Galago
• Big issues include main IR issues but also some others
IR and Search Engines
Information Retrieval Search Engines

Performance
Relevance
-Efficient search and indexing
-Effective ranking
Incorporating new data
Evaluation
-Coverage and freshness
-Testing and measuring
Information needs Scalability
-Growing with data and users
-User interaction
Adaptability
-Tuning for applications
Specific problems
-e.g. Spam
Search Engine Issues
• Performance
– Measuring and improving the efficiency of search
• e.g., reducing response time, increasing query throughput, increasing indexing
speed
– Indexes are data structures designed to improve search efficiency
• designing and implementing them are major issues for search engines
Search Engine Issues
• Dynamic data
– The “collection” for most real applications is constantly changing in
terms of updates, additions, deletions
• e.g., web pages
– Acquiring or “crawling” the documents is a major task
• Typical measures are coverage (how much has been indexed) and freshness
(how recently was it indexed)
– Updating the indexes while processing queries is also a design issue
Search Engine Issues
• Scalability
– Making everything work with millions of users every day, and many
terabytes of documents
– Distributed processing is essential
• Adaptability
– Changing and tuning search engine components such as ranking
algorithm, indexing strategy, interface for different applications
Spam
• For Web search, spam in all its forms is one of the major issues
• Affects the efficiency of search engines and, more seriously, the
effectiveness of the results
• Many types of spam
– e.g. spamdexing or term spam, link spam, “optimization”
• New subfield called adversarial IR, since spammers are
“adversaries” with different goals
Course Goals

• To help you to understand search engines, evaluate and

compare them, and modify them for specific applications
• Provide broad coverage of the important issues in information
retrieval and search engines
– includes underlying models and current research directions

Igcse Ict (Code 0417) Lesson Note Document PRDN
No ratings yet
Igcse Ict (Code 0417) Lesson Note Document PRDN
12 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
01 - Lect - Introd
No ratings yet
01 - Lect - Introd
23 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Chap 1
No ratings yet
Chap 1
23 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Chap 1
No ratings yet
Chap 1
22 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
Chapter 1 Introduction to IR
No ratings yet
Chapter 1 Introduction to IR
18 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
ch1_Information Retrieval Systems
No ratings yet
ch1_Information Retrieval Systems
52 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
IR chapter 1 (2)
No ratings yet
IR chapter 1 (2)
29 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Introduction
No ratings yet
Introduction
32 pages
1stunit GN
No ratings yet
1stunit GN
36 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Unit-5. Search Engines
No ratings yet
Unit-5. Search Engines
105 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Intro Notes
No ratings yet
Intro Notes
11 pages
1520784495 Lec5 Ir Introduction
No ratings yet
1520784495 Lec5 Ir Introduction
37 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
chapter 1 ir (1)
No ratings yet
chapter 1 ir (1)
37 pages
Ch1 Intro To Information Retrieval-Lina Nemri
No ratings yet
Ch1 Intro To Information Retrieval-Lina Nemri
23 pages
Ch2_IR and LT
No ratings yet
Ch2_IR and LT
45 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
1 introIR
No ratings yet
1 introIR
22 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
1_introIR
No ratings yet
1_introIR
15 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Information Retrieval and XML Data: ADBMS Unit-4
No ratings yet
Information Retrieval and XML Data: ADBMS Unit-4
37 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Part I IR VTU M Tech SSE
No ratings yet
Part I IR VTU M Tech SSE
72 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
chapter one IR
No ratings yet
chapter one IR
18 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
1_IR_Introductionn (1)
No ratings yet
1_IR_Introductionn (1)
30 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Examples of Compound Nouns and Worksheets PDF
No ratings yet
Examples of Compound Nouns and Worksheets PDF
5 pages
Software Development Plan
No ratings yet
Software Development Plan
6 pages
88 Third-Conditional US
No ratings yet
88 Third-Conditional US
25 pages
The Seven Cs of Building An Argument
100% (1)
The Seven Cs of Building An Argument
4 pages
Module 4.1
No ratings yet
Module 4.1
48 pages
Hope and Plan
No ratings yet
Hope and Plan
12 pages
Social Intelligence
No ratings yet
Social Intelligence
2 pages
Lupita Villegas GTZ - Grammar m2
No ratings yet
Lupita Villegas GTZ - Grammar m2
4 pages
(SV) Back to School
No ratings yet
(SV) Back to School
9 pages
Keyword Analyzer
No ratings yet
Keyword Analyzer
2 pages
CV Template
100% (1)
CV Template
1 page
Calendar & Clock
No ratings yet
Calendar & Clock
32 pages
Week2 NMP
No ratings yet
Week2 NMP
98 pages
UICC Application Programming Interface (UICC API) For Java Card
No ratings yet
UICC Application Programming Interface (UICC API) For Java Card
45 pages
Republic of The Philippines Department of Education Schools Division Office of Balanga City
No ratings yet
Republic of The Philippines Department of Education Schools Division Office of Balanga City
30 pages
Lesson Plan LLT
No ratings yet
Lesson Plan LLT
3 pages
Shakespeare Webquest Worksheet
No ratings yet
Shakespeare Webquest Worksheet
3 pages
Karandeep Singh
No ratings yet
Karandeep Singh
2 pages
ELC121 Part of Speech
No ratings yet
ELC121 Part of Speech
16 pages
LTE - FMA Version Update Log
No ratings yet
LTE - FMA Version Update Log
15 pages
White Book About The Language Dispute Between Bulgaria and Republic of North Macedonia
100% (2)
White Book About The Language Dispute Between Bulgaria and Republic of North Macedonia
162 pages
Functional Approach to Professional Discourse Exploration in Linguistics Elena N. Malyuga - Download the ebook in PDF with all chapters to read anytime
100% (1)
Functional Approach to Professional Discourse Exploration in Linguistics Elena N. Malyuga - Download the ebook in PDF with all chapters to read anytime
58 pages
Basic AutoCAD 2D Drafting
No ratings yet
Basic AutoCAD 2D Drafting
39 pages
BORANG PBD 2021 New
No ratings yet
BORANG PBD 2021 New
71 pages
Hz. Usman RA ERA Research Paper
No ratings yet
Hz. Usman RA ERA Research Paper
2 pages
Class:: Topic: The Parts of The Body
No ratings yet
Class:: Topic: The Parts of The Body
3 pages
Virtual English Exam-Basic I Student's Full Name
No ratings yet
Virtual English Exam-Basic I Student's Full Name
4 pages
Advertising Design By Medium A Visual And Verbal Approach Robyn Blakeman download
No ratings yet
Advertising Design By Medium A Visual And Verbal Approach Robyn Blakeman download
79 pages
About Raffle Draw
No ratings yet
About Raffle Draw
36 pages