0% found this document useful (0 votes)
5 views

Chapter 1

The document discusses the field of Information Retrieval (IR), which focuses on the structure, analysis, organization, storage, searching, and retrieval of information, particularly text. It highlights the challenges of web search, including the vast amount of unstructured data, user difficulties in formulating queries, and the importance of relevance in search results. Additionally, it outlines various IR models, user evaluation of search engines, and the distinction between data retrieval and information retrieval.

Uploaded by

gillybobfitz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Chapter 1

The document discusses the field of Information Retrieval (IR), which focuses on the structure, analysis, organization, storage, searching, and retrieval of information, particularly text. It highlights the challenges of web search, including the vast amount of unstructured data, user difficulties in formulating queries, and the importance of relevance in search results. Additionally, it outlines various IR models, user evaluation of search engines, and the distinction between data retrieval and information retrieval.

Uploaded by

gillybobfitz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008


Search and Information Retrieval
• Search on the Web is a daily activity for many people
throughout the world
• Search and communication are most popular uses of the
computer
• Applications involving search are everywhere
• The field of computer science that is most involved with R&D
for search is information retrieval (IR)
Information Retrieval
• “Information retrieval is a field concerned with the structure,
analysis, organization, storage, searching, and retrieval of
information.” (Salton, 1968)
• General definition that can be applied to many types of
information and search applications
• Primary focus of IR since the 50s has been on text and
documents
Information Retrieval

NLP DB

IR

ML
Information Retrieval
Web Search
• Goal is to find information relevant to a user’s interests
• Challenge 1: A significant amount of content on the web is
not quality information
– Many pages contain nonsensical rants, etc.
– The web is full of misspellings, multiple languages, etc.
– Many pages are designed not to convey information – but to get a
high ranking (e.g., “search engine optimization”)
• Challenge 2: billions of documents
• Challenge 3: hyperlinks encode information

6
Characteristics of the Web
1. Huge (1.75 terabytes of text)
2. Allow people to share information globally and freely
3. Hides the detail of communication protocols, machine
4. locations, and operating systems
5. Data are unstructured
6. Exponential growth
7. Increasingly commercial over time (1.5 % .com in 1993 to
60% .com in 1997)
Difficulties of Building a Search Engine
• Build by Companies and hide the technical detail
• Distributed data
• High percentage of volatile data
• Large volume
• Unstructured and redundant data
• Quality of data
• Heterogeneous data
• Dynamic data
• How to specify a query from the user
• How to interpret the answer provided by the system
User Problems
• Do not exactly understand how to provide a sequence of words for
the search
• Not aware of the input requirement of the search engine.
• Problems understanding Boolean logic, so the users
• cannot use advanced search
• Novice users do not know how to start using a search engine
• Do not care about advertisements ? No funding
• Around 85% of users only look at the first page of the result, so
relevant answers might be skipped
Searching Guidelines
• Specify the words clearly (+, -)
• Use Advanced Search when necessary
• Provide as many particular terms as possible
• If looking for a company, institution, or organization, try:
www.name [.com | .edu | .org | .gov | country code]
• Some searching engine specialize in some areas
• If the user use broad queries, try to use Web directories as starting
points
• The user should notice that anyone can publish data on the Web, so
information that they get from search engines might not be accurate.
The Largest Search Engines (1998)
AltaVista Architecture
Information Retrieval
• Traditional information retrieval is basically text search
– A corpus or body of text documents, e.g., in a document collection in a library or on a CD
– Documents are generally high-quality and designed to convey information
– Documents are assumed to have no structure beyond words
• Searches are generally based on meaningful phrases, perhaps including
predicates over categories, dates, etc.
• The goal is to find the document(s) that best match the search phrase, according
to a search model
• Assumptions are typically different from Web: quality text, limited-size corpus,
no hyperlinks

13
Motivation for Information Retrieval
• Information Retrieval (IR) is about:
– Representation
– Storage
– Organization of
– And access to “information items”
• Focus is on user’s “information need” rather than a precise query:
– “March Madness” – Find information on college basketball teams which: (1) are maintained
by a US university and (2) participate in the NCAA tournament
• Emphasis is on the retrieval of information (not data)

14
Data vs. Information Retrieval
• Data retrieval, analogous to database querying: which docs contain a set of
keywords?
– Well-defined, precise logical semantics
– A single erroneous object implies failure!
• Information retrieval:
– Information about a subject or topic
– Semantics is frequently loose; we want approximate matches
– Small errors are tolerated (and in fact inevitable)
• IR system:
– Interpret contents of information items
– Generate a ranking which reflects relevance
– Notion of relevance is most important – needs a model

15
Basic Model
Docs Index Terms

doc

match
Information Need Ranking
?

query 16
The Full Info Retrieval Process
Text
Browse
r / UI
user Text
interest
Text Processing and Modeling

logical logical
Query
view view
Operatio Indexin
user ns g Crawler
feedback / Data
quer inverted
index Access
y
Searchin Inde
g x
retrieved Documen
docs ts (Web
Ranking or DB)
ranked 17

docs
Terminology
• IR systems usually adopt index terms to process queries
• Index term:
– a keyword or group of selected words
– any word (more general)
• Stemming might be used:
– connect: connecting, connection, connections
• An inverted index is built for the chosen index terms

18
What’s a Meaningful Result?
• Matching at index term level is quite imprecise
– Users are frequently dissatisfied
– One problem: users are generally poor at posing queries
– Frequent dissatisfaction of Web users (who often give single-keyword
queries)
• Issue of deciding relevance is critical for IR systems: ranking

19
Rankings
• A ranking is an ordering of the documents retrieved that (hopefully)
reflects the relevance of the documents to the user query
• A ranking is based on fundamental premises regarding the notion of
relevance, such as:
– common sets of index terms
– sharing of weighted terms
– likelihood of relevance
• Each set of premisses leads to a distinct IR model
In information retrieval (IR), premises are foundational assumptions or underlying principles that guide the design,
implementation, and evaluation of IR systems. These premises influence how data is indexed, queried, and retrieved.
20
Types of IR Models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
boolean Algebraic
vector
U probabilistic Generalized Vector
s Retrieval: Lat. Semantic Index
e Adhoc Neural Networks
r Filtering
Structured Models
Probabilistic
T Non-Overlapping Lists
a Proximal Nodes Inference Network
s Belief Network
k Browsing
Browsing
Flat
Structure Guided
Hypertext 21
Classic IR Models – Basic Concepts
• Each document represented by a set of representative
keywords or index terms
• An index term is a document word useful for remembering
the document main themes
• Traditionally, index terms were nouns because nouns have
meaning by themselves
• However, search engines assume that all words are index
terms (full text representation)

22
What is a Document?
• Documents are the basic units of retrieval in an IR system.
• In practice they might be: Web pages, email messages,
LaTeX files, news articles, phone message, etc.
• Update: add, delete, append(?), modify(?)
• Passages and XML elements are other possible units of
retrieval.
What is a Document?
• Examples:
– web pages, email, books, news stories, scholarly papers, text
messages, Word™, Powerpoint™, PDF, forum postings, patents, IM
sessions, etc.
• Common properties
– Significant text content
– Some structure (e.g., title, author, date for papers; subject, sender,
destination for email)
Documents vs. Database Records
• Database records (or tuples in relational databases) are typically
made up of well-defined fields (or attributes)
– e.g., bank records with account numbers, balances, names, addresses,
social security numbers, dates of birth, etc.
• Easy to compare fields with well-defined semantics to queries in
order to find matches
• Text is more difficult
Documents vs. Records
• Example bank database query
– Find records with balance > $50,000 in branches located in Amherst,
MA.
– Matches easily found by comparison with field values of records
• Example search engine query
– bank scandals in western mass
– This text must be compared to the text of entire news stories
Comparing Text
• Comparing the query text to the document text and determining
what is a good match is the core issue of information retrieval
• Exact matching of words is not enough
– Many different ways to write the same thing in a “natural language”
like English
– e.g., does a news story containing the text “bank director in Amherst
steals funds” match the query?
– Some stories will be better matches than others
Sec. 1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents


– Assume it is a static collection for the moment

• Goal: Retrieve documents with information that is relevant to


the user’s information need and helps the user complete a task

28
The classic search model

User task Get rid of mice in a


politically correct way
Misconception?

Info need
Info about removing mice
without killing them
Misformulation?

Query
how trap mice alive Search

Search
engine

Query Results
Collection
refinement
Sec. 1.1

How good are the retrieved docs?


 Precision : Fraction of retrieved docs that are relevant to the
user’s information need
 Recall : Fraction of relevant docs in collection that are retrieved

 More precise definitions and measurements to follow later

30
IR vs. databases: Structured vs unstructured data

• Structured data tends to refer to information in “tables”

Employee Manager Salary


Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000

Typically allows numerical range and exact match


(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
31
Unstructured data

• Typically refers to free text


• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
• Classic model for searching text documents

32
Semi-structured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified zones such as the Title and
Bullets
• … to say nothing of linguistic structure
• Facilitates “semi-structured” search such as
– Title contains data AND Bullets contain search
• Or even
– Title is about Object Oriented Programming AND Author something like
stro*rup
– where * is the wild-card operator
33
Dimensions of IR
• IR is more than just text, and more than just web search
– although these are central
• People doing IR work with different media, different types of
search applications, and different tasks
Other Media
• New applications increasingly involve new media
– e.g., video, photos, music, speech
• Like text, content is difficult to describe and compare
– text may be used to represent them (e.g. tags)
• IR approaches to search and evaluation are appropriate
Dimensions of IR
Content Applications Tasks
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned docs Desktop search Question answering
Audio Forum search
Music P2P search
Literature search

P2P search, short for peer-to-peer search, refers to a decentralized search approach where the data is distributed across multiple nodes in a
network, rather than being stored centrally. It allows users to query and retrieve data directly from peers in the network.
IR Tasks
• Ad-hoc search
– Find relevant documents for an arbitrary text query
• Filtering
– Identify relevant user profiles for a new document
• Classification
– Identify relevant labels for documents
• Question answering
– Give a specific answer to a question
How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)


38
Users’ empirical evaluation of results
• Quality of pages varies widely
– Relevance is not enough
– Other desirable qualities (non IR!!)
• Content: Trustworthy, diverse, non-duplicated, well maintained
• Web readability: display correctly & fast
• No annoyances: pop-ups, etc.
• Precision vs. recall
– On the web, recall seldom matters
• What matters
– Precision at 1? Precision above the fold?
– Comprehensiveness – must be able to deal with obscure queries
• Recall matters when the number of matches is very small
• User perceptions may be unscientific, but are significant over a large aggregate

39
Users’ empirical evaluation of engines
• Relevance and validity of results
• UI – Simple, no clutter, error tolerant
• Trust – Results are objective
• Coverage of topics for polysemic queries
• Pre/Post process tools provided
– Mitigate user errors (auto spell check, search assist,…)
– Explicit: Search within results, more like this, refine ...
– Anticipative: related searches
• Deal with idiosyncrasies
– Web specific vocabulary
• Impact on stemming, spell-check, etc.
– Web addresses typed in the search box
• “The first, the last, the best and the worst …”
40
The Web document collection
• No design/co-ordination
• Distributed content creation, linking,
democratization of publishing
• Content includes truth, lies, obsolete information,
contradictions …
• Unstructured (text, html, …), semi-structured
(XML, annotated photos), structured (Databases)

• Scale much larger than previous text collections …
but corporate records are catching up
• Growth – slowed down from initial “volume
doubling every few months” but still expanding
• Content can be dynamically generated

The Web
41
Big Issues in IR
• Relevance
– What is it?
– Simple (and simplistic) definition: A relevant document contains the
information that a person was looking for when they submitted a
query to the search engine
– Many factors influence a person’s decision about what is relevant:
e.g., task, context, novelty, style
– Topical relevance (same topic) vs. user relevance (everything else)
Big Issues in IR
• Relevance
– Retrieval models define a view of relevance
– Ranking algorithms used in search engines are based on retrieval models
– Most models describe statistical properties of text rather than linguistic
• i.e. counting simple text features such as words instead of parsing and analyzing
the sentences
• Statistical approach to text processing started with Luhn in the 50s
• Linguistic features can be part of a statistical model
Big Issues in IR
• Evaluation
– Experimental procedures and measures for comparing system output
with user expectations
• Originated in Cranfield experiments in the 60s
– IR evaluation methods now used in many fields
– Typically use test collection of documents, queries, and relevance
judgments
• Most commonly used are TREC collections
– Recall and precision are two examples of effectiveness measures
Big Issues in IR
• Users and Information Needs
– Search evaluation is user-centered
– Keyword queries are often poor descriptions of actual information
needs
– Interaction and context are important for understanding user intent
– Query refinement techniques such as query expansion, query
suggestion, relevance feedback improve ranking
IR and Search Engines
• A search engine is the practical application of information
retrieval techniques to large scale text collections
• Web search engines are best-known examples, but many
others
– Open source search engines are important for research and
development
• e.g., Lucene, Lemur/Indri, Galago
• Big issues include main IR issues but also some others
IR and Search Engines
Information Retrieval Search Engines

Performance
Relevance
-Efficient search and indexing
-Effective ranking
Incorporating new data
Evaluation
-Coverage and freshness
-Testing and measuring
Information needs Scalability
-Growing with data and users
-User interaction
Adaptability
-Tuning for applications
Specific problems
-e.g. Spam
Search Engine Issues
• Performance
– Measuring and improving the efficiency of search
• e.g., reducing response time, increasing query throughput, increasing indexing
speed
– Indexes are data structures designed to improve search efficiency
• designing and implementing them are major issues for search engines
Search Engine Issues
• Dynamic data
– The “collection” for most real applications is constantly changing in
terms of updates, additions, deletions
• e.g., web pages
– Acquiring or “crawling” the documents is a major task
• Typical measures are coverage (how much has been indexed) and freshness
(how recently was it indexed)
– Updating the indexes while processing queries is also a design issue
Search Engine Issues
• Scalability
– Making everything work with millions of users every day, and many
terabytes of documents
– Distributed processing is essential
• Adaptability
– Changing and tuning search engine components such as ranking
algorithm, indexing strategy, interface for different applications
Spam
• For Web search, spam in all its forms is one of the major issues
• Affects the efficiency of search engines and, more seriously, the
effectiveness of the results
• Many types of spam
– e.g. spamdexing or term spam, link spam, “optimization”
• New subfield called adversarial IR, since spammers are
“adversaries” with different goals
Course Goals

• To help you to understand search engines, evaluate and


compare them, and modify them for specific applications
• Provide broad coverage of the important issues in information
retrieval and search engines
– includes underlying models and current research directions

You might also like