IRT Unit_I.pptx

UNIT-I
IV Year / VIII Semester
By
P.Thenmozhi AP/CSE
KNCET.
KONGUNADU COLLEGE OF ENGINEERING AND
TECHNOLOGY
(Autonomous)
NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CS8080 – Information Retrieval Techniques

Objectives
To understand the basics of information retrieval.
To understand machine learning techniques for
text classification and clustering .
To understand various search engine operations
To learn different techniques of recommender
systems

Information Retrieval Techniques
• What is Information?
• There is no “correct” definition
• Information:
– Informing, telling; thing told, knowledge, items of
knowledge, news
– Knowledge communicated or received concerning a
particular fact or circumstance; news
• Knowledge: knowing familiarity gained by
experience; person’s range of information; a
theoretical or practical understanding of; the sum
of what is known

• Types of information
• Text (Documents and portions thereof)
• XML and structured documents
• Images
• Audio (sound effects, songs, etc.)
• Video
• Source code
• Applications/Web services

• Retrieval?
• “Fetch something” that’s been stored
• Recover a stored state of knowledge
• Search through stored messages to find some
messages relevant to the task at hand.

• What is IR?
• Information retrieval is a problem-oriented
discipline, concerned with the problem of the
effective and efficient transfer of desired
information between human generator and
human user

Main Objective of IR:
• Provide the users with effective access to &
interaction with information resources.
• Goal of IR is to retrieve all and only the
“relevant” documents in a collection for a
particular user with a particular need for
information
– Relevance is a central concept in IR theory

– Web search engines have been stress-testing the
traditional IR models (and inventing new ways of
ranking)
• The goal is to search large document
collections (millions of documents) to retrieve
small subsets relevant to the user’s
information need
• Examples are:
• Internet search engines (Google, Yahoo! web
search, etc.)
• Digital library catalogues (MELVYL, GLADYS)

• What do we want from an IRS ?
• Systemic approach
– Goal (for a known information need):Return as many
relevant documents as possible and as few non-
relevant documents as possible
– Cognitive approach
• Goal (in an interactive information-seeking environment,
with a given IRS): Support the user’s exploration of the
problem domain and the task completion.

• Information Retrieval vs. Information Extraction
• Information Retrieval
• Given a set of terms and a set of document terms
select only the most relevant document
(precision), and preferably all the relevant ones
(recall)
• Information Extraction
• Extract from the text what the document means.

Databases vs. IR
Databases IR
What we are Structured data Mostly unstructured
retrieving
Queries we are Formally defined queries, Expressed in natural
posing unambiguous language
Results we get Exact. Always correct in Sometimes relevant,
formal sense. often not
Interaction with One-short queries Interaction is important
system

Performance and correctness measures
Precision
Precision is the fractionofthe documents retrieved that are relevant tothe user’s informationneed.

Recall
Recall is the fraction of the documents that are relevant to the query that are successfully
retrieved

Fall-out
The proportionofnon-relevant documentsthat are retrieved, out ofall non-relevant documents
available
F-score / F-measure
The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-
score is:

• Information Retrieval:
• IR deals with the representation, storage, organization
of, and access to information items
• Types of information items: documents, Web pages,
online catalogs, structured records, multimedia objects
• Early goals of the IR area: indexing text and searching
for useful documents in a collection
• Nowadays, research in IR includes: Modeling, Web
search, text classification, systems architecture, user
interfaces, data visualization, filtering and languages.

Early Developments
• For more than 5,000 years, man has organized
information for later retrieval and searching
• This has been done by compiling, storing, organizing,
and indexing papyrus, hieroglyphics, and books
• For holding the various items, special purpose buildings
called libraries, or bibliothekes, are used
– The oldest known library was created in Elba, in the Fertile
Crescent, between 3,000 and 2,500 BC
– By 300 BC, Ptolemy Soter, a Macedonian general, created
the Great Library at Alexandria
– Nowadays, libraries are everywhere

• In 2008, more than 2 billion items were checked
out from libraries in the US—an increase of 10%
over the previous year
• Since the volume of information in libraries is
always growing, it is necessary to build
specialized data structures for fast search — the
indexes
• For centuries indexes have been created
manually as sets of categories, with labels
associated with each category
• The advent of modern computers has allowed the
construction of large indexes automatically

• During the 50’s, research efforts in IR were initiated by
pioneers such as Hans Peter Luhn, Eugene Garfield,
Philip Bagley, and Calvin Moores, who allegedly coined
the term Information Retrieval
• In 1962, Cyril Cleverdon published the Cranfield studies
on retrieval evaluation
• In 1963, Joseph Becker and Robert Hayes published the
first book on IR
• In the late 60’s, key research conducted by Karen
Sparck Jones and Gerard Salton, among others, led to
the definition of the TF-IDF term weighting scheme
• In 1971, Jardine and van Rijsbergen articulated the
cluster hypothesis

• In 1978, the first ACM SIGIR International
Conference on Information Retrieval was held
in Rochester
• In 1979, van Rijsbergen published a classic
book entitled Information Retrieval, which
focused on the Probabilistic Model
• In 1983, Salton and McGill published a classic
book entitled Introduction to Modern
Information Retrieval, which focused on the
Vector Model

The IR Problem
• Users of modern IR systems, such as search
engine users, have information needs of varying
complexity
• An example of complex information need is as
follows:
– “ Find all documents that address the role of the
Federal Government in financing the operation of the
National Railroad Transportation Corporation
(AMTRAK)”
•
• This full description of the user information need
is not necessarily a good query to be submitted
to the IR system

• Instead, the user might want to first translate
this information need into a query
• This translation process yields a set of
keywords, or index terms, which summarize
the user information need
• Given the user query, the key goal of the IR
system is to retrieve information that is useful
or relevant to the user

• That is, the IR system must rank the
information items according to a degree of
relevance to the user query
• The IR Problem
– “The key goal of an IR system is to retrieve all the
items that are relevant to a user query, while
retrieving as few non relevant items as possible”.
• The notion of relevance is of central
importance in IR

• Consider a user who seeks information on a topic
of their interest
• This user first translates their information need
into a query, which requires specifying the words
that compose the query
• In this case, we say that the user is searching or
querying for information of their interest
• Consider now a user who has an interest that is
either poorly defined or inherently broad
• For instance, the user has an interest in car racing
and wants to browse documents on Formula 1
and Formula Indy
• In this case, we say that the user is browsing or
navigating the documents of the collection

• The User Task:
• The information first is supposed to be translated into
a query by the user.
• In the information retrieval system, there is a set of
words that convey the semantics of the information
that is required whereas, in a data retrieval system, a
query expression is used to convey the constraints
which are satisfied by the objects.
• Example: A user wants to search for something but
ends up searching with another thing.
• This means that the user is browsing and not
searching.
• The above figure shows the interaction of the user
through different tasks.

Information Retrieval Vs Data
Retrieval
• Information Retrieval: Given a set of query terms
and a set of document terms select only the most
relevant documents [precision], and preferably all
the relevant [recall].
• Data retrieval: the task of determining which
documents of a collection contain the keywords
in the user query
• Data retrieval system
• Ex: relational databases
• Deals with data that has a well defined structure
and semantics

• A single erroneous object among a thousand
retrieved objects means total failure
• Data retrieval does not solve the problem of
retrieving information about a subject or topic

Information Retrieval Data retrieval
The software program that deals with the organization,
storage, retrieval, and evaluation of information from
document repositories particularly textual information.
Data retrieval deals with obtaining data from a database
management system such as ODBMS. It is A process of
identifying and retrieving the data from the database, based
on the query provided by user or application.
Retrieves information about a subject. Determines the keywords in the user query and retrieves the
data.
Small errors are likely to go unnoticed. A single error object means total failure.
Not always well structured and is semantically ambiguous. Has a well-defined structure and semantics.
Does not provide a solution to the user of the database
system.
Provides solutions to the user of the database system.
The results obtained are approximate matches. The results obtained are exact matches.
Results are ordered by relevance. Results are unordered by relevance.
It is a probabilistic model. It is a deterministic model.

The IR System
• It has three major components in IR
• Document subsystem
– Acquisition
– Representation
– File organization
• User sub system
– Problem
– Representation
– Query
• Searching /Retrieval subsystem
– Matching
– Retrieved objects

• An information retrieval system thus has three
major components- the document subsystem,
the users subsystem, and the searching/retrieval
subsystem.
• These divisions are quite broad and each one is
designed to serve one or more functions, such as:
Analysis of documents and organization of
information(creation of a document database)
Analysis of user’s queries, preparation of a
strategy to search the database
Actual searching or matching of users queries
with the database, and finally
Retrieval of items that fully or partially match the
search statement.

• Acquisition (Document subsystem
• Selection of documents & other objects from various web
resources
• Mostly text based documents
– full texts, titles, abstracts ...
– but also other objects:
• data, statistics, images, maps, trade marks, sounds ...
• The data are collected by web crawler and stored in data base.
• Representation of documents, objects(document subsystem)
• Indexing – many ways :
– free text terms (even in full texts)
– controlled vocabulary - thesaurus
– manual & automatic techniques
• Abstracting; summarizing
• Bibliographic description:
– author, title, sources, date…
– metadata
• Classifying, clustering

• File organization (Document subsystem)
• Sequential
– record (document) by record
• Inverted
– term by term; list of records under each term
• Combination
• Problem (user subsystem)
• Related to user’s task, situation
– vary in specificity, clarity
• Produces information need
• ultimate criterion for effectiveness of retrieval
– how well was the need met?

• Representation (user subsystem)
• Converting a concept to query.
• What we search for.
• These are stemmed and corrected using
dictionary.
• Focus toward a good result
• Query - search statement (user & system)
• Translation into systems requirements & limits
– start of human-computer interaction
• query is the thing that goes into the computer

• Matching - searching (Searching subsystem)
• Process of matching, comparing
– search: what documents in the file match the query as
stated?
• Various search algorithms:
– exact match - Boolean
• still available in most, if not all systems
– best match - ranking by relevance
• increasingly used e.g. on the web
• Retrieved documents -from system to user (IR
Subsystem)
• Various order of output:
– Last In First Out (LIFO); sorted
– ranked by relevance
– ranked by other characteristics
• Various forms of output

 Relevancefeedback


 HighlevelsoftwarearchitectureofanIRsystem

The Software Architecture of the IR
System

The Retrieval and Ranking Processes

• Text Operations forms index words (tokens).
– Tokenization – Given a character sequence and a
defined document unit, tokenization is the task of
chopping it up into pieces called tokens.
– Stopword removal – Remove non-informative or
common words(tokens) from stream. E.g.
is,was,and, it, a etc.
– Stemming – Replace the word variants with single
stem of word. E.g. education, educated, educate
are replaced with single stem of word educate.

• Indexing : Documents are converted into fast
searchable internal representation using
language independent data structure called
Inverted Index.
• Searching : Calculate degree of similarity
between document and query terms; retrieves
documents that contain a given query token from
the inverted index.
• Ranking : Scores all retrieved documents
according to a relevance metric( term frequency
or Cosine similarity)
• User Interface manages interaction with the user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.

• Query Operations transform the query to
improve retrieval:
– Query expansion using a
thesaurus.(vocabulary/terms); thesaurus is a data
structure that defines semantic relatedness
between words e.g. Semantic related words are
car, auto, automobile and vehicle
– Query transformation using relevance feedback
(the user gives feedback on the relevance of
document in an initial set of results)

• Document Gathering
• This is the process of gathering the documents
that are to form the core content of the IR
system, these documents could be text, images,
audio files, video clips, entire movies, etc.
• Document Indexing
• The documents gathered in the document
gathering phase are converted into a fast
searchable internal representation.
• This will usually be implemented using some
programming language dependent data
structures which provide fast searching facilities
such as array lists, vectors, sets, multi-sets, maps

• Searching Support
• This process involves accepting a query, processing it,
finding possibly relevant documents, calculating the
degree of similarity between each document and the
query for each (possibly relevant2) document, sorting
the set of highly ranked documents and returning
these to the user in groups (usually) of 10.
• All this has to be done as efficiently and quickly as
possible. For example, the IR system that operates as
the Google search engine accepts and processes
– 150 million queries per day.
– 6.25 million per hour.
– 105,000 per minute.
– 1,700 per second.

• Document Management
• In the previous three steps, we have gathered
documents, indexed them and are now
allowing users to search their content.
However, in many scenarios such as web
searching, the documents that have been
indexed will be unstable and constantly
changing.

• Dimensions of IR
• IR is more than just text, and more than just
web search
• although these are central
• People doing IR work with different media,
different types of search applications, and
different tasks

THE WEB
• At the end of World War II, Vannevar Bush looked for applications
of new technologies to peace times
• Bush first produced a report entitled Science, The Endless Frontier
• This report directly influenced the creation of the National Science
Foundation
• Following, he wrote As We May Think, a remarkable paper which
discussed new hardware and software gadgets
• In Bush’s words: Whole new forms of encyclopedias will appear,
ready-made with a mesh of associative trails running through them,
ready to be dropped into the memex and there amplified
• As We May Think influenced people like Douglas Engelbart, who
invented the computer mouse and introduced the concept of
hyperlinked texts
• Ted Nelson, working in his Project Xanadu, pushed the concept
further and coined the term hypertext

• A hypertext allows the reader to jump from one
electronic document to another, which was one
important property regarding the problem that Tim
Berners-Lee faced in 1989
• At the time, Berners-Lee worked in Geneva at the
CERN—Conseil Européen pour la Recherche Nucléaire
• There, researchers who wanted to share
documentation with others had to reformat their
documents to make them compatible with an internal
publishing system
• Berners-Lee reasoned that it would be nice if the
solution of sharing documents were decentralized
• He saw that a networked hypertext would be a good
solution and started working on its implementation

• In 1990, Berners-Lee
• Wrote the HTTP protocol
• Defined the HTML language
• Wrote the first browser, which he called World
Wide Web
• Wrote the first Web server
• In 1991, he made his browser and server
software available in the Internet
• The Web was born!

The E-Publishing Era
• Well over 20 billion pages are now available
and accessible in the Web
• More than one fourth of humanity now access
the Web on a regular basis
• Why is the Web such a success? What is the
single most important characteristic of the
Web that makes it so revolutionary?

• In search for an answer, let us dwell into the
life of a writer who lived at the end of the
18th Century
• She finished the first draft of her novel in 1796
• The first attempt of publication was refused
without a reading
• The novel was only published 15 years later!

• Jane Austen was discriminated because there
was no freedom to publish in the beginning of
the 19th century
• The Web, unleashed by the inventiveness of
Tim Berners-Lee, changed this once and for all
• It did so by universalizing freedom to publish
• The Web moved mankind into a new era, into
a new time, into The e-Publishing Era.

How the Web Changed Search
• Web search is today the most prominent
application of IR and its techniques—the
ranking and indexing components of any
search engine are fundamentally IR pieces of
technology

• The first major impact of the Web on search is
related to the characteristics of the document
collection itself.
• The Web is composed of pages distributed
over millions of sites and connected through
hyperlinks.
• This requires collecting all documents and
storing copies of them in a central repository,
prior to indexing.
• This new phase in the IR process, introduced
by the Web, is called crawling

• The second major impact of the Web on
search is related to:
• The size of the collection
• The volume of user queries submitted on a
daily basis
• As a consequence, performance and
scalability have become critical characteristics
of the IR system

• The third major impact : in a very large
collection, predicting relevance is much
harder than before
• Fortunately, the Web also includes new
sources of evidence
• Ex: hyperlinks and user clicks in documents in
the answer set

• The fourth major impact derives from the fact
that the Web is also a medium to do business
• Search problem has been extended beyond
the seeking of text information to also
encompass other user needs
• Ex: the price of a book, the phone number of a
hotel, the link for downloading a software

• The fifth major impact of the Web on search is
Web spam
• Web spam: abusive availability of commercial
information disguised in the form of
informational content
• This difficulty is so large that today we talk of
Adversarial Web Retrieval

Practical Issues on the Web
• 1. Security
• Commercial transactions over the Internet are not yet
a completely safe procedure
• 2. Privacy
• Frequently, people are willing to exchange information
as long as it does not become public
• 3. Copyright and patent rights
• It is far from clear how the wide spread of data on the
Web affects copyright and patent laws in the various
countries
• 4. Scanning,
• optical character recognition (OCR), and cross-language
retrieval

How to People Search
• Search tasks range from the relatively simple
(e.g., looking up disputed facts or finding
weather information) to the rich and complex
(e.g., job seeking and planning vacations).
• Search interfaces should support a range of
tasks, while taking into account how people
think about searching for information.

Information Lookup versus
Exploratory Search
• User interaction with search interfaces differs depending
on
• the type of task
• the domain expertise of the information seeker
• the amount of time and effort available to invest in the
process
• Marchionini makes a distinction between information
lookup and exploratory search
• Information lookup tasks
• are akin to fact retrieval or question answering
• can be satisfied by discrete pieces of information: numbers,
dates, names, or Web sites
• can work well for standard Web search interactions

• Exploratory search is divided into learning and
investigating tasks
• Learning search
• i) requires more than single query-response pairs
• ii) requires the searcher to spend time
• scanning and reading multiple information items
• synthesizing content to form new understanding

• Investigating refers to a longer-term process
which
• involves multiple iterations that take place over
perhaps very long periods of time
• may return results that are critically assessed
before being integrated into personal and
professional knowledge bases
• may be concerned with finding a large proportion
of the relevant information available
• Information seeking can be seen as being part of
a larger process referred to as sensemaking

The Classic versus the Dynamic Model
of Information Seeking
• Classic notion of the information seeking
process:
• problem identification
• articulation of information need(s)
• query formulation
• results evaluation

Navigation versus Search
• Navigation: the searcher looks at an information
structure and browses among the available information
• This browsing strategy is preferable when the
information structure is well-matched to the user’s
information need
• it is mentally less taxing to recognize a piece of
information than it is to recall it
• it works well only so long as appropriate links are
available
• If the links are not available, then the browsing
experience might be frustrating
• Spool discusses an example of a user looking for a
software driver for a particular laser printer

Search Process
• Numerous studies have been made of people engaged
in the search process
• The results of these studies can help guide the design
of search interfaces
• One common observation is that users often
reformulate their queries with slight modifications
• Another is that searchers often search for information
that they have previously accessed
• The users’ search strategies differ when searching over
previously seen materials
• Researchers have developed search interfaces support
both query history and revisitation

Search Interface Today
• Getting Started
• How does an information seeking session begin in
online information systems?
• The most common way is to use a Web search engine
• Another method is to select a Web site from a personal
collection of already-visited sites
• which are typically stored in a browser’s bookmark
• Online bookmark systems are popular among a smaller
segment of users
• Ex: Delicious.com
• Web directories are also used as a common starting
point, but have been largely replaced by search engines

Query Specification
• The primary methods for a searcher to express their information
need are either
• entering words into a search entry form
• selecting links from a directory or other information organization
display
• For Web search engines, the query is specified in textual form
• Typically, Web queries today are very short consisting of one to
three words
• Short queries reflect the standard usage scenario in which the user
tests the waters:
• If the results do not look relevant, then the user reformulates their
query
• If the results are promising, then the user navigates to the most
relevant-looking Web site

Query Specification Interfaces
• The standard interface for a textual query is a
search box entry form
• Studies suggest a relationship between query
length and the width of the entry form
• Results found that either small forms
discourage long queries or wide forms
encourage longer queries

• Some interfaces show a list of query
suggestions as the user types the query
• This is referred to as auto-complete, auto-
suggest, or dynamic query suggestions
• Anick et al found that users clicked on
dynamic Yahoo suggestions one third of the
time

Retrieval Result Display
• When displaying search results, either
• the documents must be shown in full, or else
• the searcher must be presented with some kind of
representation of the content of those documents
• The document surrogate refers to the information that
summarizes the document
• This information is a key part of the success of the search
interface
• The design of document surrogates is an active area of
research and experimentation
• The quality of the surrogate can greatly effect the
perceived relevance of the search results listing

Query Reformulation
• There are tools to help users reformulate their
query
• One technique consists of showing terms related
to the query or to the documents retrieved in
response to the query
• A special case of this is spelling corrections or
suggestions
• Usually only one suggested alternative is shown:
clicking on that alternative re-executes the query
• In years back, the search results were shown
using the purportedly incorrect spelling

Organizing Search results
• Organizing results into meaningful groups can
help users understand the results and decide
what to do next
• Popular methods for grouping search results:
category systems and clustering
• Category system: meaningful labels organized
in such a way as to reflect the concepts
relevant to a domain

Visualization in Search Interfaces
• Experimentation with visualization for search
has been primarily applied in the following
ways:
• Visualizing Boolean syntax
• Visualizing query terms within retrieval results
• Visualizing relationships among words and
documents
• Visualization for text mining

Visualizing Boolean syntax
• Boolean query syntax is difficult for most users
and is rarely used in Web search
• For many years, researchers have experimented
with how to visualize Boolean query specification
• A common approach is to show Venn diagrams
• A more flexible version of this idea was seen in
the VQuery system, proposed by Steve Jones
• The VQuery interface for Boolean query
specification

Visualizing query terms within
retrieval results
• Understanding the role of the query terms within
the retrieved docs can help relevance assessment
• Experimental visualizations have been designed
that make this role more explicit
• In the TileBars interface, for instance, documents
are shown as horizontal glyphs
• The locations of the query term hits marked
along the glyph
• The user is encouraged to break the query into its
different facets, with one concept per line

Visualizing relationships among
words and documents
• Numerous works proposed variations on the idea
of placing words and docs on a two-dimensional
canvas
• In these works, proximity of glyphs represents
semantic relationships among the terms or
documents
• An early version of this idea is the VIBE interface
• Documents containing combinations of the query
terms are placed midway between the icons
representing those terms

Visualization for text mining
• Visualization is also used for purposes of analysis
and exploration of textual data
• Visualizations such as the Word Tree show a piece
of a text concordance
• It allows the user to view which words and
phrases commonly precede or follow a given
word
• Another example is the NameVoyager, which
shows frequencies of names for U.S. children
across time

IRT Unit_I.pptx

More Related Content

Similar to IRT Unit_I.pptx (20)

More from thenmozhip8 (14)

Recently uploaded (20)

IRT Unit_I.pptx