UNIT-I
IV Year / VIII Semester
By
P.Thenmozhi AP/CSE
KNCET.
KONGUNADU COLLEGE OF ENGINEERING AND
TECHNOLOGY
(Autonomous)
NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CS8080 – Information Retrieval Techniques
Objectives
To understand the basics of information retrieval.
To understand machine learning techniques for
text classification and clustering .
To understand various search engine operations
To learn different techniques of recommender
systems
Syllabus
Information Retrieval Techniques
• What is Information?
• There is no “correct” definition
• Information:
– Informing, telling; thing told, knowledge, items of
knowledge, news
– Knowledge communicated or received concerning a
particular fact or circumstance; news
• Knowledge: knowing familiarity gained by
experience; person’s range of information; a
theoretical or practical understanding of; the sum
of what is known
• Types of information
• Text (Documents and portions thereof)
• XML and structured documents
• Images
• Audio (sound effects, songs, etc.)
• Video
• Source code
• Applications/Web services
• Retrieval?
• “Fetch something” that’s been stored
• Recover a stored state of knowledge
• Search through stored messages to find some
messages relevant to the task at hand.
• What is IR?
• Information retrieval is a problem-oriented
discipline, concerned with the problem of the
effective and efficient transfer of desired
information between human generator and
human user
Main Objective of IR:
• Provide the users with effective access to &
interaction with information resources.
• Goal of IR is to retrieve all and only the
“relevant” documents in a collection for a
particular user with a particular need for
information
– Relevance is a central concept in IR theory
– Web search engines have been stress-testing the
traditional IR models (and inventing new ways of
ranking)
• The goal is to search large document
collections (millions of documents) to retrieve
small subsets relevant to the user’s
information need
• Examples are:
• Internet search engines (Google, Yahoo! web
search, etc.)
• Digital library catalogues (MELVYL, GLADYS)
• What do we want from an IRS ?
• Systemic approach
– Goal (for a known information need):Return as many
relevant documents as possible and as few non-
relevant documents as possible
– Cognitive approach
• Goal (in an interactive information-seeking environment,
with a given IRS): Support the user’s exploration of the
problem domain and the task completion.
• Information Retrieval vs. Information Extraction
• Information Retrieval
• Given a set of terms and a set of document terms
select only the most relevant document
(precision), and preferably all the relevant ones
(recall)
• Information Extraction
• Extract from the text what the document means.
Databases vs. IR
Databases IR
What we are Structured data Mostly unstructured
retrieving
Queries we are Formally defined queries, Expressed in natural
posing unambiguous language
Results we get Exact. Always correct in Sometimes relevant,
formal sense. often not
Interaction with One-short queries Interaction is important
system
Performance and correctness measures
Precision
Precision is the fractionofthe documents retrieved that are relevant tothe user’s informationneed.

Recall
Recall is the fraction of the documents that are relevant to the query that are successfully
retrieved
Fall-out
The proportionofnon-relevant documentsthat are retrieved, out ofall non-relevant documents
available
F-score / F-measure
The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-
score is:
• Information Retrieval:
• IR deals with the representation, storage, organization
of, and access to information items
• Types of information items: documents, Web pages,
online catalogs, structured records, multimedia objects
• Early goals of the IR area: indexing text and searching
for useful documents in a collection
• Nowadays, research in IR includes: Modeling, Web
search, text classification, systems architecture, user
interfaces, data visualization, filtering and languages.
Early Developments
• For more than 5,000 years, man has organized
information for later retrieval and searching
• This has been done by compiling, storing, organizing,
and indexing papyrus, hieroglyphics, and books
• For holding the various items, special purpose buildings
called libraries, or bibliothekes, are used
– The oldest known library was created in Elba, in the Fertile
Crescent, between 3,000 and 2,500 BC
– By 300 BC, Ptolemy Soter, a Macedonian general, created
the Great Library at Alexandria
– Nowadays, libraries are everywhere
• In 2008, more than 2 billion items were checked
out from libraries in the US—an increase of 10%
over the previous year
• Since the volume of information in libraries is
always growing, it is necessary to build
specialized data structures for fast search — the
indexes
• For centuries indexes have been created
manually as sets of categories, with labels
associated with each category
• The advent of modern computers has allowed the
construction of large indexes automatically
• During the 50’s, research efforts in IR were initiated by
pioneers such as Hans Peter Luhn, Eugene Garfield,
Philip Bagley, and Calvin Moores, who allegedly coined
the term Information Retrieval
• In 1962, Cyril Cleverdon published the Cranfield studies
on retrieval evaluation
• In 1963, Joseph Becker and Robert Hayes published the
first book on IR
• In the late 60’s, key research conducted by Karen
Sparck Jones and Gerard Salton, among others, led to
the definition of the TF-IDF term weighting scheme
• In 1971, Jardine and van Rijsbergen articulated the
cluster hypothesis
• In 1978, the first ACM SIGIR International
Conference on Information Retrieval was held
in Rochester
• In 1979, van Rijsbergen published a classic
book entitled Information Retrieval, which
focused on the Probabilistic Model
• In 1983, Salton and McGill published a classic
book entitled Introduction to Modern
Information Retrieval, which focused on the
Vector Model
The IR Problem
• Users of modern IR systems, such as search
engine users, have information needs of varying
complexity
• An example of complex information need is as
follows:
– “ Find all documents that address the role of the
Federal Government in financing the operation of the
National Railroad Transportation Corporation
(AMTRAK)”
•
• This full description of the user information need
is not necessarily a good query to be submitted
to the IR system
• Instead, the user might want to first translate
this information need into a query
• This translation process yields a set of
keywords, or index terms, which summarize
the user information need
• Given the user query, the key goal of the IR
system is to retrieve information that is useful
or relevant to the user
• That is, the IR system must rank the
information items according to a degree of
relevance to the user query
• The IR Problem
– “The key goal of an IR system is to retrieve all the
items that are relevant to a user query, while
retrieving as few non relevant items as possible”.
• The notion of relevance is of central
importance in IR
The Users Task
• Consider a user who seeks information on a topic
of their interest
• This user first translates their information need
into a query, which requires specifying the words
that compose the query
• In this case, we say that the user is searching or
querying for information of their interest
• Consider now a user who has an interest that is
either poorly defined or inherently broad
• For instance, the user has an interest in car racing
and wants to browse documents on Formula 1
and Formula Indy
• In this case, we say that the user is browsing or
navigating the documents of the collection
• The User Task:
• The information first is supposed to be translated into
a query by the user.
• In the information retrieval system, there is a set of
words that convey the semantics of the information
that is required whereas, in a data retrieval system, a
query expression is used to convey the constraints
which are satisfied by the objects.
• Example: A user wants to search for something but
ends up searching with another thing.
• This means that the user is browsing and not
searching.
• The above figure shows the interaction of the user
through different tasks.
Information Retrieval Vs Data
Retrieval
• Information Retrieval: Given a set of query terms
and a set of document terms select only the most
relevant documents [precision], and preferably all
the relevant [recall].
• Data retrieval: the task of determining which
documents of a collection contain the keywords
in the user query
• Data retrieval system
• Ex: relational databases
• Deals with data that has a well defined structure
and semantics
• A single erroneous object among a thousand
retrieved objects means total failure
• Data retrieval does not solve the problem of
retrieving information about a subject or topic
Information Retrieval Data retrieval
The software program that deals with the organization,
storage, retrieval, and evaluation of information from
document repositories particularly textual information.
Data retrieval deals with obtaining data from a database
management system such as ODBMS. It is A process of
identifying and retrieving the data from the database, based
on the query provided by user or application.
Retrieves information about a subject. Determines the keywords in the user query and retrieves the
data.
Small errors are likely to go unnoticed. A single error object means total failure.
Not always well structured and is semantically ambiguous. Has a well-defined structure and semantics.
Does not provide a solution to the user of the database
system.
Provides solutions to the user of the database system.
The results obtained are approximate matches. The results obtained are exact matches.
Results are ordered by relevance. Results are unordered by relevance.
It is a probabilistic model. It is a deterministic model.
The IR System
• It has three major components in IR
• Document subsystem
– Acquisition
– Representation
– File organization
• User sub system
– Problem
– Representation
– Query
• Searching /Retrieval subsystem
– Matching
– Retrieved objects
• An information retrieval system thus has three
major components- the document subsystem,
the users subsystem, and the searching/retrieval
subsystem.
• These divisions are quite broad and each one is
designed to serve one or more functions, such as:
Analysis of documents and organization of
information(creation of a document database)
Analysis of user’s queries, preparation of a
strategy to search the database
Actual searching or matching of users queries
with the database, and finally
Retrieval of items that fully or partially match the
search statement.
Traditional IR System
• Acquisition (Document subsystem
• Selection of documents & other objects from various web
resources
• Mostly text based documents
– full texts, titles, abstracts ...
– but also other objects:
• data, statistics, images, maps, trade marks, sounds ...
• The data are collected by web crawler and stored in data base.
• Representation of documents, objects(document subsystem)
• Indexing – many ways :
– free text terms (even in full texts)
– controlled vocabulary - thesaurus
– manual & automatic techniques
• Abstracting; summarizing
• Bibliographic description:
– author, title, sources, date…
– metadata
• Classifying, clustering
• File organization (Document subsystem)
• Sequential
– record (document) by record
• Inverted
– term by term; list of records under each term
• Combination
• Problem (user subsystem)
• Related to user’s task, situation
– vary in specificity, clarity
• Produces information need
• ultimate criterion for effectiveness of retrieval
– how well was the need met?
• Representation (user subsystem)
• Converting a concept to query.
• What we search for.
• These are stemmed and corrected using
dictionary.
• Focus toward a good result
• Query - search statement (user & system)
• Translation into systems requirements & limits
– start of human-computer interaction
• query is the thing that goes into the computer
• Matching - searching (Searching subsystem)
• Process of matching, comparing
– search: what documents in the file match the query as
stated?
• Various search algorithms:
– exact match - Boolean
• still available in most, if not all systems
– best match - ranking by relevance
• increasingly used e.g. on the web
• Retrieved documents -from system to user (IR
Subsystem)
• Various order of output:
– Last In First Out (LIFO); sorted
– ranked by relevance
– ranked by other characteristics
• Various forms of output
 Relevancefeedback


 HighlevelsoftwarearchitectureofanIRsystem
The Software Architecture of the IR
System
The Retrieval and Ranking Processes
• Text Operations forms index words (tokens).
– Tokenization – Given a character sequence and a
defined document unit, tokenization is the task of
chopping it up into pieces called tokens.
– Stopword removal – Remove non-informative or
common words(tokens) from stream. E.g.
is,was,and, it, a etc.
– Stemming – Replace the word variants with single
stem of word. E.g. education, educated, educate
are replaced with single stem of word educate.
• Indexing : Documents are converted into fast
searchable internal representation using
language independent data structure called
Inverted Index.
• Searching : Calculate degree of similarity
between document and query terms; retrieves
documents that contain a given query token from
the inverted index.
• Ranking : Scores all retrieved documents
according to a relevance metric( term frequency
or Cosine similarity)
• User Interface manages interaction with the user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query to
improve retrieval:
– Query expansion using a
thesaurus.(vocabulary/terms); thesaurus is a data
structure that defines semantic relatedness
between words e.g. Semantic related words are
car, auto, automobile and vehicle
– Query transformation using relevance feedback
(the user gives feedback on the relevance of
document in an initial set of results)
• Document Gathering
• This is the process of gathering the documents
that are to form the core content of the IR
system, these documents could be text, images,
audio files, video clips, entire movies, etc.
• Document Indexing
• The documents gathered in the document
gathering phase are converted into a fast
searchable internal representation.
• This will usually be implemented using some
programming language dependent data
structures which provide fast searching facilities
such as array lists, vectors, sets, multi-sets, maps
• Searching Support
• This process involves accepting a query, processing it,
finding possibly relevant documents, calculating the
degree of similarity between each document and the
query for each (possibly relevant2) document, sorting
the set of highly ranked documents and returning
these to the user in groups (usually) of 10.
• All this has to be done as efficiently and quickly as
possible. For example, the IR system that operates as
the Google search engine accepts and processes
– 150 million queries per day.
– 6.25 million per hour.
– 105,000 per minute.
– 1,700 per second.
• Document Management
• In the previous three steps, we have gathered
documents, indexed them and are now
allowing users to search their content.
However, in many scenarios such as web
searching, the documents that have been
indexed will be unstable and constantly
changing.
• Dimensions of IR
• IR is more than just text, and more than just
web search
• although these are central
• People doing IR work with different media,
different types of search applications, and
different tasks
THE WEB
• At the end of World War II, Vannevar Bush looked for applications
of new technologies to peace times
• Bush first produced a report entitled Science, The Endless Frontier
• This report directly influenced the creation of the National Science
Foundation
• Following, he wrote As We May Think, a remarkable paper which
discussed new hardware and software gadgets
• In Bush’s words: Whole new forms of encyclopedias will appear,
ready-made with a mesh of associative trails running through them,
ready to be dropped into the memex and there amplified
• As We May Think influenced people like Douglas Engelbart, who
invented the computer mouse and introduced the concept of
hyperlinked texts
• Ted Nelson, working in his Project Xanadu, pushed the concept
further and coined the term hypertext
• A hypertext allows the reader to jump from one
electronic document to another, which was one
important property regarding the problem that Tim
Berners-Lee faced in 1989
• At the time, Berners-Lee worked in Geneva at the
CERN—Conseil Européen pour la Recherche Nucléaire
• There, researchers who wanted to share
documentation with others had to reformat their
documents to make them compatible with an internal
publishing system
• Berners-Lee reasoned that it would be nice if the
solution of sharing documents were decentralized
• He saw that a networked hypertext would be a good
solution and started working on its implementation
• In 1990, Berners-Lee
• Wrote the HTTP protocol
• Defined the HTML language
• Wrote the first browser, which he called World
Wide Web
• Wrote the first Web server
• In 1991, he made his browser and server
software available in the Internet
• The Web was born!
The E-Publishing Era
• Well over 20 billion pages are now available
and accessible in the Web
• More than one fourth of humanity now access
the Web on a regular basis
• Why is the Web such a success? What is the
single most important characteristic of the
Web that makes it so revolutionary?
• In search for an answer, let us dwell into the
life of a writer who lived at the end of the
18th Century
• She finished the first draft of her novel in 1796
• The first attempt of publication was refused
without a reading
• The novel was only published 15 years later!
• Jane Austen was discriminated because there
was no freedom to publish in the beginning of
the 19th century
• The Web, unleashed by the inventiveness of
Tim Berners-Lee, changed this once and for all
• It did so by universalizing freedom to publish
• The Web moved mankind into a new era, into
a new time, into The e-Publishing Era.
How the Web Changed Search
• Web search is today the most prominent
application of IR and its techniques—the
ranking and indexing components of any
search engine are fundamentally IR pieces of
technology
• The first major impact of the Web on search is
related to the characteristics of the document
collection itself.
• The Web is composed of pages distributed
over millions of sites and connected through
hyperlinks.
• This requires collecting all documents and
storing copies of them in a central repository,
prior to indexing.
• This new phase in the IR process, introduced
by the Web, is called crawling
• The second major impact of the Web on
search is related to:
• The size of the collection
• The volume of user queries submitted on a
daily basis
• As a consequence, performance and
scalability have become critical characteristics
of the IR system
• The third major impact : in a very large
collection, predicting relevance is much
harder than before
• Fortunately, the Web also includes new
sources of evidence
• Ex: hyperlinks and user clicks in documents in
the answer set
• The fourth major impact derives from the fact
that the Web is also a medium to do business
• Search problem has been extended beyond
the seeking of text information to also
encompass other user needs
• Ex: the price of a book, the phone number of a
hotel, the link for downloading a software
• The fifth major impact of the Web on search is
Web spam
• Web spam: abusive availability of commercial
information disguised in the form of
informational content
• This difficulty is so large that today we talk of
Adversarial Web Retrieval
Practical Issues on the Web
• 1. Security
• Commercial transactions over the Internet are not yet
a completely safe procedure
• 2. Privacy
• Frequently, people are willing to exchange information
as long as it does not become public
• 3. Copyright and patent rights
• It is far from clear how the wide spread of data on the
Web affects copyright and patent laws in the various
countries
• 4. Scanning,
• optical character recognition (OCR), and cross-language
retrieval
How to People Search
• Search tasks range from the relatively simple
(e.g., looking up disputed facts or finding
weather information) to the rich and complex
(e.g., job seeking and planning vacations).
• Search interfaces should support a range of
tasks, while taking into account how people
think about searching for information.
Information Lookup versus
Exploratory Search
• User interaction with search interfaces differs depending
on
• the type of task
• the domain expertise of the information seeker
• the amount of time and effort available to invest in the
process
• Marchionini makes a distinction between information
lookup and exploratory search
• Information lookup tasks
• are akin to fact retrieval or question answering
• can be satisfied by discrete pieces of information: numbers,
dates, names, or Web sites
• can work well for standard Web search interactions
• Exploratory search is divided into learning and
investigating tasks
• Learning search
• i) requires more than single query-response pairs
• ii) requires the searcher to spend time
• scanning and reading multiple information items
• synthesizing content to form new understanding
• Investigating refers to a longer-term process
which
• involves multiple iterations that take place over
perhaps very long periods of time
• may return results that are critically assessed
before being integrated into personal and
professional knowledge bases
• may be concerned with finding a large proportion
of the relevant information available
• Information seeking can be seen as being part of
a larger process referred to as sensemaking
The Classic versus the Dynamic Model
of Information Seeking
• Classic notion of the information seeking
process:
• problem identification
• articulation of information need(s)
• query formulation
• results evaluation
Navigation versus Search
• Navigation: the searcher looks at an information
structure and browses among the available information
• This browsing strategy is preferable when the
information structure is well-matched to the user’s
information need
• it is mentally less taxing to recognize a piece of
information than it is to recall it
• it works well only so long as appropriate links are
available
• If the links are not available, then the browsing
experience might be frustrating
• Spool discusses an example of a user looking for a
software driver for a particular laser printer
Search Process
• Numerous studies have been made of people engaged
in the search process
• The results of these studies can help guide the design
of search interfaces
• One common observation is that users often
reformulate their queries with slight modifications
• Another is that searchers often search for information
that they have previously accessed
• The users’ search strategies differ when searching over
previously seen materials
• Researchers have developed search interfaces support
both query history and revisitation
Search Interface Today
• Getting Started
• How does an information seeking session begin in
online information systems?
• The most common way is to use a Web search engine
• Another method is to select a Web site from a personal
collection of already-visited sites
• which are typically stored in a browser’s bookmark
• Online bookmark systems are popular among a smaller
segment of users
• Ex: Delicious.com
• Web directories are also used as a common starting
point, but have been largely replaced by search engines
Query Specification
• The primary methods for a searcher to express their information
need are either
• entering words into a search entry form
• selecting links from a directory or other information organization
display
• For Web search engines, the query is specified in textual form
• Typically, Web queries today are very short consisting of one to
three words
• Short queries reflect the standard usage scenario in which the user
tests the waters:
• If the results do not look relevant, then the user reformulates their
query
• If the results are promising, then the user navigates to the most
relevant-looking Web site
Query Specification Interfaces
• The standard interface for a textual query is a
search box entry form
• Studies suggest a relationship between query
length and the width of the entry form
• Results found that either small forms
discourage long queries or wide forms
encourage longer queries
IRT Unit_I.pptx
• Some interfaces show a list of query
suggestions as the user types the query
• This is referred to as auto-complete, auto-
suggest, or dynamic query suggestions
• Anick et al found that users clicked on
dynamic Yahoo suggestions one third of the
time
`
Retrieval Result Display
• When displaying search results, either
• the documents must be shown in full, or else
• the searcher must be presented with some kind of
representation of the content of those documents
• The document surrogate refers to the information that
summarizes the document
• This information is a key part of the success of the search
interface
• The design of document surrogates is an active area of
research and experimentation
• The quality of the surrogate can greatly effect the
perceived relevance of the search results listing
IRT Unit_I.pptx
Query Reformulation
• There are tools to help users reformulate their
query
• One technique consists of showing terms related
to the query or to the documents retrieved in
response to the query
• A special case of this is spelling corrections or
suggestions
• Usually only one suggested alternative is shown:
clicking on that alternative re-executes the query
• In years back, the search results were shown
using the purportedly incorrect spelling
IRT Unit_I.pptx
Organizing Search results
• Organizing results into meaningful groups can
help users understand the results and decide
what to do next
• Popular methods for grouping search results:
category systems and clustering
• Category system: meaningful labels organized
in such a way as to reflect the concepts
relevant to a domain
IRT Unit_I.pptx
Visualization in Search Interfaces
• Experimentation with visualization for search
has been primarily applied in the following
ways:
• Visualizing Boolean syntax
• Visualizing query terms within retrieval results
• Visualizing relationships among words and
documents
• Visualization for text mining
Visualizing Boolean syntax
• Boolean query syntax is difficult for most users
and is rarely used in Web search
• For many years, researchers have experimented
with how to visualize Boolean query specification
• A common approach is to show Venn diagrams
• A more flexible version of this idea was seen in
the VQuery system, proposed by Steve Jones
• The VQuery interface for Boolean query
specification
IRT Unit_I.pptx
Visualizing query terms within
retrieval results
• Understanding the role of the query terms within
the retrieved docs can help relevance assessment
• Experimental visualizations have been designed
that make this role more explicit
• In the TileBars interface, for instance, documents
are shown as horizontal glyphs
• The locations of the query term hits marked
along the glyph
• The user is encouraged to break the query into its
different facets, with one concept per line
IRT Unit_I.pptx
Visualizing relationships among
words and documents
• Numerous works proposed variations on the idea
of placing words and docs on a two-dimensional
canvas
• In these works, proximity of glyphs represents
semantic relationships among the terms or
documents
• An early version of this idea is the VIBE interface
• Documents containing combinations of the query
terms are placed midway between the icons
representing those terms
Visualization for text mining
• Visualization is also used for purposes of analysis
and exploration of textual data
• Visualizations such as the Word Tree show a piece
of a text concordance
• It allows the user to view which words and
phrases commonly precede or follow a given
word
• Another example is the NameVoyager, which
shows frequencies of names for U.S. children
across time
IRT Unit_I.pptx

More Related Content

PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
DOCX
unit 1 INTRODUCTION
PPTX
Informationa Retrieval Techniques .pptx
PPTX
Information retrieval 1 introduction to ir
PPT
information retirval system,search info insights in unsturtcured data
PPT
Information retrival system it is part and parcel
PPTX
Introduction to Information Retrieval (concepts and principles)
PPTX
Chapter 1.pptx
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
unit 1 INTRODUCTION
Informationa Retrieval Techniques .pptx
Information retrieval 1 introduction to ir
information retirval system,search info insights in unsturtcured data
Information retrival system it is part and parcel
Introduction to Information Retrieval (concepts and principles)
Chapter 1.pptx

Similar to IRT Unit_I.pptx (20)

PDF
Chapter 1: Introduction to Information Storage and Retrieval
PPTX
Information storage and retrieval
PPTX
information Storage nd retrieval.pptx
PDF
PDF
CS8080_IRT__UNIT_I_NOTES.pdf
PPTX
Chapter 1 Intro Information Rerieval.pptx
PPT
Data mining concept and methods for basic
PPT
INTRODUCTION TO INFORMATION RETRIEVALChapter 1-IR.ppt
PPTX
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
PPTX
DMDA Unit-1.pptx .
PDF
chapter 2 Data Science.pdf emerging ecnology freshman course
PPTX
information retrieval in artificial intelligence
PPT
PPTX
CHAPTER -12 it.pptx
PPT
chap1.ppt
PPT
chap1.ppt
PPT
chap1.ppt
PPT
Information_System_and_Data_mining12.ppt
PPTX
PPT
Information retrieval system
Chapter 1: Introduction to Information Storage and Retrieval
Information storage and retrieval
information Storage nd retrieval.pptx
CS8080_IRT__UNIT_I_NOTES.pdf
Chapter 1 Intro Information Rerieval.pptx
Data mining concept and methods for basic
INTRODUCTION TO INFORMATION RETRIEVALChapter 1-IR.ppt
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
DMDA Unit-1.pptx .
chapter 2 Data Science.pdf emerging ecnology freshman course
information retrieval in artificial intelligence
CHAPTER -12 it.pptx
chap1.ppt
chap1.ppt
chap1.ppt
Information_System_and_Data_mining12.ppt
Information retrieval system
Ad

More from thenmozhip8 (14)

PPTX
U5 SPC.pptx
PDF
Unit 4.pdf
PPTX
unit 3 ppt.pptx
PPT
U2.ppt
PPT
Unit 1 .ppt
DOCX
IR UNIT V.docx
PPTX
IRT Unit_4.pptx
DOCX
UNIT 3 IRT.docx
PPTX
IRT Unit_ 2.pptx
PPT
packages unit 5 .ppt
PPT
unit 4 .ppt
PPTX
Definning class.pptx unit 3
PPT
exception-handling-in-java.ppt unit 2
PPTX
unit 1 full ppt.pptx
U5 SPC.pptx
Unit 4.pdf
unit 3 ppt.pptx
U2.ppt
Unit 1 .ppt
IR UNIT V.docx
IRT Unit_4.pptx
UNIT 3 IRT.docx
IRT Unit_ 2.pptx
packages unit 5 .ppt
unit 4 .ppt
Definning class.pptx unit 3
exception-handling-in-java.ppt unit 2
unit 1 full ppt.pptx
Ad

Recently uploaded (20)

PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
Computer System Architecture 3rd Edition-M Morris Mano.pdf
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PPTX
Petroleum Refining & Petrochemicals.pptx
PPTX
PRASUNET_20240614003_231416_0000[1].pptx
PDF
Design of Material Handling Equipment Lecture Note
PDF
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
PPT
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
20250617 - IR - Global Guide for HR - 51 pages.pdf
PDF
Applications of Equal_Area_Criterion.pdf
PDF
Computer organization and architecuture Digital Notes....pdf
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PPTX
Measurement Uncertainty and Measurement System analysis
PPTX
CyberSecurity Mobile and Wireless Devices
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PPTX
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
Computer System Architecture 3rd Edition-M Morris Mano.pdf
Information Storage and Retrieval Techniques Unit III
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Petroleum Refining & Petrochemicals.pptx
PRASUNET_20240614003_231416_0000[1].pptx
Design of Material Handling Equipment Lecture Note
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Exploratory_Data_Analysis_Fundamentals.pdf
distributed database system" (DDBS) is often used to refer to both the distri...
20250617 - IR - Global Guide for HR - 51 pages.pdf
Applications of Equal_Area_Criterion.pdf
Computer organization and architecuture Digital Notes....pdf
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
Measurement Uncertainty and Measurement System analysis
CyberSecurity Mobile and Wireless Devices
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY

IRT Unit_I.pptx

  • 1. UNIT-I IV Year / VIII Semester By P.Thenmozhi AP/CSE KNCET. KONGUNADU COLLEGE OF ENGINEERING AND TECHNOLOGY (Autonomous) NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CS8080 – Information Retrieval Techniques
  • 2. Objectives To understand the basics of information retrieval. To understand machine learning techniques for text classification and clustering . To understand various search engine operations To learn different techniques of recommender systems
  • 4. Information Retrieval Techniques • What is Information? • There is no “correct” definition • Information: – Informing, telling; thing told, knowledge, items of knowledge, news – Knowledge communicated or received concerning a particular fact or circumstance; news • Knowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known
  • 5. • Types of information • Text (Documents and portions thereof) • XML and structured documents • Images • Audio (sound effects, songs, etc.) • Video • Source code • Applications/Web services
  • 6. • Retrieval? • “Fetch something” that’s been stored • Recover a stored state of knowledge • Search through stored messages to find some messages relevant to the task at hand.
  • 7. • What is IR? • Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user
  • 8. Main Objective of IR: • Provide the users with effective access to & interaction with information resources. • Goal of IR is to retrieve all and only the “relevant” documents in a collection for a particular user with a particular need for information – Relevance is a central concept in IR theory
  • 9. – Web search engines have been stress-testing the traditional IR models (and inventing new ways of ranking) • The goal is to search large document collections (millions of documents) to retrieve small subsets relevant to the user’s information need • Examples are: • Internet search engines (Google, Yahoo! web search, etc.) • Digital library catalogues (MELVYL, GLADYS)
  • 10. • What do we want from an IRS ? • Systemic approach – Goal (for a known information need):Return as many relevant documents as possible and as few non- relevant documents as possible – Cognitive approach • Goal (in an interactive information-seeking environment, with a given IRS): Support the user’s exploration of the problem domain and the task completion.
  • 11. • Information Retrieval vs. Information Extraction • Information Retrieval • Given a set of terms and a set of document terms select only the most relevant document (precision), and preferably all the relevant ones (recall) • Information Extraction • Extract from the text what the document means.
  • 12. Databases vs. IR Databases IR What we are Structured data Mostly unstructured retrieving Queries we are Formally defined queries, Expressed in natural posing unambiguous language Results we get Exact. Always correct in Sometimes relevant, formal sense. often not Interaction with One-short queries Interaction is important system
  • 13. Performance and correctness measures Precision Precision is the fractionofthe documents retrieved that are relevant tothe user’s informationneed.  Recall Recall is the fraction of the documents that are relevant to the query that are successfully retrieved
  • 14. Fall-out The proportionofnon-relevant documentsthat are retrieved, out ofall non-relevant documents available F-score / F-measure The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F- score is:
  • 15. • Information Retrieval: • IR deals with the representation, storage, organization of, and access to information items • Types of information items: documents, Web pages, online catalogs, structured records, multimedia objects • Early goals of the IR area: indexing text and searching for useful documents in a collection • Nowadays, research in IR includes: Modeling, Web search, text classification, systems architecture, user interfaces, data visualization, filtering and languages.
  • 16. Early Developments • For more than 5,000 years, man has organized information for later retrieval and searching • This has been done by compiling, storing, organizing, and indexing papyrus, hieroglyphics, and books • For holding the various items, special purpose buildings called libraries, or bibliothekes, are used – The oldest known library was created in Elba, in the Fertile Crescent, between 3,000 and 2,500 BC – By 300 BC, Ptolemy Soter, a Macedonian general, created the Great Library at Alexandria – Nowadays, libraries are everywhere
  • 17. • In 2008, more than 2 billion items were checked out from libraries in the US—an increase of 10% over the previous year • Since the volume of information in libraries is always growing, it is necessary to build specialized data structures for fast search — the indexes • For centuries indexes have been created manually as sets of categories, with labels associated with each category • The advent of modern computers has allowed the construction of large indexes automatically
  • 18. • During the 50’s, research efforts in IR were initiated by pioneers such as Hans Peter Luhn, Eugene Garfield, Philip Bagley, and Calvin Moores, who allegedly coined the term Information Retrieval • In 1962, Cyril Cleverdon published the Cranfield studies on retrieval evaluation • In 1963, Joseph Becker and Robert Hayes published the first book on IR • In the late 60’s, key research conducted by Karen Sparck Jones and Gerard Salton, among others, led to the definition of the TF-IDF term weighting scheme • In 1971, Jardine and van Rijsbergen articulated the cluster hypothesis
  • 19. • In 1978, the first ACM SIGIR International Conference on Information Retrieval was held in Rochester • In 1979, van Rijsbergen published a classic book entitled Information Retrieval, which focused on the Probabilistic Model • In 1983, Salton and McGill published a classic book entitled Introduction to Modern Information Retrieval, which focused on the Vector Model
  • 20. The IR Problem • Users of modern IR systems, such as search engine users, have information needs of varying complexity • An example of complex information need is as follows: – “ Find all documents that address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)” • • This full description of the user information need is not necessarily a good query to be submitted to the IR system
  • 21. • Instead, the user might want to first translate this information need into a query • This translation process yields a set of keywords, or index terms, which summarize the user information need • Given the user query, the key goal of the IR system is to retrieve information that is useful or relevant to the user
  • 22. • That is, the IR system must rank the information items according to a degree of relevance to the user query • The IR Problem – “The key goal of an IR system is to retrieve all the items that are relevant to a user query, while retrieving as few non relevant items as possible”. • The notion of relevance is of central importance in IR
  • 24. • Consider a user who seeks information on a topic of their interest • This user first translates their information need into a query, which requires specifying the words that compose the query • In this case, we say that the user is searching or querying for information of their interest • Consider now a user who has an interest that is either poorly defined or inherently broad • For instance, the user has an interest in car racing and wants to browse documents on Formula 1 and Formula Indy • In this case, we say that the user is browsing or navigating the documents of the collection
  • 25. • The User Task: • The information first is supposed to be translated into a query by the user. • In the information retrieval system, there is a set of words that convey the semantics of the information that is required whereas, in a data retrieval system, a query expression is used to convey the constraints which are satisfied by the objects. • Example: A user wants to search for something but ends up searching with another thing. • This means that the user is browsing and not searching. • The above figure shows the interaction of the user through different tasks.
  • 26. Information Retrieval Vs Data Retrieval • Information Retrieval: Given a set of query terms and a set of document terms select only the most relevant documents [precision], and preferably all the relevant [recall]. • Data retrieval: the task of determining which documents of a collection contain the keywords in the user query • Data retrieval system • Ex: relational databases • Deals with data that has a well defined structure and semantics
  • 27. • A single erroneous object among a thousand retrieved objects means total failure • Data retrieval does not solve the problem of retrieving information about a subject or topic
  • 28. Information Retrieval Data retrieval The software program that deals with the organization, storage, retrieval, and evaluation of information from document repositories particularly textual information. Data retrieval deals with obtaining data from a database management system such as ODBMS. It is A process of identifying and retrieving the data from the database, based on the query provided by user or application. Retrieves information about a subject. Determines the keywords in the user query and retrieves the data. Small errors are likely to go unnoticed. A single error object means total failure. Not always well structured and is semantically ambiguous. Has a well-defined structure and semantics. Does not provide a solution to the user of the database system. Provides solutions to the user of the database system. The results obtained are approximate matches. The results obtained are exact matches. Results are ordered by relevance. Results are unordered by relevance. It is a probabilistic model. It is a deterministic model.
  • 29. The IR System • It has three major components in IR • Document subsystem – Acquisition – Representation – File organization • User sub system – Problem – Representation – Query • Searching /Retrieval subsystem – Matching – Retrieved objects
  • 30. • An information retrieval system thus has three major components- the document subsystem, the users subsystem, and the searching/retrieval subsystem. • These divisions are quite broad and each one is designed to serve one or more functions, such as: Analysis of documents and organization of information(creation of a document database) Analysis of user’s queries, preparation of a strategy to search the database Actual searching or matching of users queries with the database, and finally Retrieval of items that fully or partially match the search statement.
  • 32. • Acquisition (Document subsystem • Selection of documents & other objects from various web resources • Mostly text based documents – full texts, titles, abstracts ... – but also other objects: • data, statistics, images, maps, trade marks, sounds ... • The data are collected by web crawler and stored in data base. • Representation of documents, objects(document subsystem) • Indexing – many ways : – free text terms (even in full texts) – controlled vocabulary - thesaurus – manual & automatic techniques • Abstracting; summarizing • Bibliographic description: – author, title, sources, date… – metadata • Classifying, clustering
  • 33. • File organization (Document subsystem) • Sequential – record (document) by record • Inverted – term by term; list of records under each term • Combination • Problem (user subsystem) • Related to user’s task, situation – vary in specificity, clarity • Produces information need • ultimate criterion for effectiveness of retrieval – how well was the need met?
  • 34. • Representation (user subsystem) • Converting a concept to query. • What we search for. • These are stemmed and corrected using dictionary. • Focus toward a good result • Query - search statement (user & system) • Translation into systems requirements & limits – start of human-computer interaction • query is the thing that goes into the computer
  • 35. • Matching - searching (Searching subsystem) • Process of matching, comparing – search: what documents in the file match the query as stated? • Various search algorithms: – exact match - Boolean • still available in most, if not all systems – best match - ranking by relevance • increasingly used e.g. on the web • Retrieved documents -from system to user (IR Subsystem) • Various order of output: – Last In First Out (LIFO); sorted – ranked by relevance – ranked by other characteristics • Various forms of output
  • 37. The Software Architecture of the IR System
  • 38. The Retrieval and Ranking Processes
  • 39. • Text Operations forms index words (tokens). – Tokenization – Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces called tokens. – Stopword removal – Remove non-informative or common words(tokens) from stream. E.g. is,was,and, it, a etc. – Stemming – Replace the word variants with single stem of word. E.g. education, educated, educate are replaced with single stem of word educate.
  • 40. • Indexing : Documents are converted into fast searchable internal representation using language independent data structure called Inverted Index. • Searching : Calculate degree of similarity between document and query terms; retrieves documents that contain a given query token from the inverted index. • Ranking : Scores all retrieved documents according to a relevance metric( term frequency or Cosine similarity) • User Interface manages interaction with the user: – Query input and document output. – Relevance feedback. – Visualization of results.
  • 41. • Query Operations transform the query to improve retrieval: – Query expansion using a thesaurus.(vocabulary/terms); thesaurus is a data structure that defines semantic relatedness between words e.g. Semantic related words are car, auto, automobile and vehicle – Query transformation using relevance feedback (the user gives feedback on the relevance of document in an initial set of results)
  • 42. • Document Gathering • This is the process of gathering the documents that are to form the core content of the IR system, these documents could be text, images, audio files, video clips, entire movies, etc. • Document Indexing • The documents gathered in the document gathering phase are converted into a fast searchable internal representation. • This will usually be implemented using some programming language dependent data structures which provide fast searching facilities such as array lists, vectors, sets, multi-sets, maps
  • 43. • Searching Support • This process involves accepting a query, processing it, finding possibly relevant documents, calculating the degree of similarity between each document and the query for each (possibly relevant2) document, sorting the set of highly ranked documents and returning these to the user in groups (usually) of 10. • All this has to be done as efficiently and quickly as possible. For example, the IR system that operates as the Google search engine accepts and processes – 150 million queries per day. – 6.25 million per hour. – 105,000 per minute. – 1,700 per second.
  • 44. • Document Management • In the previous three steps, we have gathered documents, indexed them and are now allowing users to search their content. However, in many scenarios such as web searching, the documents that have been indexed will be unstable and constantly changing.
  • 45. • Dimensions of IR • IR is more than just text, and more than just web search • although these are central • People doing IR work with different media, different types of search applications, and different tasks
  • 46. THE WEB • At the end of World War II, Vannevar Bush looked for applications of new technologies to peace times • Bush first produced a report entitled Science, The Endless Frontier • This report directly influenced the creation of the National Science Foundation • Following, he wrote As We May Think, a remarkable paper which discussed new hardware and software gadgets • In Bush’s words: Whole new forms of encyclopedias will appear, ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified • As We May Think influenced people like Douglas Engelbart, who invented the computer mouse and introduced the concept of hyperlinked texts • Ted Nelson, working in his Project Xanadu, pushed the concept further and coined the term hypertext
  • 47. • A hypertext allows the reader to jump from one electronic document to another, which was one important property regarding the problem that Tim Berners-Lee faced in 1989 • At the time, Berners-Lee worked in Geneva at the CERN—Conseil Européen pour la Recherche Nucléaire • There, researchers who wanted to share documentation with others had to reformat their documents to make them compatible with an internal publishing system • Berners-Lee reasoned that it would be nice if the solution of sharing documents were decentralized • He saw that a networked hypertext would be a good solution and started working on its implementation
  • 48. • In 1990, Berners-Lee • Wrote the HTTP protocol • Defined the HTML language • Wrote the first browser, which he called World Wide Web • Wrote the first Web server • In 1991, he made his browser and server software available in the Internet • The Web was born!
  • 49. The E-Publishing Era • Well over 20 billion pages are now available and accessible in the Web • More than one fourth of humanity now access the Web on a regular basis • Why is the Web such a success? What is the single most important characteristic of the Web that makes it so revolutionary?
  • 50. • In search for an answer, let us dwell into the life of a writer who lived at the end of the 18th Century • She finished the first draft of her novel in 1796 • The first attempt of publication was refused without a reading • The novel was only published 15 years later!
  • 51. • Jane Austen was discriminated because there was no freedom to publish in the beginning of the 19th century • The Web, unleashed by the inventiveness of Tim Berners-Lee, changed this once and for all • It did so by universalizing freedom to publish • The Web moved mankind into a new era, into a new time, into The e-Publishing Era.
  • 52. How the Web Changed Search • Web search is today the most prominent application of IR and its techniques—the ranking and indexing components of any search engine are fundamentally IR pieces of technology
  • 53. • The first major impact of the Web on search is related to the characteristics of the document collection itself. • The Web is composed of pages distributed over millions of sites and connected through hyperlinks. • This requires collecting all documents and storing copies of them in a central repository, prior to indexing. • This new phase in the IR process, introduced by the Web, is called crawling
  • 54. • The second major impact of the Web on search is related to: • The size of the collection • The volume of user queries submitted on a daily basis • As a consequence, performance and scalability have become critical characteristics of the IR system
  • 55. • The third major impact : in a very large collection, predicting relevance is much harder than before • Fortunately, the Web also includes new sources of evidence • Ex: hyperlinks and user clicks in documents in the answer set
  • 56. • The fourth major impact derives from the fact that the Web is also a medium to do business • Search problem has been extended beyond the seeking of text information to also encompass other user needs • Ex: the price of a book, the phone number of a hotel, the link for downloading a software
  • 57. • The fifth major impact of the Web on search is Web spam • Web spam: abusive availability of commercial information disguised in the form of informational content • This difficulty is so large that today we talk of Adversarial Web Retrieval
  • 58. Practical Issues on the Web • 1. Security • Commercial transactions over the Internet are not yet a completely safe procedure • 2. Privacy • Frequently, people are willing to exchange information as long as it does not become public • 3. Copyright and patent rights • It is far from clear how the wide spread of data on the Web affects copyright and patent laws in the various countries • 4. Scanning, • optical character recognition (OCR), and cross-language retrieval
  • 59. How to People Search • Search tasks range from the relatively simple (e.g., looking up disputed facts or finding weather information) to the rich and complex (e.g., job seeking and planning vacations). • Search interfaces should support a range of tasks, while taking into account how people think about searching for information.
  • 60. Information Lookup versus Exploratory Search • User interaction with search interfaces differs depending on • the type of task • the domain expertise of the information seeker • the amount of time and effort available to invest in the process • Marchionini makes a distinction between information lookup and exploratory search • Information lookup tasks • are akin to fact retrieval or question answering • can be satisfied by discrete pieces of information: numbers, dates, names, or Web sites • can work well for standard Web search interactions
  • 61. • Exploratory search is divided into learning and investigating tasks • Learning search • i) requires more than single query-response pairs • ii) requires the searcher to spend time • scanning and reading multiple information items • synthesizing content to form new understanding
  • 62. • Investigating refers to a longer-term process which • involves multiple iterations that take place over perhaps very long periods of time • may return results that are critically assessed before being integrated into personal and professional knowledge bases • may be concerned with finding a large proportion of the relevant information available • Information seeking can be seen as being part of a larger process referred to as sensemaking
  • 63. The Classic versus the Dynamic Model of Information Seeking • Classic notion of the information seeking process: • problem identification • articulation of information need(s) • query formulation • results evaluation
  • 64. Navigation versus Search • Navigation: the searcher looks at an information structure and browses among the available information • This browsing strategy is preferable when the information structure is well-matched to the user’s information need • it is mentally less taxing to recognize a piece of information than it is to recall it • it works well only so long as appropriate links are available • If the links are not available, then the browsing experience might be frustrating • Spool discusses an example of a user looking for a software driver for a particular laser printer
  • 65. Search Process • Numerous studies have been made of people engaged in the search process • The results of these studies can help guide the design of search interfaces • One common observation is that users often reformulate their queries with slight modifications • Another is that searchers often search for information that they have previously accessed • The users’ search strategies differ when searching over previously seen materials • Researchers have developed search interfaces support both query history and revisitation
  • 66. Search Interface Today • Getting Started • How does an information seeking session begin in online information systems? • The most common way is to use a Web search engine • Another method is to select a Web site from a personal collection of already-visited sites • which are typically stored in a browser’s bookmark • Online bookmark systems are popular among a smaller segment of users • Ex: Delicious.com • Web directories are also used as a common starting point, but have been largely replaced by search engines
  • 67. Query Specification • The primary methods for a searcher to express their information need are either • entering words into a search entry form • selecting links from a directory or other information organization display • For Web search engines, the query is specified in textual form • Typically, Web queries today are very short consisting of one to three words • Short queries reflect the standard usage scenario in which the user tests the waters: • If the results do not look relevant, then the user reformulates their query • If the results are promising, then the user navigates to the most relevant-looking Web site
  • 68. Query Specification Interfaces • The standard interface for a textual query is a search box entry form • Studies suggest a relationship between query length and the width of the entry form • Results found that either small forms discourage long queries or wide forms encourage longer queries
  • 70. • Some interfaces show a list of query suggestions as the user types the query • This is referred to as auto-complete, auto- suggest, or dynamic query suggestions • Anick et al found that users clicked on dynamic Yahoo suggestions one third of the time
  • 71. `
  • 72. Retrieval Result Display • When displaying search results, either • the documents must be shown in full, or else • the searcher must be presented with some kind of representation of the content of those documents • The document surrogate refers to the information that summarizes the document • This information is a key part of the success of the search interface • The design of document surrogates is an active area of research and experimentation • The quality of the surrogate can greatly effect the perceived relevance of the search results listing
  • 74. Query Reformulation • There are tools to help users reformulate their query • One technique consists of showing terms related to the query or to the documents retrieved in response to the query • A special case of this is spelling corrections or suggestions • Usually only one suggested alternative is shown: clicking on that alternative re-executes the query • In years back, the search results were shown using the purportedly incorrect spelling
  • 76. Organizing Search results • Organizing results into meaningful groups can help users understand the results and decide what to do next • Popular methods for grouping search results: category systems and clustering • Category system: meaningful labels organized in such a way as to reflect the concepts relevant to a domain
  • 78. Visualization in Search Interfaces • Experimentation with visualization for search has been primarily applied in the following ways: • Visualizing Boolean syntax • Visualizing query terms within retrieval results • Visualizing relationships among words and documents • Visualization for text mining
  • 79. Visualizing Boolean syntax • Boolean query syntax is difficult for most users and is rarely used in Web search • For many years, researchers have experimented with how to visualize Boolean query specification • A common approach is to show Venn diagrams • A more flexible version of this idea was seen in the VQuery system, proposed by Steve Jones • The VQuery interface for Boolean query specification
  • 81. Visualizing query terms within retrieval results • Understanding the role of the query terms within the retrieved docs can help relevance assessment • Experimental visualizations have been designed that make this role more explicit • In the TileBars interface, for instance, documents are shown as horizontal glyphs • The locations of the query term hits marked along the glyph • The user is encouraged to break the query into its different facets, with one concept per line
  • 83. Visualizing relationships among words and documents • Numerous works proposed variations on the idea of placing words and docs on a two-dimensional canvas • In these works, proximity of glyphs represents semantic relationships among the terms or documents • An early version of this idea is the VIBE interface • Documents containing combinations of the query terms are placed midway between the icons representing those terms
  • 84. Visualization for text mining • Visualization is also used for purposes of analysis and exploration of textual data • Visualizations such as the Word Tree show a piece of a text concordance • It allows the user to view which words and phrases commonly precede or follow a given word • Another example is the NameVoyager, which shows frequencies of names for U.S. children across time