0% found this document useful (0 votes)
25 views

2.5 Pre- and Post-Coordinate Indexing

Uploaded by

dipayanbhaumik5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

2.5 Pre- and Post-Coordinate Indexing

Uploaded by

dipayanbhaumik5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

2.

5
PRE- AND POST-COORDINATE INDEXING

IMT 530 | 2008 | Tennis


Outline
y Indexing
y Some Principles of Indexing
y Coordination in Indexing
y Pre-Coordinate Indexing
y Post-Coordinate Indexing
y Evaluation

IMT 530 | 2008 | Tennis


2.5.1 INDEXING

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y Purposes of indexing subjects
◦ According to Shera and Egan (4) subject indexing should ideally do:
y Provide access by subject to all relevant material

y Provide subject access to materials through all suitable principles of


subject organization, e.g., matter, process, applications, etc.

y Bring together references to materials which treat of substantially


the same subject regardless of disparities in terminology, disparities
which may have resulted from national differences, differences
among groups of subject specialists, and/or from the changing
nature of the concepts with the discipline itself.

y Show affiliations among subject fields, affiliations which may depend


upon similarities of matter studies, of methods, or of point of view,
or upon use or application of knowledge.

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y Purposes of indexing subjects (cont.)
◦ According to Shera and Egan (4) subject indexing should ideally do:
y Provide entry to any subject field at an level of analysis, from the
most general to the most specific.

y Provide entry through any vocabulary common to any considerable


group as users, specialized or lay.

y Provide a formal description of the subject content of any bibliographic


unit in the most precise, or specific, terms possible, whether the
description be in the form of a word or brief phrase or in the form
of a class number or symbol.

y Provide means for the user to make selection from among all items
in any particular category, according to any chosen set of criteria
such as: most thorough, most recent, most elementary, etc.

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y Indexing is:
y Analysis of documents for significant
characteristics in order to represent those
characteristics in an information system for
some user(s)

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y This definition has many parts that we
want to understand:

1. Document analysis
2. Significant characteristics
3. Representing significant characteristics
4. Information system
5. User(s)

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y We have begun to understand document
analysis
y Week 1.5 and 2.0 helped us start this
understanding, as did Assignment #1

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y We know that documents have attributes,
some of them they have in common and
some are different
y We know we can add value to document
analysis by identifying content, subject
matter, and even different types of subject
matter (e.g., through facet analysis)
y These attributes are the documents’
significant characteristics (they are significant
to the user and to the purposes/functional
requirements of the information system)
IMT 530 | 2008 | Tennis
2.5.1 Indexing
y After we’ve done the analysis,
y We then try represent what we’ve found
(e.g., content, subject, etc.) using special
terms
y Sometimes these are controlled and
sometimes these are not controlled (free)

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y Thus there are two major steps in
indexing:

y Document analysis for significant


characteristics
y Representing those characteristics using
terms

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y Indexing is either derived or assigned:

y Derived indexing uses terms from the


document
y Assigned uses terms another tool (an
indexing language)

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y Indexing builds indexes and indexing
languages

y These are tools that help people find


documents

y And there are multiple ways of building


these tools and multiple types of tools

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y Index – Greek origin – to point

y Terms (sometimes called Descriptors, Key


Words, depending on the type of indexing
tool) used to signal concept presence in a
document, sometimes used to signal its
subject, or sometimes its content, or both

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y An Index or Indexing Language is:
y A set of representations
y Systematically ordered
y Provides access to the conceptual
content
y Indicates or establishes relationships
x Between terms to denote concepts
x Between natural language and terms used to denote
concepts

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y An index is used for one and only one
resource (document)
y An indexing language is used for many
resources (documents)

y An index manifests as a list that points to


content within the resource:

IMT 530 | 2008 | Tennis


IMT 530 | 2008 | Tennis
2.5.1 Indexing
y An indexing language (sometimes called a
controlled vocabulary) is a list used to
indexing multiple documents, and is a tool
for databases

IMT 530 | 2008 | Tennis


2.5.1 Indexing
Information
Management

Used For: Data


Management
Broader Terms Related Terms Narrower Terms

Management Information Retrieval Competitive Intelligence


Systems

Information Science Knowledge Management

Management Information Markup Languages


Systems

IMT 530 | 2008 | Tennis


2.5.1 Indexing
y An indexing language is used in databases
for retrieval, in conjunction with other
attributes of documents.

y Once assigned to documents the terms


from the indexing language manifest as an
inverted file, which lists the connection
between terms and documents containing
terms:
IMT 530 | 2008 | Tennis
2.5.1 Indexing
Indexing in information systems creates a an inverted file
T = Term
D = Document

D1 D2 D3 D4 D5 D6 D7
T1 X X X
T2 X X
T3 X X X X
T4 X X
T5 X X
T6 X X

IMT 530 | 2008 | Tennis


Attributes Values of Attributes
Title: GridVine: an infrastructure for peer information management

Authors: Cudre-Mauroux, P.1 ; Agarwal, S.1 ; Aberer, K.1

Author affiliation: 1 Ecole Polythechnique Fed. de Lausanne, Lausanne, Switzerland

Serial title: IEEE Internet Computing

Abstract: GridVine is a semantic overlay infrastructure based on a peer-to-peer (P2P) access structure. Built following the
principle of data independence, it separates a logical layer - in which data, schemas, and schema mappings are managed
- from a physical layer consisting of a structured P2P network supporting decentralized indexing, key load-balancing,
and efficient routing. The system is decentralized, yet fosters semantic interoperability through pair-wise schema
mappings and query reformulation. GridVine's heterogeneous but semantically related information sources can be
queried transparently using iterative query reformulation. The authors discuss a reference implementation of the
system and several mechanisms for resolving queries collaboratively.

Inspec controlled grid computing - information management - open systems - peer-to-peer computing -
terms: query formulation - semantic Web
Uncontrolled GridVine - peer information management - semantic overlay infrastructure - peer-to-
terms: peer access structure - data independence - logical layer - physical layer - decentralized
indexing - key load-balancing - decentralized system - semantic interoperability - pair-
wise schema mappings - iterative query reformulation

Inspec C7250R Information retrieval techniques - C6150N Distributed systems software -


classification C7210N Information networks
codes:
IMT 530 | 2008 | Tennis
2.5.2 SOME PRINCIPLES OF INDEXING

IMT 530 | 2008 | Tennis


2.5.2 Some Principles of Indexing
y Indexing aligns terms with concepts, as we
said above:

y Indexing Indicates or establishes


relationships
x Between terms to denote concepts
x Between natural language and terms used to denote
concepts

IMT 530 | 2008 | Tennis


2.5.2 Some Principles of Indexing
y Exhaustivity
◦ identification of all concepts (terms) to signal
content of document
◦ representation of ALL concepts (terms) to signal
content of document

y Specificity
◦ identification of the precise concepts (terms) to
signal content of document
◦ representation of precisely those and only those
concepts (terms) to signal content of document

IMT 530 | 2008 | Tennis


2.5.2 Some Principles of Indexing
Specificity

Architecture Ecclesiastical Architecture Architecture Architecture Architecture


Architecture of Cathedrals of Cathedrals of Cathedrals of Cathedrals
in Europe in Spain in Granada

Here we have a document with the title:


Architecture of Cathedrals in Spain
IMT 530 | 2008 | Tennis
2.5.2 Some Principles of Indexing
Exhaustivity Architecture

Holy Places in Spain

Alhambra

Granada, Spain

Here we have a
document with the title: Ecclesiastical Architecture
Architecture of
Cathedrals in Spain
Architecture of Cathedrals

IMT 530 | 2008 | Tennis


2.5.3 COORDINATION IN INDEXING

IMT 530 | 2008 | Tennis


2.5.3 Coordination in Indexing
y Coordination of Terms for Indexing and Searching
y Pre-coordinate – indexer "coordinates" terms,
pulling them together
◦ United States–History–Civil War, 1861-1865–
Literature and the war
◦ Church maintenance and repair (Ecclesiastical law)

y Post-coordinate – searcher "coordinates" terms,


pulling them together, not the indexer
◦ United States; Civil War, 1861-1865; Literature; War
◦ Churches; Maintenance and Repair; Ecclesiastical Law

IMT 530 | 2008 | Tennis


2.5.4 PRE-COORDINATE INDEXING

IMT 530 | 2008 | Tennis


2.5.4 Pre-Coordinate Indexing
y Since the indexing is coordinating
(combining, pulling together concepts)
then they engage in an act of synthesis to
build one long index entry.

y This can happen in three ways:

IMT 530 | 2008 | Tennis


2.5.4 Pre-Coordinate Indexing
y Represent a single subject
◦ flowers or flowers and shrubs

y An aspect of a single subject


◦ fertilization of flowers, arrangement of flowers

y Two or more subjects treated in relation


to one another
◦ flowers in art, flowers in religion folklore, etc.

IMT 530 | 2008 | Tennis


2.5.4 Pre-Coordinate Indexing
y Synthesis may require:
◦ one specific subject heading
◦ more than one subject heading

y One Specific Index Entry:


◦ Opening of the eyes of one blind at Bethsaida
(Miracle)
◦ Church maintenance and repair (Ecclesiastical
law)
◦ Suites (Clarinets (2), horns (2), violins (2), viola,
double bass)

IMT 530 | 2008 | Tennis


2.5.4 Pre-Coordinate Indexing
y More than One Index Entry linked
together:
◦ Military communication equipment industry–
United States.
◦ Artificial satellites in telecommunication.
◦ Defense contracts–United States.

IMT 530 | 2008 | Tennis


2.5.5 POST-COORDINATE INDEXING
y The opposite of Pre-Coordinate

y The searcher controls how terms are


combined for a search and the set of
documents retrieved.

IMT 530 | 2008 | Tennis


2.5.5 Post-Coordinate Indexing
y grid computing
y information management
y open systems
y peer-to-peer computing
y query formulation
y semantic Web

IMT 530 | 2008 | Tennis


2.5.6 Evaluation
y Control of concept formation
y Relevance of documents retrieved
y Trade offs between methods of indexing

IMT 530 | 2008 | Tennis


2.5.6 Evaluation – Concept Control

Control over search interaction and indexing language:

Indexer Document Searcher

Pre-coordinate X

Post-coordinate, X
Assignment
Post-coordinate, X X
Derived
Free Text X X

IMT 530 | 2008 | Tennis


2.5.6 Evaluation – Concept Control
Control over concept formation:

Indexer Searcher

Pre-coordinate X

Post-coordinate, X X
Assignment
Post-coordinate, X X
Derived
Free Text X

IMT 530 | 2008 | Tennis


2.5.6 Evaluation – Relevance
y Measuring Indexing Language
Effectiveness [1]
y Precision and Recall
◦ measures of effective information retrieval
◦ measures of search results
◦ NOT measures of analysis, but helpful in
understanding this type of system

IMT 530 | 2008 | Tennis


2.5.6 Evaluation - Relevance
y Recall:
◦ Relevant documents retrieved/ Total relevant
documents in the system = % recall

y Precision:
◦ Relevant documents retrieved/ Total
documents retrieved = % precision

IMT 530 | 2008 | Tennis


2.5.6 Evaluation - Relevance
Action/ Judgment by Searcher Total
Judgment by
Information
System
Relevant Not Relevant

Retrieved A (correctly B (falsely 65


retrieved) 30 retrieved)
35
Not retrieved C (missed) D (correctly 9,935
20 rejected)
9,915
Total 50 9,950 10,000

Table taken from [3] IMT 530 | 2008 | Tennis


2.5.6 Evaluation – Relevance
y Recall =
◦ A/A+C
◦ 30/50
◦ 60%
y
y Precision =
◦ A/A+B
◦ 30/65
◦ 46%

y thus, in this search, we recalled 60% of the total # of


relevant documents in the system, and of those
retrieved, 46% were relevant

IMT 530 | 2008 | Tennis


2.5.6 Evaluation - Relevance
y Relationships between Recall and
Precision and Specificity and Exhaustivity

y High Specificity =
◦ High precision
◦ Low recall
y High Exhaustivity =
◦ Low precision
◦ High recall
IMT 530 | 2008 | Tennis
2.5.6 Evaluation – Challenges
y Challenges for Indexing

y Source(s) of evidence for indexing


(warrant)

y Consistency in indexing

IMT 530 | 2008 | Tennis


2.5.6 Evaluation - Challenges
y Valid source of evidence for indexing
(warrant)
◦ Text?
◦ Users?
◦ Experts?
◦ Domain?
◦ Structure of Vocabulary (e.g., Masonry Vaults
alongside Brick Vaults)
◦ Mixture of them?

IMT 530 | 2008 | Tennis


2.5.6. Evaluation - Challenges
y Consistency
◦ Most controlled vocabularies are built with
the assumption of consistency
◦ How is this instructed, maintained, tested,
evaluated?
◦ Are two indexers consistent?
◦ Are all the concepts consistently represented
for users across various documents?
◦ Are all relationships between concepts
represented to lead users through the
system?
IMT 530 | 2008 | Tennis
References
1. Cleverdon, C. W. (1962). ASLIB Cranfield research project:
report on the test and analysis of an investigation into the
comparative efficiency of indexing systems. (An investigation
supported by a grant from The National Science
Foundation, Washington).

2. Lancaster, F. W. (2003). Indexing and abstracting in theory and


practice. (Champaign, IL: University of Illinois Press,
Graduate School of Library and Information Science).

3. Soergel, D. (1985). Organizing information: principles of data


base and retrieval systems. (San Diego, CA: Academic Press).

4. Shera, J. H. and Egan, M. (1956). Classified catalog: basic


principles and practices. (Chicago: American Library
Association).

IMT 530 | 2008 | Tennis

You might also like