0% found this document useful (0 votes)
76 views48 pages

Unit-3 Irs

The document discusses different classes of automatic indexing including statistical, natural language, concept, and hypertext linkages indexing. It then provides details on statistical indexing methods such as probabilistic weighting, vector weighting, inverse document frequency, and discrimination value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views48 pages

Unit-3 Irs

The document discusses different classes of automatic indexing including statistical, natural language, concept, and hypertext linkages indexing. It then provides details on statistical indexing methods such as probabilistic weighting, vector weighting, inverse document frequency, and discrimination value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

UNIT-3

U N I T- 3 : S Y L L A B U S

• Automatic Indexing: Classes of automatic


indexing, Statistical indexing, Natural language,
Concept indexing, Hypertext linkages.
• Document and Term Clustering:
Introduction, Thesaurus generation, Item
clustering, Hierarchy of clusters
AUTOMATIC INDEXING
A UTOMATIC I NDEXING
 Automatic indexing is the computerized process of scanning large
volumes of documents against a controlled vocabulary, taxonomy,
thesaurus or ontology and using those controlled terms to quickly and
effectively index large electronic document depositories.
 Automatic indexing is the process of analyzing an item to extract
the information to be permanently kept in an index.
 This process is associated with the generation of the searchable data
structures associated with an item.
A UTOMATIC I NDEXING

Data Flow in Information Processing System


C L A S S E S O F AUTOMATIC I N D E X I N G

• Search Strategies can be classified as


 Statistical

 Natural language

 Concept

 Hypertext linkages( Special Class of Indexing)


S TATISTICAL
 Statistical strategies cover the broadest range of indexing
techniques and are the most prevalent in commercial
systems. The basis for a statistical approach is use of
frequency of occurrence of events.
 The statistics that are applied to the event data are
probabilistic, Bayesian, vector space, neural net.
N ATURAL LANGUAGE
 Natural language approaches perform the similar
processing token identification as in statistical
techniques, but additionally perform varying levels of
natural language parsing of the item.
 This parsing disambiguates the context of the
processing tokens and generalizes to more abstract
concepts within an item (e.g., present, past, future
actions). This additional information is stored within the
index to be used to enhance the search precision.
C ONCEPT
 Concept indexing uses the words within an item to
correlate to concepts discussed in the item.
 This is a generalization of the specific words to
values used to index the item.
 When generating the concept classes automatically,
there may not be a name applicable to the concept but
just a statistical significance.
HYPERTEXT LINKAGES
o Finally a special class of indexing can be defined by creation of
hypertext linkages.

o These linkages provide virtual threads of concepts between items


versus directly defining the concept within an item.
S TATISTICAL I NDEXING
 Statistical indexing uses frequency of occurrence of
events to calculate a number that is used to indicate the
potential relevance of an item.
 Probabilistic systems attempt to calculate a probability
value that should be invariant to both calculation method
and text corpora.
 The Bayesian and Vector approaches calculate a relative
relevance value that a particular item is relevant. Quite
often term distributions across the searchable database are
used in the calculations.
S TATISTICAL I NDEXING
•Methods
 Probabilistic Weighting.
⚫ logistic regression
 Vector Weighting
⚫ Simple Term Frequency Algorithm
⚫ Inverse Document Frequency
⚫ Signal Weighting
⚫ Discrimination Value
 Bayesian Model
P ROBABILISTIC W EIGHTING
• The probabilistic approach is based upon direct application of the theory

of probability to information retrieval systems.

 This has the advantage of being able to use the developed formal theory of

probability to direct the algorithmic development.

 It also leads to an invariant result that facilitates integration of results

from different databases.

 The use of probability theory is a natural choice because it is the basis of

evidential reasoning (i.e., drawing conclusions from evidence).

 This is summarized by the Probability Ranking Principle (PRP).


P ROBABILISTIC W EIGHTING
 There are many different areas in which the probabilistic approach may be
applied.
 This method of significance of document is estimates on the basis of
probability value.
 Example: User typed a query “ What is Java”?

Query
“What is java” Data Base
Doc 1

Doc 3
• which document is more specific to the user query. Doc 7
• Rank them according to their occurrence and will be
.
displayed by the system to the user. .
Doc m
P ROBABILISTIC W EIGHTING
Applications Of Probabilistic Statistical Indexing :
It is used in logic regression / reference model. This model
consist of special system called “ Model O System”.
• This system includes the following components,
(a) Number of words in the document (d).
(b) Number of words in the query(q).
• In addition to these there are attributes. Attributes are classified
in to the following types.
(i)Query Attributes: How many times a particular word has
occurred in query
(ii)Document Attributes: How many times a particular word has
occurred in Document.
set of attributes (v1 . . . . . V n) from the query
P ROBABILISTIC W EIGHTING
 Log O is the logarithm of the odds (log odds) of relevance for term t k
which is present in document d j and query q i :

log(O(r/q i ,d j ,t k ))=c 0 +c 1 v 1 +c 2 v 2 +….+c n v n

• The logarithm that the i th Query is relevant and j th Document is


the sum of the log odds for all terms:

where O(R) is the odds that a document chosen at random from the
database is relevant to query Qi.
• The inverse logistic transformation is applied to obtain the
probability of relevance of a document to a query:

• P(R /Qi,Dj) = 1 \ (1 + e -log(O(R/Qi,Dj)) )


V ECTOR W EIGHTING
 One of the earliest systems that investigated statistical approaches to
information retrieval was the S M A RT system at Cornell
University. The system is based upon a vector model.
 In information retrieval each position in the vector typically represents
a processing token.
 There are two approaches to the domain of values in the vector:
1)Binary
2)Weighted.

Binary approach the domain contains the value of one or zero,


 1 representing the existence of the processing token in the item.
 0 representing the non-existence of the processing token in the item.
V ECTOR W EIGHTING
 In the weighted approach the domain is typically the set
of all real positive numbers.
 The value for each processing token represents the
relative importance of that processing token represents
the semantics of the item.
V ECTOR W EIGHTING
• The figure shows how an item that discusses petroleum refineries
in Mexico would be represented. In the example major topics
discussed are indicated by the index terms for each column.

Binary and Vector Representation of an Item


Three-dimensional Vector Representation
Processing Tokens: Petroleum Mexico and Oil.
V ECTOR W EIGHTING
 There are many algorithms that can be used in calculating the
weights used to represent a processing token, The following
subsections present the major algorithms.
⚫ Simple Term Frequency Algorithm
⚫ Inverse Document Frequency
⚫ S ig n a l Weighting
⚫ Discrimination Value
SIMPLE TERM F R EQ U E N C Y ALGORITHM

 This is one of the algorithm in vector weighting for estimation


of weights in a particular document.
 In a statistical system, the data that are potentially available for
calculating a weight are
 The frequency of occurrence of the processing token in an existing
item (i.e., Term frequency - TF),
 The frequency of occurrence of the processing token in the existing
database (i.e., total frequency -TOTF)
 The number of unique items in the database that contain the
processing token (i.e., item frequency - I F , frequently labeled in
other publications as document frequency - DF).
 The simplest approach is to have the weight equal to the term
frequency.
• Ex: If the word "computer" occurs 15 times within an item it has
a weight of 15.
Short documents frequency of occurrence is less
Longer documents frequency of occurrence is high
• Objective : Reduce the problems among he different documents an
words in a database That occurs through normalization.
• An example of this normalization in calculating term-frequency is
the algorithm used in the S M A RT System at Cornell. The term
frequency weighting formula used in T R E C 4 was:

• where slope was set at .2 and the pivot was set to the average
number of unique terms occurring in the collection
INVERSE DOCUMENT FREQUENCY

• The basic algorithm is improved by taking into consideration the


frequency of occurrence of the processing token in the database.
• One of the objectives of indexing an item is to discriminate the
semantics of that item from other items in the database.
• If tile token "computer” Meaning of a particular word will be
different in one document and different for other documents
• The term "computer" represents a concept used in an item, but it
does not help a user find the specific information being sought
since it returns the complete database.
INVERSE DOCUMENT FREQUENCY

• This leads to the general statement enhancing weighting


algorithms that the weight assigned to an item should be inversely
proportional to the frequency of occurrence of an item in the
database.
• This algorithm is called inverse document frequency (IDF). The un-
normalized weighting formula is:

• Where WEIGHT ij is the vector weight that is assigned to term “j” in


term “i” TF ij (Term Frequency) is the frequency of term ‘j” in “i”,
• ”n” is the number of items in the database and
• IF j (Item Frequency or Document frequency)is the number of items
in the database that have term “j” in them.
For Example:
Total items =2048 DB
Term oil found=128 items
Term Mexico found=8 times
Term Refinery found=10 times
SIGNAL WEIGHTING
• The drawback of inverse document frequency algorithm is that
it does not specify which document are relevant for a
particular item.
• Rate of occurrence of particular item is not properly
distributed among multiple documents. As a result documents
are not properly ranked.
SIGNAL WEIGHTING
• A special algorithm is used to determine precision called
Shannon’s information theory algorithm.
• In information theory the information content value of an
object is inversely proportional to the probability of
occurrence of the item.
• INFORMATION=-Log2(p)
• where p is the probability of occurrence of event “p”.
• The information value for an event that occurs 50 per cent of
the time is:
• INFORMATION = - Log2 (.50)
= -(-1)=1
SIGNAL WEIGHTING
• The above formula is applicable only for one document. If
there is more than one document, then the formula is,
𝒏
AVE _ INFO= ∑ Pk Log 2 ( Pk )
𝒌=𝟏
• Where is the ratio of how many times a word has occurred in
a document to the total number of document occurences in the
data base.
The following formula for calculating the weighting factor called
Signal can be used:
Signal k =Log 2 (TOTF)-AVE_INFO
SIGNAL WEIGHTING
• Producing a final formula of :
Weight ik =TF ik *Signal k
DISCRIMINATION VALUE
 Another approach to creating a weighting algorithm is to base it
upon the discrimination value of a term.
To achieve the objective of finding relevant items, it is important that the
index discrimination among items.
All items appear the same, the harder it is to identify those that are needed.
Salton and Yang proposed a weighting algorithm that takes into
consideration the ability for a search term to discriminate among items.
They proposed use of a discrimination value for each term “i”.
DISCRIMi=AVESIMi-AVESIM
•Where AVESIM is the average similarity between every item in the database
and AVESIMi is the same calculation except that term “i” is removed from all
items. Weight ik =TF ik *DISCRIMk
Problems With Weighting Schemes
• The two weighting schemes, inverse document frequency and signal, use
total frequency and item frequency factors which makes them dependent
upon distributions of processing tokens within the Database.
– a. Ignore the variances and calculate weights based upon current
values, with the factors changing over time. Periodically rebuild the
complete search database.
– b. Use a fixed value while monitoring changes in the factors. When the
changes reach a certain threshold, start using the new value and update
all existing vectors with the new value.
– c. Store tile invariant variables (e.g., term frequency within an item)
and at search time calculate the latest weights for processing tokens in
items needed for search terms.
Problems With the Vector Model
• Each processing token can be viewed as a new semantic topic.
• Another major limitation of a vector space is in The concept of
proximity searching (e.g.,term "a" within 10 words of term
"b“) is not possible.
BAYESIAN MODEL
• One way of overcoming the restrictions inherent in a vector model is
to use a Bayesian approach to maintaining information on
processing tokens.
• In its most general definition, the Bayesian approach is based upon
conditional probabilities (e.g., Probability of Event 1 given Event 2
occurred).
• Bayesian formula, is P(REL/DOCi ,Queryj)
– Where (REL) the probability of relevance

Determine the weights of specific


items called Bayesian network
There are maximum of ‘m’ topics
And each topic has a maximum of ‘n’
words
BAYESIAN MODEL
NAT U R A L L A N G UAG E P RO C E S S I N G

• The goal of natural language processing is to use the semantic


information in addition to the statistical information to enhance the
Document indexing of the item.
• This improves the precision of searches, reducing the number of false
hits a user reviews.
– Document indexing can be improved semantically – to treat all
words in a document as a single entity and process them
– Document indexing can be improved statistics by using proximity
searching
For example: computer programming
computer ADJ programming
programming ADJ computer
NAT U R A L L A N G UAG E P RO C E S S I N G

• NLP system also plays an important role in specifying how relationships


like semantic, general statistical etc.
• The determination of each relationship is explained through various steps.
• Step1: Mapping of concepts in a document to special in subject codes
– This codes are specified in LDOCE In this theoretical as well as
statistical relationships are established among the word phrases in a
document.
• Step2 : Once relation ships have been identified. DR-LINK system
determines the general relationships among the concepts in a document
by using special tool known a text structure.
• Step3: DR-LINK system identified semantic relationship called connector
relation ship.
• Step 4: In this step DR-LINK system assign final weight values to the
above mentioned relationships.
NAT U R A L L A N G UAG E P RO C E S S I N G

• Index Phrase Generation:


• Another method used in NLP for creation of phrases in order
to implement this methods, a special algorithm was used
called Salton Algorithm.
• this algorithm used a special factor called “Cohesion“
Factor for creation of phrases.
• Formulae=COHESIONk,h=SIZE-FACTOR*(PAIR-FREQk,h/TOTFk
* TOTFH)
• PAIR-FREQk,h- Rate of co-occurrene of the pair of terms ‘k’
and ‘h’ in the same document.
NAT U R A L L A N G UAG E P RO C E S S I N G

• This initial algorithm has been modified in the SMART


system to be based on the following guidelines:
– Any pair of adjacent non-stop words is a potential phrase
– Any pair must exist in 25 or more items
– Phrase weighting uses a modified version of the SMART system single
term algorithm
– Normalization is achieved by dividing by the length of the single-term
subvector.
• Statistical approaches tend to focus on two term phrases. A
major advantage of natural language approaches is their
ability to produce multiple-term phrases to denote a single
concept.
• Tagged Text Parser (TTP): Phrases can also be determined in a
lexical format by using a part of speech tags. These tags can be divided in to
various classes. The classes along with the examples are given below
• This structure allows for identification of potential term
phrases usually based upon noun identification. To determine
if a header-modifier pair warrants indexing, calculates a value
for Informational Contribution (IC) for each element in tile
pair.
– IC(x,[x,yl) = fx,y /nx + Dx – 1

• where fx, y is the frequency of (x,y) in the database, n x is the


number of pairs
CONCEPT INDEXING
• Through natural language processing method, concept in a documet can be
represented in the following manner,
– i. Semantically
– ii. Statistically
– Iii.. Generally
• Where as in concept indexing, concept are represented in amore orthodox
format. Consider an example of a document which consists of a word A
term such as "automobile" could be associated with concepts such as.
• "vehicle,“
• "transportation,“
• "mechanical device,“
• "fuel," and "environment."
•But the disadvantage of this approach is that it is difficult to make out
which concept is more appropriate for automobile. To overcome this,
special type of algorithms are implemented by this indexing methods
called neural network algorithms.
•A term can have different weights associated with different concepts
as described. The definition of a similar context is typically defined by
the number of non-stop words separating the terms.

Concept Vector for Automobile


CONCEPT INDEXING
• One of the methods that implement neural network algorithm
is LSI( “Latent Semantic Indexing”) approach. LSI analyses
the relationship among the documents and the terms by
providing concepts that are related to document and the terms.
• This type of document representation is done with the help of
a term-document matrix, where all the words are represented
in terms of rows and documents in terms of columns.
• Any rectangular matrix can be decomposed into the product
of three matrices. Let O be a MxN matrix such that:
• O=A*B*C
CONCEPT INDEXING
• A consists of M rows and R columns(M x R)
• B consists of R rows and N columns(Rx N)
• C consists of R rows and R columns(R x R)
• M x N=(M x R)* (R x R)* (R x N)
• R is the rank of O. The above matrix formula is feasible if B
has smaller values. Suppose B has larger values, then the
revised formula is,
• O = An* Bn * Cn
H YPERTEXT L INKAGES
• One of the commonly used methods for retrieval of
information is hypertext with linkages.
• Example hence you go to any search engine and type your
search statement. Based on the statement various sites will be
displayed.
• Hypertext linkages no only help you to find relevant
information about a particular document, but also additional
information about it.
H YPERTEXT L INKAGES
⚫ Manually generated (e.g. Yahoo!) : pages are indexed manually
into a linked hierarchy(an “index”). Users browse in the hierarchy
by following links. Eventually, users reach the “end documents”.

⚫ Automatically generated (e.g. Alta Vista) : pages at each


Internet site are indexed automatically

⚫ Crawlers (e.g. WebCrawler) : No a priori indexing. Users define


search terms, and the crawler goes to various sites searching for the
desired information.

You might also like