automatic classification in information retrieval

Outline
Introduction
Classification for Information Retrieval
Classification methods
Measures of association
Clustering
The use of clustering in information retrieval
graph theoretic
 'Single-Pass Algorithm'
Single-link

Introduction
What is classification?
A formal definition of classification will not be attempted.
The word 'classification' is used to describe the result of
such a process.
Classification is one of the core areas in machine learning.

Classification for Information
Retrieval
In the context of information retrieval, a classification is
required for a purpose.
The purpose may be to group the documents in such a
way that retrieval will be faster or alternatively it may be to
construct a thesaurus automatically.
classification is used in various subtasks of Preprocessing,
content filtering, sorting, ranking, et.
There are two main areas of application of classification
methods in IR:
• keyword clustering;
• document clustering.

In the main, people have achieved the 'logical
organization' in two different ways.
1.direct classification of the documents
2.The intermediate calculation of a measure of closeness
between documents.
-The first approach has proved theoretically to be intractable
so that any experimental test results cannot be considered
to be reliable.
-The second approach to classification is fairly well
documented now.

Classification methods
The data consists of objects and their corresponding descriptions.
The objects may be documents, keywords, hand written characters,
or species (in the last case the objects themselves are classes as
opposed to individuals)
The descriptors come under various names depending on their
structure:
•(1) multi-state attributes (e.g. color)
•(2) binary-state (e.g. keywords)
•(3) numerical (e.g. hardness scale, or weighted keywords)
•(4) probability distributions.
•The fourth category of descriptors is applicable when the objects are
classes.

Measures of association
Some classification methods are based on a binary relationship
between objects, basis of this relationship a classification method can
construct a system of clusters.
The relationship is described variously as 'similarity', 'association' and
'dissimilarity'.
There are five commonly used measures of association in
information retrieval. Since in information retrieval documents and
requests are most commonly represented by term or keyword lists, an
object is represented by a set of keywords and that the counting
measure | . | gives the size of the set.
|X υ Y | Simple matching coefficient
Which is the number of shared index terms.

Dice’s Coefficient
Jaccard’s Coefficient
Cosine’s
Coefficient
Overlap’s
Coefficient

Clustering in information retrieval
 Clustering is used in information retrieval systems to
enhance the efficiency and effectiveness of the retrieval
process.
Clustering is achieved by partitioning the documents in a
collection into classes such that documents that are
associated with each other are assigned to the same
cluster.
In order to cluster the items in a data set, some means of
quantifying the degree of association between them is
required.
A cluster method depending only on the rank-ordering of
the association values would given identical clusterings for

Cluster hypothesis
In information retrieval, it states that documents that are
clustered together "behave similarly with respect to
relevance to information needs".
In terms of classification, it states that if points are in the
same cluster.
This hypothesis may be simply stated as follows: closely
associated documents tend to be relevant to the same
requests.

The use of clustering in information retrieval
“ theoretical soundness of the method”
•method should satisfy certain criteria of adequacy. To list some
of the more important of these:
1.the method produces a clustering which is unlikely to be altered drastically
when further objects are incorporated, i.e. it is stable under growth.
2.the method is stable in the sense that small errors in the description of the
objects lead to small changes in the clustering.
3.the method is independent of the initial ordering of the objects

“the efficiency”
• We know much about the behavior of clustered files in terms of the effectiveness of
retrieval (i.e. the ability to retrieve wanted and hold back unwanted documents.)
•Efficiency is really a property of the algorithm implementing the cluster method
•It is sometimes useful to distinguish the cluster method from its algorithm
•the context of IR this distinction becomes slightly less than useful since many cluster
methods are defined by their algorithm.
•two distinct approaches to clustering can be identified:
1.the clustering is based on a measure of similarity between the objects to be clustered;
2.the cluster method proceeds directly from the object descriptions

graph theoretic
• define clusters in terms of a graph derived from the measure of similarity
• A string is a connected sequence of objects from some starting point
• A connected component is a set of objects such that each object is
connected to at least one other member of the set and the set is maximal
with respect to this property.
• A maximal complete subgraph is a subgraph such that each node is
connected to every other node in the subgraph and the set is maximal
with respect to this property

automatic classification in information retrieval

A large class of hierarchic cluster methods
• is based on the initial measurement of similarity.
The most important of these is single-link which is
the only one to have extensively used in document
retrieval. It satisfies all the criteria of adequacy
mentioned.
• A further class of cluster methods based on
measurement of similarity is the class of so called
'clump' methods.

• The algorithms also use a number of empirically determined
parameters such as:
1. The number of clusters desired;
2. A minimum and maximum size for each cluster;
3. A threshold value on the matching function, below which an
object will not be included in a cluster;
4. the control of overlap between clusters;
5. An arbitrarily chosen objective function which is optimized.

'Single-Pass Algorithm'
(1) The object descriptions are processed serially;
(2) The first object becomes the cluster representative of the first
cluster;
(3) Each subsequent object is matched against all cluster representatives
existing at its processing time;
(4) A given object is assigned to one cluster (or more if overlap is
allowed) according to some condition on the matching function;
(5) When an object is assigned to a cluster the representative for that
cluster is recomputed;
(6) If an object fails a certain test it becomes the cluster representative
of a new cluster.

Single-link
• The output is a hierarchy with associated numerical levels called a dendrogram.
• Frequently the hierarchy is represented by a tree structure such that each node
represents a cluster.

The appropriateness of stratified hierarchic cluster methods

Single-link and the minimum spanning tree
• The single-link tree is closely related to another kind of tree: the minimum
spanning tree, or MST, also derived from a dissimilarity coefficient .
• This second tree is quite different from the first, the nodes instead of
representing clusters represent the individual objects to be clustered.
• The MST is the tree of minimum length connecting the objects, where by 'length'
I mean the sum of the weights of the connecting links in the tree.
• we can define a maximum spanning tree as one of maximum length.. maximum
spanning tree based on the expected mutual information measure.

• Given the minimum spanning tree then the single-link clusters are obtained by
deleting links from the MST in order of decreasing length;
• The connected sets after each deletion are the single-link clusters.
• The order of deletion and the structure of the MST ensure that the clusters will
be nested into a hierarchy.

Implication of classification methods
• classification process can usually be speeded up by using extra
storage.
• In experiments classification structure is keot in fast store but
it’s impossible in operational system where the document
collections are so much bigger.
• In experiment we want to vary cluster representatives at search
time but in operational classification cluster representative
would be constructed once and for all cluster time.
• In IR classification file structure is:
– Easily updated - Easily search - Reasonably compact

Reference
Classification for Information Retrieval
http://bit.ly/2zff6cA
Conceptual clustering in information retrieval
https://ieeexplore.ieee.org/document/678640
CLUSTERING ALGORITHMS
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap16.htm

automatic classification in information retrieval

Recommended

More Related Content

What's hot (20)

Similar to automatic classification in information retrieval (20)

Recently uploaded (20)

automatic classification in information retrieval

Editor's Notes