A Tutorial Review On Text Mining Algorithms: Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K. Bandyopadhyay
A Tutorial Review On Text Mining Algorithms: Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K. Bandyopadhyay
ABSTRACT- As we enter the third decade of the World Wide Web (WWW), the textual revolution has seen a
tremendous change in the availability of online information. Finding information for just about any need has
never been more automatic—just a keystroke or mouse click away .It can be viewed as one of a class of non
traditional Information Retrieval (IR) strategies which attempt to treat entire text collections holistically, avoid
the bias of human queries, objectify the IR process with principled algorithms, and "let the data speak for itself."
These strategies share many techniques such as semantic parsing and statistical clustering, and the boundaries
between them are fuzzy. In this paper different existing Text Mining Algorithms i.e Classification Algorithm,
Association Algorithm, Clustering Algorithm is briefly reviewed, stating the merits / demerits of the algorithms.
In addition some alternate implementation of the algorithms is proposed. Finally the logic of these algorithms
are , merged to generate an algorithm which will perform the task of Classification of a data set into some
predefined classes, establish relationship between the classified date and finally cluster the data based on the
association between them into groups.
Keywords: Data Mining, Text Mining, Classification, Clustering, Association, Agglomerative, Divisive, Information
Retrieval, Information Extraction.
contain the name of a protein, or some form of the Classification of data set and Discovery of
verb ‘to interact’ or one of its synonyms. Associations among data. In order to overcome
2.2. Natural Language Processing (NLP) is one from the problems of Data Mining the following
of the oldest and most difficult problems in the algorithms have been designed.
field of artificial intelligence. It is the analysis of
human language so that computers can understand
natural languages as humans do. Although this goal 4. TASKS OF TEXT MINING ALGORITHMS
is still some way off, NLP can perform some types Text categorization: assigning the
of analysis with a high degree of success. Shallow documents with pre-defined categories
parsers identify only the main grammatical (e.g decision trees induction).
elements in a sentence, such as noun phrases and Text clustering: descriptive activity, which
verb phrases, whereas deep parsers generate a groups similar documents together (e.g.
complete representation of the grammatical self-organizing maps).
structure of a sentence. The role of NLP in text Concept mining: modelling and
mining is to provide the systems in the information discovering of concepts, sometimes
extraction phase (see below) with linguistic data combines categorization and clustering
that they need to perform their task. Often this is approaches with concept/ logic based
done by annotating documents with information ideas in order to find concepts and their
like sentence boundaries, part-of-speech tags, relations from text collections (e.g. formal
parsing results, which can then be read by the concept analysis approach for building of
information extraction tools. concept hierarchy).
2.3. Data Mining (DM) is the process of Information retrieval: retrieving the
identifying patterns in large sets of data. The aim is documents relevant to the user’s query.
to uncover previously unknown, useful knowledge. Information extraction: question
When used in text mining, DM is applied to the answering.
facts generated by the information extraction phase.
We put the results of our DM process into another
database that can be queried by the end-user via a 5. TYPE OF TEXT MINING ALGORITHM
suitable graphical interface. The data generated by 5.1 Classification Algorithm
such queries can also be represented visually. The Classification problem can be stated as a
2.4. Information Extraction (IE) is the process of training data set consisting of records. Each record
automatically obtaining structured data from an is identified by an unique record id, and consist of
unstructured natural language document. Often this fields corresponding to the attributes. An attribute
involves defining the general form of the with a continuous domain is called a continuous
information that we are interested in as one or more attribute. An attribute with a finite domain of
templates, which are then used to guide the discrete values is called a categorical attribute. One
extraction process. IE systems rely heavily on the of the categorical attribute is the classifying
data generated by NLP systems. attribute or class and the value in its domain are
called class labels.
5.1.1 Objective:
3. PROBLEMS OF TEXT MINING
Classification is the process of discovering a model
One main reason for applying data mining methods for the class in terms of the remaining attributes.
to text document collections is to structure them. A The objective is to use the training data set to build
structure can significantly simplify the access to a a model of the class label based on the other
document collection for a user. Well known access attributes such that the model can be used to
structures are library catalogues or book indexes. classify new data not from the training data set.
However, the problem of manual designed indexes 5.1.2 Classification Models:
is the time required to maintain them. Therefore, The different type of classification models are as
they are very often not up-to-date and thus not follows:
usable for recent publications or frequently 1. Decision Tree
changing information sources like the World Wide 2. Neural Network
Web. The existing methods for structuring 3. Genetic Algorithm
collections either try to assign keywords to 5.1.2.1. Classification Using Decision Tree:
documents based on a given keyword set Sequential Decision Tree based
(classification or categorization methods) or Classification
automatically structure document collections to Parallel Formulation of Decision Tree
find groups of similar documents (clustering based Classification.
methods). The problem of Text Mining is therefore
5.1.2.1.1. Sequential Decision Tree based Case 2 T contains cases that belong to a mixture of
Classification: classes. A test is chosen, based on a single attribute,
A decision tree model consists of internal node and that has one or more mutually exclusive outcomes
leaves. Each of the internal node has a decision fO1, O2, . . . , Ong. Note that in many
associated with it and each of the leaves has a class implementations, n is chosen to be 2 and this leads
label attached to it. A decision tree based to a binary decision tree. T is partitioned into
classification consists of two steps. subsets T1, T2, . . . , Tn, where Ti contains all the
1. Tree induction – A tree is induced from cases in T that have outcome Oi of the chosen test.
the given training set. The decision tree for T consists of a decision node
2. Tree pruning – The induced tree is made identifying the test, and one branch for each
more concise and robust by removing any possible outcome. The same tree building
statistical dependencies on the specific machinery is applied recursively to each subset of
training data set. training cases.
5.1.2.1.1.1. Hunt’s method:The following gives Case 3 T contains no cases. The decision tree for T
the recursive description of Hunt’s method for is a leaf, but the class to be associated with the leaf
constructing a decision tree from a set T of must be determined from information other than T.
training cases with classes denoted fC1,C2, . . . For example, C4.5 chooses this to be the most
,Ckg. frequent class at the parent of this node.
Case 1 T contains cases all belonging to a single
class Cj. The decision tree for T is a leaf
identifying class Cj .
Figure1: Hunt’s method possible tests that can split the data set and selects a
test that gives the best information gain. For each
Figure 1 shows how Hunt’s method works with the discrete attribute, one test with outcomes as many
training data set. In case 2 of Hunt’s method, a test as the number of distinct values of the attribute is
based on a single attribute is chosen for expanding considered. For each continuous attribute, binary
the current node. The choice of an attribute is tests involving every distinct value of the attribute
normally based on the entropy gains of the are considered. In order to gather the entropy gain
attributes. The entropy of an attribute is calculated of all these binary tests efficiently, the training data
from class distribution information. For a discrete set belonging to the node in consideration is sorted
attribute, class distribution information of each
value of the attribute is required. Outlook at the
root of the decision tree shown in Figure 1. Once
the class distribution information of all the for the values of the continuous attribute and the
attributes are gathered, each attribute is evaluated entropy gains of the binary cut based on each
in terms of either entropy [Qui93] or Gini Index distinct values are calculated in one scan of the
[BFOS84]. The best attribute is selected as a test sorted data. This process is repeated for each
for the node expansion. continuous attribute.
5.1.2.1.1.2. C4.5 Algorithm: The C4.5 algorithm Recently proposed classification algorithms
generates a classification–decision tree for the SLIQ[MAR96] and SPRINT [SAM96] avoid costly
given training data set by recursively partitioning sorting at each node by pre-sorting continuous
the data. The decision tree is grown using depth– attributes once in the beginning.
first strategy. The algorithm considers all the
5.1.2.1.1.3. SPRINT Algorithm: In SPRINT, each 2. For each data attribute, collect class distribution
continuous attribute is maintained in a sorted information of the local data at the current node.
attribute list. In this list, each entry contains a value 3. Exchange the local class distribution information
of the attribute and its corresponding record id. using global reduction [KGGK94] among
Once the best attribute to split a node in a processors.
classification tree is determined, each attribute list 4. Simultaneously compute the entropy gains of
has to be split according to the split decision. A each attribute at each processor and select the best
hash table, of the same order as the number of attribute for child node expansion.
training cases, has the mapping between record ids 5. Depending on the branching factor of the tree
and where each record belongs according to the desired, create child nodes for the same number of
split decision. Each entry in the attribute list is partitions of attribute values, and split training
moved to a classification tree node according to the cases accordingly.
information retrieved by probing the hash table. 6. Repeat above steps (1–5) until no more nodes are
The sorted order is maintained as the entries are available for the expansion.
moved in pre-sorted order.
Decision trees are usually built in two steps. First,
an initial tree is built till the leaf nodes belong to a
single class only. Second, pruning is done to
remove any over fitting to the training data.
Typically, the time spent on pruning for a large
dataset is a small fraction, less than 1% of the
initial tree generation.
Advantages are they are inexpensive to construct,
easy to interpret, and easy to integrate with the
commercial database and they yield better
accuracy. Disadvantages are it cannot handle larger
data sets that are it suffers from memory limitations
and it has low computational speed.
5.1.2.1.2. Parallel Formulation of Decision Tree
based Classification Figure 2: Synchronous Tree Construction Approach
The goal of parallel formulation of decision tree with Depth-First Expansion Strategy
based classification algorithms are scalability in
both runtime and memory requirements. The In Figure 2 the root node has already been
parallel formulation overcome the memory expanded and the current node is the leftmost child
limitation faced by the sequential algorithms, that is of the root (as shown in the top part of the figure).
it should make it possible to handle larger data sets All the four processors cooperate to expand this
without requiring redundant disk I/O. Also parallel node to have two child nodes. Next, the leftmost
formulation offer good speedup over serial node of these child nodes is selected as the current
algorithm. node (in the bottom of the figure) and all four
Type of parallel formulations for the classification processors again cooperate to expand the node. The
decision tree construction is advantage of this approach is that it does not
Synchronous Tree require any movement of the training data items.
Construction Approach Disadvantages are this algorithm suffers from high
communication cost and load imbalance. For each
Partitioned Tree
node in the decision tree, after collecting the class
Construction Approach
distribution information, all the processors need to
Hybrid Parallel
synchronize and exchange the distribution
Formulation
information. Hence, as the tree deepens, the
communication overhead dominates the overall
5.1.2.1.2.1. Synchronous Tree Construction
processing time.The other problem is due to load
Approach
imbalance. Even though each processor started out
In this approach, all processors construct a decision
with the same number of the training data items,
tree synchronously by sending and receiving class
the number of items belonging to the same node of
distribution information of local data. Major steps
the decision tree can vary substantially among
for the approach are shown below:
processors.
1. Select a node to expand according to a decision
tree expansion strategy (eg. Depth-First or Breadth-
5.1.2.1.2.2. Partitioned Tree Construction
First), and call that node as the current node. At the
Approach
beginning, root node is selected as the current node.
The second disadvantage is poor load to infer the mapping implied by the data. The cost
balancing inherent in the algorithm. function is related to the mismatch between our
Assignment of nodes to processors is done mapping and the data and it implicitly contains
based on the number of training cases in prior knowledge about the problem domain. Tasks
the successor nodes. However, the number that fall within the paradigm of supervised learning
of training cases associated with a node are pattern recognition (also known as
does not necessarily correspond to the classification) and regression (also known as
amount of work needed to process the function approximation).The supervised learning
subtree rooted at the node. paradigm is also applicable to sequential data (e.g.,
for speech and gesture recognition).
5.1.2.1.2.3. Hybrid Parallel Formulation With respect to the above specification the
The hybrid parallel formulation has elements of following assumptions have been considered.
both schemes. The Synchronous Tree Construction (1)Multi-Layer Perceptions is the simple feed
Approach incurs high communication overhead as forward neural network is actually called a
the frontier gets larger. The Partitioned Tree Multilayer perception (MLP). An MLP is a
Construction Approach incurs cost of load network of perceptions. The neurons are placed in
balancing after each step. The hybrid scheme keeps layers with outputs always flowing toward the
continuing with the first approach as long as the output layer. If only one layer exists, it is called a
communication cost incurred by the first perception. If multiple layers exist, it is an MLP.
formulation is not too high. Once this cost becomes (2) Back Propagation algorithm is a learning
high, the processors as well as the current frontier technique that adjusts weights in neural network by
of the classification tree are partitioned into two propagating weight changes backward from the
parts. The description assumes that the number of sink to the source nodes.
processors is a power of 2, and that these
processors are connected in a hypercube
configuration. The algorithm can be appropriately
modified if P is not a power of 2. Also this
algorithm can
be mapped on to any parallel architecture by simply
embedding a virtual hypercube in the architecture.
from the algorithmic point of view. Bringing in the relocation schemes that iteratively reassign points
sequential relationships increases the combinatorial between the k clusters. Unlike traditional
complexity of the problem enormously. The reason hierarchical methods, in which clusters are not
is that, the maximum number of sequences having revisited after being constructed, relocation
k events is O(mk2k1), where m is the total number algorithms gradually improve clusters. With
of distinct events in the input data. In contrast, appropriate data, this results in high quality
there are only (m C k ) size-k item-sets possible clusters. One approach to data partitioning is to
while discovering non-sequential associations from take a conceptual point of view that identifies the
m distinct items. cluster with a certain model whose unknown
5.3. Clustering Algorithm: parameters have to be found. More specifically,
5.3.1. Objective: Clustering is a division of data probabilistic models assume that the data comes
into groups of similar objects Each group, called from a mixture of several populations whose
cluster, consists of objects that are similar between distributions and priors we want to find. One
themselves and dissimilar to objects of other advantage of probabilistic methods is the
groups. Representing data by fewer clusters interpretability of the constructed clusters. Having
necessarily loses certain fine details (akin to lossy concise cluster representation also allows
data compression), but achieves simplification. It inexpensive computation of intra-clusters measures
represents many data objects by few clusters, and of it that give rise to a global objective function.
hence, it models data by its clusters.
Data modelling puts clustering in a historical
perspective rooted in mathematics, statistics, and 6. PROPOSALS
numerical analysis. In this section we have made certain proposals that
5.3.2. Clustering Algorithms: can be implemented as a modification to the
Clustering Algorithms are classified into following existing Text Mining Algorithms as defined in the
two methods: previous sections.
5.3.2.1. Hierarchical Methods: Hierarchical Association Algorithm: In Sequential Algorithm,
clustering builds a cluster hierarchy or, in other the Sequential pattern between the data elements
words, a tree of clusters, also known as a can be determined by associating a timestamp with
dendogram. Every cluster node contains child each data, this time is assigned based on the arrival
clusters; sibling clusters partition the points time of each data . If this information can be put to
covered by their common parent. Such an approach use, one can find relationships such as if a customer
allows exploring data on different levels of bought book today, then he/she is likely to buy a
granularity. book in a few days time.
Hierarchical clustering methods are categorized However this technique of finding the sequential
into agglomerative (bottom-up) and divisive (top- pattern between the data item is more applicable if
down) . we are considering the dynamic data set where the
An agglomerative clustering starts with one-point number of data in a dataset varies dynamically with
(singleton) clusters and recursively merges two or time.
more most appropriate clusters. Advantages:
A divisive clustering starts with one cluster of all Every data item can be uniquely
data points and recursively splits the most identified. There is less possibility of
appropriate cluster. The process continues until a overlapping.
stopping criterion (frequently, the requested The mechanism is simple to implement
number k of clusters) is achieved. and incurs less overhead than assigning a
Advantages are 1) Embedded flexibility regarding timestamp to each data item.
the level of granularity 2) Ease of handling of any This technique is suitable when
forms of similarity or distance 3) Consequently, considering static data set.
applicability to any attribute types. Disadvantages Disadvantages:
are 1) Vagueness of termination criteria 2 The fact When using a very large dataset, it is
that most hierarchical algorithms do not revisit required to generate a unique sequence
once constructed (intermediate) clusters with the number for each item; this may not always
purpose of their improvement be feasible.
5.3.2.2 Partitioning Methods: In data partitioning There lies a chance that more than one
algorithms, which divide data into several subsets. same item may be assigned a same
Since checking all possible subset systems is sequence number belonging to a different
computationally infeasible, certain greedy dataset.
heuristics are used in the form of iterative Clustering Algorithm: In the Hierarchical
optimization. Specifically, this means different Clustering Algorithm, there are two approaches
Metropolitan Cities
Districts
Figure 9:
Samir K Bandyopadhyay
He is Professor of Dept. Of
Computer Science & Engi-
neering, University
Copyright to IJARCCE www.ijarcce.com
of Calcutta, Kolkata, India. 233