0% found this document useful (0 votes)
104 views

A Tutorial Review On Text Mining Algorithms: Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K. Bandyopadhyay

tutorial

Uploaded by

Miske Mostar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

A Tutorial Review On Text Mining Algorithms: Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K. Bandyopadhyay

tutorial

Uploaded by

Miske Mostar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

A tutorial review on Text Mining Algorithms


Mrs. Sayantani Ghosh1, Mr. Sudipta Roy2, and Prof. Samir K. Bandyopadhyay3
Department of Computer Science and Engineering1,2,3
University of Calcutta, 92 A.P.C. Road,
Kolkata-700009, India.

ABSTRACT- As we enter the third decade of the World Wide Web (WWW), the textual revolution has seen a
tremendous change in the availability of online information. Finding information for just about any need has
never been more automatic—just a keystroke or mouse click away .It can be viewed as one of a class of non
traditional Information Retrieval (IR) strategies which attempt to treat entire text collections holistically, avoid
the bias of human queries, objectify the IR process with principled algorithms, and "let the data speak for itself."
These strategies share many techniques such as semantic parsing and statistical clustering, and the boundaries
between them are fuzzy. In this paper different existing Text Mining Algorithms i.e Classification Algorithm,
Association Algorithm, Clustering Algorithm is briefly reviewed, stating the merits / demerits of the algorithms.
In addition some alternate implementation of the algorithms is proposed. Finally the logic of these algorithms
are , merged to generate an algorithm which will perform the task of Classification of a data set into some
predefined classes, establish relationship between the classified date and finally cluster the data based on the
association between them into groups.

Keywords: Data Mining, Text Mining, Classification, Clustering, Association, Agglomerative, Divisive, Information
Retrieval, Information Extraction.

by many as the next wave of knowledge discovery,


1. INTRODUCTION text mining has very high commercial values
Labour-intensive manual text-mining approaches
first surfaced in the mid-1980s, but technological
advances have enabled the field to advance swiftly 2. STAGES OF TEXT MINING PROCESS
during the past decade. Text mining is an
interdisciplinary field which draws on information
retrieval, data mining, machine learning, statistics,
and computational linguistics. As most information Text mining involves the application of techniques
(common estimates say over 80%) is currently from areas such as information retrieval, natural
stored as text, text mining is believed to have a language processing, information extraction and
high commercial potential value. Increasing interest data mining. These various stages of a text-mining
is being paid to multilingual data mining: the process can be combined together into a single
ability to gain information across languages and workflow.
cluster similar items from different linguistic 2.1. Information Retrieval (IR) systems identify
sources according to their meaning. Text mining, the documents in a collection which match a user’s
sometimes alternately referred to as text data query. The most well known IR systems are search
mining, roughly equivalent to text analytics, refers engines such as Google, which identify those
to the process of deriving high-quality information documents on the World Wide Web that are
from text. High-quality information is typically relevant to a set of given words. IR systems are
derived through the divining of patterns and trends often used in libraries, where the documents are
through means such as statistical pattern learning. typically not the books themselves but digital
Text mining usually involves the process of records containing information about the books. IR
structuring the input text (usually parsing, along systems allow us to narrow down the set of
with the addition of some derived linguistic documents that are relevant to a particular problem.
features and the removal of others, and subsequent As text mining involves applying very
insertion into a database), deriving patterns within computationally-intensive algorithms to large
the structured data, and finally evaluation and document collections, IR can speed up the analysis
interpretation of the output. 'High quality' in text considerably by reducing the number of documents
mining usually refers to some combination of for analysis. For example, if we are interested in
relevance, novelty, and interestingness. Regarded mining information only about protein interactions,
we might restrict our analysis to documents that

Copyright to IJARCCE www.ijarcce.com 223


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

contain the name of a protein, or some form of the Classification of data set and Discovery of
verb ‘to interact’ or one of its synonyms. Associations among data. In order to overcome
2.2. Natural Language Processing (NLP) is one from the problems of Data Mining the following
of the oldest and most difficult problems in the algorithms have been designed.
field of artificial intelligence. It is the analysis of
human language so that computers can understand
natural languages as humans do. Although this goal 4. TASKS OF TEXT MINING ALGORITHMS
is still some way off, NLP can perform some types  Text categorization: assigning the
of analysis with a high degree of success. Shallow documents with pre-defined categories
parsers identify only the main grammatical (e.g decision trees induction).
elements in a sentence, such as noun phrases and  Text clustering: descriptive activity, which
verb phrases, whereas deep parsers generate a groups similar documents together (e.g.
complete representation of the grammatical self-organizing maps).
structure of a sentence. The role of NLP in text  Concept mining: modelling and
mining is to provide the systems in the information discovering of concepts, sometimes
extraction phase (see below) with linguistic data combines categorization and clustering
that they need to perform their task. Often this is approaches with concept/ logic based
done by annotating documents with information ideas in order to find concepts and their
like sentence boundaries, part-of-speech tags, relations from text collections (e.g. formal
parsing results, which can then be read by the concept analysis approach for building of
information extraction tools. concept hierarchy).
2.3. Data Mining (DM) is the process of  Information retrieval: retrieving the
identifying patterns in large sets of data. The aim is documents relevant to the user’s query.
to uncover previously unknown, useful knowledge.  Information extraction: question
When used in text mining, DM is applied to the answering.
facts generated by the information extraction phase.
We put the results of our DM process into another
database that can be queried by the end-user via a 5. TYPE OF TEXT MINING ALGORITHM
suitable graphical interface. The data generated by 5.1 Classification Algorithm
such queries can also be represented visually. The Classification problem can be stated as a
2.4. Information Extraction (IE) is the process of training data set consisting of records. Each record
automatically obtaining structured data from an is identified by an unique record id, and consist of
unstructured natural language document. Often this fields corresponding to the attributes. An attribute
involves defining the general form of the with a continuous domain is called a continuous
information that we are interested in as one or more attribute. An attribute with a finite domain of
templates, which are then used to guide the discrete values is called a categorical attribute. One
extraction process. IE systems rely heavily on the of the categorical attribute is the classifying
data generated by NLP systems. attribute or class and the value in its domain are
called class labels.
5.1.1 Objective:
3. PROBLEMS OF TEXT MINING
Classification is the process of discovering a model
One main reason for applying data mining methods for the class in terms of the remaining attributes.
to text document collections is to structure them. A The objective is to use the training data set to build
structure can significantly simplify the access to a a model of the class label based on the other
document collection for a user. Well known access attributes such that the model can be used to
structures are library catalogues or book indexes. classify new data not from the training data set.
However, the problem of manual designed indexes 5.1.2 Classification Models:
is the time required to maintain them. Therefore, The different type of classification models are as
they are very often not up-to-date and thus not follows:
usable for recent publications or frequently 1. Decision Tree
changing information sources like the World Wide 2. Neural Network
Web. The existing methods for structuring 3. Genetic Algorithm
collections either try to assign keywords to 5.1.2.1. Classification Using Decision Tree:
documents based on a given keyword set  Sequential Decision Tree based
(classification or categorization methods) or Classification
automatically structure document collections to  Parallel Formulation of Decision Tree
find groups of similar documents (clustering based Classification.
methods). The problem of Text Mining is therefore

Copyright to IJARCCE www.ijarcce.com 224


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

5.1.2.1.1. Sequential Decision Tree based Case 2 T contains cases that belong to a mixture of
Classification: classes. A test is chosen, based on a single attribute,
A decision tree model consists of internal node and that has one or more mutually exclusive outcomes
leaves. Each of the internal node has a decision fO1, O2, . . . , Ong. Note that in many
associated with it and each of the leaves has a class implementations, n is chosen to be 2 and this leads
label attached to it. A decision tree based to a binary decision tree. T is partitioned into
classification consists of two steps. subsets T1, T2, . . . , Tn, where Ti contains all the
1. Tree induction – A tree is induced from cases in T that have outcome Oi of the chosen test.
the given training set. The decision tree for T consists of a decision node
2. Tree pruning – The induced tree is made identifying the test, and one branch for each
more concise and robust by removing any possible outcome. The same tree building
statistical dependencies on the specific machinery is applied recursively to each subset of
training data set. training cases.
5.1.2.1.1.1. Hunt’s method:The following gives Case 3 T contains no cases. The decision tree for T
the recursive description of Hunt’s method for is a leaf, but the class to be associated with the leaf
constructing a decision tree from a set T of must be determined from information other than T.
training cases with classes denoted fC1,C2, . . . For example, C4.5 chooses this to be the most
,Ckg. frequent class at the parent of this node.
Case 1 T contains cases all belonging to a single
class Cj. The decision tree for T is a leaf
identifying class Cj .

Figure1: Hunt’s method possible tests that can split the data set and selects a
test that gives the best information gain. For each
Figure 1 shows how Hunt’s method works with the discrete attribute, one test with outcomes as many
training data set. In case 2 of Hunt’s method, a test as the number of distinct values of the attribute is
based on a single attribute is chosen for expanding considered. For each continuous attribute, binary
the current node. The choice of an attribute is tests involving every distinct value of the attribute
normally based on the entropy gains of the are considered. In order to gather the entropy gain
attributes. The entropy of an attribute is calculated of all these binary tests efficiently, the training data
from class distribution information. For a discrete set belonging to the node in consideration is sorted
attribute, class distribution information of each
value of the attribute is required. Outlook at the
root of the decision tree shown in Figure 1. Once
the class distribution information of all the for the values of the continuous attribute and the
attributes are gathered, each attribute is evaluated entropy gains of the binary cut based on each
in terms of either entropy [Qui93] or Gini Index distinct values are calculated in one scan of the
[BFOS84]. The best attribute is selected as a test sorted data. This process is repeated for each
for the node expansion. continuous attribute.
5.1.2.1.1.2. C4.5 Algorithm: The C4.5 algorithm Recently proposed classification algorithms
generates a classification–decision tree for the SLIQ[MAR96] and SPRINT [SAM96] avoid costly
given training data set by recursively partitioning sorting at each node by pre-sorting continuous
the data. The decision tree is grown using depth– attributes once in the beginning.
first strategy. The algorithm considers all the

Copyright to IJARCCE www.ijarcce.com 225


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

5.1.2.1.1.3. SPRINT Algorithm: In SPRINT, each 2. For each data attribute, collect class distribution
continuous attribute is maintained in a sorted information of the local data at the current node.
attribute list. In this list, each entry contains a value 3. Exchange the local class distribution information
of the attribute and its corresponding record id. using global reduction [KGGK94] among
Once the best attribute to split a node in a processors.
classification tree is determined, each attribute list 4. Simultaneously compute the entropy gains of
has to be split according to the split decision. A each attribute at each processor and select the best
hash table, of the same order as the number of attribute for child node expansion.
training cases, has the mapping between record ids 5. Depending on the branching factor of the tree
and where each record belongs according to the desired, create child nodes for the same number of
split decision. Each entry in the attribute list is partitions of attribute values, and split training
moved to a classification tree node according to the cases accordingly.
information retrieved by probing the hash table. 6. Repeat above steps (1–5) until no more nodes are
The sorted order is maintained as the entries are available for the expansion.
moved in pre-sorted order.
Decision trees are usually built in two steps. First,
an initial tree is built till the leaf nodes belong to a
single class only. Second, pruning is done to
remove any over fitting to the training data.
Typically, the time spent on pruning for a large
dataset is a small fraction, less than 1% of the
initial tree generation.
Advantages are they are inexpensive to construct,
easy to interpret, and easy to integrate with the
commercial database and they yield better
accuracy. Disadvantages are it cannot handle larger
data sets that are it suffers from memory limitations
and it has low computational speed.
5.1.2.1.2. Parallel Formulation of Decision Tree
based Classification Figure 2: Synchronous Tree Construction Approach
The goal of parallel formulation of decision tree with Depth-First Expansion Strategy
based classification algorithms are scalability in
both runtime and memory requirements. The In Figure 2 the root node has already been
parallel formulation overcome the memory expanded and the current node is the leftmost child
limitation faced by the sequential algorithms, that is of the root (as shown in the top part of the figure).
it should make it possible to handle larger data sets All the four processors cooperate to expand this
without requiring redundant disk I/O. Also parallel node to have two child nodes. Next, the leftmost
formulation offer good speedup over serial node of these child nodes is selected as the current
algorithm. node (in the bottom of the figure) and all four
Type of parallel formulations for the classification processors again cooperate to expand the node. The
decision tree construction is advantage of this approach is that it does not
 Synchronous Tree require any movement of the training data items.
Construction Approach Disadvantages are this algorithm suffers from high
communication cost and load imbalance. For each
 Partitioned Tree
node in the decision tree, after collecting the class
Construction Approach
distribution information, all the processors need to
 Hybrid Parallel
synchronize and exchange the distribution
Formulation
information. Hence, as the tree deepens, the
communication overhead dominates the overall
5.1.2.1.2.1. Synchronous Tree Construction
processing time.The other problem is due to load
Approach
imbalance. Even though each processor started out
In this approach, all processors construct a decision
with the same number of the training data items,
tree synchronously by sending and receiving class
the number of items belonging to the same node of
distribution information of local data. Major steps
the decision tree can vary substantially among
for the approach are shown below:
processors.
1. Select a node to expand according to a decision
tree expansion strategy (eg. Depth-First or Breadth-
5.1.2.1.2.2. Partitioned Tree Construction
First), and call that node as the current node. At the
Approach
beginning, root node is selected as the current node.

Copyright to IJARCCE www.ijarcce.com 226


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

In this approach, whenever feasible, different


processors work on different parts of the
classification tree. In particular, if more than one
processors cooperate to expand a node, then these
processors are partitioned to expand the successors
of this node. Consider the case in which a group of
processors Pn cooperate to expand node n.
The algorithm consists of following steps:
Step 1 Processors in Pn cooperate to expand node n
using the method described above. Step 2 Once the
node n is expanded in to successor nodes, n1, n2, . .
. , nk , then the processor group Pn is also
partitioned, and the successor nodes are assigned to
processors as follows:
Case 1: If the number of successor nodes is greater
than |Pn|,
1. Partition the successor nodes into |Pn| groups
such that the total number of training cases
corresponding to each node group is roughly equal. Figure 3: Partioned Tree Construction
Assign each processor to one node group. Approach
2. Shuffle the training data such that each processor At the beginning, all processors work together to
has data items that belong to the nodes it is expand the root node of the classification tree. At
responsible for. the end, the whole classification tree is constructed
3. Now the expansion of the sub trees rooted at a by combining subtrees of each processor.
node group proceeds completely independently at Figure 3 shows an example. First (at the top of the
each processor as in the serial algorithm. figure), all four processors cooperate to expand the
Case 2: Otherwise (if the number of successor root node just like they do in the synchronous tree
nodes is less than |Pn|), construction approach. Next (in the middle of the
1. Assign a subset of processors to each node such figure), the set of four processors
that number of processors assigned to a node is is partitioned in three parts. The leftmost child is
proportional to the number of the training cases assigned to processors 0 and 1, while the other
corresponding to the node. nodes are assigned to processors 2 and 3,
2. Shuffle the training cases such that each subset respectively. Now these sets of processors proceed
of processors has training cases that belong to the independently to expand these assigned nodes.
nodes it is responsible for. In particular, processors 2 and processor 3 proceed
3. Processor subsets assigned to different nodes to expand their part of the tree using the serial
develop subtrees independently. Processor subsets algorithm. The group containing processors 0 and 1
that contain only one processor use the sequential splits the leftmost child node into three nodes.
algorithm to expand the part of the classification These three new nodes are partitioned in two parts
tree rooted at the node assigned to them. Processor (shown in the bottom of the figure); the leftmost
subsets that contain more than one processor node is assigned to processor 0, while the other two
proceed by following the above steps recursively. are assigned to processor 1. From now on,
processors 0 and 1 also independently work on
their respective sub trees.
Advantages:
 The advantage of this approach is that
once a processor becomes solely
responsible for a node, it can develop a
subtree of the classification tree
independently without any communication
overhead.
Disadvantages:
 The first disadvantage is that it requires
data movement after each node expansion
until one processor becomes responsible
for an entire subtree. The communication
cost is expensive in the expansion of the
upper part of the classification tree.

Copyright to IJARCCE www.ijarcce.com 227


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

 The second disadvantage is poor load to infer the mapping implied by the data. The cost
balancing inherent in the algorithm. function is related to the mismatch between our
Assignment of nodes to processors is done mapping and the data and it implicitly contains
based on the number of training cases in prior knowledge about the problem domain. Tasks
the successor nodes. However, the number that fall within the paradigm of supervised learning
of training cases associated with a node are pattern recognition (also known as
does not necessarily correspond to the classification) and regression (also known as
amount of work needed to process the function approximation).The supervised learning
subtree rooted at the node. paradigm is also applicable to sequential data (e.g.,
for speech and gesture recognition).
5.1.2.1.2.3. Hybrid Parallel Formulation With respect to the above specification the
The hybrid parallel formulation has elements of following assumptions have been considered.
both schemes. The Synchronous Tree Construction (1)Multi-Layer Perceptions is the simple feed
Approach incurs high communication overhead as forward neural network is actually called a
the frontier gets larger. The Partitioned Tree Multilayer perception (MLP). An MLP is a
Construction Approach incurs cost of load network of perceptions. The neurons are placed in
balancing after each step. The hybrid scheme keeps layers with outputs always flowing toward the
continuing with the first approach as long as the output layer. If only one layer exists, it is called a
communication cost incurred by the first perception. If multiple layers exist, it is an MLP.
formulation is not too high. Once this cost becomes (2) Back Propagation algorithm is a learning
high, the processors as well as the current frontier technique that adjusts weights in neural network by
of the classification tree are partitioned into two propagating weight changes backward from the
parts. The description assumes that the number of sink to the source nodes.
processors is a power of 2, and that these
processors are connected in a hypercube
configuration. The algorithm can be appropriately
modified if P is not a power of 2. Also this
algorithm can
be mapped on to any parallel architecture by simply
embedding a virtual hypercube in the architecture.

Figure 6: The typical structure of a back


propagation network

Advantages of Neural Network:


 Artificial neural networks make no
assumptions about the nature of the
distribution of the data and are not
Figure 4: The computation front therefore, biased in their analysis. Instead
during computation phase of making assumptions about the
underlying population, neural networks
with at least one middle layer use the data
to develop an internal representation of the
relationship between the variables.
 Since time-series data are dynamic in
nature, it is necessary to have non-linear
tools in order to discern relationships
among time-series data. Neural networks
are best at discovering nonlinear
Figure 5: Binary partioning of the relationships.
tree to communication costs  Neural networks perform well with
missing or incomplete data. Whereas
5.1.2.2. Classification Using Neural Network: traditional regression analysis is not
In Supervised learning, we are given a set of adaptive, typically processing all older
example pairs (x,y), x є X, y є Y and the aim is to data together with new data, neural
find a function f in the allowed class of functions networks adapt their weights as new input
that matches the examples. In other words, we wish data

Copyright to IJARCCE www.ijarcce.com 228


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

becomes available. supermarket. This market basket data, as it is


Disadvantages of Neural Network: popularly known, consists of transactions made by
 No estimation or prediction errors are each customer. Each transaction contains items
calculated with an artificial neural network bought by the customer. The goal is to see if
 Artificial neural networks are “black occurrence of certain items in a transaction can be
boxes,” for it is impossible to figure out used to deduce occurrence of other items, or in
how other words, to find associative relationships
relations in hidden layers are estimated. between items. Traditionally, association models
Tasks of Neural Network: are used to discover business trends by analyzing
The tasks to which artificial neural networks are customer transactions. However, they can also be
applied tend to fall within the following broad used effectively to predict Web page accesses for
categories: personalization. For example, assume that after
 Function approximation, or regression mining the Web access log, Company X discovered
analysis, Including time series prediction an association rule "A and B implies C," with 80%
and modeling. confidence, where A, B, and C are Web page
 Classification, including pattern and accesses. If a user has visited pages A and B, there
Sequence. is an 80% chance that he/she will visit page C in
 Recognition, novelty detection and the same session. Page C may or may not have a
sequential decision making. direct link from A or B. This information can be
 Data processing, including filtering, used to create a dynamic link to page C from pages
clustering, Blind source separation and A or B so that the user can "click-through" to page
compression. C directly. This kind of information is particularly
valuable for a Web server supporting an e-
5.1.2.3. Classification using Genetic Algorithm: commerce site to link the different product pages
Genetic algorithms are heuristic optimization dynamically, based on the customer interaction.
methods whose mechanisms are analogous to 5.2.2 Type:
biological evolution .In Genetic Algorithm, the 5.2.2.1. Parallel Algorithm for Discovering
solutions are called individuals or chromosomes. Associations: The problem can be stated as given a
After the initial population is generated randomly, set of items, association rules predict the
selection and variation function are executed in a occurrence of some other set of items with certain
loop until some termination criterion is reached. degree of confidence. The goal is to discover all
Each run of the loop is called a generation. The such interesting rules. There are several properties
selection operator is intended to improve the of association models that can be calculated.
average quality of the population by giving 5.2.2.2. Sequential Algorithm for finding
individuals of higher quality a higher probability to Association: The concept of association rules can
be copied into the next generation. The quality of be generalized and made more useful by observing
an individual is measured by a fitness function. another fact about transactions. All transactions
5.1.2.3.1. Genetic Operators have a timestamp associated with them; i.e. the
The genetic algorithm uses crossover and mutation time at which the transaction occurred. If this
operators to generate the offspring of the existing information can be put to use, one can find
population. Before genetic operators are applied, relationships such as if a customer bought book
parents have been selected for evolution to the next today, then he/she is likely to buy a book in a few
generation. The crossover and mutation algorithm days time. The usefulness of this kind of rules gave
is used to produce next generation. The probability birth to the problem of discovering sequential
of deploying crossover and mutation operators can patterns or sequential associations. In general, a
be changed by user. In all of next generation, sequential pattern is a sequence of item-sets with
WTSD has used as the fitness function. various timing constraints imposed on the
5.1.2.3.2. End Condition occurrences of items appearing in the pattern.
GA needs an End Condition to end the generation Example Consider the instance that A, B, C, D are
process. If there is no sufficient improvement in the set of transactions such that (A) (C,B) (D)
two or more consecutive generations; stop the GA encodes a relationship that event D occurs after an
process. In other cases, time limitation can be used event-set (C,B), which in turn occurs after event A.
as a criterion for ending the process. Prediction of events or identification of sequential
rules that characterize different parts of the data,
5.2 Algorithm for Discovering Associations: are some example applications of sequential
5.2.1 Objective: In order to discover associations patterns. Such patterns are not only important
present in the data. The problem was formulated because they represent more powerful and
originally in the context of the transaction data at predictive relationships, but they are also important

Copyright to IJARCCE www.ijarcce.com 229


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

from the algorithmic point of view. Bringing in the relocation schemes that iteratively reassign points
sequential relationships increases the combinatorial between the k clusters. Unlike traditional
complexity of the problem enormously. The reason hierarchical methods, in which clusters are not
is that, the maximum number of sequences having revisited after being constructed, relocation
k events is O(mk2k1), where m is the total number algorithms gradually improve clusters. With
of distinct events in the input data. In contrast, appropriate data, this results in high quality
there are only (m C k ) size-k item-sets possible clusters. One approach to data partitioning is to
while discovering non-sequential associations from take a conceptual point of view that identifies the
m distinct items. cluster with a certain model whose unknown
5.3. Clustering Algorithm: parameters have to be found. More specifically,
5.3.1. Objective: Clustering is a division of data probabilistic models assume that the data comes
into groups of similar objects Each group, called from a mixture of several populations whose
cluster, consists of objects that are similar between distributions and priors we want to find. One
themselves and dissimilar to objects of other advantage of probabilistic methods is the
groups. Representing data by fewer clusters interpretability of the constructed clusters. Having
necessarily loses certain fine details (akin to lossy concise cluster representation also allows
data compression), but achieves simplification. It inexpensive computation of intra-clusters measures
represents many data objects by few clusters, and of it that give rise to a global objective function.
hence, it models data by its clusters.
Data modelling puts clustering in a historical
perspective rooted in mathematics, statistics, and 6. PROPOSALS
numerical analysis. In this section we have made certain proposals that
5.3.2. Clustering Algorithms: can be implemented as a modification to the
Clustering Algorithms are classified into following existing Text Mining Algorithms as defined in the
two methods: previous sections.
5.3.2.1. Hierarchical Methods: Hierarchical Association Algorithm: In Sequential Algorithm,
clustering builds a cluster hierarchy or, in other the Sequential pattern between the data elements
words, a tree of clusters, also known as a can be determined by associating a timestamp with
dendogram. Every cluster node contains child each data, this time is assigned based on the arrival
clusters; sibling clusters partition the points time of each data . If this information can be put to
covered by their common parent. Such an approach use, one can find relationships such as if a customer
allows exploring data on different levels of bought book today, then he/she is likely to buy a
granularity. book in a few days time.
Hierarchical clustering methods are categorized However this technique of finding the sequential
into agglomerative (bottom-up) and divisive (top- pattern between the data item is more applicable if
down) . we are considering the dynamic data set where the
An agglomerative clustering starts with one-point number of data in a dataset varies dynamically with
(singleton) clusters and recursively merges two or time.
more most appropriate clusters. Advantages:
A divisive clustering starts with one cluster of all  Every data item can be uniquely
data points and recursively splits the most identified. There is less possibility of
appropriate cluster. The process continues until a overlapping.
stopping criterion (frequently, the requested  The mechanism is simple to implement
number k of clusters) is achieved. and incurs less overhead than assigning a
Advantages are 1) Embedded flexibility regarding timestamp to each data item.
the level of granularity 2) Ease of handling of any  This technique is suitable when
forms of similarity or distance 3) Consequently, considering static data set.
applicability to any attribute types. Disadvantages Disadvantages:
are 1) Vagueness of termination criteria 2 The fact  When using a very large dataset, it is
that most hierarchical algorithms do not revisit required to generate a unique sequence
once constructed (intermediate) clusters with the number for each item; this may not always
purpose of their improvement be feasible.
5.3.2.2 Partitioning Methods: In data partitioning  There lies a chance that more than one
algorithms, which divide data into several subsets. same item may be assigned a same
Since checking all possible subset systems is sequence number belonging to a different
computationally infeasible, certain greedy dataset.
heuristics are used in the form of iterative Clustering Algorithm: In the Hierarchical
optimization. Specifically, this means different Clustering Algorithm, there are two approaches

Copyright to IJARCCE www.ijarcce.com 230


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

namely (a) Top- Down Approach in which a Design and Output:


single cluster is divided into smaller clusters. (b)
Bottom – Up Approach in which several smaller
clusters are merged into a single cluster.
This two approach can however be combined into a
single approach which is a sandwich of both the
Top- Down and the Bottom- Up Approach. In this
approach, if we start from a particular cluster say
C1, the cluster C1 at that level along with the other
clusters in the same level can be combined to form
a single cluster, similarly the cluster C1 can be
divided into smaller clusters. Figure 8:
Example: Each level consists of clusters. The top-
down approach can be explained as, at the topmost
level, there is the cluster country, India. This (a) Input: Newspaper Document (Specified
cluster can be further classified into several clusters topic)
containing metropolitan cities . The cities are Query: Determine whether the news is a
further divided into clusters called districts. The follow up or fresh news.
bottom-up approach will finally generate the cluster Modification Req: Yes
INDIA
Country INDIA

Metropolitan Cities

Districts

Figure 6: Hierarchical Clustering

Figure 9:

(b) Input: Newspaper Document


Query: List of IT jobs specialization in
Oracle with workplace preferably in
Kolkata or otherwise
Modification Req: Yes

Figure 7: Sandwich Approach

Advantage of this approach is a combination of


both the top-down and the bottom- up approach, it
will utilize the advantage of both the approach and Figure 10:
disadvantage of it is more complicated than the 3. Implementation of Clustering Algorithm
other two approaches.
7. WORKFLOW WITH SET OF SAMPLE
TEST CASES
1. Implementation of Classification Algorithm:
Input: Newspaper Document and the location.

Copyright to IJARCCE www.ijarcce.com 231


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

 The association between the subclasses


Bengali, English and Hindi Newspaper is
established based on unique sequence
number.
 The agglomerative clustering is used to
club the smaller clusters i.e the Bengali,
English and Hindi newspapers of
different cities into appropriate clusters
Bengali, English and Hindi Newspaper ,
containing the news of different cities in
the specified language.

8. CONCLUSIONS AND DISCUSSIONS


Figure 11: When a user gives a set of words as input for a
search of specific information, Google perform the
search on the existing documents available on the
World Wide Web to find a match for the requisite
information as per the user’s query. While Data
mining is typically concerned with the detection of
patterns in numeric data, very often important (e.g.,
critical to business) information is stored in the
form of text. Unlike numeric data, text is often
amorphous, and difficult to deal with .Text mining
generally consists of the analysis of (multiple) text
documents by extracting key phrases, concepts, etc.
and the preparation of the text processed in that
manner for further analyses with numeric data
mining techniques .A typical (first) goal in data
mining is feature extraction, i.e., the identification
Figure 12:
of the terms and concepts most frequently used in
the input documents; a second goal typically is to
4. Depiction of Combined Approach of discover any associations between features (e.g.,
Classification, Association and Clustering associations between symptoms as described by
Algorithm: patients). Hence, a first step to text mining usually
consists of "coding" the information in the input
text; as a second step various methods such as
Association Rules algorithms may be applied to
determine relations between features.
9. REFERENCES
[1] N. Jovanovic, V. Milutinovic, and Z. Obradovic, Member,
IEEE, “Foundations of Predictive Data Mining” (2002).
[2] Yochanan Shachmurove, Department of Economics, The
City College of the City, University of New York and The
University of Pennsylvania, Dorota Witkowska, Department of
Management, Technical University of Lodz “CARESS Working
Paper 00-11Utilizing Artificial Neural Network Model to
Predict Stock Markets” September 2000.
[3] Margaret H.Dunham, “Data Mining- Introductory and
Figure 13: Advanced Topics” Pearson Education, 2003, pages 106-112.
[4] Michael W. Berry and Malu Castellanos, Editors “Survey of
 Consider the Central Publisher as the Text Mining:Clustering, Classification, and Retrieval, Second
Editio” Springe,September 30, 2007.
predefined base class. The base class is [5] http://en.wikipedia.org/wiki/Text_mining
classified into sub classes Kolkata [6] Ying Zhao and George Karypis. Criterion Functions for
Publisher, Mumbai Publisher and Chennai Document Clustering: Experiments and Analysis. TR# 01-40,
Publisher based on the name of the city Department of Computer Science &Engineering, University of
Minnesota, Minneapolis, 2000.
name as the classification condition. (Here [7] Keno Buss, “Mining and Summarizing Customer Reviews”
I have considered only three cities) STRL, De Montfort University.

Copyright to IJARCCE www.ijarcce.com 232


ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 4, June 2012

[8] Abdul-Baquee M. Shara “The Qur'an Annotation for Text


Minin” School of Computin December 2009.
[9] Suman Chakraborty, Sudipta Roy, Prof. Samir K.
Bandyopadhyay “Image Steganography Using DNA Sequence
and Sudoku Solution Matrix”, International Journal of
Advanced Research in Computer Science and Software
Engineering, Volume 2, Issue 2, February 2012.
[10] Minqing Hu and Bing Liu “Mining and Summarizing
Customer Reviews” Department of Computer Science
,University of Illinois at Chicago ,851 South Morgan
Street,Chicago, IL 60607-7053.
[11] Anthony Don, Elena Zheleva, Machon Gregory, Sureyya
Tarkan, Loretta Auvil, Tanya Clement , Ben Shneiderman and
Catherine Plaisant “Discovering interesting usage patterns in
text collections: Integrating text mining with visualizatio”
http://hcil2.cs.umd.edu/trs/2007-08/2007-08.pdf
[12] http://store.elsevier.com/Practical-Text-Mining-and-
Statistical-Analysis-for-Non-structured-Text-Data Applications
/Gary-Miner/isbn-9780123869791/
[13] http://www.autonlab.org/tutorials/
[14] Un Yong Nahm “Text Mining with Information
Extraction” National Science Foundation
[15] Louise Francis, FCAS, MAAA, and Matt Flynn “Text
Mining Handboo” Casualty Actuarial Society E-Forum, Spring
2010.
[16] http://www.amazon.com/Principles-Adaptive-Computation-
Machine-Learning/dp/026208290X
[17] Un Yong Nahm, Mikhail Bilenko and Raymond J. Moone “
Two Approaches to Handling Noisy Variation in Text Minin”
Proceedings of the ICML-2002 Workshop on Text Learning, pp.
18-27, Sydney, Australia, July 2002.

Mrs. Sayantani Ghosh


She has enrolled herself
for Ph.D. in the
Department of Computer
Science and Engineering.
Kolkata , India.
She received her B.Sc. degree in
Computer Science from Bethune College in the year
2006, under the University of Calcutta. Sheranked 1st
class 3rd in the same university. She completed her
M.Sc. in Computer and Information Science from
University College of Science and Technology,
University of Calcutta in 2008. She did her M.Tech. in
Computer Science and Engineering from the Dept. of
Computer Science and Engineering, University of
Calcutta in the year 2010. Currently she is working as an
Assistant Professor in IERCEM, Institute of Information
Technology, under Techno
SudiptaIndia
RoyGroup (TIG) over two
years since 2010. He is pursuing M.Tech in
the Dept. Of Computer
Science & Engineering ,
University of Calcutta,
India. He received B.Sc(Phys
Hons) from Burdwan University and B.Tech from
Calcutta University. He is Author of more than five
publications in National and International Journal.
Field of interest is Biomedical Image Analysis,
Image Processing, Steganography, Database
Management System , Data Structure, Artificial
Intelligence, Programming Languages etc.

Samir K Bandyopadhyay
He is Professor of Dept. Of
Computer Science & Engi-
neering, University
Copyright to IJARCCE www.ijarcce.com
of Calcutta, Kolkata, India. 233

He is Chairman, Science & Engineering Research


Support Society(SERSC, Indian Part), Fellow of
Computer Society of India, Sectional President of
ICT of Indian Science Congress Association, 2008-

You might also like