0% found this document useful (0 votes)
24 views

Cluster Analysis

Notes on cluster analysis

Uploaded by

pythonds123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Cluster Analysis

Notes on cluster analysis

Uploaded by

pythonds123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Module VI

4.1 CLUSTER ANALYSIS

Cluster

Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped
in one cluster and dissimilar objects are grouped in another cluster.

Clustering

Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects
within a cluster have high similarity, but are very dissimilar to objects in other clusters.

Helps users understand the natural grouping or structure in a data set

Cluster Analysis

Cluster Analysis in Data Mining means that to find out the group of objects which are similar to each
other in the group but are different from the object in other groups.

The set of clusters resulting from a cluster analysis can be referred to as a clustering.

Clustering is unsupervised classification; no predefined classes

A good clustering method will produce high quality clusters with

 High intra-class similarity


 Low inter-class similarity
The quality of a clustering result depends on both the similarity measure used by the method and its
implementation

4.2 Applications of Clustering

Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.

Search Engines: Search engines also work on the clustering technique. The search result appears based
on the closest object to the search query. It does it by grouping similar data objects in one group that is
far from the other dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
Customer Segmentation: It is used in market research to segment the customers based on their choice
and preferences.

Biology: It is used in the biology stream to classify different species of plants and animals using the
image recognition technique.

Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used, that
means for which purpose it is more suitable.

Marketing: Finding groups of customers with similar behavior given a large database of customer data
containing their properties and past buying records.

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost,
identifying frauds.

City planning: Identifying groups of houses according to their house type, value and geographical
location.

Earthquake Studies: Clustering observed earthquake epicenters to identify dangerous zones

4.3 REQUIREMENTS

The following are typical requirements of clustering in data mining.

 Scalability

o We need highly scalable clustering algorithms to deal with large databases.

 Ability to deal with different kinds of attributes

o Algorithms should be capable to be applied on any kind of data such as interval-based


(numerical) data, categorical, and binary data.

 Discovery of clusters with attribute shape

o The clustering algorithm should be capable of detecting clusters of arbitrary shape. They
should not be bounded to only distance measures that tend to find spherical cluster of
small sizes.

 High dimensionality

o The clustering algorithm should not only be able to handle low-dimensional data but also
the high dimensional space.

 Interpretability
o The clustering results should be interpretable, comprehensible, and usable.
 Ability to deal with noisy data

o Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to
such data and may lead to poor quality clusters.

4.4 CATEGORIES OF CLUSTERING

The major fundamental clustering methods can be classified into the following categories.

Partitioning methods:

 Given a set of n objects, the partitioning method constructs k partitions of the data.
 Each partition will represent a cluster and k ≤ n.
 It divides the data into k groups.
 Each group must contain at least one object.
 Each object must belong to exactly one group.
Most partitioning methods are distance-based.
 For a given number of partitions (k) to construct, a partitioning method creates an initial
partitioning.
 Then it uses an iterative relocation technique to improve the partitioning by moving objects from
one group to another.
 The general criterion of a good partitioning is that objects in the same cluster are “close” or
related to each other, whereas objects in different clusters are “far apart” or very different.
Most applications adopt popular heuristic methods
 k-means algorithm
Where each cluster is represented by the mean value of the objects in the cluster.
 k-medoids algorithm
Where each cluster is represented by one of the objects located near the center of the cluster.
To find clusters with complex shapes and for very large data sets, partitioning-based methods need to
be extended.

Fig: Partitioning clustering method

Hierarchical methods:

 The hierarchical method creates a hierarchical decomposition of the given set of data objects.
 Organized by representation of a tree is called Dendrogram.

A hierarchical method can be classified as being either agglomerative or divisive, based on how the
hierarchical decomposition is formed.

Agglomerative approach

 This approach also called the bottom-up approach.


 It starts with each object forming a separate group.
 It successively merges the objects or groups close to one another, until all the groups are merged
into one, or a termination condition holds.

Divisive approach

 This approach is also called the top-down approach


 It starts with all the objects in the same cluster.
 In each successive iteration, a cluster is split into smaller clusters.
 It is down until each object in one cluster or the termination condition holds.
Density-based methods:

 This method is based on the notion of density.


 Their general idea is to continue growing a given cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.
 For example, for each data point within a given cluster, the neighborhood of a given radius has
to contain at least a minimum number of points.
 Such a method can be used to filter out noise or outliers and discover clusters of arbitrary shape.

Grid-based methods:

 In this, the objects together form a grid.


 Grid-based methods quantize the object space into a finite number of cells that form a grid
structure.
 All the clustering operations are performed on the grid structure (i.e., on the quantized
space).
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the quantized space.

5. Model based methods

Model based methods hypothesize a model for each of the clusters, and find the best fit of the data to
the given model.

4.5CHARACTERISTICS OF CLUSTERING TECHNIQUES

4.6TYPES OF DATA IN CLUSTER ANALYSIS


Different Types of Data

1. Interval-scaled variables
2. Binary variables
3. Nominal or Categorical Variables
4. Variables of mixed types
Interval-Scaled Variables

 Interval-scaled variables are continuous measurements of a roughly linear scale.


 Typical examples include weight and height, latitude and longitude coordinates (e.g., when
clustering houses), and weather temperature.
Binary Variables

 A binary variable is a variable that can take only 2 values.


 For example, generally, gender variables can take 2 variables male and female.
Types:
o Symmetric binary variables
o Asymmetric binary variables

Categorical Variables

 Data that can be divided into two categories


 Nominal and ordinal variables

Nominal variables

 No particular order to its category

Example : Gender (male or female) can be in any order.

Ordinal Variables
 An ordinal variable can be discrete or continuous.
 In this order is important.
Example : Temperature (low, medium, high) should be in an order.

Variables of mixed type


 A database may contain all the six types of variables symmetric binary, asymmetric binary,
nominal, ordinal, interval, and ratio.
 And those combined called as mixed-type variables.

4.7 PARTITIONING METHODS


 The simplest and most fundamental version of cluster analysis is partitioning, which organizes
the objects of a set into several e groups or clusters.

Suppose we are given a data set, D, of n objects, and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k ≤ n), where each partition represents a cluster.

It means that it will classify the data into k groups, which satisfy the following requirements:

 Each group contains at least one object


 Each object must belong to exactly one group.
For a given number of partitions (say k), the partitioning method will create an initial partitioning.

Then it uses the iterative relocation technique to improve the partitioning by moving objects form one
group to other.

There are two main categories of partition algorithms:

1. k-means algorithms
2. k-medoid algorithms

4.7.1 k-Means- A Centroid-Based Technique

 Given a data set D, of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into k partitions, where each partition represents a cluster.
 The k-means algorithm takes the input parameter, k, and partitions a set of n objects in to k
clusters.
 The k-means algorithm defines the centroid of a cluster as the mean value of the points within
the cluster.

The k-means algorithm:

Input: D is a dataset containing n objects, k is the number of cluster


Output: A set of k clusters
Steps:
1. Choose a value of k, number of clusters to be formed.

2. Randomly choose k objects from D as the initial cluster centroids.


3. For each of the objects in D do
a. Compute distance between the current objects and k cluster centroids
b. Assign the current object to that cluster to which it is closest.
4. Compute the “cluster centers” of each cluster. These become the new cluster centroids
5. Repeat step 3 and 4 until the convergence criterion is satisfied
Example:
Data sets {2,4,0,12,3,20,30,11,25}
Iteration 1
M1 and M2 for to randomly selected centroids/means
Where
M1=4, M2=11
The initial clusters are
C1={4}, C2={11}
Calculate the Euclidean distance as

D=[x,a]=Sqrt(x-a)2

D1 is the distance from M1


D2 is the distance from M2
Data D1 D2 Cluster
point
2 2 9 C1
4 0 7 C1
10 6 1 C2
12 8 1 C2
3 1 8 C1
20 16 9 C2
30 26 19 C2
11 7 0 C2
25 21 14 C2

In the table, 3 data points are added to cluster C1 and other data points are added to cluster C2.
Therefore
C1 = {2,4,3}
C2 = {10,12,20,30,11,25}
Iteration 2

Calculate new mean of data points in C1 and C2


Therefore
M1 = (2+4+3)/3 = 3
M2 = (10,12,20,30,11,25) /6 = 18

Calculating distance and updating clusters


Data D1 D2 Cluster
point
2 1 16 C1
3 1 14 C1
4 0 15 C1
10 7 8 C1
12 9 6 C2
20 17 2 C2
30 27 12 C2
11 8 7 C2
25 22 7 C2
New clusters
C1 = {2,3,4,10}
C2 = {12,20,30,11,25}
Iteration 3

Calculate new mean of data points in C1 and C2


Therefore
M1 = (2+3+4+10) / 4 = 4.75
M2 = {12+20+30+11+25)/ 5 = 19.6

Calculating the distance and updating the cluster

Data D1 D2 Cluster
point
2 2.75 17.6 C1
4 0.75 15.6 C1
3 1.75 16.6 C2
10 5.25 9.6 C2
12 7.25 7.6 C1
20 15.25 0.4 C2
30 25.25 10.4 C2
11 6.25 8.6 C1
25 20.25 5.4 C2
New clusters

C1 = {2,3,4,10,12,11}
C2 = {20,30,25}
Calculate new mean of data points in C1 and C2
Therefore
M1 = (2+3+4+10+12+11) / 6 = 7
M2 = (20+30+25) / 3 =25

Calculating distance and updating clusters

Data D1 D2 Cluster
point
2 5 23 C1
4 3 21 C1
3 4 22 C1
10 3 15 C1
12 5 13 C1
11 4 14 C1
20 13 5 C2
30 23 5 C2
25 18 0 C2
Iteration 4

New clusters

C1 = {2,3,4,10,12,11}

C2 = {20,30,25}

Data points in the cluster C1 and C2 in iteration 3 are same as the data points of the cluster C1 and C2
of iteration 2.

It means none of the data points has moved to other cluster.


Also the mean/centroid of these clusters is constant, so this becomes the stopping condition for our
algorithm.

4.7.2 k-Medoids: A Representative Object-Based Technique

There are three algorithms for K-medoids Clustering:

 PAM (Partitioning around medoids)

 CLARA (Clustering LARge Applications)

 CLARANS ("Randomized" CLARA).

Among these PAM is known to be most powerful and considered to be used widely.

 The PAM represents a cluster by a medoid.


 k-Medoids algorithm was proposed in 1987 by Kaufman and Rousseeuw.

A medoid is a most centrally located object in the Cluster or whose average dissimilarity to all
the objects is minimum.

A medoid can be defined as the point in the cluster, whose dissimilarities with all the other points
in the cluster is minimum.

The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi - Ci|

Find representative objects, called medoids, in clusters

Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central
objects.

Input:
 k: the number of clusters
 D: a data set containing n objects.
Output: a set of k clusters.
Methods
1. Select k objects in D as the initial representative objects (as initial k-medoid)
2. repeat
 assign each point to the cluster with the closet medoid
 randomly select a non-representative object, oi
 compute the total cost, S of swapping the medoid m with oi;
 if S<0 then swap m with oi to form the new set of medoids.
3. until convergence criterion is satisfied.
Advantages:

1. It is simple to understand and easy to implement.


2. k-Medoid Algorithm is fast and converges in a fixed number of steps.
3. PAM is less sensitive to outliers than other partitioning algorithms.
Disadvantages:

The main disadvantage of K-Medoid algorithms is that it is not suitable for clustering non-spherical
(arbitrary shaped) groups of objects.

4.8 OUTLIERS DETECTION IN CLUSTERING

OUTLIERS

Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data
value) or could be correct data values that are simply much different from the remaining data.

There exist data objects that do not comply with the general behavior or model of the data. Such data
objects, which are grossly different from or inconsistent with the remaining set of data, are called
outliers.

Some clustering techniques do not perform well with the presence of outliers. This problem is
illustrated in Figure

Fig: Outlier clustering problem.


Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two
clusters are found (dashed line), the two (obviously) different sets of data will be placed in one cluster
because they are closer together than the outlier.

Many data mining algorithms try to minimize the influence of outliers or eliminate them all together.

This, however, could result in the loss of important hidden information because one person’s noise
could be another person’s signal.

Outlier detection and analysis is an interesting data mining task, referred to as outlier mining.

It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services.

In addition, it is useful in customized marketing for identifying the spending behavior of customers with
extremely low or extremely high incomes, or in medical analysis for finding unusual responses to
various medical treatments

Outlier mining can be described as follows:

Given a set of n data points or objects and k, the expected number of outliers, find the top k
objects that are considerably dissimilar, exceptional, or inconsistent with respect to the
remaining data.

The outlier mining problem can be viewed as two sub problems:

 Define what data can be considered as inconsistent in a given data set


 Find an efficient method to mine the outliers so defined.

OUTLIER DETECTION

Outlier detection, or outlier mining, is the process of identifying outliers in a set of data.

Types of outlier detection:

 Statistical Distribution-Based Outlier Detection


 Distance-Based Outlier Detection
 Density-Based Local Outlier Detection
 Deviation-Based Outlier Detection
Some outlier detection techniques ate based on statistical techniques.

These usually assume that the set of data follows a known distribution and that outliers can be detected
by well-known tests such as discordancy tests.

However, these tests are not very realistic for real-world data because real world data values may not
follow well-defined data distributions.
Also, most of these tests assume a single attribute value, and many attributes are- involved in real-world
datasets.

Alternative detection techniques may be based on distance measures.

Web Mining
Web mining is the process of using data mining techniques and algorithms to extract information
directly from the Web by extracting it from Web documents and services, Web content, hyperlinks and
server logs. The goal of Web mining is to look for patterns in Web data by collecting and analyzing
information in order to gain insight into trends, the industry and users in general.
Web mining is a branch of data mining concentrating on the World Wide Web as the primary data
source, including all of its components from Web content, server logs to everything in between. The
contents of data mined from the Web may be a collection of facts that Web pages are meant to contain,
and these may consist of text, structured data such as lists and tables, and even images, video and audio.

Categories of Web mining:

 Web content mining — This is the process of mining useful information from the contents of
Web pages and Web documents, which are mostly text, images and audio/video files.
Techniques used in this discipline have been heavily drawn from natural language processing
(NLP) and information retrieval.
 Web structure mining — This is the process of analyzing the nodes and connection structure of a
website through the use of graph theory. There are two things that can be obtained from this:
the structure of a website in terms of how it is connected to other sites and the document
structure of the website itself, as to how each page is connected.
 Web usage mining — This is the process of extracting patterns and information from server logs
to gain insight on user activity including where the users are from, how many clicked what item
on the site and the types of activities being done on the site.

Text Data Mining


Text data mining can be described as the process of extracting essential data from standard language
text. All the data that we generate via text messages, documents, emails, files are written in common
language text. Text mining is primarily used to draw useful insights or patterns from such data.

The text mining market has experienced exponential growth and adoption over the last few years and
also expected to gain significant growth and adoption in the coming future. One of the primary reasons
behind the adoption of text mining is higher competition in the business market, many organizations
seeking value-added solutions to compete with other organizations. With increasing completion in
business and changing customer perspectives, organizations are making huge investments to find a
solution that is capable of analyzing customer and competitor data to improve competitiveness. The
primary source of data is e-commerce websites, social media platforms, published articles, survey, and
many more. The larger part of the generated data is unstructured, which makes it challenging and
expensive for the organizations to analyze with the help of the people. This challenge integrates with
the exponential growth in data generation has led to the growth of analytical tools. It is not only able to
handle large volumes of text data but also helps in decision-making purposes. Text mining software
empowers a user to draw useful information from a huge set of data available sources.

Areas of text mining in data mining:


These are the following area of text mining :

o Information Extraction:
The automatic extraction of structured data such as entities, entities relationships, and
attributes describing entities from an unstructured source is called information extraction.
o Natural Language Processing:
NLP stands for Natural language processing. Computer software can understand human
language as same as it is spoken. NLP is primarily a component of artificial intelligence(AI). The
development of the NLP application is difficult because computers generally expect humans to
"Speak" to them in a programming language that is accurate, clear, and exceptionally structured.
Human speech is usually not authentic so that it can depend on many complex variables,
including slang, social context, and regional dialects.
o Data Mining:
Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data
mining tools can predict behaviors and future trends that allow businesses to make a better
data-driven decision. Data mining tools can be used to resolve many business problems that
have traditionally been too time-consuming.
o Information Retrieval:
Information retrieval deals with retrieving useful data from data that is stored in our systems.
Alternately, as an analogy, we can view search engines that happen on websites such as e-
commerce sites or any other sites as part of information retrieval.

Text Mining Process:


The text mining process incorporates the following steps to extract the data from the document.

o Text transformation
A text transformation is a technique that is used to control the capitalization of the text.
Here the two major way of document representation is given.
a. Bag of words
b. Vector Space
o Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining, Natural Language
Processing (NLP), and information retrieval(IR). In the field of text mining, data pre-processing
is used for extracting useful information and knowledge from unstructured text data.
Information Retrieval (IR) is a matter of choosing which documents in a collection should be
retrieved to fulfill the user's need.
o Feature selection:
Feature selection is a significant part of data mining. Feature selection can be defined as the
process of reducing the input of processing or finding the essential information sources. The
feature selection is also called variable selection.
o Data Mining:
Now, in this step, the text mining procedure merges with the conventional process. Classic Data
Mining procedures are used in the structural database.
o Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
o Applications:
These are the following text mining applications:
o Risk Management:
Risk Management is a systematic and logical procedure of analyzing, identifying, treating, and
monitoring the risks involved in any action or process in organizations. Insufficient risk analysis
is usually a leading cause of disappointment. It is particularly true in the financial organizations
where adoption of Risk Management Software based on text mining technology can effectively
enhance the ability to diminish risk. It enables the administration of millions of sources and
petabytes of text documents, and giving the ability to connect the data. It helps to access the
appropriate data at the right time.
o Customer Care Service:
Text mining methods, particularly NLP, are finding increasing significance in the field of
customer care. Organizations are spending in text analytics programming to improve their
overall experience by accessing the textual data from different sources such as customer
feedback, surveys, customer calls, etc. The primary objective of text analysis is to reduce the
response time of the organizations and help to address the complaints of the customer rapidly
and productively.
o Business Intelligence:
Companies and business firms have started to use text mining strategies as a major aspect of
their business intelligence. Besides providing significant insights into customer behavior and
trends, text mining strategies also support organizations to analyze the qualities and
weaknesses of their opponent's so, giving them a competitive advantage in the market.
o Social Media Analysis:
Social media analysis helps to track the online data, and there are numerous text mining tools
designed particularly for performance analysis of social media sites. These tools help to monitor
and interpret the text generated via the internet from the news, emails, blogs, etc. Text mining
tools can precisely analyze the total no of posts, followers, and total no of likes of your brand on
a social media platform that enables you to understand the response of the individuals who are
interacting with your brand and content.

Text Mining Approaches in Data Mining:


These are the following text mining approaches that are used in data mining.

1. Keyword-based Association Analysis:

It collects sets of keywords or terms that often happen together and afterward discover the association
relationship among them. First, it preprocesses the text data by parsing, stemming, removing stop
words, etc. Once it pre-processed the data, then it induces association mining algorithms. Here, human
effort is not required, so the number of unwanted results and the execution time is reduced.

2. Document Classification Analysis:

Automatic document classification:

This analysis is used for the automatic classification of the huge number of online text documents like
web pages, emails, etc. Text document classification varies with the classification of relational data as
document databases are not organized according to attribute values pairs.

Numericizing text:
o Stemming algorithms
A significant pre-processing step before ordering of input documents starts with the stemming
of words. The terms "stemming" can be defined as a reduction of words to their roots. For
example, different grammatical forms of words and ordered are the same. The primary purpose
of stemming is to ensure a similar word by text mining program.
o Support for different languages:
There are some highly language-dependent operations such as stemming, synonyms, the letters
that are allowed in words. Therefore, support for various languages is important.
o Exclude certain character:
Excluding numbers, specific characters, or series of characters, or words that are shorter or
longer than a specific number of letters can be done before the ordering of the input documents.
o Include lists, exclude lists (stop-words):
A particular list of words to be listed can be characterized, and it is useful when we want to
search for a specific word. It also classifies the input documents based on the frequencies with
which those words occur. Additionally, "stop words," which means terms that are to be rejected
from the ordering can be characterized. Normally, a default list of English stop words
incorporates "the," "a," "since," etc. These words are used in the respective language very often
but communicate very little data in the document.

Difference between Spatial and Temporal Data Mining


Spatial data mining refers to the process of extraction of knowledge, spatial relationships and
interesting patterns that are not specifically stored in a spatial database; on the other hand, temporal
data mining refers to the process of extraction of knowledge about the occurrence of an event whether
they follow, random, cyclic, seasonal variation, etc. Spatial means space, whereas temporal means time.
In this article, we will learn Spatial and temporal data mining separately; after that, we will discuss the
difference between them.

What is Spatial Data Mining?

The emergence of spatial data and extensive usage of spatial databases has led to spatial knowledge
discovery. Spatial data mining can be understood as a process that determines some exciting and
hypothetically valuable patterns from spatial databases.

Several tools are there that assist in extracting information from geospatial data. These tools play a vital
role for organizations like NASA, the National Imagery and Mapping Agency (NIMA), the National
Cancer Institute (NCI), and the United States Department of Transportation (USDOT) which tends to
make big decisions based on large spatial datasets.

Earlier, some general-purpose data mining like Clementine See5/C5.0, and Enterprise Miner were used.
These tools were utilized to analyze large commercial databases, and these tools were mainly designed
for understanding the buying patterns of all customers from the database.

Besides, the general-purpose tools were preferably used to analyze scientific and engineering data,
astronomical data, multimedia data, genomic data, and web data.
These are the given specific features of geographical data that prevent the use of general-purpose data
mining algorithms are:

1. spatial relationships among the variables,


2. spatial structure of errors
3. observations that are not independent
4. spatial autocorrelation among the features
5. non-linear interaction in feature space.

Spatial data must have latitude or longitude, UTM easting or northing, or some other coordinates
denoting a point's location in space. Beyond that, spatial data can contain any number of attributes
pertaining to a place. You can choose the types of attributes you want to describe a place. Government
websites provide a resource by offering spatial data, but you need not be limited to what they have
produced. You can produce your own.

Say, for example, you wanted to log information about every location you've visited in the past week.
This might be useful to provide insight into your daily habits. You could capture your destination's
coordinates and list a number of attributes such as place name, the purpose of visit, duration of visit,
and more. You can then create a shapefile in Quantum GIS or similar software with this information and
use the software to query and visualize the data. For example, you could generate a heatmap of the
most visited places or select all places you've visited within a radius of 8 miles from home.

Any data can be made spatial if it can be linked to a location, and one can even have spatiotemporal
data linked to locations in both space and time. For example, when geolocating tweets from Twitter in
the aftermath of a disaster, an animation might be generated that shows the spread of tweets from the
epicentre of the event.

Spatial data mining tasks


These are the primary tasks of spatial data mining.

Classification:

Classification determines a set of rules which find the class of the specified object as per its attributes.
Association rules:

Association rules determine rules from the data sets, and it describes patterns that are usually in the
database.

Characteristic rules:

Characteristic rules describe some parts of the data set.

Discriminate rules:

As the name suggests, discriminate rules describe the differences between two parts of the database,
such as calculating the difference between two cities as per employment rate.

What is temporal data mining?

Temporal data mining refers to the process of extraction of non-trivial, implicit, and potentially
important data from huge sets of temporal data. Temporal data are sequences of a primary data type,
usually numerical values, and it deals with gathering useful knowledge from temporal data.

With the increase of stored data, the interest in finding hidden data has shattered in the last decade.
The finding of hidden data has primarily been focused on classifying data, finding relationships, and
data clustering. The major drawback that comes during the discovery process is treating data with
temporal dependencies. The attributes related to the temporal data present in this type of dataset must
be treated differently from other types of attributes. Therefore, most data mining techniques treat
temporal data as an unordered collection of events, ignoring its temporal data.

Temporal data mining tasks


o Data characterization and comparison
o Cluster Analysis
o Classification
o Association rules
o Prediction and trend analysis
o Pattern Analysis

Difference between spatial and Temporal data mining

Spatial Data Mining Temporal Data Mining


Spatial data mining refers to the temporal data mining refers to the process of
extraction of knowledge, spatial extraction of knowledge about the occurrence of
relationships and interesting patterns an event whether they follow, random, cyclic,
that are not specifically stored in a seasonal variation, etc
spatial database.

It needs space. It needs time.

Primarily, it deals with spatial data Primarily, it deals with implicit and explicit
such as location, geo-referenced. temporal content, form a huge set of data.

It involves characteristic rules, It targets mining new patterns and unknown


discriminant rules, evaluation rules, knowledge, which takes the temporal aspects of
and association rules. data.

Examples: Finding hotspots, unusual Examples: An association rules which seems -


locations. "Any person who buys motorcycle also buys
helmet". By temporal aspect, this rule would be -
"Any person who buys a motorcycle also buy a
helmet after that."

Data Mining tools


Data Mining is the set of techniques that utilize specific algorithms, statical analysis, artificial
intelligence, and database systems to analyze data from different dimensions and perspectives.

Data Mining tools have the objective of discovering patterns/trends/groupings among large sets of data
and transforming data into more refined information.

It is a framework, such as Rstudio or Tableau that allows you to perform different types of data mining
analysis.

We can perform various algorithms such as clustering or classification on your data set and visualize
the results itself. It is a framework that provides us better insights for our data and the phenomenon
that data represent. Such a framework is called a data mining tool.

The Market for Data Mining tool is shining: as per the latest report from ReortLinker noted that the
market would top $1 billion in sales by 2023, up from $ 591 million in 2018

These are the most popular data mining tools:


1. Orange Data Mining:

Orange is a perfect machine learning and data mining software suite. It supports the visualization and is
a software-based on components written in Python computing language and developed at the
bioinformatics laboratory at the faculty of computer and information science, Ljubljana University,
Slovenia.

As it is a software-based on components, the components of Orange are called "widgets." These widgets
range from preprocessing and data visualization to the assessment of algorithms and predictive
modeling.

Widgets deliver significant functionalities such as:

o Displaying data table and allowing to select features


o Data reading
o Training predictors and comparison of learning algorithms
o Data element visualization, etc.

Besides, Orange provides a more interactive and enjoyable atmosphere to dull analytical tools. It is
quite exciting to operate.

Why Orange?
Data comes to orange is formatted quickly to the desired pattern, and moving the widgets can be easily
transferred where needed. Orange is quite interesting to users. Orange allows its users to make smarter
decisions in a short time by rapidly comparing and analyzing the data.It is a good open-source data
visualization as well as evaluation that concerns beginners and professionals. Data mining can be
performed via visual programming or Python scripting. Many analyses are feasible through its visual
programming interface(drag and drop connected with widgets)and many visual tools tend to be
supported such as bar charts, scatterplots, trees, dendrograms, and heat maps. A substantial amount of
widgets(more than 100) tend to be supported.

The instrument has machine learning components, add-ons for bioinformatics and text mining, and it is
packed with features for data analytics. This is also used as a python library.

Python scripts can keep running in a terminal window, an integrated environment like PyCharmand
PythonWin, pr shells like iPython. Orange comprises of canvas interface onto which the user places
widgets and creates a data analysis workflow. The widget proposes fundamental operations, For
example, reading the data, showing a data table, selecting features, training predictors, comparing
learning algorithms, visualizing data elements, etc. Orange operates on Windows, Mac OS X, and a
variety of Linux operating systems. Orange comes with multiple regression and classification
algorithms.

Orange can read documents in native and other data formats. Orange is dedicated to machine learning
techniques for classification or supervised data mining. There are two types of objects used in
classification: learner and classifiers. Learners consider class-leveled data and return a classifier.
Regression methods are very similar to classification in Orange, and both are designed for supervised
data mining and require class-level data. The learning of ensembles combines the predictions of
individual models for precision gain. The model can either come from different training data or use
different learners on the same sets of data.

Learners can also be diversified by altering their parameter sets. In orange, ensembles are simply
wrappers around learners. They act like any other learner. Based on the data, they return models that
can predict the results of any data instance.

2. SAS Data Mining:


SAS stands for Statistical Analysis System. It is a product of the SAS Institute created for analytics and
data management. SAS can mine data, change it, manage information from various sources, and analyze
statistics. It offers a graphical UI for non-technical users.

SAS data miner allows users to analyze big data and provide accurate insight for timely decision-
making purposes. SAS has distributed memory processing architecture that is highly scalable. It is
suitable for data mining, optimization, and text mining purposes.

3. DataMelt Data Mining:

DataMelt is a computation and visualization environment which offers an interactive structure for data
analysis and visualization. It is primarily designed for students, engineers, and scientists. It is also
known as DMelt.

DMelt is a multi-platform utility written in JAVA. It can run on any operating system which is
compatible with JVM (Java Virtual Machine). It consists of Science and mathematics libraries.

o Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
o Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms, curve fitting, etc.

DMelt can be used for the analysis of the large volume of data, data mining, and statistical analysis. It is
extensively used in natural sciences, financial markets, and engineering.

4. Rattle:

Ratte is a data mining tool based on GUI. It uses the R stats programming language. Rattle exposes the
statical power of R by offering significant data mining features. While rattle has a comprehensive and
well-developed user interface, It has an integrated log code tab that produces duplicate code for any
GUI operation.
The data set produced by Rattle can be viewed and edited. Rattle gives the other facility to review the
code, use it for many purposes, and extend the code without any restriction.

5. Rapid Miner:

Rapid Miner is one of the most popular predictive analysis systems created by the company with the
same name as the Rapid Miner. It is written in JAVA programming language. It offers an integrated
environment for text mining, deep learning, machine learning, and predictive analysis.

The instrument can be used for a wide range of applications, including company applications,
commercial applications, research, education, training, application development, machine learning.

Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It has a
client/server model as its base. A rapid miner comes with template-based frameworks that enable fast
delivery with few errors(which are commonly expected in the manual coding writing process)

You might also like