Cluster Analysis
Cluster Analysis
Cluster
Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped
in one cluster and dissimilar objects are grouped in another cluster.
Clustering
Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects
within a cluster have high similarity, but are very dissimilar to objects in other clusters.
Cluster Analysis
Cluster Analysis in Data Mining means that to find out the group of objects which are similar to each
other in the group but are different from the object in other groups.
The set of clusters resulting from a cluster analysis can be referred to as a clustering.
Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
Search Engines: Search engines also work on the clustering technique. The search result appears based
on the closest object to the search query. It does it by grouping similar data objects in one group that is
far from the other dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
Customer Segmentation: It is used in market research to segment the customers based on their choice
and preferences.
Biology: It is used in the biology stream to classify different species of plants and animals using the
image recognition technique.
Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used, that
means for which purpose it is more suitable.
Marketing: Finding groups of customers with similar behavior given a large database of customer data
containing their properties and past buying records.
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost,
identifying frauds.
City planning: Identifying groups of houses according to their house type, value and geographical
location.
4.3 REQUIREMENTS
Scalability
o The clustering algorithm should be capable of detecting clusters of arbitrary shape. They
should not be bounded to only distance measures that tend to find spherical cluster of
small sizes.
High dimensionality
o The clustering algorithm should not only be able to handle low-dimensional data but also
the high dimensional space.
Interpretability
o The clustering results should be interpretable, comprehensible, and usable.
Ability to deal with noisy data
o Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to
such data and may lead to poor quality clusters.
The major fundamental clustering methods can be classified into the following categories.
Partitioning methods:
Given a set of n objects, the partitioning method constructs k partitions of the data.
Each partition will represent a cluster and k ≤ n.
It divides the data into k groups.
Each group must contain at least one object.
Each object must belong to exactly one group.
Most partitioning methods are distance-based.
For a given number of partitions (k) to construct, a partitioning method creates an initial
partitioning.
Then it uses an iterative relocation technique to improve the partitioning by moving objects from
one group to another.
The general criterion of a good partitioning is that objects in the same cluster are “close” or
related to each other, whereas objects in different clusters are “far apart” or very different.
Most applications adopt popular heuristic methods
k-means algorithm
Where each cluster is represented by the mean value of the objects in the cluster.
k-medoids algorithm
Where each cluster is represented by one of the objects located near the center of the cluster.
To find clusters with complex shapes and for very large data sets, partitioning-based methods need to
be extended.
Hierarchical methods:
The hierarchical method creates a hierarchical decomposition of the given set of data objects.
Organized by representation of a tree is called Dendrogram.
A hierarchical method can be classified as being either agglomerative or divisive, based on how the
hierarchical decomposition is formed.
Agglomerative approach
Divisive approach
Grid-based methods:
Model based methods hypothesize a model for each of the clusters, and find the best fit of the data to
the given model.
1. Interval-scaled variables
2. Binary variables
3. Nominal or Categorical Variables
4. Variables of mixed types
Interval-Scaled Variables
Categorical Variables
Nominal variables
Ordinal Variables
An ordinal variable can be discrete or continuous.
In this order is important.
Example : Temperature (low, medium, high) should be in an order.
Suppose we are given a data set, D, of n objects, and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k ≤ n), where each partition represents a cluster.
It means that it will classify the data into k groups, which satisfy the following requirements:
Then it uses the iterative relocation technique to improve the partitioning by moving objects form one
group to other.
1. k-means algorithms
2. k-medoid algorithms
Given a data set D, of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into k partitions, where each partition represents a cluster.
The k-means algorithm takes the input parameter, k, and partitions a set of n objects in to k
clusters.
The k-means algorithm defines the centroid of a cluster as the mean value of the points within
the cluster.
D=[x,a]=Sqrt(x-a)2
In the table, 3 data points are added to cluster C1 and other data points are added to cluster C2.
Therefore
C1 = {2,4,3}
C2 = {10,12,20,30,11,25}
Iteration 2
Data D1 D2 Cluster
point
2 2.75 17.6 C1
4 0.75 15.6 C1
3 1.75 16.6 C2
10 5.25 9.6 C2
12 7.25 7.6 C1
20 15.25 0.4 C2
30 25.25 10.4 C2
11 6.25 8.6 C1
25 20.25 5.4 C2
New clusters
C1 = {2,3,4,10,12,11}
C2 = {20,30,25}
Calculate new mean of data points in C1 and C2
Therefore
M1 = (2+3+4+10+12+11) / 6 = 7
M2 = (20+30+25) / 3 =25
Data D1 D2 Cluster
point
2 5 23 C1
4 3 21 C1
3 4 22 C1
10 3 15 C1
12 5 13 C1
11 4 14 C1
20 13 5 C2
30 23 5 C2
25 18 0 C2
Iteration 4
New clusters
C1 = {2,3,4,10,12,11}
C2 = {20,30,25}
Data points in the cluster C1 and C2 in iteration 3 are same as the data points of the cluster C1 and C2
of iteration 2.
Among these PAM is known to be most powerful and considered to be used widely.
A medoid is a most centrally located object in the Cluster or whose average dissimilarity to all
the objects is minimum.
A medoid can be defined as the point in the cluster, whose dissimilarities with all the other points
in the cluster is minimum.
The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi - Ci|
Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central
objects.
Input:
k: the number of clusters
D: a data set containing n objects.
Output: a set of k clusters.
Methods
1. Select k objects in D as the initial representative objects (as initial k-medoid)
2. repeat
assign each point to the cluster with the closet medoid
randomly select a non-representative object, oi
compute the total cost, S of swapping the medoid m with oi;
if S<0 then swap m with oi to form the new set of medoids.
3. until convergence criterion is satisfied.
Advantages:
The main disadvantage of K-Medoid algorithms is that it is not suitable for clustering non-spherical
(arbitrary shaped) groups of objects.
OUTLIERS
Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data
value) or could be correct data values that are simply much different from the remaining data.
There exist data objects that do not comply with the general behavior or model of the data. Such data
objects, which are grossly different from or inconsistent with the remaining set of data, are called
outliers.
Some clustering techniques do not perform well with the presence of outliers. This problem is
illustrated in Figure
Many data mining algorithms try to minimize the influence of outliers or eliminate them all together.
This, however, could result in the loss of important hidden information because one person’s noise
could be another person’s signal.
Outlier detection and analysis is an interesting data mining task, referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services.
In addition, it is useful in customized marketing for identifying the spending behavior of customers with
extremely low or extremely high incomes, or in medical analysis for finding unusual responses to
various medical treatments
Given a set of n data points or objects and k, the expected number of outliers, find the top k
objects that are considerably dissimilar, exceptional, or inconsistent with respect to the
remaining data.
OUTLIER DETECTION
Outlier detection, or outlier mining, is the process of identifying outliers in a set of data.
These usually assume that the set of data follows a known distribution and that outliers can be detected
by well-known tests such as discordancy tests.
However, these tests are not very realistic for real-world data because real world data values may not
follow well-defined data distributions.
Also, most of these tests assume a single attribute value, and many attributes are- involved in real-world
datasets.
Web Mining
Web mining is the process of using data mining techniques and algorithms to extract information
directly from the Web by extracting it from Web documents and services, Web content, hyperlinks and
server logs. The goal of Web mining is to look for patterns in Web data by collecting and analyzing
information in order to gain insight into trends, the industry and users in general.
Web mining is a branch of data mining concentrating on the World Wide Web as the primary data
source, including all of its components from Web content, server logs to everything in between. The
contents of data mined from the Web may be a collection of facts that Web pages are meant to contain,
and these may consist of text, structured data such as lists and tables, and even images, video and audio.
Web content mining — This is the process of mining useful information from the contents of
Web pages and Web documents, which are mostly text, images and audio/video files.
Techniques used in this discipline have been heavily drawn from natural language processing
(NLP) and information retrieval.
Web structure mining — This is the process of analyzing the nodes and connection structure of a
website through the use of graph theory. There are two things that can be obtained from this:
the structure of a website in terms of how it is connected to other sites and the document
structure of the website itself, as to how each page is connected.
Web usage mining — This is the process of extracting patterns and information from server logs
to gain insight on user activity including where the users are from, how many clicked what item
on the site and the types of activities being done on the site.
The text mining market has experienced exponential growth and adoption over the last few years and
also expected to gain significant growth and adoption in the coming future. One of the primary reasons
behind the adoption of text mining is higher competition in the business market, many organizations
seeking value-added solutions to compete with other organizations. With increasing completion in
business and changing customer perspectives, organizations are making huge investments to find a
solution that is capable of analyzing customer and competitor data to improve competitiveness. The
primary source of data is e-commerce websites, social media platforms, published articles, survey, and
many more. The larger part of the generated data is unstructured, which makes it challenging and
expensive for the organizations to analyze with the help of the people. This challenge integrates with
the exponential growth in data generation has led to the growth of analytical tools. It is not only able to
handle large volumes of text data but also helps in decision-making purposes. Text mining software
empowers a user to draw useful information from a huge set of data available sources.
o Information Extraction:
The automatic extraction of structured data such as entities, entities relationships, and
attributes describing entities from an unstructured source is called information extraction.
o Natural Language Processing:
NLP stands for Natural language processing. Computer software can understand human
language as same as it is spoken. NLP is primarily a component of artificial intelligence(AI). The
development of the NLP application is difficult because computers generally expect humans to
"Speak" to them in a programming language that is accurate, clear, and exceptionally structured.
Human speech is usually not authentic so that it can depend on many complex variables,
including slang, social context, and regional dialects.
o Data Mining:
Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data
mining tools can predict behaviors and future trends that allow businesses to make a better
data-driven decision. Data mining tools can be used to resolve many business problems that
have traditionally been too time-consuming.
o Information Retrieval:
Information retrieval deals with retrieving useful data from data that is stored in our systems.
Alternately, as an analogy, we can view search engines that happen on websites such as e-
commerce sites or any other sites as part of information retrieval.
o Text transformation
A text transformation is a technique that is used to control the capitalization of the text.
Here the two major way of document representation is given.
a. Bag of words
b. Vector Space
o Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining, Natural Language
Processing (NLP), and information retrieval(IR). In the field of text mining, data pre-processing
is used for extracting useful information and knowledge from unstructured text data.
Information Retrieval (IR) is a matter of choosing which documents in a collection should be
retrieved to fulfill the user's need.
o Feature selection:
Feature selection is a significant part of data mining. Feature selection can be defined as the
process of reducing the input of processing or finding the essential information sources. The
feature selection is also called variable selection.
o Data Mining:
Now, in this step, the text mining procedure merges with the conventional process. Classic Data
Mining procedures are used in the structural database.
o Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
o Applications:
These are the following text mining applications:
o Risk Management:
Risk Management is a systematic and logical procedure of analyzing, identifying, treating, and
monitoring the risks involved in any action or process in organizations. Insufficient risk analysis
is usually a leading cause of disappointment. It is particularly true in the financial organizations
where adoption of Risk Management Software based on text mining technology can effectively
enhance the ability to diminish risk. It enables the administration of millions of sources and
petabytes of text documents, and giving the ability to connect the data. It helps to access the
appropriate data at the right time.
o Customer Care Service:
Text mining methods, particularly NLP, are finding increasing significance in the field of
customer care. Organizations are spending in text analytics programming to improve their
overall experience by accessing the textual data from different sources such as customer
feedback, surveys, customer calls, etc. The primary objective of text analysis is to reduce the
response time of the organizations and help to address the complaints of the customer rapidly
and productively.
o Business Intelligence:
Companies and business firms have started to use text mining strategies as a major aspect of
their business intelligence. Besides providing significant insights into customer behavior and
trends, text mining strategies also support organizations to analyze the qualities and
weaknesses of their opponent's so, giving them a competitive advantage in the market.
o Social Media Analysis:
Social media analysis helps to track the online data, and there are numerous text mining tools
designed particularly for performance analysis of social media sites. These tools help to monitor
and interpret the text generated via the internet from the news, emails, blogs, etc. Text mining
tools can precisely analyze the total no of posts, followers, and total no of likes of your brand on
a social media platform that enables you to understand the response of the individuals who are
interacting with your brand and content.
It collects sets of keywords or terms that often happen together and afterward discover the association
relationship among them. First, it preprocesses the text data by parsing, stemming, removing stop
words, etc. Once it pre-processed the data, then it induces association mining algorithms. Here, human
effort is not required, so the number of unwanted results and the execution time is reduced.
This analysis is used for the automatic classification of the huge number of online text documents like
web pages, emails, etc. Text document classification varies with the classification of relational data as
document databases are not organized according to attribute values pairs.
Numericizing text:
o Stemming algorithms
A significant pre-processing step before ordering of input documents starts with the stemming
of words. The terms "stemming" can be defined as a reduction of words to their roots. For
example, different grammatical forms of words and ordered are the same. The primary purpose
of stemming is to ensure a similar word by text mining program.
o Support for different languages:
There are some highly language-dependent operations such as stemming, synonyms, the letters
that are allowed in words. Therefore, support for various languages is important.
o Exclude certain character:
Excluding numbers, specific characters, or series of characters, or words that are shorter or
longer than a specific number of letters can be done before the ordering of the input documents.
o Include lists, exclude lists (stop-words):
A particular list of words to be listed can be characterized, and it is useful when we want to
search for a specific word. It also classifies the input documents based on the frequencies with
which those words occur. Additionally, "stop words," which means terms that are to be rejected
from the ordering can be characterized. Normally, a default list of English stop words
incorporates "the," "a," "since," etc. These words are used in the respective language very often
but communicate very little data in the document.
The emergence of spatial data and extensive usage of spatial databases has led to spatial knowledge
discovery. Spatial data mining can be understood as a process that determines some exciting and
hypothetically valuable patterns from spatial databases.
Several tools are there that assist in extracting information from geospatial data. These tools play a vital
role for organizations like NASA, the National Imagery and Mapping Agency (NIMA), the National
Cancer Institute (NCI), and the United States Department of Transportation (USDOT) which tends to
make big decisions based on large spatial datasets.
Earlier, some general-purpose data mining like Clementine See5/C5.0, and Enterprise Miner were used.
These tools were utilized to analyze large commercial databases, and these tools were mainly designed
for understanding the buying patterns of all customers from the database.
Besides, the general-purpose tools were preferably used to analyze scientific and engineering data,
astronomical data, multimedia data, genomic data, and web data.
These are the given specific features of geographical data that prevent the use of general-purpose data
mining algorithms are:
Spatial data must have latitude or longitude, UTM easting or northing, or some other coordinates
denoting a point's location in space. Beyond that, spatial data can contain any number of attributes
pertaining to a place. You can choose the types of attributes you want to describe a place. Government
websites provide a resource by offering spatial data, but you need not be limited to what they have
produced. You can produce your own.
Say, for example, you wanted to log information about every location you've visited in the past week.
This might be useful to provide insight into your daily habits. You could capture your destination's
coordinates and list a number of attributes such as place name, the purpose of visit, duration of visit,
and more. You can then create a shapefile in Quantum GIS or similar software with this information and
use the software to query and visualize the data. For example, you could generate a heatmap of the
most visited places or select all places you've visited within a radius of 8 miles from home.
Any data can be made spatial if it can be linked to a location, and one can even have spatiotemporal
data linked to locations in both space and time. For example, when geolocating tweets from Twitter in
the aftermath of a disaster, an animation might be generated that shows the spread of tweets from the
epicentre of the event.
Classification:
Classification determines a set of rules which find the class of the specified object as per its attributes.
Association rules:
Association rules determine rules from the data sets, and it describes patterns that are usually in the
database.
Characteristic rules:
Discriminate rules:
As the name suggests, discriminate rules describe the differences between two parts of the database,
such as calculating the difference between two cities as per employment rate.
Temporal data mining refers to the process of extraction of non-trivial, implicit, and potentially
important data from huge sets of temporal data. Temporal data are sequences of a primary data type,
usually numerical values, and it deals with gathering useful knowledge from temporal data.
With the increase of stored data, the interest in finding hidden data has shattered in the last decade.
The finding of hidden data has primarily been focused on classifying data, finding relationships, and
data clustering. The major drawback that comes during the discovery process is treating data with
temporal dependencies. The attributes related to the temporal data present in this type of dataset must
be treated differently from other types of attributes. Therefore, most data mining techniques treat
temporal data as an unordered collection of events, ignoring its temporal data.
Primarily, it deals with spatial data Primarily, it deals with implicit and explicit
such as location, geo-referenced. temporal content, form a huge set of data.
Data Mining tools have the objective of discovering patterns/trends/groupings among large sets of data
and transforming data into more refined information.
It is a framework, such as Rstudio or Tableau that allows you to perform different types of data mining
analysis.
We can perform various algorithms such as clustering or classification on your data set and visualize
the results itself. It is a framework that provides us better insights for our data and the phenomenon
that data represent. Such a framework is called a data mining tool.
The Market for Data Mining tool is shining: as per the latest report from ReortLinker noted that the
market would top $1 billion in sales by 2023, up from $ 591 million in 2018
Orange is a perfect machine learning and data mining software suite. It supports the visualization and is
a software-based on components written in Python computing language and developed at the
bioinformatics laboratory at the faculty of computer and information science, Ljubljana University,
Slovenia.
As it is a software-based on components, the components of Orange are called "widgets." These widgets
range from preprocessing and data visualization to the assessment of algorithms and predictive
modeling.
Besides, Orange provides a more interactive and enjoyable atmosphere to dull analytical tools. It is
quite exciting to operate.
Why Orange?
Data comes to orange is formatted quickly to the desired pattern, and moving the widgets can be easily
transferred where needed. Orange is quite interesting to users. Orange allows its users to make smarter
decisions in a short time by rapidly comparing and analyzing the data.It is a good open-source data
visualization as well as evaluation that concerns beginners and professionals. Data mining can be
performed via visual programming or Python scripting. Many analyses are feasible through its visual
programming interface(drag and drop connected with widgets)and many visual tools tend to be
supported such as bar charts, scatterplots, trees, dendrograms, and heat maps. A substantial amount of
widgets(more than 100) tend to be supported.
The instrument has machine learning components, add-ons for bioinformatics and text mining, and it is
packed with features for data analytics. This is also used as a python library.
Python scripts can keep running in a terminal window, an integrated environment like PyCharmand
PythonWin, pr shells like iPython. Orange comprises of canvas interface onto which the user places
widgets and creates a data analysis workflow. The widget proposes fundamental operations, For
example, reading the data, showing a data table, selecting features, training predictors, comparing
learning algorithms, visualizing data elements, etc. Orange operates on Windows, Mac OS X, and a
variety of Linux operating systems. Orange comes with multiple regression and classification
algorithms.
Orange can read documents in native and other data formats. Orange is dedicated to machine learning
techniques for classification or supervised data mining. There are two types of objects used in
classification: learner and classifiers. Learners consider class-leveled data and return a classifier.
Regression methods are very similar to classification in Orange, and both are designed for supervised
data mining and require class-level data. The learning of ensembles combines the predictions of
individual models for precision gain. The model can either come from different training data or use
different learners on the same sets of data.
Learners can also be diversified by altering their parameter sets. In orange, ensembles are simply
wrappers around learners. They act like any other learner. Based on the data, they return models that
can predict the results of any data instance.
SAS data miner allows users to analyze big data and provide accurate insight for timely decision-
making purposes. SAS has distributed memory processing architecture that is highly scalable. It is
suitable for data mining, optimization, and text mining purposes.
DataMelt is a computation and visualization environment which offers an interactive structure for data
analysis and visualization. It is primarily designed for students, engineers, and scientists. It is also
known as DMelt.
DMelt is a multi-platform utility written in JAVA. It can run on any operating system which is
compatible with JVM (Java Virtual Machine). It consists of Science and mathematics libraries.
o Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
o Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms, curve fitting, etc.
DMelt can be used for the analysis of the large volume of data, data mining, and statistical analysis. It is
extensively used in natural sciences, financial markets, and engineering.
4. Rattle:
Ratte is a data mining tool based on GUI. It uses the R stats programming language. Rattle exposes the
statical power of R by offering significant data mining features. While rattle has a comprehensive and
well-developed user interface, It has an integrated log code tab that produces duplicate code for any
GUI operation.
The data set produced by Rattle can be viewed and edited. Rattle gives the other facility to review the
code, use it for many purposes, and extend the code without any restriction.
5. Rapid Miner:
Rapid Miner is one of the most popular predictive analysis systems created by the company with the
same name as the Rapid Miner. It is written in JAVA programming language. It offers an integrated
environment for text mining, deep learning, machine learning, and predictive analysis.
The instrument can be used for a wide range of applications, including company applications,
commercial applications, research, education, training, application development, machine learning.
Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It has a
client/server model as its base. A rapid miner comes with template-based frameworks that enable fast
delivery with few errors(which are commonly expected in the manual coding writing process)