Cluster Analysis
Cluster Analysis
Cluster Analysis
• Cluster: a collection of data objects
• Clustering helps to splits data into several subsets. Each of these subsets
contains data similar to each other, and these subsets are called clusters.
Applications of cluster analysis in data
mining:
• In many applications, clustering analysis is widely used, such as data analysis, market research,
• It assists marketers to find different groups in their client base and based on the purchasing
• Clustering is also used in tracking applications such as detection of credit card fraud.
• In terms of biology, It can be used to determine plant and animal taxonomies, categorization of
genes with the same functionalities and gain insight into structure inherent to populations.
• It helps in the identification of areas of similar land that are used in an earth observation database
and the identification of house groups in a city according to house type, value, and geographical
location.
Requirements of Clustering in Data Mining:
• Scalability − We need highly scalable clustering algorithms to deal with large databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any
kind of data such as interval-based (numerical) data, categorical, and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data
but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms
are sensitive to such data and may lead to poor quality clusters.
Typical examples include weight and height, latitude and longitude coordinates and
weather temperature.
• Binary Variables- A binary variable is a variable that can take only 2 values. 0 for absent
and 1 for present variable.
For example, binary variable given to the bike holder, 1 mean customer have bike and 0
means customer don’t have a bike.
• Nominal or Categorical Variables- A generalization of the binary variable in that it
can take more than 2 states, e.g., Map contain more then two colors to indicate
states like red, yellow, blue, green.
• Variables Of Mixed Type- A database may contain all the six types of variables.
symmetric binary, asymmetric binary, nominal, ordinal, interval, and ratio. And
those combined called as mixed-type variables.
Clustering Methods:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method:
hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two
• Agglomerative Approach- This approach is also known as the bottom-up approach. In this, we
start with each object forming a separate group. It keeps on merging the objects or groups that
are close to one another. It keep on doing so until all of the groups are merged into one.
• Divisive Approach- This approach is also known as the top-down approach. In this, we start with
all of the objects in the same cluster. a cluster is split up into smaller clusters. It is down until each
object in one cluster. once a merging or splitting is done, it can never be undone.
Density-based Method:
• In this, the objects together form a grid. The object space is divide
into finite number of cells that form a grid structure.
Advantages
• In this method, a model of cluster is find the best fit data for a given
model. This method locates the clusters by clustering the density
function. It reflects spatial distribution of the data points.
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Retail Industry
• Telecommunication Industry
• Integration of data mining with database systems, data warehouse systems and web database systems.
• Web mining.
• The main purpose of web mining is to discover useful information from the
World Wide Web and its usage patterns.
• Web content consist of several types of data – text, image, audio, video etc.
• Text documents are related to text mining, machine learning and natural language
processing.
• This mining is also known as text mining. This type of mining performs scanning and
mining of the text, images and groups of web pages according to the content of the
input.
Web Structure Mining
• Web structure mining is the discovering structure information from
the web.
• In web usage mining, user access data on the web and collect data in
form of logs. So, Web usage mining is also called log mining.
Applications of Web Mining
• Personalized marketing
• E-commerce
• Search engine optimization
• Fraud detection
• Web content analysis
• Customer service
• Healthcare
Text Data Mining
• All the data that we generate via text messages, documents, emails,
files are written in common language text.
Areas of text mining in data mining:
• Information Extraction: The automatic extraction of structured data such as
entities, relationships, and attributes describing entities from an unstructured
source is called information extraction.
• Data Mining: Data mining refers to the extraction of useful data, hidden patterns
from large data sets. Data mining tools can predict behaviors and future trends.
• Information Retrieval: Information retrieval deals with retrieving useful data from
data that is stored in our systems.
Text Mining Applications
• Digital Library: Various text mining strategies and tools are being used to get the pattern and
trends from journal and proceedings which is stored in text database.
• Academic and Research Field: In the education field, different text-mining tools and strategies
are utilized to examine the instructive patterns in a specific region/research field.
• Life Science: Life science and healthcare industries are producing textual and mathematical data
regarding patient records, sicknesses, medicines, symptoms, and treatments of diseases, etc.
• Social-Media: Text mining analyzing web-based media applications to monitor and investigate
online content like the plain text from internet news, web journals, emails, blogs, etc.
• Business Intelligence: Text mining plays an important role in business intelligence that help
different organization and enterprises to analyze their customers and competitors to make better
decisions.
Advantages of Text Mining
It involves characteristic rules, evaluation rules, and It targets mining new patterns and unknown knowledge,
association rules. which takes the temporal aspects of data.
Examples: Finding hotspots, unusual locations. Examples: An association rules which seems - "Any
person who buys motorcycle also buys helmet". By
temporal aspect, this rule would be - "Any person who
buys a motorcycle also buy a helmet after that."
Rough Set Theory