0% found this document useful (0 votes)
18 views

Cluster Analysis

Uploaded by

nssaini1712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Cluster Analysis

Uploaded by

nssaini1712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Cluster Analysis

Cluster Analysis
• Cluster: a collection of data objects

• Similar to one another within the same cluster

• Dissimilar to the objects in other clusters

• Cluster analysis:- Finding similarities between data according to the


characteristics found in the data and grouping similar data objects into
clusters

• Clustering helps to splits data into several subsets. Each of these subsets
contains data similar to each other, and these subsets are called clusters.
Applications of cluster analysis in data
mining:
• In many applications, clustering analysis is widely used, such as data analysis, market research,

pattern recognition, and image processing.

• It assists marketers to find different groups in their client base and based on the purchasing

patterns. They can characterize their customer groups.

• Clustering is also used in tracking applications such as detection of credit card fraud.

• In terms of biology, It can be used to determine plant and animal taxonomies, categorization of

genes with the same functionalities and gain insight into structure inherent to populations.

• It helps in the identification of areas of similar land that are used in an earth observation database

and the identification of house groups in a city according to house type, value, and geographical

location.
Requirements of Clustering in Data Mining:
• Scalability − We need highly scalable clustering algorithms to deal with large databases.

• Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any
kind of data such as interval-based (numerical) data, categorical, and binary data.

• Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.

• High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data
but also the high dimensional space.

• Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms
are sensitive to such data and may lead to poor quality clusters.

• Interpretability − The clustering results should be interpretable, comprehensible, and usable.


Types Of Data Used In Cluster Analysis
Are:
• Interval-Scaled variables
• Binary variables
• Nominal, Ordinal,
• Ratio variables
• Variables of mixed types
Types Of Data Used In Cluster Analysis
Are:
• Interval-Scaled Variables- Interval-scaled variables are continuous measurements of a
roughly linear scale.

Typical examples include weight and height, latitude and longitude coordinates and
weather temperature.

• Binary Variables- A binary variable is a variable that can take only 2 values. 0 for absent
and 1 for present variable.

For example, binary variable given to the bike holder, 1 mean customer have bike and 0
means customer don’t have a bike.
• Nominal or Categorical Variables- A generalization of the binary variable in that it
can take more than 2 states, e.g., Map contain more then two colors to indicate
states like red, yellow, blue, green.

• Ordinal Variables- An ordinal variable can be discrete or continuous. In this order


is important, e.g., rank. It can be treated like interval-scaled.

• Ratio-Scaled Intervals- Ratio-scaled variable: It is a positive measurement on a


nonlinear scale, approximately at an exponential scale, such as Ae^Bt or A^e-Bt.

• Variables Of Mixed Type- A database may contain all the six types of variables.
symmetric binary, asymmetric binary, nominal, ordinal, interval, and ratio. And
those combined called as mixed-type variables.
Clustering Methods:

Clustering methods can be classified into the following categories −

• Partitioning Method

• Hierarchical Method

• Density-based Method

• Grid-Based Method

• Model-Based Method

• Constraint-based Method
Partitioning Method:

• Suppose we are given a database of ‘n’ objects and the partitioning


method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −

• Each group contains at least one object.

• Each object must belong to exactly one group.


Hierarchical Methods:
• This method creates a hierarchical decomposition of the given set of data objects. We can classify

hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two

approaches here − 1. Agglomerative Approach 2. Divisive Approach

• Agglomerative Approach- This approach is also known as the bottom-up approach. In this, we

start with each object forming a separate group. It keeps on merging the objects or groups that

are close to one another. It keep on doing so until all of the groups are merged into one.

• Divisive Approach- This approach is also known as the top-down approach. In this, we start with

all of the objects in the same cluster. a cluster is split up into smaller clusters. It is down until each

object in one cluster. once a merging or splitting is done, it can never be undone.
Density-based Method:

• This method is based on the notion of density. The basic idea is to


continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within
a given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method:

• In this, the objects together form a grid. The object space is divide
into finite number of cells that form a grid structure.

Advantages

• The major advantage of this method is fast processing time.

• It is dependent only on the number of cells in each dimension in the


quantized space.
Model-based methods:

• In this method, a model of cluster is find the best fit data for a given
model. This method locates the clusters by clustering the density
function. It reflects spatial distribution of the data points.

• This method also provides a way to automatically determine the


number of clusters based on standard statistics, taking outlier or noise
into account.
Constraint-based Method:

• In this method, the clustering is performed by the application-


oriented constraints. A constraint refers to the user expectation
or the properties of clustering results. Constraints provide us
with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the
application requirement.
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs

• Land use: Identification of areas of similar land use in an earth observation database

• Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost

• City-planning: Identifying groups of houses according to their house type, value, and
geographical location

• Earth-quake studies: Observed earth quake epicenters should be clustered along


continent faults
Data Mining Applications
Data Mining Applications

Here is the list of areas where data mining is widely used −

• Financial Data Analysis

• Retail Industry

• Telecommunication Industry

• Biological Data Analysis


Financial Data Analysis

• Design and construction of data warehouses for multidimensional


data analysis and data mining.

• Loan payment prediction and customer credit policy analysis.

• Classification and clustering of customers for targeted marketing.

• Detection of money laundering and other financial crimes.


Retail Industry

• Design and Construction of data warehouses based on the benefits of


data mining.

• Multidimensional analysis of sales, customers, products, time and


region.

• Analysis of effectiveness of sales campaigns.

• Product recommendation and cross-referencing of items.


Telecommunication Industry

• Multidimensional Analysis of Telecommunication data.

• Fraud pattern analysis.

• Identification of unusual patterns.

• Multidimensional association and sequential patterns analysis.

• Mobile Telecommunication services.

• Use of visualization tools in telecommunication data analysis.


Biological Data Analysis

• Semantic integration of heterogeneous, distributed genomic databases.

• Alignment, indexing, similarity search and comparative analysis multiple


nucleotide sequences.

• Discovery of structural patterns and analysis of genetic networks and protein


pathways.

• Association and path analysis.

• Visualization tools in genetic data analysis.


Trends in Data Mining
• Application Exploration.

• Integration of data mining with database systems, data warehouse systems and web database systems.

• Standardization of data mining query language.

• Web mining.

• Biological data mining.

• Data mining and software engineering.

• Distributed data mining.

• Real time data mining.

• Privacy protection and information security in data mining.


Web mining
• Web Mining is the process of Data Mining techniques to extract information
from Web documents and services.

• The main purpose of web mining is to discover useful information from the
World Wide Web and its usage patterns.

Web mining is further divided into three different types

• Web content mining

• Web structure mining

• Web usage mining


Web content mining
• Web content mining is the extracting useful information from the content of the web
documents.

• Web content consist of several types of data – text, image, audio, video etc.

• It can provide effective and interesting patterns about user needs.

• Text documents are related to text mining, machine learning and natural language
processing.

• This mining is also known as text mining. This type of mining performs scanning and
mining of the text, images and groups of web pages according to the content of the
input.
Web Structure Mining
• Web structure mining is the discovering structure information from
the web.

• The structure of the web consists of web pages as nodes, and


hyperlinks as edges connecting related pages.

• Structure mining basically shows the structured summary of a


particular website. It identifies relationship between web pages linked
by information or direct link connection.
Web Usage Mining

• Web usage mining is the identifying or discovering interesting usage


patterns from large data sets.

• These patterns enable you to understand the user behaviors.

• In web usage mining, user access data on the web and collect data in
form of logs. So, Web usage mining is also called log mining.
Applications of Web Mining
• Personalized marketing
• E-commerce
• Search engine optimization
• Fraud detection
• Web content analysis
• Customer service
• Healthcare
Text Data Mining

• Text data mining can be described as the process of extracting


essential data from standard language text.

• All the data that we generate via text messages, documents, emails,
files are written in common language text.
Areas of text mining in data mining:
• Information Extraction: The automatic extraction of structured data such as
entities, relationships, and attributes describing entities from an unstructured
source is called information extraction.

• Natural Language Processing: NLP stands for Natural language processing.


Computer software can understand human language as same as it is spoken. NLP
is primarily a component of artificial intelligence(AI).

• Data Mining: Data mining refers to the extraction of useful data, hidden patterns
from large data sets. Data mining tools can predict behaviors and future trends.

• Information Retrieval: Information retrieval deals with retrieving useful data from
data that is stored in our systems.
Text Mining Applications
• Digital Library: Various text mining strategies and tools are being used to get the pattern and
trends from journal and proceedings which is stored in text database.

• Academic and Research Field: In the education field, different text-mining tools and strategies
are utilized to examine the instructive patterns in a specific region/research field.

• Life Science: Life science and healthcare industries are producing textual and mathematical data
regarding patient records, sicknesses, medicines, symptoms, and treatments of diseases, etc.

• Social-Media: Text mining analyzing web-based media applications to monitor and investigate
online content like the plain text from internet news, web journals, emails, blogs, etc.

• Business Intelligence: Text mining plays an important role in business intelligence that help
different organization and enterprises to analyze their customers and competitors to make better
decisions.
Advantages of Text Mining

• Large Amounts of Data: Text mining allows organizations to extract


data from large amounts of unstructured text data.

• Variety of Applications: Text mining has a wide range of applications,


including sentiment analysis, entity recognition, and topic modeling.

• Improved Decision Making

• Cost-effective: Text mining can be a cost-effective way, as it eliminates


the need for manual data entry.
Difference between spatial and Temporal data
mining
Spatial Data Mining Temporal Data Mining
Spatial data mining refers to the extraction of temporal data mining refers to the process of extraction
knowledge, spatial relationships and interesting patterns of knowledge about the occurrence of an event whether
that are not specifically stored in a spatial database. they follow, random, cyclic, seasonal variation, etc

It needs space. It needs time.


Primarily, it deals with spatial data such as location, geo- Primarily, it deals with temporal content, form a huge set
referenced. of data.

It involves characteristic rules, evaluation rules, and It targets mining new patterns and unknown knowledge,
association rules. which takes the temporal aspects of data.

Examples: Finding hotspots, unusual locations. Examples: An association rules which seems - "Any
person who buys motorcycle also buys helmet". By
temporal aspect, this rule would be - "Any person who
buys a motorcycle also buy a helmet after that."
Rough Set Theory

• It is a formal theory derived from fundamental research on logical


properties of information systems.

• Rough set theory has been a methodology of database mining or


knowledge discovery in relational databases.

• We can use rough set approach to discover structural relationship


within imprecise and noisy data.
Basic problems in data analysis solved by
Rough Set:
• Characterization of a set of objects in terms of attribute values.

• Finding dependency between the attributes.

• Reduction of superfluous attributes.

• Finding the most significant attributes.

• Decision rule generation.


Goals of Rough Set Theory
• The main goal of the rough set analysis is the induction of (learning)
approximations of concepts. Rough sets work on basis of KDD. It offers
mathematical tools to discover hidden patterns in data.

• It can be used for feature selection, feature extraction, data reduction,


decision rule generation, and pattern extraction (templates, association
rules) etc.

• Identifies partial or total dependencies in data, eliminates redundant data,


gives approach to null values, missing data, dynamic data and others.

You might also like